Code4libBC Day 2: Lightning Talks Part 2

The last part of lightning talks for Code4libBC.

Speeding up Digital Preservation with a Graphics Card, Alex Garnett, SFU

GPU-accelerated computing. graphics cards are very powerful nowadays, and many organizations have figured out how to use the graphics cards to do things.
graphics are much more powerful to CPU, but very specialized for video or similar workloads
applied for a grant to NVidia and got a TitanX video card
looked at different projects such as FFmpeg, encoding into archive friendly formats
only difference in workflow to get work from CPU to GPU to use a difference encoder
used benchmark, gpu x10 faster than cpu and saving cpu time
only problem is that most software such as archivematica usually run in VM and don’t have GPU acceleration/access
Virtualbox does not support, but amazon does support
could have video file run somewhere else with using a different command
a lot of aborted efforts to bring to gpu acceleration
looking into tesseract OCR library

discovery interface backed by Solr index service
a lot of massaging of metadata happens, which happens all by shell script workflow including OCR, media formats
when you have unstructured text (e.g. PDF, docs, text), what do you do? because there’s only very barebone metadata e.g. title, author, keywords
wanted to pluck out names (people, orgs) to create new access points
Stanford Natural Language Processing, NER based on Java
NER file inputs file and entities are recognized; second file output files by entity category
demo ensued

Portland Data Data Model, originally came from Hydra, but generalized for any Fedora use
compared to Dublin Core
DC started ~20 years
in Hydra community, found data models were not compatible
UCSD proposed a model to the Hydra community, which evolved and with collaboration from the community into the Fedora Community Data Model, which turned into PCDM hosted in duraspace
UCSD focused on properties, but with linked data in mind and a model that would work with others in the hydra community
hydra technical metadata application profile modelled after Europeana and DPLA map
Islandora parallel work with PCDM
ontology now exists as RDF schema
had a hard problem but well defined scope. All wanted to solve problem and had some experience. developed a shared understanding and out in the open.
good example of collaboration between what some might see a competing (i.e. Hydra vs. Islandora)
small set allowing different entities to fit the model

hard to predict how much scale is needed
but what can you do and consider?
scalability is ability to handle increased workload without adding resources to a system; handle increased workflow by repeatedly applying cost-effective strategy for extending a system’s capacity.
starting points: modern programming techniques, best hardware infrastructure possible, modularize, ongoing monitoring, run scalability audit
optimize code/hardware, distribute key components to dedicated hardware as needed
share knowledge, useful resources, real-world experience
scalability audit specifically for library systems software?
technical scalability and organization scalability

Might even have time to catch a quick nap before breakouts this afternoon.