The last part of lightning talks for Code4libBC.
Speeding up Digital Preservation with a Graphics Card, Alex Garnett, SFU
- GPU-accelerated computing. graphics cards are very powerful nowadays, and many organizations have figured out how to use the graphics cards to do things.
- graphics are much more powerful to CPU, but very specialized for video or similar workloads
- applied for a grant to NVidia and got a TitanX video card
- looked at different projects such as FFmpeg, encoding into archive friendly formats
- only difference in workflow to get work from CPU to GPU to use a difference encoder
- used benchmark, gpu x10 faster than cpu and saving cpu time
- only problem is that most software such as archivematica usually run in VM and don’t have GPU acceleration/access
- Virtualbox does not support, but amazon does support
- could have video file run somewhere else with using a different command
- a lot of aborted efforts to bring to gpu acceleration
- looking into tesseract OCR library
Scripting Named Entity Recognition (NER) to pluck names, organizations and locations from text, Peter Tyrrell, Andornot
- discovery interface backed by Solr index service
- a lot of massaging of metadata happens, which happens all by shell script workflow including OCR, media formats
- when you have unstructured text (e.g. PDF, docs, text), what do you do? because there’s only very barebone metadata e.g. title, author, keywords
- wanted to pluck out names (people, orgs) to create new access points
- Stanford Natural Language Processing, NER based on Java
- NER file inputs file and entities are recognized; second file output files by entity category
- demo ensued
PCDM: A Data Model and a Community Model, Justin Simpson
- Portland Data Data Model, originally came from Hydra, but generalized for any Fedora use
- compared to Dublin Core
- DC started ~20 years
- in Hydra community, found data models were not compatible
- UCSD proposed a model to the Hydra community, which evolved and with collaboration from the community into the Fedora Community Data Model, which turned into PCDM hosted in duraspace
- UCSD focused on properties, but with linked data in mind and a model that would work with others in the hydra community
- hydra technical metadata application profile modelled after Europeana and DPLA map
- Islandora parallel work with PCDM
- ontology now exists as RDF schema
- had a hard problem but well defined scope. All wanted to solve problem and had some experience. developed a shared understanding and out in the open.
- good example of collaboration between what some might see a competing (i.e. Hydra vs. Islandora)
- small set allowing different entities to fit the model
Built to grow: scalability factors to consider before commencing your next digital library software project, Marcus Barnes, SFU
- hard to predict how much scale is needed
- but what can you do and consider?
- scalability is ability to handle increased workload without adding resources to a system; handle increased workflow by repeatedly applying cost-effective strategy for extending a system’s capacity.
- starting points: modern programming techniques, best hardware infrastructure possible, modularize, ongoing monitoring, run scalability audit
- optimize code/hardware, distribute key components to dedicated hardware as needed
- share knowledge, useful resources, real-world experience
- scalability audit specifically for library systems software?
- technical scalability and organization scalability
Lunch
Might even have time to catch a quick nap before breakouts this afternoon.