Code4libBC Day 2: Lightning Talks Notes

Notes from day 2 of Code4lib BC talks.

Supplejack Triclops Architecture – Daniel Sifton

harvester and parser
Mongo, Rails, Solr, built on SFU IT clusters
aggregates metadata using OAI, etc.; libraries, galleries, museums, etc. but also YouTube and other services with API in cases where organizations don’t have their own IR
can define new schemas to map to, roles for specified fields
make calls to API to harvest metadata
results can be filtered by specified roles/fields, and mapped
documentation available, but have added to documentation
Supplejack developed by Boost based in New Zealand and Philippines

Open Source, Open Access, Open Data – Maryann Kempthorne

more and more often, vendors are telling us that libraries will be open in the future
open isn’t just open textbooks
feel like doing less open source than 5 years ago
open repositories coming with digital libraries
open educational resources
open source and open access libraries collections: e.g. nnels.ca
open data sets e.g. FRDR
AMICUS no longer open data; enclosure of open data; sold to OCLC
asking the questions: is the library open? who has open source software in their library ecology? open data?
digital asset management spaces

Linking video and audio in the Hansard – Mike Sinclair

Technical Operations Officer at Hansard, Legislative Assembly of BC
essentially a software developer
trying to streamline and improve, and make Hansard more open
Hansard = official report of debates (broadcast, webcast, transcriptions), legislative chamber and committee rooms, Chamber proceedings, Parliamentary Committee meetings
video search: pain points: 4-8 hour meetings, information dense, very popular, difficult problem
online text is easy to browse and scan, but no way to connect video and transcript
developed a new video search
MongoDB based search
easier to navigate
can search by keyword, which searches closed captioning and transcript, which is indexed with video
search results will return video and excerpt of transcript, and start playing
jwplayer as video player
did alignment using forced alignment
video will also take you to portion of transcript and vice versa
start in Word, transformation with XSLT, tagged in XML sent to server with a forced alignment controller, talks to video server, gets ID for video, takes text without tags, fires up cloud machine, goes through all the meetings to find the video, converts audio to WAV, using Caldi (sp?) to do timestamping based on probability of match as JSON with 95-97% accuracy, uses it and transforms back to XML then HTML
results in transcript XML and linked HTML
- going forward, want to do closer integration: scrolling transcript while watching video
- proper database underlying entire infrastructure
- single point of search for all Hansard products

Getting Started Automating your EZproxy – James Fournie

always changing the config: change database subscriptions, vendors change, IT changes; in plain text files
title, URL, DJ; many commented out as disabled
some are pretty crazy
directory with old versions
why is this happening? who changed it, when, what, why
git or any other revision control system shows that
EZproxy has the docs which controls login screen
method 1: quickest way: turn it into git repository
check the ezproxy directory for changes
problem: a lot of private files, which don’t really change, but can use gitignore
some problems: git commits probably won’t be attributed to you unless you use author; extra commands which could be forgotten, single point of failure, may accidentally override
distributed repo using git server, which may be difficult to set up, but what if you accidentally commit something private? how to deplay to the ezproxy server?
updated EZproxy and RHEL
Ansible: no specialized software to install on the server, just on the “management” node; can run on linux or Mac, can control just about any server including Windows, Python-based, glorified programming language around YAML files and SSH, similar to Puppet or Chef
running it locally, but can build a pipeline to automate it; can send you a message on slack and some other cool things
public ecosystem e.g. GitLab, GOGS

Smallest Possible Library Online Tools “SPLOT” – Scott Leslie

ed tech world, built a SPLOT: acronym randomly generates different phrase
instead of building the swiss army knife that does nothing well trying to do everything
so built tiny tools that work well for a specific purpose
what would be the library equivalent? that would work for non-technical library people
ideas: online SIP tester, z39.50 tester, system control/semantic identifier API.
what little tool would make your life easier?

Break

Source: kuribo. (2008). Red Panda. https://www.flickr.com/photos/kuribo/2162319728/ CC BY-SA 2.0

We have detected spider/robot activity on our site – Calvin Mah

Slides

EZproxy license violation notices
who is able to deal with these emails before you get them?
detect this type of activity before you get the emails from the vendors
already knew before got the email from vendor
blocks whole EZproxy server from accessing that e-resource
cause? compromised login credentials, report to IT services: force password reset; repeat offenders get locked out and meets with IT services security team, usually keystroke logger or something similar
proactive monitoring script = tripwire; passive log checking for abusive usage
recognize pattern of attack: bulk downloading, many resources simultaneously
detecting the attack: check last 10,000 lines of log every 10 minutes using cron job, separate log by individual sessions, in each session: check for number of domains visited and number of times a different domain was visited
number of domain switch, number of access per time, % of 404 = 15%+, some vendors have tripwire token
block ID in ezproxy, kill session, email library staff
key takeaway: block ID in user.txt file, not IP addresses, have good relationship with IT services b/c need to be responsive
false positives: test it with emailing self only and no blocking, adjusted thresholds
posted on github

Using Islandora to centralize access to distributed content – Mark Jordan

use cases: multiple systems e.g. Islandora and DSpace but want single search; want collection curated e.g. in Zotero that is imported into Islandora
ways to get content: harvest via OAI-PMH (typically use cron job to automate), allow adding batches of objects from CSV or RIS, add individual objects
for harvested items, will link back to the original site
batch load: using RIS file that was collected in Zotero, uploaded zip via webform
can also load individual files via webform

Systems Projects in Special Libraries – Charles Hogg

highly specialized resources and searches
system projects
scale is very small, limited client base
clients are highly demanding with full service, specialization, customization
funding budgets are small
specialization: very specific clientele with specific needs, demands, associations
may not be able ot partner because may have conflicting mandates
Community News Alert Service: standard programmed queries from Infomart e.g. mention particular MLA, reviewed by staff
GALLOP Portal: single search to Legislative Library digital collections in Canada; collections hosted locally, indexed and searched through the portal
Newsletter: major requirement is metrics on use and engagement, using third party to track engagement is a challenge, wanter t ooffer more services highlightning materials and services, planning to move to mailchimp
Ebooks and Kindle Services: create and manage kindle accounts, use it to push PDFs (load to account), and add on-demand loading; not universally available; uses burn accounts so that no other know what else someone else has been reading
Reference statistics: created simple list in SharePoint link to Access database
Client Tracking and Introduction: first impressions and first contact an essential part of client, new hires and staff moves not always communicated, library has its own system of identifying new users and tracing engagement

OCR tools for non-Latin text: Lessons from the Digital Himalaya Project – Rebecca Dickson

Digital Himalaya project: started at University of Cambridge; collection for materials about the Himalayan region
tie all the collections into one search as part of open collections at UBC
there are numerous types of formats including maps, etc.; presentation focus on text
difficulty is script, writing systems using Devanagari (15 languages) and Tibetan (3 languages)
using AbbyFineReader was great for Latin based scripts, but not for other scripts
British Library project: doing OCR for Bangla; best tool ended up being Google Docs
Google Docs can recognize the script and works well with layout
workflow we wanted: uploaded jpgs, OCR in Google Drive, spit out spreadsheet with transcript by row for each image
worked with script
issues: runtime limits mean scalability problems & truly horrifying workarounds; reliability: will this even be possible a year from now?
alternatives? Tesseract okay with Nepali but not others because not enough training data; Google Cloud Visions API most effective tool but proprietary
how accurate does OCR need to be effectively support full text search, teaching & research, computational text analysis, etc.? how hard would it be to compile training data for unsupported languages like Tibetan?

Through the Looking Glass: A Parallel World of Print Collections Data – Jean Blackburn

using GreenGlass; normally don’t have access for this but part of print shared (last copy) project
tool was meant to support that kind of project
OCLC has acquired the company and service, loaded data from OCLC
SPAN (Shared Print Archive Network) Monograph Project by COPPUL
can compare what is rarely held elsewhere or unique to institution e.g. UA/UBC, rest of COPPUL
looking at using tool for local collection management, especially weeding
can see how many and which items have 0 use, and which other libraries have
have some interesting visualizations
main problem is that based on snapshot, not realtime

Teen Summer Reaching Challenge – Allison Trumble

teens do things and get points
was being done using google form, but not scalable
last year: WordPress with badge OS; but stopped being supported
other solutions? ways to make this work?

End of Talks

Lunch time now and that’s all the talks. Hopefully see you next year!

Source: x@ray. (2007). pika. https://www.flickr.com/photos/xaray/1541286910/ CC BY-ND-NC 2.0

Supplejack Triclops Architecture – Daniel Sifton

Open Source, Open Access, Open Data – Maryann Kempthorne

Linking video and audio in the Hansard – Mike Sinclair

Getting Started Automating your EZproxy – James Fournie

Smallest Possible Library Online Tools “SPLOT” – Scott Leslie

Break

We have detected spider/robot activity on our site – Calvin Mah

Using Islandora to centralize access to distributed content – Mark Jordan

Systems Projects in Special Libraries – Charles Hogg

OCR tools for non-Latin text: Lessons from the Digital Himalaya Project – Rebecca Dickson

Through the Looking Glass: A Parallel World of Print Collections Data – Jean Blackburn

Teen Summer Reaching Challenge – Allison Trumble

End of Talks

Share this: