Code4libBC Day 2: Lightning Talks Notes

Notes from day 2 of Code4lib BC talks.

Supplejack Triclops Architecture – Daniel Sifton

  • harvester and parser
  • Mongo, Rails, Solr, built on SFU IT clusters
  • aggregates metadata using OAI, etc.; libraries, galleries, museums, etc. but also YouTube and other services with API in cases where organizations don’t have their own IR
  • can define new schemas to map to, roles for specified fields
  • make calls to API to harvest metadata
  • results can be filtered by specified roles/fields, and mapped
  • documentation available, but have added to documentation
  • Supplejack developed by Boost based in New Zealand and Philippines

Open Source, Open Access, Open Data – Maryann Kempthorne

  • more and more often, vendors are telling us that libraries will be open in the future
  • open isn’t just open textbooks
  • feel like doing less open source than 5 years ago
  • open repositories coming with digital libraries
  • open educational resources
  • open source and open access libraries collections: e.g. nnels.ca
  • open data sets e.g. FRDR
  • AMICUS no longer open data; enclosure of open data; sold to OCLC
  • asking the questions: is the library open? who has open source software in their library ecology? open data?
  • digital asset management spaces

Linking video and audio in the Hansard – Mike Sinclair

  • Technical Operations Officer at Hansard, Legislative Assembly of BC
  • essentially a software developer
  • trying to streamline and improve, and make Hansard more open
  • Hansard = official report of debates (broadcast, webcast, transcriptions), legislative chamber and committee rooms, Chamber proceedings, Parliamentary Committee meetings
  • video search: pain points: 4-8 hour meetings, information dense, very popular, difficult problem
  • online text is easy to browse and scan, but no way to connect video and transcript
  • developed a new video search
  • MongoDB based search
  • easier to navigate
  • can search by keyword, which searches closed captioning and transcript, which is indexed with video
  • search results will return video and excerpt of transcript, and start playing
  • jwplayer as video player
  • did alignment using forced alignment
  • video will also take you to portion of transcript and vice versa
  • start in Word, transformation with XSLT, tagged in XML sent to server with a forced alignment controller, talks to video server, gets ID for video, takes text without tags, fires up cloud machine, goes through all the meetings to find the video, converts audio to WAV, using Caldi (sp?) to do timestamping based on probability of match as JSON with 95-97% accuracy, uses it and transforms back to XML then HTML
  • results in transcript XML and linked HTML
    • going forward, want to do closer integration: scrolling transcript while watching video
    • proper database underlying entire infrastructure
    • single point of search for all Hansard products

Getting Started Automating your EZproxy – James Fournie

  • always changing the config: change database subscriptions, vendors change, IT changes; in plain text files
  • title, URL, DJ; many commented out as disabled
  • some are pretty crazy
  • directory with old versions
  • why is this happening? who changed it, when, what, why
  • git or any other revision control system shows that
  • EZproxy has the docs which controls login screen
  • method 1: quickest way: turn it into git repository
  • check the ezproxy directory for changes
  • problem: a lot of private files, which don’t really change, but can use gitignore
  • some problems: git commits probably won’t be attributed to you unless you use author; extra commands which could be forgotten, single point of failure, may accidentally override
  • distributed repo using git server, which may be difficult to set up, but what if you accidentally commit something private? how to deplay to the ezproxy server?
  • updated EZproxy and RHEL
  • Ansible: no specialized software to install on the server, just on the “management” node; can run on linux or Mac, can control just about any server including Windows, Python-based, glorified programming language around YAML files and SSH, similar to Puppet or Chef
  • running it locally, but can build a pipeline to automate it; can send you a message on slack and some other cool things
  • public ecosystem e.g. GitLab, GOGS

Smallest Possible Library Online Tools “SPLOT” – Scott Leslie

  • ed tech world, built a SPLOT: acronym randomly generates different phrase
  • instead of building the swiss army knife that does nothing well trying to do everything
  • so built tiny tools that work well for a specific purpose
  • what would be the library equivalent? that would work for non-technical library people
  • ideas: online SIP tester, z39.50 tester, system control/semantic identifier API.
  • what little tool would make your life easier?

Break

red panda eating snack
Source: kuribo. (2008). Red Panda. https://www.flickr.com/photos/kuribo/2162319728/ CC BY-SA 2.0

We have detected spider/robot activity on our site – Calvin Mah

Slides

  • EZproxy license violation notices
  • who is able to deal with these emails before you get them?
  • detect this type of activity before you get the emails from the vendors
  • already knew before got the email from vendor
  • blocks whole EZproxy server from accessing that e-resource
  • cause? compromised login credentials, report to IT services: force password reset; repeat offenders get locked out and meets with IT services security team, usually keystroke logger or something similar
  • proactive monitoring script = tripwire; passive log checking for abusive usage
  • recognize pattern of attack: bulk downloading, many resources simultaneously
  • detecting the attack: check last 10,000 lines of log every 10 minutes using cron job, separate log by individual sessions, in each session: check for number of domains visited and number of times a different domain was visited
  • number of domain switch, number of access per time, % of 404 = 15%+, some vendors have tripwire token
  • block ID in ezproxy, kill session, email library staff
  • key takeaway: block ID in user.txt file, not IP addresses, have good relationship with IT services b/c need to be responsive
  • false positives: test it with emailing self only and no blocking, adjusted thresholds
  • posted on github

Using Islandora to centralize access to distributed content – Mark Jordan

  • use cases: multiple systems e.g. Islandora and DSpace but want single search; want collection curated e.g. in Zotero that is imported into Islandora
  • ways to get content: harvest via OAI-PMH (typically use cron job to automate), allow adding batches of objects from CSV or RIS, add individual objects
  • for harvested items, will link back to the original site
  • batch load: using RIS file that was collected in Zotero, uploaded zip via webform
  • can also load individual files via webform

Systems Projects in Special Libraries – Charles Hogg

  • highly specialized resources and searches
  • system projects
  • scale is very small, limited client base
  • clients are highly demanding with full service, specialization, customization
  • funding budgets are small
  • specialization: very specific clientele with specific needs, demands, associations
  • may not be able ot partner because may have conflicting mandates
  • Community News Alert Service: standard programmed queries from Infomart e.g. mention particular MLA, reviewed by staff
  • GALLOP Portal: single search to Legislative Library digital collections in Canada; collections hosted locally, indexed and searched through the portal
  • Newsletter: major requirement is metrics on use and engagement, using third party to track engagement is a challenge, wanter t ooffer more services highlightning materials and services, planning to move to mailchimp
  • Ebooks and Kindle Services: create and manage kindle accounts, use it to push PDFs (load to account), and add on-demand loading; not universally available; uses burn accounts so that no other know what else someone else has been reading
  • Reference statistics: created simple list in SharePoint link to Access database
  • Client Tracking and Introduction: first impressions and first contact an essential part of client, new hires and staff moves not always communicated, library has its own system of identifying new users and tracing engagement

OCR tools for non-Latin text: Lessons from the Digital Himalaya Project – Rebecca Dickson

  • Digital Himalaya project: started at University of Cambridge; collection for materials about the Himalayan region
  • tie all the collections into one search as part of open collections at UBC
  • there are numerous types of formats including maps, etc.; presentation focus on text
  • difficulty is script, writing systems using Devanagari (15 languages) and Tibetan (3 languages)
  • using AbbyFineReader was great for Latin based scripts, but not for other scripts
  • British Library project: doing OCR for Bangla; best tool ended up being Google Docs
  • Google Docs can recognize the script and works well with layout
  • workflow we wanted: uploaded jpgs, OCR in Google Drive, spit out spreadsheet with transcript by row for each image
  • worked with script
  • issues: runtime limits mean scalability problems & truly horrifying workarounds; reliability: will this even be possible a year from now?
  • alternatives? Tesseract okay with Nepali but not others because not enough training data; Google Cloud Visions API most effective tool but proprietary
  • how accurate does OCR need to be effectively support full text search, teaching & research, computational text analysis, etc.? how hard would it be to compile training data for unsupported languages like Tibetan?

Through the Looking Glass: A Parallel World of Print Collections Data – Jean Blackburn

  • using GreenGlass; normally don’t have access for this but part of print shared (last copy) project
  • tool was meant to support that kind of project
  • OCLC has acquired the company and service, loaded data from OCLC
  • SPAN (Shared Print Archive Network) Monograph Project by COPPUL
  • can compare what is rarely held elsewhere or unique to institution e.g. UA/UBC, rest of COPPUL
  • looking at using tool for local collection management, especially weeding
  • can see how many and which items have 0 use, and which other libraries have
  • have some interesting visualizations
  • main problem is that based on snapshot, not realtime

Teen Summer Reaching Challenge – Allison Trumble

  • teens do things and get points
  • was being done using google form, but not scalable
  • last year: WordPress with badge OS; but stopped being supported
  • other solutions? ways to make this work?

End of Talks

Lunch time now and that’s all the talks. Hopefully see you next year!

pika
Source: x@ray. (2007). pika. https://www.flickr.com/photos/xaray/1541286910/ CC BY-ND-NC 2.0