digital collections – Page 3 – Learning (Lib)Tech

Code4lib Day 1: Lightning Talks

Cynthia Ng – RULA Bookfinder

Link to the full write-up

Julien Gibert – Turning a Solr Response into a RDF file

Theses.fr
Sorry, this went by me, plus I was busy running back to my seat

Bill Dueber – Datamart Report Generator at UMich

actually talking about spreadsheets
want to support data-drive decision making, but it’s boring, and canned reports tend not to do it
can end up in substring hell
solution: build data warehouse
took Aleph oracle COBOL store, removed insanity and put it in another oracle database
funds and inventory reports now possible
running 20-25 reports a week
more than when we ran it by hand, and saves lots of time

Jonathan Rochkind – bento_search

RubyRails gem
external search services e.g. Google books
federated e.g. primo, eds, ebscohost, scopus, worldcat, google books
can use whatever you want, just need to add it
can customize to have link resolver
github.com/jrochkind/sample_megasearch/
much more functionality

Masao Takaku – saveMLAK project for two years

came out of the effort to save museum, library, archive, kominkan (community centre) after the big earthquake
gather information of facilities in damaged area using a wiki
coordinate activities to rebuild
efforts are still continuing

Jon Stroop – Loris Image Server

define syntax for image access
can specify width/height, part of image, quality
Talk link

Ross Singer – How are you managing copyright?

lazy attempt at crowd-sourced business development
copyright is complicated
there are standard licenses, but then there are a lot of exclusions and exceptions
still, roughly the same model
management already being done in some capacity by the universities
but in US/Canada there is fair dealing and fair use
Slides

Eric Nord – Candybars for Bugs

Harold B. Lee Library
worked on maps in library
pop up map
will give candy bar if found error
only had to give away 18
have a ‘report a problem’ with this item
builds the idea to power the patron

Megan O’Neill Kudzia – Games for Pedagogy in the Library

working with faculty
a lot of interest, but no opportunity to talk about it
purchasing games on an ask basis
working out how to make accessible, in catalogue
licensing issues for PC/console games

Geoffrey Boushey – GEDI Reference App for InterLibrary Loan

General Electronic Document Interchange (ISO Standard)
used by Ariel
headers added to a file when sent from one institution to another
basis for making an easy to use tool so different ILL systems can communicate with each other
on Github

George Campbell – three.js: 3D Objects in the browser

used to have to use flash or flip through images
can now use interactive 3D graphics
can scale, add text/images, move

John Sarnowski – Audio Archiving with Full Text Search

ResCarta Toolkit
display and play audio
add metadata
use conversion tool
embeds into XML portion
final file can then be searched
words can be highlighted just like a text file

That’s the end of Day 1! Join us tomorrow. Time for a nap.

Code4Lib Day 1: Afternoon Notes

Practical Relevance Ranking for 10 million books

Tom Burton-West, University of Michigan Library

Search Challenges

multilingual, 400+ languages
OCR quality varies
very long documents
books are different from what they normally have

Relevance Ranking

how to score, weigh
default algorithm ranks very short documents very high
needed to tune/customize parameters
average document size is ~30 times larger
did prelim testing with Solr4 and didn’t see the same problem, but need more testing
dirty OCR complicates things, as well as language
occurrence of words in specific chapters vs. whole book – should we index parts of books?
similar issue with other objects e.g. bound journals, dictionaries & encyclopedias
difficulty too is inconsistent metadata, breakdowns of articles/chapters/etc. will be inconsistent
creating a testing plan and adding click logs

n Characters in Search of an Author

Jay Luker, IT Specialist, Smithsonian Astrophysics Data System
Slides

Goal of a search is to match user input to metadata. e.g. author names

Building the next generation of the ADS 2.0. Trying to increase recall without sacrificing precision.

Requirements

match UTF-8 e.g. matching ASCII version to versions with diacritics/markings
match more or less information e.g. first name initial but without triggering substring matching
need to work with hand curated synonyms e.g. pseudonyms, maiden/married name

Solving the Problem

normalization – strip out punctuation, rearrange name parts – based on whether a common is entered
generate name part variations to whatever can be realistically expected
transliteration – use index instrospection for list of synonyms
expand user queries at each step:
1. user searches
2. normalize
3. name part vars
4. transliteration
5. name parts vars of transliterated entries
6. curated synonyms
7. transliteration of anything added
8. name part variations to catch everything
9. assembled into large boolean query

Implementation

Python/JavaScript prototype
actual – Solr/Lucene

Evolving Towards a Consortium MARCR BIBFRAME Redis Datastore

Jeremy Nelson, Colorado College, jeremy.nelson@coloradocollege.edu
Sheila Yeh, University of Denver

Presentation Slides

I think this presentation speaks for itself.

Journal Article: Building a Library App Portfolio with Redis and Django

Hybrid Archival Collections Using Blacklight and Hydra

Adam Wead, Rock and Roll Hall of Fame and Museum
Presentation

Centre of everything is the Solr index. Blacklight puts everything into Solr. Library materials is easy enough, but with Archival collections use EAD with many items (not just one item as typical of MARC).

Extended Blacklight to search EAD

index collections and single items from a collection
search results include books, entire collections, and items from collections

Digital Content

kept in Fedora – objects described using Rubys
use Hydra to manage the content in Fedora – manages RDF relationships
indexes into Solr
Need to related Fedora content to its archival collection
content originates from sources in collection, and part of series
collection metadata already exists in Solr
create RDF representations of collections
Hydra queries Solr for collection meatadata
creates objects for series, subseries, items

Issues

terrible Solr performance for series, 500+ items
no EAD “round tripping” – EAD can go into Solr, but not back out
currently 60% complete

Citation search in SOLR and second-order operators

Roman Chyla, Astrophysics Data System

Sorry, I don’t have notes for this. My brain is a bit fried by this point. Will post link when I get it.

Break Time

Breakout Sessions – reports will be available on the wiki

Next Up – lightning talks

Code4Lib Day 1: Morning Notes

Was trying to do too many things this morning, so sorry if the notes are not complete.

ARCHITECTING ScholarSphere: How We Built a Repository App That Doesn’t Feel Like Yet Another Janky Old Repository App

Dan Coughlin, Penn State University
Mike Giarlo, Penn State University

Presentation Slides

Trying to make it less confusing without exposing what system it’s using.

Simple Metadata Management

building metadata widgets
required: title, creator, keyword, rights
hide most non-required, have ‘more’ link for rest
limited to a set numbers, with tooltip
use jQuery autocomplete to suggest authority vocabulary

Dashboard

list of uploaded files
list of files have access to

Background Jobs

I got lost here talking about rescue jobs, sorry
has tracebacks for

Permissions Widget

set visibility
share with specific people

Version Control

can restore previous versions

Social Features

not in initial requirements
profile
contributions – “trophies”
activity – follow/following

8 months to develop, but spent 2 months just doing usability and responding to feedback.

Available on GitHub.

Pitfall! Working with Legacy Born Digital Materials in Special Collections

Donald Mennerich, The New York Public Library
Mark A. Matienzo, Yale University Library

Presentation Slides

Disk Images Process

process
stream – digitized analog magnetic signal
sector – stream decoded using algorithm(s)
object
physical – entirety of device
logical

Pitfalls

formats mean different things
communities of practice use different kinds of container formats
no single solution

Quest for Access

delivery format
what allowed to be done with material
need usability testing

Pitfalls

no ideal single model
decisions through the life cycle have an impact on access
capacities of institution

Collection

faculty papers – 162 floppies
goal: “recover” backup into something useful with minimal changes, repeatable process
Vita Russo Papers
goal: preserve original, describe and arrange, access

Conclusions

Time consuming
acknowledge researchers
need to work on communities of practice

Hacking the DPLA

Nate Hill, Chattanooga Public Library, nathanielhill AT gmail.com
Sam Klein, Wikipedia

A rally to get involved.

It’s an API, and a community.

Examples

Biodiversity Heritage Library
Minnesota Digital

Events

Digital Public Library of America Appfest
Launch at Boston Public Library April 18-19

Documentation and API Creator is on GitHub.

EAD without XSLT: A Practical New Approach to Web-Based Finding Aids

Trevor Thornton, New York Public Library

Side note: EAD = Encoded Archival Description — a way of describing archival collection.

Project Goals

enable multiple presentations of the same data
support dynamic web apps
cross-collection search with component-level specificity in results, and faceting on common access points

Archives Data Management Application

system using Ruby on Rails + MySQL + Solr
based on existing infrastructure
stick with what they know
didn’t need to do anything more complex
key functionality: data import, search index, API

Core Models

collection: collection as we know it, may also be single volume
component: some collections at item level, some not
description: some data has descriptive attributes
access term

I just felt like I was copying the slides at this point, so I’ll try to get a link to the presentation slides instead.

The Avalon Media System: A Next Generation Hydra Head For Audio and Video Delivery

Michael Klein, Senior Software Developer, Northwestern University LIbrary, michael.klein AT northwestern DOT edu
Nathan Rogers, Programmer/Analyst, Indiana University

Demo!

can upload from computer, but also shared dropbox
forced to enter some metadata

Avalon

is a stack
media streaming server

Content Processing

with Matterhorn
workflow pipeline – batch/unattended ingest – uploading one delimited file with names of files that should be related
pingbacks for status updates
caching of key metadata/images

Stream Security

support different types of streaming (for desktop & mobile) and authentication
use authentication tokens
half is media ID, add another half, whole thing is auth token

Lunch Time

code4libTO December Meetup Talks

BagIt Profiles – @ruebot

directory of data
bag has what you’re bagging, data, contact email/name, organization information, profile identifier (JSON via a URI)
pull in all the field values
validate
wrote a spec and send it to digital curation community
can look up profiles in the registry

Okay, I got a little lost, but you can see more on github.

Internet Archive Torrent Collections (iaTorrent) – @ruebot

see demo

Bookfinder – @TheRealArty & Steven

I will write this up later probably as a separate blog post, or maybe journal article

TPL’s Web Services Architecture: Understanding the Big Picture – @waharnum

many different systems that don’t easily communicate, which needs specialized knowledge even to do basic tasks
address the challenges by translation, simplication, standardization
Three tiers: Front End Systems (requests to back end) / TPL Web Services (REST) / Back End Systems (responds to front end)
Example: TPL Website -> Account Web Services -> Symphony Web Services (Symphony) – and back
can add new features and functions
helps to solve the challenges mentioned
also helps with reusability e.g. in addition to website, build mobile-friendly website, iPhone App
Might end up with:
- Front End (Website, mobile, App)
- Middle Tier (Account Web Services, ebook Web Services, online payment web services)
- Back End (symphony, overdrive, payment gateway, accounting systems)
other benefits:
- increase ease of knowledge transfer about how our systems work
- follow modern best practice approach to building interoperating systems
- reduce cost and integration time
reduce learning time for new staff or consultants
metrics: wish had resources
bolting together a lot of things, not using a lot of custom code

Ladder (aka MyTPL 2) – @mjsuhonos

wanted to solve problem: discovery layers suck
problems:
- not scalable
- inflexible
- read-only
- expensive
goals:
- better than open source options (VuFind, Blacklight)
- cheaper (than proprietary)
scalable as WorldCat
design:
- schema-free/multi-schema (e.g. Dublin Core)
- horizontally scalable (multi-node)
- modern OSS components
simple data model (RDF)
Features:
- hierarchical relations
- clustering/de-duplication
- versioning
- real-time import & indexing
- multi-thread/process
- responsive UI
- fully multilingual (18/10)
- dynamic faceting
- dynamic mapping modification
- digital content storage (coming soon)
built on a linked data
not a discovery layer; it’s an integration platform

Heritage U of T – @ajmcalorum

News Announcement and Promotional Video
previously not centralized: hard drives, flickr, etc.
need central repository for tri-campus initiative with search & discovery, preservation, long-term access to content and metadata, support for multiple formats (e.g. images, books, documents, video, exhibits)
Drupal + Solr (search) + Fedora Commons (collection management, batch ingesting, metadata crosswalk, digital preservation) == islandora (digital asset management system)
pilot: 8 parent collections (by format, by campus)
exhibits in Drupal, not through islandora/fedora commons
modules: internet archive book reader (OCR on the fly), galleria, colorbox
official launch: 2 weeks ago

That’s it! Food and drinks time!

Code4lib Day 1: Lightning Talks Notes

Al Cornish – XTF in 300 seconds (Slides in PDF)

technology developed and maintained by California Digital Library
supports the search/display of digital collections (images, PDFs, etc)
fully open source platform, based on Apache Lucene search toolkit
Java framework, runs in Tomcat or Jetty servlet engine
extensive customization possible through XSLT programming
user and developer group communication through Google Groups
search interface running on Solr with facets
can output in RSS
has a debug mode

Makoto Okamoto – saveMLAK (English)

Aid activities for the Great East Japan Earthquake through collaboration via wiki
input from museum, library, archive, kominkan = MLAK
20,000 data of damaged area
Information about places, damages, and relief support
Key Lessons
- build synergy with twitter
- have offline meet ups & training

Andrew Nagy – Vendors Suck

vendors aren’t really that bad
used to think vendors suck, and that they don’t know how to solve libraries’ problems
but working for a vendor allows to make a greater impact on higher education, more so than from one university (he started to work for SerialsSolution)
libraries’ problems aren’t really that unique
together with the vendor, a difference can be made
call your vendors and talk to the product managers
if they blow you off, you’ve selected the wrong vendor
sometimes vendor solutions can provide a better fit

Andreas Orphanides – Heat maps

The library needed grad students to teach instructional sessions, but how to set schedule when classes have a very inflexible schedule? So, he used the data of 2 semesters of instructional sessions using date and start time, but there were inconsistent start times and duration. The question is how best to visualize the data.

heatmap package from clickheat
time of day – x-dimension
day of the week – y-dimension
could see patterns in way that you can’t in histogram or bar graph
heat map needn’t be spatial
heat maps can compare histogram-like data along a single dimension or scatter-like plot data to look for high density areas

Gabriel Farrell – ElasticSearch

similar to Solr
goes across servers
e.g. Free103Point9

Nettie Lagace from NISO

National Information Standards Organization (NISO)
work internationally
want to know: What environment or conditions are needed to identify and solve the problem of interoperability problems?

Eric Larson – Finding images in book page images

A lot of free books exist out there, but you can’t have the time to read them all. What if you just wanted to look at the images? Because a lot of books have great images.

He used curl to pull all those images out, then use imagemagick to manage the images. The processing steps:

Convert to greyscale
Contrast boost x8
Covert image to 1px by height
Sharpen image
Heavy-handed grayscaling
Convert to text
Look for long continuous line of black to pull pages with images

Code is on github

Adam Wead – Blacklight at the Rock Hall

went live, soft launch about a month ago
broken down to the item level
find bugs he doesn’t know about for a beer!

Kelley McGrath – Finding Movies with FRBR & Facets

users are looking for movies, either particular movie or genre/topic
libraries describe publications e.g. date by DVD, not by movie
users care about versions e.g. Blu-Ray, language
Try the prototyped catalog
Hit list provides one result per movie, can filter by different facets

Bohyun Kim – Web Usability in terms of words

don’t over rely on the context
but context is still necessary for understanding e.g. “mobile” – means on the go, what they want on the go
sometimes there is no better term e.g. “Interlibrary Loan”
brevity will cost you “tour” vs. “online tour”
Time ran out, but check out the rest of the slides

Simon Spero – Restriction Classes, Bitches

OWL:

lets you define properties
control what the property can apply to
control the values the property can take
provides an easy way to do this
provides a really confusing way to do this

The easy way is usually wrong!

When defining what can apply to and the range, this applies to every use of the property. An alternative is Attempto.

Cynthia Ng – Processing & ProcessingJS

Processing: open source visual programming language
Processing.js: related project to make processing available through web browsers without plugins
While both tend to focus on data visualizations, digital art, and (in the case of PJS) games, there are educational oriented applications.
Examples:
- Kanji Compositing – allows visual breakdown of Japanese kanji characters, interact with parts, and see children.
- Primer on Bezier Curves – scroll down to see interactive (i.e. if you move points, replots on the fly) and animated graphs.
Obvious use might be instructional materials, but how might we apply it in this context? What other applications might we think of in the information organization world?

Since doing the presentation, I have already gotten one response by Dan Chudnov who did a quick re-rendering of newspaper data from OCR data. Still thinking on (best) use in libraries and other information organizations.

It’s over for today, but if you’d like more, do remember that there is a livestream and you can follow on twitter, #c4l12 or IRC.