metadata – Page 3 – Learning (Lib)Tech

Code4Lib Day 3: Closing Keynote – Gordon Dunsire

Granularity in Library Linked Open Data

Slides

Fractals

self-similar at all levels of granularity
each circle represents of things that look very similar (snowflake looking pattern but of different sizes)
characteristic of fractals
cannot determine level: all levels are equal, some more equal than others

Multi-Faceted Granularity

What is described by a bibliographic record? or a single statement?
What is the level of description? How complete is it? e.g. AACR2
How detailed is the schema used? How dumb? – especially relevant right now. The more detailed, the higher level of granularity possible.
Semantic constraints? Unconstrained?

Resource Description Framework – Linked Data

Triple: This resource | has intended audience | Juvenile
Subject / Predicated / Object
do each of these parts have granularity?
higher/lower level, but should talk about coarse or fine grained granularity

Subject: What is the Statement About?

we can focus on description an article / resource / work, then think about coarser or finer granularity:
- coarser: consortium collection / RDF map
- library collection / digital collection
- super-aggregate journal title / jurnal index
- aggregate: issue / festschrift
- focus on description an article / resource / work
- component: section / graphics / page
- sub-component: paragraph / markup
- finer: word rdf/xml
- uri / node

Predicate: What is the Aspect Described?

similar coarse/fine breakdown:
- membership category
- access to resource
- access to content
- suitability rating
- audience and usage
- audience
- audience of audio-visual material
diagram: possible audience map (partial) – unconstrained version to avoid collisions of isbd/dct/schema/rda/m21/frbrer
different links can be made while still retain proper semantic links
currently constructing just one giant graph

What is the Aspect Described?

coarse to fine:
- resource record
- manifestation record
- title and s.o.r
- title statement
- title of manifestation
- title word
- first word of title
why do librarians need so many titles? Why not just use dublin core title and be done with it? Because we need it to do our work e.g. spine title to browse
title = string identifier
RDA: what to do with this? how do we apply these needs?
possible semantic map (partial) – I won’t even try to reproduce this
need to take into account names and ranges
make it more difficult, but more powerful

Semantic Reasoning: The Sub-Property Ladder

this is where the graph becomes useful and property
machines can’t reason, so we’re demantic the semantics such that we can give the rules to machines to process our data
semantic rule:
- if property1 sub-property of property2;
- then data triple: resource property1 “string”
- implies data triple: resource property2 “string”
otherwise, data triple remains the same
simple enough for computer to carry out
doesn’t matter how complex the map actually is, because it can still do it in matters of seconds
machine entailment: isbd” “hast title proper” (finer) -> dct: “has title” (coarser)
might sound simple, but making a computer do interferance
‘dumb(ing)-up, data has been lost, but still meaningful – moved from one schema to another

Data Triples from Multiple Schema / Entailed from Sub-Property Map / rom Property Domains

frbrer: “has intended audience” – “primary school”
isbd: “has note on use or audience” – “for ages 5-9”
rda: “intended audience (work)” – “for children aged 7-“
m21: “target audience” 0> m21terms: -> “Juvenile”
definition attached to the vocabulary
also talking about granularity
can map the sub-property to top level of unc: “has note on use or audience”
“is a” frbrer: “work”, isbd: “resource”, rda: “work” – rda and frbr schema actually separate, not semantically linked even though vocabulary is similar and RDA is based on FRBR
once stabalized can be drawn from each other

What is the Aspect Described?

coarser to finer:
- creator
- author
- screenwriting
- animation screenwriting
- children’s cartoon screenwriting
different controlled vocabulary
graph of RDA for author/creator/screenwriting in relation to work and agent
graph of same thing, but for dc for creator and agent
what is the semantic relationship between the dct creator and the rda creator?
marcrel author maps to dc contributor, not creator – what is the relationship between rda author and marcrel author?
decision from 2005, needs to be reappraised and reviewed
relationship between dc creator and dc contributor?
how does lcsh “screenwriters” fit?

Machine-Generated Granularity

also has issues
e.g. full-text indexing: down to the word level
BabelNet: A very large multilingual ontology
can get quite complex and granular

User-Generated Granularity

users can actually generate useful metadata
can use statistical methods to remove extremes and come back with consensus
going to cause granularity problems e.g. “OK for my kids (7 and 9)”, “Too childish for me (age 14)”

KISS

keep it simple, stupid
keep it simple and stupid?
data model is very simple: triples!
in terms of complexity, actually very simple
but metadata content is complex
and therefore, resource discovery is complex
complex structure of application of simple rules, similar in the hard sciences and math
simplicity is elegance

AAA

Anyone can say anything about any thing
someone will say something about every thing
in every conceivable way
and then constrained linguistically

OWA

open world assumption: the absence of a statement is not a statement of non-existence

Will it get so granular that it becomes too complex?

And the rest is science

Break Time

tiny octopus — How Fine Can an Octopus be?

Code4Lib Day 2: Morning Notes

REST IS Your Mobile Strategy

Richard Wolf, University of Illinois at Chicago
Slides
Raw Material

REST

Representational State Transfer – a methodology developed alongside HTTP 1.1
clients request representations of resources from servers – typically a document
basically turns into an API

Examples

Twitter
New York Times – Congress API
Chicago Transit

iOS Development

need to know: Xcode, Objective-C, Cocoa Touch, Provisioning
Xcode – Apple developer, like Visual Studio or Eclipse
Objective-C – strict superset of C
Cocoa Touch – frameworks to talk to iOS, similar to RubyRails
UIKit
Provisioning Portal – annoying paperwork

OCLC Classify API

give it an item, tell you how it’s classified including call number

Process

Use Rested -MAC tool,grabs API information, and provides you the raw output
Xcode – create a new basic project
go from XML to Objective-C
use RestKit – maps XML to Objective-C
use PaintCode – create GUI
hire an artist
Apple App Review Process

Librobot App

in the store by April 2nd

Why REST Matters – What are the Major Milestones

math formula – importance of technology can be determined by the amount of money involved in a court case
Personal Computers
The Internet
Mobility
Build an API – ask for ideas, and apps will come.

Take Away

you have interesting data
make an API
If we build it, they will come for it!

All Teh Metadatas Re-Revisited

Esme Cowles, UC San Diego Library
Matt Critchlow, UC San Diego Library
Bradley Westbrook, UC San Diego Library

Continues the story from last year.

Needs

more consistent data
maintain syntax of hierarchical subjects
improve support for complex objects
align more strongly with the digital libraries community – most important

User Stories

to understand requirements of administration and researchers

Sorry, I had to take a brain break and got a little lost. I’m also going to blame twitter and IRC for distracting me. Take a look at the slides:

Implementation

DAMS Repository – new version of lightweight repository, with APIs
Manager – separate and uses the API
Public Access System – new frontend in Hydra, great community

Timeline

release in summer
code now available on Github

Browser/Javascript Integration Testing with Ruby

Jessie Keck, Stanford University
Slides

The Problem

needed to test JavaScript
especially since using progressive enhancement
site works without JavaScript, then more features with JavaScript
mistakes happen e.g. killed navigation,

Some Solution(s)

Watir == Web Application Testing in Ruby
built on watir-webdriver
Capybara – RSpec/Cucumber driver
ability to test responsive design
webkit integration available
personally like Capybara syntax (vs. Watir)
automated test that there is JavaScript bug e.g. automatically test that facets working

Gotchas

might want to use Watir Rails
transactional fixtures

Linked Open Communism: Better discovery through data dis- and re- aggregation

Corey A Harper, New York University

How to shut up about linked data and actually build something.

Context

context, the narrative of the library/archive
user stories

Death of Browse

discovery systems don’t use authority control
browse broken as UI design
rich data in authorities disconnected

The Idea Implemented

take EAD records, blow them up, take headings to match MARC records
pull people, coporations, and topic – pull info from DBpedia
index in Solr
slower than would like
On Github but is buggy

Solr Update

Erik Hatcher, LucidWorks

Sorry, but we don’t use Solr, and anyone really interested I think can look up information the update. e.g. Apache Solr Release Notes

Check out the slides:

Break Time

Ask Anything

Who’s faculty? Half Faculty – small handful who care about being faculty

Planing and pilot phase of bringing together all resources of types. How to decide what to use and where to start?

Normalizing records from MARC to Solr. Want help with format.

How many have library degrees? 2/3 do, 1/3 don’t

Code4Lib – archiving our stuff? Talk to Mark/anarchivist. Mailing list is archived on the university server. Mirrored on post. Regular basis, dumped to media forward.

Goals of BIBFRAME? Replacing/superseding MARC.

First-timers to c4lcon? majority of room. All? < 20

Anyone collecting social media on behalf of user community or collection building purposes? Going to be a lightning talk tomorrow.

Anyone from a theology library? ~5 ppl

Want to know successful examples of gamification to support information literacy by @maccabeelevine e.g. Lemontree

Glossary of technology and stacks. On code4lib wiki? A guide for the perplexed. We can work on it.

Who is using graph databases? 2-3 ppl

Using DSpace? 25-30 FedoraCommons? 25-30 Hydra? 10-15

This conference working for you? Almost everyone.

What do people think of the wiki? One idea is to move it over to github code4lib account.

From the federal government? 3

Anyone interested in integrated TSM into Solr? anarchivist says he knows people

How many non-library degree people considering getting one? 2

How many have project managers as their title? ~12 Public? 5 Academic? rest

CodeRead – looking at PyMARC (sp?). Anyone else looking into this?

Didn’t get all the questions, but that’s most of them.

Lunch Time

Code4Lib Day 1: Afternoon Notes

Practical Relevance Ranking for 10 million books

Tom Burton-West, University of Michigan Library

Search Challenges

multilingual, 400+ languages
OCR quality varies
very long documents
books are different from what they normally have

Relevance Ranking

how to score, weigh
default algorithm ranks very short documents very high
needed to tune/customize parameters
average document size is ~30 times larger
did prelim testing with Solr4 and didn’t see the same problem, but need more testing
dirty OCR complicates things, as well as language
occurrence of words in specific chapters vs. whole book – should we index parts of books?
similar issue with other objects e.g. bound journals, dictionaries & encyclopedias
difficulty too is inconsistent metadata, breakdowns of articles/chapters/etc. will be inconsistent
creating a testing plan and adding click logs

n Characters in Search of an Author

Jay Luker, IT Specialist, Smithsonian Astrophysics Data System
Slides

Goal of a search is to match user input to metadata. e.g. author names

Building the next generation of the ADS 2.0. Trying to increase recall without sacrificing precision.

Requirements

match UTF-8 e.g. matching ASCII version to versions with diacritics/markings
match more or less information e.g. first name initial but without triggering substring matching
need to work with hand curated synonyms e.g. pseudonyms, maiden/married name

Solving the Problem

normalization – strip out punctuation, rearrange name parts – based on whether a common is entered
generate name part variations to whatever can be realistically expected
transliteration – use index instrospection for list of synonyms
expand user queries at each step:
1. user searches
2. normalize
3. name part vars
4. transliteration
5. name parts vars of transliterated entries
6. curated synonyms
7. transliteration of anything added
8. name part variations to catch everything
9. assembled into large boolean query

Implementation

Python/JavaScript prototype
actual – Solr/Lucene

Evolving Towards a Consortium MARCR BIBFRAME Redis Datastore

Jeremy Nelson, Colorado College, jeremy.nelson@coloradocollege.edu
Sheila Yeh, University of Denver

Presentation Slides

I think this presentation speaks for itself.

Journal Article: Building a Library App Portfolio with Redis and Django

Hybrid Archival Collections Using Blacklight and Hydra

Adam Wead, Rock and Roll Hall of Fame and Museum
Presentation

Centre of everything is the Solr index. Blacklight puts everything into Solr. Library materials is easy enough, but with Archival collections use EAD with many items (not just one item as typical of MARC).

Extended Blacklight to search EAD

index collections and single items from a collection
search results include books, entire collections, and items from collections

Digital Content

kept in Fedora – objects described using Rubys
use Hydra to manage the content in Fedora – manages RDF relationships
indexes into Solr
Need to related Fedora content to its archival collection
content originates from sources in collection, and part of series
collection metadata already exists in Solr
create RDF representations of collections
Hydra queries Solr for collection meatadata
creates objects for series, subseries, items

Issues

terrible Solr performance for series, 500+ items
no EAD “round tripping” – EAD can go into Solr, but not back out
currently 60% complete

Citation search in SOLR and second-order operators

Roman Chyla, Astrophysics Data System

Sorry, I don’t have notes for this. My brain is a bit fried by this point. Will post link when I get it.

Break Time

Breakout Sessions – reports will be available on the wiki

Next Up – lightning talks

Code4Lib Day 1: Morning Notes

Was trying to do too many things this morning, so sorry if the notes are not complete.

ARCHITECTING ScholarSphere: How We Built a Repository App That Doesn’t Feel Like Yet Another Janky Old Repository App

Dan Coughlin, Penn State University
Mike Giarlo, Penn State University

Presentation Slides

Trying to make it less confusing without exposing what system it’s using.

Simple Metadata Management

building metadata widgets
required: title, creator, keyword, rights
hide most non-required, have ‘more’ link for rest
limited to a set numbers, with tooltip
use jQuery autocomplete to suggest authority vocabulary

Dashboard

list of uploaded files
list of files have access to

Background Jobs

I got lost here talking about rescue jobs, sorry
has tracebacks for

Permissions Widget

set visibility
share with specific people

Version Control

can restore previous versions

Social Features

not in initial requirements
profile
contributions – “trophies”
activity – follow/following

8 months to develop, but spent 2 months just doing usability and responding to feedback.

Available on GitHub.

Pitfall! Working with Legacy Born Digital Materials in Special Collections

Donald Mennerich, The New York Public Library
Mark A. Matienzo, Yale University Library

Presentation Slides

Disk Images Process

process
stream – digitized analog magnetic signal
sector – stream decoded using algorithm(s)
object
physical – entirety of device
logical

Pitfalls

formats mean different things
communities of practice use different kinds of container formats
no single solution

Quest for Access

delivery format
what allowed to be done with material
need usability testing

Pitfalls

no ideal single model
decisions through the life cycle have an impact on access
capacities of institution

Collection

faculty papers – 162 floppies
goal: “recover” backup into something useful with minimal changes, repeatable process
Vita Russo Papers
goal: preserve original, describe and arrange, access

Conclusions

Time consuming
acknowledge researchers
need to work on communities of practice

Hacking the DPLA

Nate Hill, Chattanooga Public Library, nathanielhill AT gmail.com
Sam Klein, Wikipedia

A rally to get involved.

It’s an API, and a community.

Examples

Biodiversity Heritage Library
Minnesota Digital

Events

Digital Public Library of America Appfest
Launch at Boston Public Library April 18-19

Documentation and API Creator is on GitHub.

EAD without XSLT: A Practical New Approach to Web-Based Finding Aids

Trevor Thornton, New York Public Library

Side note: EAD = Encoded Archival Description — a way of describing archival collection.

Project Goals

enable multiple presentations of the same data
support dynamic web apps
cross-collection search with component-level specificity in results, and faceting on common access points

Archives Data Management Application

system using Ruby on Rails + MySQL + Solr
based on existing infrastructure
stick with what they know
didn’t need to do anything more complex
key functionality: data import, search index, API

Core Models

collection: collection as we know it, may also be single volume
component: some collections at item level, some not
description: some data has descriptive attributes
access term

I just felt like I was copying the slides at this point, so I’ll try to get a link to the presentation slides instead.

The Avalon Media System: A Next Generation Hydra Head For Audio and Video Delivery

Michael Klein, Senior Software Developer, Northwestern University LIbrary, michael.klein AT northwestern DOT edu
Nathan Rogers, Programmer/Analyst, Indiana University

Demo!

can upload from computer, but also shared dropbox
forced to enter some metadata

Avalon

is a stack
media streaming server

Content Processing

with Matterhorn
workflow pipeline – batch/unattended ingest – uploading one delimited file with names of files that should be related
pingbacks for status updates
caching of key metadata/images

Stream Security

support different types of streaming (for desktop & mobile) and authentication
use authentication tokens
half is media ID, add another half, whole thing is auth token

Lunch Time

code4libTO December Meetup Talks

BagIt Profiles – @ruebot

directory of data
bag has what you’re bagging, data, contact email/name, organization information, profile identifier (JSON via a URI)
pull in all the field values
validate
wrote a spec and send it to digital curation community
can look up profiles in the registry

Okay, I got a little lost, but you can see more on github.

Internet Archive Torrent Collections (iaTorrent) – @ruebot

see demo

Bookfinder – @TheRealArty & Steven

I will write this up later probably as a separate blog post, or maybe journal article

TPL’s Web Services Architecture: Understanding the Big Picture – @waharnum

many different systems that don’t easily communicate, which needs specialized knowledge even to do basic tasks
address the challenges by translation, simplication, standardization
Three tiers: Front End Systems (requests to back end) / TPL Web Services (REST) / Back End Systems (responds to front end)
Example: TPL Website -> Account Web Services -> Symphony Web Services (Symphony) – and back
can add new features and functions
helps to solve the challenges mentioned
also helps with reusability e.g. in addition to website, build mobile-friendly website, iPhone App
Might end up with:
- Front End (Website, mobile, App)
- Middle Tier (Account Web Services, ebook Web Services, online payment web services)
- Back End (symphony, overdrive, payment gateway, accounting systems)
other benefits:
- increase ease of knowledge transfer about how our systems work
- follow modern best practice approach to building interoperating systems
- reduce cost and integration time
reduce learning time for new staff or consultants
metrics: wish had resources
bolting together a lot of things, not using a lot of custom code

Ladder (aka MyTPL 2) – @mjsuhonos

wanted to solve problem: discovery layers suck
problems:
- not scalable
- inflexible
- read-only
- expensive
goals:
- better than open source options (VuFind, Blacklight)
- cheaper (than proprietary)
scalable as WorldCat
design:
- schema-free/multi-schema (e.g. Dublin Core)
- horizontally scalable (multi-node)
- modern OSS components
simple data model (RDF)
Features:
- hierarchical relations
- clustering/de-duplication
- versioning
- real-time import & indexing
- multi-thread/process
- responsive UI
- fully multilingual (18/10)
- dynamic faceting
- dynamic mapping modification
- digital content storage (coming soon)
built on a linked data
not a discovery layer; it’s an integration platform

Heritage U of T – @ajmcalorum

News Announcement and Promotional Video
previously not centralized: hard drives, flickr, etc.
need central repository for tri-campus initiative with search & discovery, preservation, long-term access to content and metadata, support for multiple formats (e.g. images, books, documents, video, exhibits)
Drupal + Solr (search) + Fedora Commons (collection management, batch ingesting, metadata crosswalk, digital preservation) == islandora (digital asset management system)
pilot: 8 parent collections (by format, by campus)
exhibits in Drupal, not through islandora/fedora commons
modules: internet archive book reader (OCR on the fly), galleria, colorbox
official launch: 2 weeks ago

That’s it! Food and drinks time!