digital repository – Learning (Lib)Tech

Access 2014: Day 1 Notes

The notes from Day 1, other than the opening keynote. Continue reading “Access 2014: Day 1 Notes”

Access 2013: It’s dangerous to go alone! How about we do this!?

Steve Marks, Nick Ruest, Graham Stewart & Amaz Taufique

Everyone has many of the same needs when looking at digital collections: digitization of collections, mixed types of content, preservation, etc. Continue reading “Access 2013: It’s dangerous to go alone! How about *we* do this!?”

Code4Lib Day 1: Morning Notes

Was trying to do too many things this morning, so sorry if the notes are not complete.

ARCHITECTING ScholarSphere: How We Built a Repository App That Doesn’t Feel Like Yet Another Janky Old Repository App

Dan Coughlin, Penn State University
Mike Giarlo, Penn State University

Presentation Slides

Trying to make it less confusing without exposing what system it’s using.

Simple Metadata Management

building metadata widgets
required: title, creator, keyword, rights
hide most non-required, have ‘more’ link for rest
limited to a set numbers, with tooltip
use jQuery autocomplete to suggest authority vocabulary

Dashboard

list of uploaded files
list of files have access to

Background Jobs

I got lost here talking about rescue jobs, sorry
has tracebacks for

Permissions Widget

set visibility
share with specific people

Version Control

can restore previous versions

Social Features

not in initial requirements
profile
contributions – “trophies”
activity – follow/following

8 months to develop, but spent 2 months just doing usability and responding to feedback.

Available on GitHub.

Pitfall! Working with Legacy Born Digital Materials in Special Collections

Donald Mennerich, The New York Public Library
Mark A. Matienzo, Yale University Library

Disk Images Process

process
stream – digitized analog magnetic signal
sector – stream decoded using algorithm(s)
object
physical – entirety of device
logical

Pitfalls

formats mean different things
communities of practice use different kinds of container formats
no single solution

Quest for Access

delivery format
what allowed to be done with material
need usability testing

Pitfalls

no ideal single model
decisions through the life cycle have an impact on access
capacities of institution

Collection

faculty papers – 162 floppies
goal: “recover” backup into something useful with minimal changes, repeatable process
Vita Russo Papers
goal: preserve original, describe and arrange, access

Conclusions

Time consuming
acknowledge researchers
need to work on communities of practice

Hacking the DPLA

Nate Hill, Chattanooga Public Library, nathanielhill AT gmail.com
Sam Klein, Wikipedia

A rally to get involved.

It’s an API, and a community.

Examples

Biodiversity Heritage Library
Minnesota Digital

Events

Digital Public Library of America Appfest
Launch at Boston Public Library April 18-19

Documentation and API Creator is on GitHub.

EAD without XSLT: A Practical New Approach to Web-Based Finding Aids

Trevor Thornton, New York Public Library

Side note: EAD = Encoded Archival Description — a way of describing archival collection.

Project Goals

enable multiple presentations of the same data
support dynamic web apps
cross-collection search with component-level specificity in results, and faceting on common access points

Archives Data Management Application

system using Ruby on Rails + MySQL + Solr
based on existing infrastructure
stick with what they know
didn’t need to do anything more complex
key functionality: data import, search index, API

Core Models

collection: collection as we know it, may also be single volume
component: some collections at item level, some not
description: some data has descriptive attributes
access term

I just felt like I was copying the slides at this point, so I’ll try to get a link to the presentation slides instead.

The Avalon Media System: A Next Generation Hydra Head For Audio and Video Delivery

Michael Klein, Senior Software Developer, Northwestern University LIbrary, michael.klein AT northwestern DOT edu
Nathan Rogers, Programmer/Analyst, Indiana University

Demo!

can upload from computer, but also shared dropbox
forced to enter some metadata

Avalon

is a stack
media streaming server

Content Processing

with Matterhorn
workflow pipeline – batch/unattended ingest – uploading one delimited file with names of files that should be related
pingbacks for status updates
caching of key metadata/images

Stream Security

support different types of streaming (for desktop & mobile) and authentication
use authentication tokens
half is media ID, add another half, whole thing is auth token

Lunch Time

Access 2012 Day 2: Morning Notes

Zero to 50K in Three Weeks: Building a Digital Repository from Scratch, Fast

by Brianne Selman

Decided within a day to build a digital repository. While had previously thought about digitizing materials, there had been no repository to put it in. Note: No web programming support in house.

Process:

looked into archival standard
brainstormed
call out and identification of potential content
invited public in to scan personal artifacts (e.g. postcards) of local history
quick preparation of a budget and project plan (to keep money in the library)
met with collector and historian to talk about content and how it would displayed
first priority: collaboration on images (to show off knowledge)
met with scanning consultant to provide and discuss preliminary metadata
met with director and head of IT

At the three week mark, had not spent anything, but had created plan, which convinced

Hurdles:

ITS Expenditure Request
Software RFP (set evaluation matrix with extra weighting on OAI, etc.)

Purchased software and paid for scanning ahead of time. Ended up with ContentDM and at this point, done some scanning, added controlled vocabularies, test PDFs, contacted Canadiana.

Still to come:

workflow for future collections
identification of additional materials
Local History Nights
Collectors’ Scanning Days
Digitization days for public
Local History Talks

Prototype Demo

[gigya id=”preziEmbed_rbajjjd-0rv5″ name=”preziEmbed_rbajjjd-0rv5″ src=”https://prezi.com/bin/preziloader.swf” type=”application/x-shockwave-flash” allowfullscreen=”true” allowFullScreenInteractive=”true” allowscriptaccess=”always” width=”550″ height=”400″ bgcolor=”#ffffff” flashvars=”prezi_id=rbajjjd-0rv5&lock_to_path=1&color=ffffff&autoplay=no&autohide_ctrls=0″]
Zero to 50k in Three Weeks on Prezi

Open Source OCR for Large Collections of Scanned Documents

by Art Rhyno

Newspaper Death Watch – The state of newspapers.

Removing Barriers to Discovery

Currently, most old newspaper issues on microfilm. This is not accessible!

OCR:

Commercial: Abbyy
Open Source: Tesseract (can add own symbols)

Even with top of the line commercial software, low accuracy. In open source, need to do some Gaussian pre-processing first.

Line Segment Detector to help separate columns and Olena to help with pre-processing.

Python has good image support, then use MapReduce and Hadoop Streaming to coordinate tasks and machines (but use very odd ports).

Abbyy works well if images vary and no consistent approach to cleaning, have non-flexible windows environment, can do processing on one station, and one-off project that needs to get done in a hurry.

Tesseract Mods on Github

Break Time

Ignite Talks

Are in a separate blog post.

Cooking with Chef at the University of Toronto Libraries: Automated Deployment of Web Applications in a Library Context

by Graham Stewart

Not really about hardware, instead, focus on efficient web operations and services to users:

fast
reliable
highly available
useful

Technology used:

open source tools
Linux, KVM
web apps
others (didn’t catch them)

Chef:

configuration management for infrastructure automation, as code
ensures servers running specified programs with specified configurations
chef-server stores information about your environment
chelf-client gets information from chef-server what it should do, how configured, what other nodes need to know

Chef Components:

recipes: perform specific task(s), mostly install
attributes: data about chef clients
templates: files used to dynamically generate content, frequently for config files (can execute Ruby code)
Cookbooks: modules
Roles: collection of recipes, other roles, and attributes -can be building blocks of an application
Data bags: data about the infrastructure that exists outside nodes e.g. user accounts

One of the best parts of Chef is the community. Very active with conference, wiki, etc. Can use Ruby.

Why Useful?

server configs similar
never do anything twice
“easily” recover infrastructure: separate config from data and applications
end of monolithic, critical, fragile server
there is more than one way to do it, but best to do it consistently
can start another project right away

Problems

complicated: steep learning curve
potential for big fail
a bit bleeding edge: very aggressive release schedule
acquisition potential

Interest to the Library Community

IT no longer the roadblock
leads to greater cooperation

More notes on the Access 2012 Live Blog.