Access 2012 Day 2: Morning Notes

Zero to 50K in Three Weeks: Building a Digital Repository from Scratch, Fast

by Brianne Selman

Decided within a day to build a digital repository. While had previously thought about digitizing materials, there had been no repository to put it in. Note: No web programming support in house.

Process:

looked into archival standard
brainstormed
call out and identification of potential content
invited public in to scan personal artifacts (e.g. postcards) of local history
quick preparation of a budget and project plan (to keep money in the library)
met with collector and historian to talk about content and how it would displayed
first priority: collaboration on images (to show off knowledge)
met with scanning consultant to provide and discuss preliminary metadata
met with director and head of IT

At the three week mark, had not spent anything, but had created plan, which convinced

Hurdles:

ITS Expenditure Request
Software RFP (set evaluation matrix with extra weighting on OAI, etc.)

Purchased software and paid for scanning ahead of time. Ended up with ContentDM and at this point, done some scanning, added controlled vocabularies, test PDFs, contacted Canadiana.

Still to come:

workflow for future collections
identification of additional materials
Local History Nights
Collectors’ Scanning Days
Digitization days for public
Local History Talks

Prototype Demo

[gigya id=”preziEmbed_rbajjjd-0rv5″ name=”preziEmbed_rbajjjd-0rv5″ src=”http://prezi.com/bin/preziloader.swf” type=”application/x-shockwave-flash” allowfullscreen=”true” allowFullScreenInteractive=”true” allowscriptaccess=”always” width=”550″ height=”400″ bgcolor=”#ffffff” flashvars=”prezi_id=rbajjjd-0rv5&lock_to_path=1&color=ffffff&autoplay=no&autohide_ctrls=0″]
Zero to 50k in Three Weeks on Prezi

Open Source OCR for Large Collections of Scanned Documents

by Art Rhyno

Newspaper Death Watch – The state of newspapers.

Removing Barriers to Discovery

Currently, most old newspaper issues on microfilm. This is not accessible!

OCR:

Commercial: Abbyy
Open Source: Tesseract (can add own symbols)

Even with top of the line commercial software, low accuracy. In open source, need to do some Gaussian pre-processing first.

Line Segment Detector to help separate columns and Olena to help with pre-processing.

Python has good image support, then use MapReduce and Hadoop Streaming to coordinate tasks and machines (but use very odd ports).

Abbyy works well if images vary and no consistent approach to cleaning, have non-flexible windows environment, can do processing on one station, and one-off project that needs to get done in a hurry.

Tesseract Mods on Github

Break Time

Ignite Talks

Are in a separate blog post.

Cooking with Chef at the University of Toronto Libraries: Automated Deployment of Web Applications in a Library Context

by Graham Stewart

Not really about hardware, instead, focus on efficient web operations and services to users:

fast
reliable
highly available
useful

Technology used:

open source tools
Linux, KVM
web apps
others (didn’t catch them)

Chef:

configuration management for infrastructure automation, as code
ensures servers running specified programs with specified configurations
chef-server stores information about your environment
chelf-client gets information from chef-server what it should do, how configured, what other nodes need to know

Chef Components:

recipes: perform specific task(s), mostly install
attributes: data about chef clients
templates: files used to dynamically generate content, frequently for config files (can execute Ruby code)
Cookbooks: modules
Roles: collection of recipes, other roles, and attributes -can be building blocks of an application
Data bags: data about the infrastructure that exists outside nodes e.g. user accounts

One of the best parts of Chef is the community. Very active with conference, wiki, etc. Can use Ruby.

Why Useful?

server configs similar
never do anything twice
“easily” recover infrastructure: separate config from data and applications
end of monolithic, critical, fragile server
there is more than one way to do it, but best to do it consistently
can start another project right away

Problems

complicated: steep learning curve
potential for big fail
a bit bleeding edge: very aggressive release schedule
acquisition potential

Interest to the Library Community

IT no longer the roadblock
leads to greater cooperation

More notes on the Access 2012 Live Blog.