Access 2012 Day 2: Morning Notes

Zero to 50K in Three Weeks: Building a Digital Repository from Scratch, Fast

by Brianne Selman

Decided within a day to build a digital repository. While had previously thought about digitizing materials, there had been no repository to put it in. Note: No web programming support in house.

Process:

  • looked into archival standard
  • brainstormed
  • call out and identification of potential content
  • invited public in to scan personal artifacts (e.g. postcards) of local history
  • quick preparation of a budget and project plan (to keep money in the library)
  • met with collector and historian to talk about content and how it would displayed
  • first priority: collaboration on images (to show off knowledge)
  • met with scanning consultant to provide and discuss preliminary metadata
  • met with director and head of IT

At the three week mark, had not spent anything, but had created plan, which convinced

Hurdles:

  • ITS Expenditure Request
  • Software RFP (set evaluation matrix with extra weighting on OAI, etc.)

Purchased software and paid for scanning ahead of time. Ended up with ContentDM and at this point, done some scanning, added controlled vocabularies, test PDFs, contacted Canadiana.

Still to come:

  • workflow for future collections
  • identification of additional materials
  • Local History Nights
  • Collectors’ Scanning Days
  • Digitization days for public
  • Local History Talks

Prototype Demo

[gigya id=”preziEmbed_rbajjjd-0rv5″ name=”preziEmbed_rbajjjd-0rv5″ src=”http://prezi.com/bin/preziloader.swf” type=”application/x-shockwave-flash” allowfullscreen=”true” allowFullScreenInteractive=”true” allowscriptaccess=”always” width=”550″ height=”400″ bgcolor=”#ffffff” flashvars=”prezi_id=rbajjjd-0rv5&lock_to_path=1&color=ffffff&autoplay=no&autohide_ctrls=0″]
Zero to 50k in Three Weeks on Prezi

Open Source OCR for Large Collections of Scanned Documents

by Art Rhyno

Newspaper Death Watch – The state of newspapers.

Removing Barriers to Discovery

Currently, most old newspaper issues on microfilm. This is not accessible!

OCR:

  • Commercial: Abbyy
  • Open Source: Tesseract (can add own symbols)

Even with top of the line commercial software, low accuracy. In open source, need to do some Gaussian pre-processing first.

Line Segment Detector to help separate columns and Olena to help with pre-processing.

Python has good image support, then use MapReduce and Hadoop Streaming to coordinate tasks and machines (but use very odd ports).

Abbyy works well if images vary and no consistent approach to cleaning, have non-flexible windows environment, can do processing on one station, and one-off project that needs to get done in a hurry.

Tesseract Mods on Github

Break Time

Ignite Talks

Are in a separate blog post.

Cooking with Chef at the University of Toronto Libraries: Automated Deployment of Web Applications in a Library Context

by Graham Stewart

Not really about hardware, instead, focus on efficient web operations and services to users:

  • fast
  • reliable
  • highly available
  • useful

Technology used:

  • open source tools
  • Linux, KVM
  • web apps
  • others (didn’t catch them)

Chef:

  • configuration management for infrastructure automation, as code
  • ensures servers running specified programs with specified configurations
  • chef-server stores information about your environment
  • chelf-client gets information from chef-server what it should do, how configured, what other nodes need to know

Chef Components:

  • recipes: perform specific task(s), mostly install
  • attributes: data about chef clients
  • templates: files used to dynamically generate content, frequently for config files (can execute Ruby code)
  • Cookbooks: modules
  • Roles: collection of recipes, other roles, and attributes -can be building blocks of an application
  • Data bags: data about the infrastructure that exists outside nodes e.g. user accounts

One of the best parts of Chef is the community. Very active with conference, wiki, etc. Can use Ruby.

Why Useful?

  • server configs similar
  • never do anything twice
  • “easily” recover infrastructure: separate config from data and applications
  • end of monolithic, critical, fragile server
  • there is more than one way to do it, but best to do it consistently
  • can start another project right away

Problems

  • complicated: steep learning curve
  • potential for big fail
  • a bit bleeding edge: very aggressive release schedule
  • acquisition potential

Interest to the Library Community

  • IT no longer the roadblock
  • leads to greater cooperation

More notes on the Access 2012 Live Blog.