Zero to 50K in Three Weeks: Building a Digital Repository from Scratch, Fast
by Brianne Selman
Decided within a day to build a digital repository. While had previously thought about digitizing materials, there had been no repository to put it in. Note: No web programming support in house.
- looked into archival standard
- call out and identification of potential content
- invited public in to scan personal artifacts (e.g. postcards) of local history
- quick preparation of a budget and project plan (to keep money in the library)
- met with collector and historian to talk about content and how it would displayed
- first priority: collaboration on images (to show off knowledge)
- met with scanning consultant to provide and discuss preliminary metadata
- met with director and head of IT
At the three week mark, had not spent anything, but had created plan, which convinced
- ITS Expenditure Request
- Software RFP (set evaluation matrix with extra weighting on OAI, etc.)
Purchased software and paid for scanning ahead of time. Ended up with ContentDM and at this point, done some scanning, added controlled vocabularies, test PDFs, contacted Canadiana.
Still to come:
- workflow for future collections
- identification of additional materials
- Local History Nights
- Collectors’ Scanning Days
- Digitization days for public
- Local History Talks
Open Source OCR for Large Collections of Scanned Documents
by Art Rhyno
Newspaper Death Watch – The state of newspapers.
Removing Barriers to Discovery
Currently, most old newspaper issues on microfilm. This is not accessible!
- Commercial: Abbyy
- Open Source: Tesseract (can add own symbols)
Even with top of the line commercial software, low accuracy. In open source, need to do some Gaussian pre-processing first.
Python has good image support, then use MapReduce and Hadoop Streaming to coordinate tasks and machines (but use very odd ports).
Abbyy works well if images vary and no consistent approach to cleaning, have non-flexible windows environment, can do processing on one station, and one-off project that needs to get done in a hurry.
Tesseract Mods on Github
Are in a separate blog post.
Cooking with Chef at the University of Toronto Libraries: Automated Deployment of Web Applications in a Library Context
by Graham Stewart
Not really about hardware, instead, focus on efficient web operations and services to users:
- highly available
- open source tools
- Linux, KVM
- web apps
- others (didn’t catch them)
- configuration management for infrastructure automation, as code
- ensures servers running specified programs with specified configurations
- chef-server stores information about your environment
- chelf-client gets information from chef-server what it should do, how configured, what other nodes need to know
- recipes: perform specific task(s), mostly install
- attributes: data about chef clients
- templates: files used to dynamically generate content, frequently for config files (can execute Ruby code)
- Cookbooks: modules
- Roles: collection of recipes, other roles, and attributes -can be building blocks of an application
- Data bags: data about the infrastructure that exists outside nodes e.g. user accounts
One of the best parts of Chef is the community. Very active with conference, wiki, etc. Can use Ruby.
- server configs similar
- never do anything twice
- “easily” recover infrastructure: separate config from data and applications
- end of monolithic, critical, fragile server
- there is more than one way to do it, but best to do it consistently
- can start another project right away
- complicated: steep learning curve
- potential for big fail
- a bit bleeding edge: very aggressive release schedule
- acquisition potential
Interest to the Library Community
- IT no longer the roadblock
- leads to greater cooperation
More notes on the Access 2012 Live Blog.