The notes from Day 1, other than the opening keynote. Continue reading “Access 2014: Day 1 Notes”
Tag: digital repository
Access 2013: It’s dangerous to go alone! How about *we* do this!?
- Steve Marks, Nick Ruest, Graham Stewart & Amaz Taufique
Everyone has many of the same needs when looking at digital collections: digitization of collections, mixed types of content, preservation, etc. Continue reading “Access 2013: It’s dangerous to go alone! How about *we* do this!?”
Code4Lib Day 1: Morning Notes
Was trying to do too many things this morning, so sorry if the notes are not complete.
ARCHITECTING ScholarSphere: How We Built a Repository App That Doesn’t Feel Like Yet Another Janky Old Repository App
- Dan Coughlin, Penn State University
- Mike Giarlo, Penn State University
Trying to make it less confusing without exposing what system it’s using.
Simple Metadata Management
- building metadata widgets
- required: title, creator, keyword, rights
- hide most non-required, have ‘more’ link for rest
- limited to a set numbers, with tooltip
- use jQuery autocomplete to suggest authority vocabulary
- list of uploaded files
- list of files have access to
- I got lost here talking about rescue jobs, sorry
- has tracebacks for
- set visibility
- share with specific people
- can restore previous versions
- not in initial requirements
- contributions – “trophies”
- activity – follow/following
8 months to develop, but spent 2 months just doing usability and responding to feedback.
Available on GitHub.
Pitfall! Working with Legacy Born Digital Materials in Special Collections
- Donald Mennerich, The New York Public Library
- Mark A. Matienzo, Yale University Library
Disk Images Process
- stream – digitized analog magnetic signal
- sector – stream decoded using algorithm(s)
- physical – entirety of device
- formats mean different things
- communities of practice use different kinds of container formats
- no single solution
Quest for Access
- delivery format
- what allowed to be done with material
- need usability testing
- no ideal single model
- decisions through the life cycle have an impact on access
- capacities of institution
- faculty papers – 162 floppies
- goal: “recover” backup into something useful with minimal changes, repeatable process
- Vita Russo Papers
- goal: preserve original, describe and arrange, access
- Time consuming
- acknowledge researchers
- need to work on communities of practice
Hacking the DPLA
- Nate Hill, Chattanooga Public Library, nathanielhill AT gmail.com
- Sam Klein, Wikipedia
A rally to get involved.
It’s an API, and a community.
- Biodiversity Heritage Library
- Minnesota Digital
- Digital Public Library of America Appfest
- Launch at Boston Public Library April 18-19
Documentation and API Creator is on GitHub.
EAD without XSLT: A Practical New Approach to Web-Based Finding Aids
- Trevor Thornton, New York Public Library
Side note: EAD = Encoded Archival Description — a way of describing archival collection.
- enable multiple presentations of the same data
- support dynamic web apps
- cross-collection search with component-level specificity in results, and faceting on common access points
Archives Data Management Application
- system using Ruby on Rails + MySQL + Solr
- based on existing infrastructure
- stick with what they know
- didn’t need to do anything more complex
- key functionality: data import, search index, API
- collection: collection as we know it, may also be single volume
- component: some collections at item level, some not
- description: some data has descriptive attributes
- access term
I just felt like I was copying the slides at this point, so I’ll try to get a link to the presentation slides instead.
The Avalon Media System: A Next Generation Hydra Head For Audio and Video Delivery
- Michael Klein, Senior Software Developer, Northwestern University LIbrary, michael.klein AT northwestern DOT edu
- Nathan Rogers, Programmer/Analyst, Indiana University
- can upload from computer, but also shared dropbox
- forced to enter some metadata
- is a stack
- media streaming server
- with Matterhorn
- workflow pipeline – batch/unattended ingest – uploading one delimited file with names of files that should be related
- pingbacks for status updates
- caching of key metadata/images
- support different types of streaming (for desktop & mobile) and authentication
- use authentication tokens
- half is media ID, add another half, whole thing is auth token
Access 2012 Day 2: Morning Notes
Zero to 50K in Three Weeks: Building a Digital Repository from Scratch, Fast
by Brianne Selman
Decided within a day to build a digital repository. While had previously thought about digitizing materials, there had been no repository to put it in. Note: No web programming support in house.
- looked into archival standard
- call out and identification of potential content
- invited public in to scan personal artifacts (e.g. postcards) of local history
- quick preparation of a budget and project plan (to keep money in the library)
- met with collector and historian to talk about content and how it would displayed
- first priority: collaboration on images (to show off knowledge)
- met with scanning consultant to provide and discuss preliminary metadata
- met with director and head of IT
At the three week mark, had not spent anything, but had created plan, which convinced
- ITS Expenditure Request
- Software RFP (set evaluation matrix with extra weighting on OAI, etc.)
Purchased software and paid for scanning ahead of time. Ended up with ContentDM and at this point, done some scanning, added controlled vocabularies, test PDFs, contacted Canadiana.
Still to come:
- workflow for future collections
- identification of additional materials
- Local History Nights
- Collectors’ Scanning Days
- Digitization days for public
- Local History Talks
Zero to 50k in Three Weeks on Prezi
Open Source OCR for Large Collections of Scanned Documents
by Art Rhyno
Newspaper Death Watch – The state of newspapers.
Removing Barriers to Discovery
Currently, most old newspaper issues on microfilm. This is not accessible!
- Commercial: Abbyy
- Open Source: Tesseract (can add own symbols)
Even with top of the line commercial software, low accuracy. In open source, need to do some Gaussian pre-processing first.
Line Segment Detector to help separate columns and Olena to help with pre-processing.
Python has good image support, then use MapReduce and Hadoop Streaming to coordinate tasks and machines (but use very odd ports).
Abbyy works well if images vary and no consistent approach to cleaning, have non-flexible windows environment, can do processing on one station, and one-off project that needs to get done in a hurry.
Tesseract Mods on Github
Are in a separate blog post.
Cooking with Chef at the University of Toronto Libraries: Automated Deployment of Web Applications in a Library Context
by Graham Stewart
Not really about hardware, instead, focus on efficient web operations and services to users:
- highly available
- open source tools
- Linux, KVM
- web apps
- others (didn’t catch them)
- configuration management for infrastructure automation, as code
- ensures servers running specified programs with specified configurations
- chef-server stores information about your environment
- chelf-client gets information from chef-server what it should do, how configured, what other nodes need to know
- recipes: perform specific task(s), mostly install
- attributes: data about chef clients
- templates: files used to dynamically generate content, frequently for config files (can execute Ruby code)
- Cookbooks: modules
- Roles: collection of recipes, other roles, and attributes -can be building blocks of an application
- Data bags: data about the infrastructure that exists outside nodes e.g. user accounts
One of the best parts of Chef is the community. Very active with conference, wiki, etc. Can use Ruby.
- server configs similar
- never do anything twice
- “easily” recover infrastructure: separate config from data and applications
- end of monolithic, critical, fragile server
- there is more than one way to do it, but best to do it consistently
- can start another project right away
- complicated: steep learning curve
- potential for big fail
- a bit bleeding edge: very aggressive release schedule
- acquisition potential
Interest to the Library Community
- IT no longer the roadblock
- leads to greater cooperation
More notes on the Access 2012 Live Blog.