Practical Relevance Ranking for 10 million books
- Tom Burton-West, University of Michigan Library
Search Challenges
- multilingual, 400+ languages
- OCR quality varies
- very long documents
- books are different from what they normally have
Relevance Ranking
- how to score, weigh
- default algorithm ranks very short documents very high
- needed to tune/customize parameters
- average document size is ~30 times larger
- did prelim testing with Solr4 and didn’t see the same problem, but need more testing
- dirty OCR complicates things, as well as language
- occurrence of words in specific chapters vs. whole book – should we index parts of books?
- similar issue with other objects e.g. bound journals, dictionaries & encyclopedias
- difficulty too is inconsistent metadata, breakdowns of articles/chapters/etc. will be inconsistent
- creating a testing plan and adding click logs
n Characters in Search of an Author
- Jay Luker, IT Specialist, Smithsonian Astrophysics Data System
- Slides
Goal of a search is to match user input to metadata. e.g. author names
Building the next generation of the ADS 2.0. Trying to increase recall without sacrificing precision.
Requirements
- match UTF-8 e.g. matching ASCII version to versions with diacritics/markings
- match more or less information e.g. first name initial but without triggering substring matching
- need to work with hand curated synonyms e.g. pseudonyms, maiden/married name
Solving the Problem
- normalization – strip out punctuation, rearrange name parts – based on whether a common is entered
- generate name part variations to whatever can be realistically expected
- transliteration – use index instrospection for list of synonyms
- expand user queries at each step:
- user searches
- normalize
- name part vars
- transliteration
- name parts vars of transliterated entries
- curated synonyms
- transliteration of anything added
- name part variations to catch everything
- assembled into large boolean query
Implementation
- Python/JavaScript prototype
- actual – Solr/Lucene
Evolving Towards a Consortium MARCR BIBFRAME Redis Datastore
- Jeremy Nelson, Colorado College, jeremy.nelson@coloradocollege.edu
- Sheila Yeh, University of Denver
I think this presentation speaks for itself.
Journal Article: Building a Library App Portfolio with Redis and Django
Hybrid Archival Collections Using Blacklight and Hydra
- Adam Wead, Rock and Roll Hall of Fame and Museum
- Presentation
Centre of everything is the Solr index. Blacklight puts everything into Solr. Library materials is easy enough, but with Archival collections use EAD with many items (not just one item as typical of MARC).
Extended Blacklight to search EAD
- index collections and single items from a collection
- search results include books, entire collections, and items from collections
Digital Content
- kept in Fedora – objects described using Rubys
- use Hydra to manage the content in Fedora – manages RDF relationships
- indexes into Solr
- Need to related Fedora content to its archival collection
- content originates from sources in collection, and part of series
- collection metadata already exists in Solr
- create RDF representations of collections
- Hydra queries Solr for collection meatadata
- creates objects for series, subseries, items
Issues
- terrible Solr performance for series, 500+ items
- no EAD “round tripping” – EAD can go into Solr, but not back out
- currently 60% complete
Citation search in SOLR and second-order operators
- Roman Chyla, Astrophysics Data System
Sorry, I don’t have notes for this. My brain is a bit fried by this point. Will post link when I get it.
Break Time
Breakout Sessions – reports will be available on the wiki
Next Up – lightning talks