Skip to content

Learning (Lib)Tech

Stories from my Life as a Technologist

  • About Me
  • About this Blog
  • Contact Me

Code4Lib Day 1: Afternoon Notes

Practical Relevance Ranking for 10 million books

  • Tom Burton-West, University of Michigan Library

Search Challenges

  • multilingual, 400+ languages
  • OCR quality varies
  • very long documents
  • books are different from what they normally have

Relevance Ranking

  • how to score, weigh
  • default algorithm ranks very short documents very high
  • needed to tune/customize parameters
  • average document size is ~30 times larger
  • did prelim testing with Solr4 and didn’t see the same problem, but need more testing
  • dirty OCR complicates things, as well as language
  • occurrence of words in specific chapters vs. whole book – should we index parts of books?
  • similar issue with other objects e.g. bound journals, dictionaries & encyclopedias
  • difficulty too is inconsistent metadata, breakdowns of articles/chapters/etc. will be inconsistent
  • creating a testing plan and adding click logs

n Characters in Search of an Author

  • Jay Luker, IT Specialist, Smithsonian Astrophysics Data System
  • Slides

Goal of a search is to match user input to metadata. e.g. author names

Building the next generation of the ADS 2.0. Trying to increase recall without sacrificing precision.

Requirements

  • match UTF-8 e.g. matching ASCII version to versions with diacritics/markings
  • match more or less information e.g. first name initial but without triggering substring matching
  • need to work with hand curated synonyms e.g. pseudonyms, maiden/married name

Solving the Problem

  • normalization – strip out punctuation, rearrange name parts – based on whether a common is entered
  • generate name part variations to whatever can be realistically expected
  • transliteration – use index instrospection for list of synonyms
  • expand user queries at each step:
    1. user searches
    2. normalize
    3. name part vars
    4. transliteration
    5. name parts vars of transliterated entries
    6. curated synonyms
    7. transliteration of anything added
    8. name part variations to catch everything
    9. assembled into large boolean query

Implementation

  • Python/JavaScript prototype
  • actual – Solr/Lucene

Evolving Towards a Consortium MARCR BIBFRAME Redis Datastore

  • Jeremy Nelson, Colorado College, jeremy.nelson@coloradocollege.edu
  • Sheila Yeh, University of Denver

Presentation Slides

I think this presentation speaks for itself.

Journal Article: Building a Library App Portfolio with Redis and Django

Hybrid Archival Collections Using Blacklight and Hydra

  • Adam Wead, Rock and Roll Hall of Fame and Museum
  • Presentation

Centre of everything is the Solr index. Blacklight puts everything into Solr. Library materials is easy enough, but with Archival collections use EAD with many items (not just one item as typical of MARC).

Extended Blacklight to search EAD

  • index collections and single items from a collection
  • search results include books, entire collections, and items from collections

Digital Content

  • kept in Fedora – objects described using Rubys
  • use Hydra to manage the content in Fedora – manages RDF relationships
  • indexes into Solr
  • Need to related Fedora content to its archival collection
  • content originates from sources in collection, and part of series
  • collection metadata already exists in Solr
  • create RDF representations of collections
  • Hydra queries Solr for collection meatadata
  • creates objects for series, subseries, items

Issues

  • terrible Solr performance for series, 500+ items
  • no EAD “round tripping” – EAD can go into Solr, but not back out
  • currently 60% complete

Citation search in SOLR and second-order operators

  • Roman Chyla, Astrophysics Data System

Sorry, I don’t have notes for this. My brain is a bit fried by this point. Will post link when I get it.

Break Time

Breakout Sessions – reports will be available on the wiki

Next Up – lightning talks

pig taking bath
‘Cause Pig

Share this:

  • Click to share on Mastodon (Opens in new window) Mastodon
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on X (Opens in new window) X
  • More
  • Click to share on Reddit (Opens in new window) Reddit
  • Click to share on Bluesky (Opens in new window) Bluesky
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on Pocket (Opens in new window) Pocket
  • Click to email a link to a friend (Opens in new window) Email
Author Cynthia NgPosted on February 12, 2013October 12, 2024Categories Events, TechnologyTags code4lib, digital collections, metadata, search

Post navigation

Previous Previous post: Code4Lib Day 1: Morning Notes
Next Next post: Code4Lib Day 1: RULA Bookfinder: Getting People to Books Fast! Lightning Talk
Cynthia Ng Avatar

Technologist, Support Engineer, Librarian, Metadata and Technical Services expert, Educator, Mentor, Web Developer, UXer, Accessibility Advocate, Documentarian

  • Mastodon
  • LinkedIn
  • YouTube
  • RSS Feed
  • Link

Categories

  • Events (268)
  • Project work (90)
  • Work culture (59)
  • Technology (53)
  • Web design (52)

Tags

  • code4lib (86)
  • reflection (68)
  • accessibility (61)
  • presentation (32)
  • GitLab (30)

Year

  • 2025 (2)
  • 2024 (8)
  • 2023 (17)
  • 2022 (14)
  • 2021 (5)
  • Academic
  • Events
  • Librarianship
  • Library
  • Methodology
  • Project work
  • Public
  • Special
  • Technology
  • Tools
  • Tours
  • Update
  • Web design
  • Work culture
  • About Me
  • About this Blog
  • Contact Me
Learning (Lib)Tech Proudly powered by WordPress