Afternoon of Day 1 of Code4lib 2014.
Structured Data NOW: seeding schema.org in library systems – Dan Scott
Bags of words are hard.
Consistent flaw:
Wanted to make my library system part of the Semantic web, but
* XML
etc.
Schema.org was introduced
- offer simple vocabulary for short tail of results (events, products, people)
- enable normals to add markup without experts, with lots of examples
- enable search engines to aggregate data and apply better disambiguation and relevance strategies
Baby Steps
- Evergreen was publishing simplistic title/author/keyword via microdata
- OCLC WorldCat also started publishing rich, heavily extended schema.org via JSON
- If your holdings are not in OCLC, you’re not linked to Google Books
Iterate Towards Linked Data
Being enriching data using web standard
* persistent URIs
* HTML5
* RDFa (or microdata) expressing schema.org
* sitemaps listing all the URIs of interest
W3C Library Linked Data Incubator Group Report says many of the same things, so go read it.
Reality Check
Ronallo found American academic libraries published under 10k shcema.org instances in total.
RDFa
Lite is pared down to just 5 attributes. Microdata is roughly equivalent form of inline markup. Provide information on type and property.
Test with structured data extracted from a page.
SchemaBibEx
Look at what needed to be extended to bring into schema.org proper. The idea was to make it for all articles and library items.
Mapping
Mapping holdings to schema.org offers
* seller = library
* sku = call number
* serialNumber = barcode
Periodicals
* article type
* periodical extension with PublicationIssue, PublicationVolume, Periodical, Book
* currently under consideration by schema.org
Status
Stopped making new extensions, and looking at best practices, documentation, etc.
Now being published by Koha, VuFind, and about to be published in Evergreen.
Next Generation Catalogue – RDF as a Basis for New Services – Anne-Lena Westrum, Benjamin Rokseth, Asgeir Rekkavik, and Petter Goksøyr Åsen
4 years ago, we were living inside the black box of the ILS.
One example search for an author, providing 851 results but should be 40 results.
70% of material in stacks, so rely on OPAC to find what they have.
In 2017, will be in a new building. Open digital mediation centre.
Have chosen to move away from MARC. User centred services
Active shelves = physical touchscreen device. Shelf reads RDIF, present information that is relevant to the book e.g. reviews, stuff by the same author (only one edition instead of all), similar books
Collected book recommendations around the country into one RDF store, connected to books via ISBN. Can query database for recommended books.
Move From Black Box to Open System Architecture
Started preparation. System makes user choose specific edition of a book.
> This kind of user experience is like going to a library and being helped by a librarian who is a complete idiot
Need to add common sense to the system.
MARC2RDF
* open source tool kit [code]/code
* conversion from MARC bibliographic data to RDF statements
* enrich data with external content from various APIs and linked open data e.g. cover images, book reviews
* can control multiple groups of data and multiple mapping files
* can add conditional choices
RDF2MARC
* Still going to need MARC records for several purposes e.g. circulation, ILL
More Like This: Approaches to Recommending Related Items using Subject Headings – Kevin Beswick
Recommendations for more serendipitous discovery in part because using ASRS (bookbot).
Did it based on subject headings, most subject terms, weighted subject terms.
Built with Python/Flask App, Solr/SolrMARC.
The most headings and most terms algorithms looked to be producing decent recommendations (first headings too few results), and weighting differs based on subject or user interests which is impossible without user input
Tested algorithms using blind ranking and qualitative comments on result sets of 10. Most subject terms (esp. longer/more headings) better than most headings (better for shorter/fewer headings), but wanted less in the 0-5 range. Found that gov docs and fiction have thematic recommendations can’t achieve with shelf browse.
Found a lot of duplicate titles (different editions, print & electronic). Poorly assigned subject headings can cause issues. Interface considerations include integration on full record of an item of 5 at a time.
Takeaways
- overall algorithms perform decently but could improve
- but depends on how your items are catalogued
- still under active development