Final half day of Access 2014. Continue reading “Access 2014: Day 3 Notes”
Tag: linked data
Access 2014 Day 2: Afternoon Notes
We continue with the afternoon of day 2 of Access 2014. Continue reading “Access 2014 Day 2: Afternoon Notes”
BCLA 2014: Linked Open Data and Libraries. When? Or NOW!
Panel of 3 on linked open data.
Continue reading “BCLA 2014: Linked Open Data and Libraries. When? Or NOW!”
Code4Lib Day 3: Closing Keynote – Gordon Dunsire
Granularity in Library Linked Open Data
Fractals
- self-similar at all levels of granularity
- each circle represents of things that look very similar (snowflake looking pattern but of different sizes)
- characteristic of fractals
- cannot determine level: all levels are equal, some more equal than others
Multi-Faceted Granularity
- What is described by a bibliographic record? or a single statement?
- What is the level of description? How complete is it? e.g. AACR2
- How detailed is the schema used? How dumb? – especially relevant right now. The more detailed, the higher level of granularity possible.
- Semantic constraints? Unconstrained?
Resource Description Framework – Linked Data
- Triple: This resource | has intended audience | Juvenile
- Subject / Predicated / Object
- do each of these parts have granularity?
- higher/lower level, but should talk about coarse or fine grained granularity
Subject: What is the Statement About?
- we can focus on description an article / resource / work, then think about coarser or finer granularity:
- coarser: consortium collection / RDF map
- library collection / digital collection
- super-aggregate journal title / jurnal index
- aggregate: issue / festschrift
- focus on description an article / resource / work
- component: section / graphics / page
- sub-component: paragraph / markup
- finer: word rdf/xml
- uri / node
Predicate: What is the Aspect Described?
- similar coarse/fine breakdown:
- membership category
- access to resource
- access to content
- suitability rating
- audience and usage
- audience
- audience of audio-visual material
- diagram: possible audience map (partial) – unconstrained version to avoid collisions of isbd/dct/schema/rda/m21/frbrer
- different links can be made while still retain proper semantic links
- currently constructing just one giant graph
What is the Aspect Described?
- coarse to fine:
- resource record
- manifestation record
- title and s.o.r
- title statement
- title of manifestation
- title word
- first word of title
- why do librarians need so many titles? Why not just use dublin core title and be done with it? Because we need it to do our work e.g. spine title to browse
- title = string identifier
- RDA: what to do with this? how do we apply these needs?
- possible semantic map (partial) – I won’t even try to reproduce this
- need to take into account names and ranges
- make it more difficult, but more powerful
Semantic Reasoning: The Sub-Property Ladder
- this is where the graph becomes useful and property
- machines can’t reason, so we’re demantic the semantics such that we can give the rules to machines to process our data
- semantic rule:
- if property1 sub-property of property2;
- then data triple: resource property1 “string”
- implies data triple: resource property2 “string”
- otherwise, data triple remains the same
- simple enough for computer to carry out
- doesn’t matter how complex the map actually is, because it can still do it in matters of seconds
- machine entailment: isbd” “hast title proper” (finer) -> dct: “has title” (coarser)
- might sound simple, but making a computer do interferance
- ‘dumb(ing)-up, data has been lost, but still meaningful – moved from one schema to another
Data Triples from Multiple Schema / Entailed from Sub-Property Map / rom Property Domains
- frbrer: “has intended audience” – “primary school”
- isbd: “has note on use or audience” – “for ages 5-9”
- rda: “intended audience (work)” – “for children aged 7-“
- m21: “target audience” 0> m21terms: -> “Juvenile”
- definition attached to the vocabulary
- also talking about granularity
- can map the sub-property to top level of unc: “has note on use or audience”
- “is a” frbrer: “work”, isbd: “resource”, rda: “work” – rda and frbr schema actually separate, not semantically linked even though vocabulary is similar and RDA is based on FRBR
- once stabalized can be drawn from each other
What is the Aspect Described?
- coarser to finer:
- creator
- author
- screenwriting
- animation screenwriting
- children’s cartoon screenwriting
- different controlled vocabulary
- graph of RDA for author/creator/screenwriting in relation to work and agent
- graph of same thing, but for dc for creator and agent
- what is the semantic relationship between the dct creator and the rda creator?
- marcrel author maps to dc contributor, not creator – what is the relationship between rda author and marcrel author?
- decision from 2005, needs to be reappraised and reviewed
- relationship between dc creator and dc contributor?
- how does lcsh “screenwriters” fit?
Machine-Generated Granularity
- also has issues
- e.g. full-text indexing: down to the word level
- BabelNet: A very large multilingual ontology
- can get quite complex and granular
User-Generated Granularity
- users can actually generate useful metadata
- can use statistical methods to remove extremes and come back with consensus
- going to cause granularity problems e.g. “OK for my kids (7 and 9)”, “Too childish for me (age 14)”
KISS
- keep it simple, stupid
- keep it simple and stupid?
- data model is very simple: triples!
- in terms of complexity, actually very simple
- but metadata content is complex
- and therefore, resource discovery is complex
- complex structure of application of simple rules, similar in the hard sciences and math
- simplicity is elegance
AAA
- Anyone can say anything about any thing
- someone will say something about every thing
- in every conceivable way
- and then constrained linguistically
OWA
- open world assumption: the absence of a statement is not a statement of non-existence
Will it get so granular that it becomes too complex?
And the rest is science
Break Time

Access 2012 Day 1: Afternoon Notes
Adventures in Linked Data: Building a Connected Research Environment
by Lisa Goddard
Linked data doesn’t just accommodate collaboration, it enforces collaboration. Need a framework that can handle a lot of data and scale.
Text data is really messy, because it doesn’t fit into a single category. Linked data should allow all of this.
Identify Top Level Entities
Main types of identities with mint URIs for entities include:
- people
- places
- events
- documents
- annotations
- books
- organizations
Abstract away from implementation details to make it manageable in the long term.
Canonical URIs means that one ‘link’ is actually 3 depending on format through content navigation.
Define Relationships
Through RDF, make machine readable definitions.
Linked data is basically an accessibility initiative for machines.
Use ontologies to provide definitions for entities, relationships, and impose rules.
An ontology is for life.
Ontology searches are available, such as Linked Open Vocabularies (LOV), e.g. foaf:Person (Class) – friend of a friend
Tie the entity and class using rdf:type, such as creator. Which then results in a data model.
CWRC Writer
Provides a way to create a document, which provides an interface to tag in XML, where you can select existing authority file, the web (using APIs), or custom. You can then add relations.
Quick Comment
This looks like a really neat tool to easily add XML tags in a document. Would want to see it integrated into a standard document writer, much like RefWorks does through Write’n’Cite. I’m definitely looking forward to seeing this move forward.
Big Data, Answers, and Civil Rights
If you want volume, velocity, and variety, it’s actually very expensive.
Efficiency means lower costs, new uses, but more demand and consumption.
Big data is about abundance. The number of ways we can do things with this data has exploded.
We live in a world of abundant, instant, ubiquitous information. We evolved to seek peer approval. It all comes down to who is less dumb.
We look for confirmation rather than the truth.
The more we get confirmation, the greater the polarization.
Abundant data has change the way we live and think.
The Problem with Big Data
Polarization can lead to increase in prejudices. You don’t know when you’re not contacted. Increasingly moving from culture of convictions to a culture of evidence.
Genius says possibly. Finds pattern, inspires hypotheses, reason demands testing, but open to changes.
Correlation is so good at predicting that it looks like convincing facts, but they’re just guesses.
See also: Big data, big apple, big ethics by Alistair Croll
Break Time
BiblioBox: A Library in Box
Inspired by PirateBox, which allows people to share media annonymously within a community using a standalone wiki router (not connected to the Internet). People in the same place like to share stuff.
LibraryBox then simplified by taking out chat and upload function.
Dedicated ebook device that allows browsing and searching of the collection.
Components:
- Unix based file server using a wifi access point and small flash drive.
- Ebooks using OPDS metadata format.
- SQLite database
- API module usually available in language of choice e.g. Python
- Bottle – framework for web developing in Python
- Mako Templating – templating in Python
Adding books much more complex than serving books. For example, author authority file. Want to automate taking out metadata from ePub files, but no good module for reading ePub files in Python.
User View
Add catalogue to ebook app. It then looks like a store, where you can browse by title or author.
Available on GitHub.
Question Answering, Serendipity, and the Research Process of Scholars in the Humanities
by Kim Martin
Serendipity occurs when there is a prepared mind that notices a piece that helps them solve a problem. It allows discovery and thinking outside of the box.
Chance is recognized as an important part of the historical research process.
Shelf browser of some sort in the catalogue can be useful, but what we really need in a system is something that allows personalization and in-depth searching. Researchers just do not typically leave their offices and use search engines.
Visualizations, such as tag clouds, could allow more serendipitous browsing.
More notes on the Access 2012 live blog.
Code4lib Day 2: Lightning Talks
Scott Hanrath – Zotero and SHERPA/RoMEO API mashup
- quick and dirty way to filter a collection of articles by publisher policies
- use Zotero and SHERPA/RoMEO APIs to tag articles with publisher policies
- work flow?
- zotero plugin?
- Code on github
David Walker – Basic Learning Tool Interoperability (LTI) Protocol
- Need LMS to pull all the relevant library information, items, etc.
- In LMS, register library tool as if it were a native building block
- When insert into course, make a little iframe of tool
- Hidden form elements post it to tool with data of course, data, security (OAuth)
Peter Murray – Introducing FOSS4LIB.org
- Lyrasis’ response to survey on what librarians wanted
- open source adopters are still in the early adopters stage
- thus, website was created
- determine whether OSS is right for the library including cost
- help to select software
- Call to action: register packages, releases, events, providers
Mark Matienzo – I’ve Got Good News
- C4L11: fiwalk with me – using open source digital forensics software to support pre-digest work
- update of work since then
- pluggable
- could integrate anything
- two working plugins: virus scanner, file format identification against PRONOM
- Code on github
- BitCurator
Mike Durbin – Edge Cases – Digitizing and delivering undescribed items in EAD
- should automate as much of the workflow as possible
- items selected for digitization, scanning, created/updating spreadsheet with ID and sequence, name image files according to ID/sequence
- put it in for automated processing including quality control, files pushed into master file archive, ingested into Fedora, and e-mail is sent to collection manager
- Finally, publication
Ryuuji Yoshimoto – Introducing CALIL.JP, scraping/mashup all of OPACs in JAPAN! PDF Slides
- OPACs have no API
- so start scraping OPACs, fighting with dirty HTML
- 2 months to scrap 200+ OPACs
- CALIL.JP
- realtime holding by through the CALIL API by ISBN, returns XML or JSON
- item information from amazon and Google
- now have many third-party apps e.g. browser extension
Kåre Fiedler Christiansen – Chucking all the software components in a library together to present recorded radio and tv
- built MPEG -> streaming server
- website -> cool design
- cool design, website, streaming server, access control -> cool website
- except lawyers, oh noes!
- PDF Slides
Joel Richard – Introducing Macaw: Metadata Collection Tool for Book-like things
- digitizing lots of book-like things including pamphlets
- most libraries sent to Internet Archive then to the Biodiversity Heritage Library
- but some items too large to fit on usual scanning hardware
- had to use camera, but had to add metadata
- Macaw collects metadata but doesn’t really do workflow
- two views: thumbnails and list
- take data from wherever, using Z39.50 or CSV, into Macaw
- custom export from Macaw, including Internet Archive, the library
- each piece is modular
- Code on Google Code
Rachel Frick – LOD-LAM Incubator Project
- Linked Open Data for Library, Archives, and Museums
- lightweight approach in terms of funding and consultation
- timeline: March – May = recruit panel, fundraising, open comment
Mao Tsunekawa – Project Shizuku : Making Friends in Libraries
- Shizuku 2.0
- software development project in supporting encounters among library users
- not recommending books, recommending users instead
- visualize circulation data for finding other users reading the same books
- can share history of reading books
- developing Baron which allows searching OPAC and then making friends
Keith Folsom – Archivists’ Toolkit Database Server on an Amazon EC2 Instance
- multi-institutional
- hosting on small instance of amazon
- Ubuntu/MySQL
- single open port
- download kit with PuTTY
- going out of pilot
Rebecca Jones – Call for Services
- InnovativeInterfaces
- provide SQL access
- working on RESTful services
- What services would people like to have?
- Live Beta in March
Code4lib Day 2 Morning: Notes & TakeAways
I didn’t take full notes on all the presentations. I like to just sit back and listen to some of the presentations, especially if there are a lot of visuals, but I do have a few notes.
Full Notes for the following sessions:
- Discovering Digital Library User Behavior with Google Analytics
- How People Search the Library from a Single Search Box
- Stack View: A Library Browsing Tool
Building Research Applications with Mendeley
by William Gunn, Mendeley
- Number of tweets a PLoS article gets is a better predictor of number of citations than impact factor.
- Mendeley makes science more collaborative and transparent. Great to organize papers and then extract and aggregate research data in the cloud.
- Can use impact factor as a relevance ranking tool.
- Linked Data right now by citation, but now have tag co-occurrences, etc.
- Link to slides.
NoSQL Bibliographic Records: Implementing a Native FRBR Datasotre with Redis
No notes. Instead, have the link to the presentation complete with what looks like speaker notes.
Ask Anything!
- Things not taught in library school: all the important things, social skills, go talk to the professor directly if you want to get into CS classes.
- Momento project and UK Archives inserting content for their 404s.
- In response to librarians lamenting loss of physical books, talk to faculty in digital humanities to present data mining etc., look at ‘train based’ circulations, look at ebook stats.
- Take a look at libcatcode.org for library cataloguers learning to code as well as codeyear hosted by codeacademy.
Code4lib Day 1 Morning: HTML5, Microdata and Schema.org (and other takeaways)
I did not take notes on everything in part because some of it was very technical and it can be hard to do notes, but here are some takeaways from the morning:
- Versioning Control: Use it, Git or Mercurial. Doesn’t need to be code, can be data too. – Description and Slides
- Take library data and make it available to users, can’t expect them to search for it.
- Linked Data doesn’t need to be a huge project. Start small.
- Why RDF? It’s flexible with easy addition of new attributes or classes, and works cleanly with an iterative approach.
HTML5 Microdata and Schema.org
Other than getting good ranking, we need to provide rich results, i.e. rich snippets. Some digital collection have been providing rich snippets already, such as NCSU Libraries.
How do we get this?
- embedded semantic markup
- HTML5 Semantics include nav, header, article, section, footer
- HTML5 Microdata is a syntax for annotating content to communicate meaning of data to machines
- similar to RDFA, other microdata
- Microdata comes back as tree based JSON and allows for DOM API
For example:
<div itemscope itemtype=”http://schema.org/Organization” itemref=”logo”>
<a itemprop=”url” href=”http://code4lib.org/”>
<span itemprop=”name”>Code4Lib<\span>
</a>
</div>
where: scope = about something
type = type of item
prop = properties
For the user, there is no difference as display is the same. This provides a complete data model.
Schema.org is a one-stop shop for vocabulary in describing items on the web.
Apologies, I did not take extensive notes on it, but to read more, check out the slides below or the Code4lib article he wrote.