Access 2012 Day 1: Afternoon Notes

Adventures in Linked Data: Building a Connected Research Environment

by Lisa Goddard

Linked data doesn’t just accommodate collaboration, it enforces collaboration. Need a framework that can handle a lot of data and scale.

Text data is really messy, because it doesn’t fit into a single category. Linked data should allow all of this.

Identify Top Level Entities

Main types of identities with mint URIs for entities include:

  • people
  • places
  • events
  • documents
  • annotations
  • books
  • organizations

Abstract away from implementation details to make it manageable in the long term.

Canonical URIs means that one ‘link’ is actually 3 depending on format through content navigation.

Define Relationships

Through RDF, make machine readable definitions.

Linked data is basically an accessibility initiative for machines.

Use ontologies to provide definitions for entities, relationships, and impose rules.

An ontology is for life.

Ontology searches are available, such as Linked Open Vocabularies (LOV), e.g. foaf:Person (Class) – friend of a friend

Tie the entity and class using rdf:type, such as creator. Which then results in a data model.

CWRC Writer

Provides a way to create a document, which provides an interface to tag in XML, where you can select existing authority file, the web (using APIs), or custom. You can then add relations.

Slides

Quick Comment

This looks like a really neat tool to easily add XML tags in a document. Would want to see it integrated into a standard document writer, much like RefWorks does through Write’n’Cite. I’m definitely looking forward to seeing this move forward.

Big Data, Answers, and Civil Rights

Alistair Croll

If you want volume, velocity, and variety, it’s actually very expensive.

Efficiency means lower costs, new uses, but more demand and consumption.

Big data is about abundance. The number of ways we can do things with this data has exploded.

We live in a world of abundant, instant, ubiquitous information. We evolved to seek peer approval. It all comes down to who is less dumb.

We look for confirmation rather than the truth.

The more we get confirmation, the greater the polarization.

Abundant data has change the way we live and think.

The Problem with Big Data

Polarization can lead to increase in prejudices. You don’t know when you’re not contacted. Increasingly moving from culture of convictions to a culture of evidence.

Genius says possibly. Finds pattern, inspires hypotheses, reason demands testing, but open to changes.

Correlation is so good at predicting that it looks like convincing facts, but they’re just guesses.

See also: Big data, big apple, big ethics by Alistair Croll

Break Time

BiblioBox: A Library in Box

by David Fiander

Inspired by PirateBox, which allows people to share media annonymously within a community using a standalone wiki router (not connected to the Internet). People in the same place like to share stuff.

LibraryBox then simplified by taking out chat and upload function.

Dedicated ebook device that allows browsing and searching of the collection.

Components:

  • Unix based file server using a wifi access point and small flash drive.
  • Ebooks using OPDS metadata format.
  • SQLite database
  • API module usually available in language of choice e.g. Python
  • Bottle – framework for web developing in Python
  • Mako Templating – templating in Python

Adding books much more complex than serving books. For example, author authority file. Want to automate taking out metadata from ePub files, but no good module for reading ePub files in Python.

User View

Add catalogue to ebook app. It then looks like a store, where you can browse by title or author.

Available on GitHub.

Question Answering, Serendipity, and the Research Process of Scholars in the Humanities

by Kim Martin

Serendipity occurs when there is a prepared mind that notices a piece that helps them solve a problem. It allows discovery and thinking outside of the box.

Chance is recognized as an important part of the historical research process.

Shelf browser of some sort in the catalogue can be useful, but what we really need in a system is something that allows personalization and in-depth searching. Researchers just do not typically leave their offices and use search engines.

Visualizations, such as tag clouds, could allow more serendipitous browsing.

More notes on the Access 2012 live blog.

Access 2012 Day 1: Ignite Talk – Social Feed Manager

To collect social media data (especially Twitter), researchers are doing this manually (possibly by proxy).

 

Some paid options to collect the data:

  • DataSift
  • Gnip
  • Topsy

Friendly, but not cheap, and more than what we need. Still need tools to collect, process, etc.

What researchers ask for:

  • specific users, keywords
  • historic time periods
  • basic values: user, date, text, counts
  • delimited files to import

We can do this free with APIs.

Built Social Feed Manager with features

  • Users by Item Count with temporal graphs
  • Details on user
  • can export to CSV files
  • hashtag queries by 10 minutes
  • search function with 1000

Free on github

  • python/django
  • user timelines, filter, sample, search
  • simple display with export for user timelines

Leaves out:

  • historical tweets
  • tweets beyond last 3200

By @dchud

More notes on the Access 2012 live blog.

Code4lib Day 2 Morning: Notes & TakeAways

I didn’t take full notes on all the presentations. I like to just sit back and listen to some of the presentations, especially if there are a lot of visuals, but I do have a few notes.

Full Notes for the following sessions:

Building Research Applications with Mendeley

by William Gunn, Mendeley

  • Number of tweets a PLoS article gets is a better predictor of number of citations than impact factor.
  • Mendeley makes science more collaborative and transparent. Great to organize papers and then extract and aggregate research data in the cloud.
  • Can use impact factor as a relevance ranking tool.
  • Linked Data right now by citation, but now have tag co-occurrences, etc.
  • Link to slides.

NoSQL Bibliographic Records: Implementing a Native FRBR Datasotre with Redis

No notes. Instead, have the link to the presentation complete with what looks like speaker notes.

Ask Anything!

  • Things not taught in library school: all the important things, social skills, go talk to the professor directly if you want to get into CS classes.
  • Momento project and UK Archives inserting content for their 404s.
  • In response to librarians lamenting loss of physical books, talk to faculty in digital humanities to present data mining etc., look at ‘train based’ circulations, look at ebook stats.
  • Take a look at libcatcode.org for library cataloguers learning to code as well as codeyear hosted by codeacademy.

Code4lib Pre-Conference: Microsoft Research (MSR)

Future Technology

So the first half of the tour was the non-disclosure, confidential part but the group that I was part of basically got information on how Microsoft research trends and some of their results. We then got to play with some of the prototypes they have been working on, which is technology they see as coming into the market in 5-10 years. To get a general sense of what might have been included, take a look at the Future Productivity Vision video they released recently:

http://www.youtube.com/watch?v=a6cNdhOKwi0

Microsoft Research (MSR) at Building 99

The research division focuses on core computer science research of fundamental aspects of computing. A lot of the products of their research include papers, patents, and prototypes. They supplement staff and resources with scholarly research by partnering with academia. The focus is mostly on applied projects.

ChronoZoom

  • to be released in March
  • working with Berkeley and a couple of other universities
  • prototype to help in research and teaching cross-discipline
  • no details beyond that as we were told to keep this one under wraps, but check out the link for more information

F#

  • practical, functional-first programming language that allows you to write simple code to solve complex problems
  • in the .NET family, fully supported by Microsoft Visual Studio
  • multi-paradigm: can used different models, e.g. object-oriented
  • interoperable: doesn’t work in isolation, can use all of .NET framework

Simplicity: Functional Data

  • simple code, strongly typed
  • Example 1: let swap (x, y) = (y, x)  vs. (in C#) Tuple<U,T> Swap<T,U>(Tuple<T,U> t) { return new Tuple<U,T>(t.Item2, t.Item1) }
  • Example 2: let reduce f (x, y, z) = f x + f y + f z vs. (in C#) int Reduce<T>(Func<T,int> f,Tuple<T,T,T> t) { return f(t.Item1) + f(t.Item2) + f(t.Item3); }

Simplicity: Functions as Values

  • can define function inline
  • can define own units of measure, and enforce conversions

Example:

  • type Command = Command of (Rover -> unit)
  • let BreakCOmmand = Command(fun rover -> rover.Accelerate(-1.0))
  • let TurnLeftCommand = Command(fun rover -> rover.ROtate(-90.0<degs>))

Some Other Features

  • built-in run parallel and asynchronous
  • can use traditionally, compile and run OR interactively, execute on the fly
  • x |> f – apply f to x

There was more, but I honestly couldn’t copy that quickly and didn’t understand every detail, but if you’re interested you try F# through a browser which includes an interactive tutorial, or download it from tools and resources. To learn more about what people are doing with it, take a look at F# Snippets.

F# 3.0

While 2.0 excels at analytical programming, solving computationally complex problems, 3.0 is an accelerator for data-complex problems by bringing information to your fingertips.

Basically, you can load a database (through URI) and while you program, you can see a full list of all the data elements that are available.

For example, after defining a type by loading the netflix database, in typing “netflix.” you would at this point get a list of the fields (e.g. Movies) from the database

Layerscape

  • geoscience tool
  • can download and run for free
  • have the ability to bring a lot of time-sensitive data and use GPU to create visualization
  • talk to worldwidetelescope (WWT) through API
  • also has a custom ribbon plugin for excel to view in WWT for non-programmers
  • can also create custom tours including text and audio, which then exports into videos. Note: The data is included in the tour so that people can see the data – check out the Seismicity Samoa and Tohoku example video we saw (requires Silverlight)

Microsoft Audio Visual Indexing Service (MAVIS)

  • keyword search in audio/video files with speech
    • speech recognition technologies used to ‘crack’ audio files
  • Microsoft Research technology: world-level lattice indexing
    • 30-60% accuracy improvement over indexing automatic transcripts – right now, 80% of content, 85%+ accuracy
    • can provide closed caption which can also be edited later
    • index word alternatives – robust to recognizer errors
    • index timing – navigate to exact point in video and provides timeline of where the phrase is spoken
    • tune-able – queries from ‘give me something’ to ‘dig deeper to find it’
  • computer intensive speech recognition done in Azure
  • no need to invest in H/W infrastructure
  • front end user search integrated with SQL server
    • search infrastructure is the same as full text indexing in SQL
  • SOAP based API
    • allows integration of media search results in other applications e.g. text search
  • need at least 500 hours of transcribed data in order to train the program for other languages

MAVIS Architecture

Great for library and archives in order to pull content from digitized audio and video of formats becoming obsolete or degrading.

Microsoft Academic Search

  • free academic search engine
  • structure unstructured data
  • 38+ publications including non-public data
  • can search or browse by domain to see top authors, publications, journals, keywords, organizations
  • for recognized terms e.g. Bone Marrow can see term occurrence, definition context from full text indexes, top authors, conferences, journals, etc.
  • can search for person and see their publications, but then with disambiguation, and then a profile with list of publications, citations, visualization of coauthors, citers
  • can see organization profiles and how they compare to others including Venn diagram of publication keywords
  • can pull most of the visualizations and embed into a website
  • RSS feed for each element
  • full API also available and get results in JSON or XML via SOAP
  • site interface allows crowd sourcing to edit information e.g. if disambiguation of publications is wrong (though right now, only with Live account, working on OpenID)

This strikes me as Google Scholar but with more functions, visualizations, and linked data. Right now, not a lot has been indexed, but I can see this as a much better version of Google Scholar.

Being Green > Swag You’ll Probably Throw Away

Finally, at the end of the night, one of the staff presented on why he’s anti-swag, so instead of giving MS swag away, we got the opportunity to take home an epiphyte complete with care package. Unfortunately, I can’t take it home across the border so I found someone to adopt it.

Epiphyte complete with care package