Presentations for Day 3 of Code4Lib 2014. Continue reading “Code4Lib 2014: Day 3 Morning Presentations”
Tag: search
Code4Lib 2014: Day 3 Lightning Talks
Lightning talks on Day 3 of Code4Lib. Continue reading “Code4Lib 2014: Day 3 Lightning Talks”
Code4Lib 2014: Day 2 Lightning Talks
Lightning Talks on Day 2 of Code4Lib 2014.
Code4Lib 2014 Day 1: Lightning Talks
ResCarta Foundation
Since last year,
* automatic audio transcription
* but can edit audio afterwards
* internationalization
* facet filtering
* software.rescarta.org
How many times have you been to code4lib?
Michael Giarlo
* many first timers and returnees
SLiMS
- using open source software
- motivated to initiate knowledge sharing
- features integration
- many libraries can’t afford ILS
- spreading, active community across Indonesia
- community of volunteers
- truly free OSS for library automation
- bottom up approach
Harvard Library Lab
Bobbi Fox
* got grant funded
* developed the Harvard Library Lab
* lots of projects that librarians actually want e.g. inscriptio (carrel reservations), class request tool, link-o-matic, course reserves unleashed, PDS mobile web API, do we own this? (check if library have this based on ISBN)
GeoHydra: GIS in the Digital Library
- full service repository for GIS data
- treat as first class objects
- metadata / wrangling -> curation -> delivery -> discovery
- “Wrangle the data until it submits”
- Blacklight Maps where you can plot the data
- have different workflow and deal with different schema
- I’m sorry but I did not understand most of that…
Logs are Your Friend
- logs should be up to date
- should look at log when
- logs logging similar things should be similar
- consider that you might analyze your logs e.g. how big was the file it was doing a checksum on?
Solr Browse & Sort
- faceted heading browser with cross references in Solr
- sort on multi-valued fields
- did some demo/showing
End of the Day
Code4lib Day 1: Lightning Talks
Cynthia Ng – RULA Bookfinder
- Link to the full write-up
Julien Gibert – Turning a Solr Response into a RDF file
- Theses.fr
- Sorry, this went by me, plus I was busy running back to my seat
Bill Dueber – Datamart Report Generator at UMich
- actually talking about spreadsheets
- want to support data-drive decision making, but it’s boring, and canned reports tend not to do it
- can end up in substring hell
- solution: build data warehouse
- took Aleph oracle COBOL store, removed insanity and put it in another oracle database
- funds and inventory reports now possible
- running 20-25 reports a week
- more than when we ran it by hand, and saves lots of time
Jonathan Rochkind – bento_search
- RubyRails gem
- external search services e.g. Google books
- federated e.g. primo, eds, ebscohost, scopus, worldcat, google books
- can use whatever you want, just need to add it
- can customize to have link resolver
- github.com/jrochkind/sample_megasearch/
- much more functionality
Masao Takaku – saveMLAK project for two years
- came out of the effort to save museum, library, archive, kominkan (community centre) after the big earthquake
- gather information of facilities in damaged area using a wiki
- coordinate activities to rebuild
- efforts are still continuing
Jon Stroop – Loris Image Server
- define syntax for image access
- can specify width/height, part of image, quality
- Talk link
Ross Singer – How are you managing copyright?
- lazy attempt at crowd-sourced business development
- copyright is complicated
- there are standard licenses, but then there are a lot of exclusions and exceptions
- still, roughly the same model
- management already being done in some capacity by the universities
- but in US/Canada there is fair dealing and fair use
- Slides
Eric Nord – Candybars for Bugs
- Harold B. Lee Library
- worked on maps in library
- pop up map
- will give candy bar if found error
- only had to give away 18
- have a ‘report a problem’ with this item
- builds the idea to power the patron
Megan O’Neill Kudzia – Games for Pedagogy in the Library
- working with faculty
- a lot of interest, but no opportunity to talk about it
- purchasing games on an ask basis
- working out how to make accessible, in catalogue
- licensing issues for PC/console games
Geoffrey Boushey – GEDI Reference App for InterLibrary Loan
- General Electronic Document Interchange (ISO Standard)
- used by Ariel
- headers added to a file when sent from one institution to another
- basis for making an easy to use tool so different ILL systems can communicate with each other
- on Github
George Campbell – three.js: 3D Objects in the browser
- used to have to use flash or flip through images
- can now use interactive 3D graphics
- can scale, add text/images, move
John Sarnowski – Audio Archiving with Full Text Search
- ResCarta Toolkit
- display and play audio
- add metadata
- use conversion tool
- embeds into XML portion
- final file can then be searched
- words can be highlighted just like a text file
That’s the end of Day 1! Join us tomorrow. Time for a nap.
Code4Lib Day 1: Afternoon Notes
Practical Relevance Ranking for 10 million books
- Tom Burton-West, University of Michigan Library
Search Challenges
- multilingual, 400+ languages
- OCR quality varies
- very long documents
- books are different from what they normally have
Relevance Ranking
- how to score, weigh
- default algorithm ranks very short documents very high
- needed to tune/customize parameters
- average document size is ~30 times larger
- did prelim testing with Solr4 and didn’t see the same problem, but need more testing
- dirty OCR complicates things, as well as language
- occurrence of words in specific chapters vs. whole book – should we index parts of books?
- similar issue with other objects e.g. bound journals, dictionaries & encyclopedias
- difficulty too is inconsistent metadata, breakdowns of articles/chapters/etc. will be inconsistent
- creating a testing plan and adding click logs
n Characters in Search of an Author
- Jay Luker, IT Specialist, Smithsonian Astrophysics Data System
- Slides
Goal of a search is to match user input to metadata. e.g. author names
Building the next generation of the ADS 2.0. Trying to increase recall without sacrificing precision.
Requirements
- match UTF-8 e.g. matching ASCII version to versions with diacritics/markings
- match more or less information e.g. first name initial but without triggering substring matching
- need to work with hand curated synonyms e.g. pseudonyms, maiden/married name
Solving the Problem
- normalization – strip out punctuation, rearrange name parts – based on whether a common is entered
- generate name part variations to whatever can be realistically expected
- transliteration – use index instrospection for list of synonyms
- expand user queries at each step:
- user searches
- normalize
- name part vars
- transliteration
- name parts vars of transliterated entries
- curated synonyms
- transliteration of anything added
- name part variations to catch everything
- assembled into large boolean query
Implementation
- Python/JavaScript prototype
- actual – Solr/Lucene
Evolving Towards a Consortium MARCR BIBFRAME Redis Datastore
- Jeremy Nelson, Colorado College, jeremy.nelson@coloradocollege.edu
- Sheila Yeh, University of Denver
I think this presentation speaks for itself.
Journal Article: Building a Library App Portfolio with Redis and Django
Hybrid Archival Collections Using Blacklight and Hydra
- Adam Wead, Rock and Roll Hall of Fame and Museum
- Presentation
Centre of everything is the Solr index. Blacklight puts everything into Solr. Library materials is easy enough, but with Archival collections use EAD with many items (not just one item as typical of MARC).
Extended Blacklight to search EAD
- index collections and single items from a collection
- search results include books, entire collections, and items from collections
Digital Content
- kept in Fedora – objects described using Rubys
- use Hydra to manage the content in Fedora – manages RDF relationships
- indexes into Solr
- Need to related Fedora content to its archival collection
- content originates from sources in collection, and part of series
- collection metadata already exists in Solr
- create RDF representations of collections
- Hydra queries Solr for collection meatadata
- creates objects for series, subseries, items
Issues
- terrible Solr performance for series, 500+ items
- no EAD “round tripping” – EAD can go into Solr, but not back out
- currently 60% complete
Citation search in SOLR and second-order operators
- Roman Chyla, Astrophysics Data System
Sorry, I don’t have notes for this. My brain is a bit fried by this point. Will post link when I get it.
Break Time
Breakout Sessions – reports will be available on the wiki
Next Up – lightning talks

WordPress Development: Lessons Learned & Downsides
After 8 months, I have finally finished with WordPress development. I definitely learnt a lot, especially in terms of how the back end works and some more PHP.
Lessons Learned
The most important one:
know more PHP than I did.
Admittedly, I knew very little. While I have some experience programming, I only took a 2 day course in PHP. Not having to look up every little thing would have saved me invaluable time.
The other big one was definitely:
know more WordPress.
The documentation is obviously written for programmers (in most cases, those familiar with WordPress). So once again, I spent a lot of time looking things up. In this case, it was even more difficult because I usually had to rely on a couple of different tutorials and piece things together, making things work through trial and error.
Of course, I didn’t have much choice. And if there is one really good way to learn something is to be thrown into it, and make it happen.
Plugins
WordPress could really use some improvements though. One area is definitely in the plugins area. There is little to no cooperation between plugin authors, so there may be anywhere from zero to fifty plugins that do similar things, but all work differently and are of varying quality.
One of the reasons I’ve been posting a lot of plugins review is not only for my own records, but in the hopes that it’ll save other people time from looking through the mass amount of plugins. Unfortunately, because plugins come and go like the wind, plugin reviews become out of date very quickly.
Search
The one other thing I wish WordPress would improve is their search. While the site search uses Google, the plugin search is pretty bad and so is the internal built-in WordPress search. For the plugin search, you cannot refine your search in any way, and the sorting doesn’t seem to work properly.
The built-in WordPress site search (and dashboard pages/posts search) is also pretty bad. It’s organized by date and there is a plugin that allows you to sort by title, but it does full text searching and does no relevance ranking whatsoever. If it even did the minimum of “do these words match words in the title, if yes, put those higher” then that alone would be a huge improvement.
Conclusion
While I think WordPress is a great platform (and it’s open source!), there is definitely room for improvement and may not be the right platform for everyone. In comparison, for example, I get the impression that Drupal has a more cooperative and supportive community with better plugin support and development. On the other hand, I find WordPress easier to teach users.
If I had to do it again, I would definitely have taken the time to learn more about the overall WordPress framework and how different parts fit into the puzzle before diving into making the theme.
Stop Living in a Bubble: Privacy & Tracking of Google and Others
With Google’s new policy in effect, there is currently no shortage of news articles and blog posts about how to protect your information from Google. I think it’s great that people are becoming more aware of the effects of how one big company can track you, but this has been going on for many years, just never in one nice neat package as Google is talking about now. [Too long? Skip to the Summary at the bottom]
It’s Not Just Google
While zdnet.com and many others focus specifically on Google, but just recently in the news, Target figure out a teenage girl was pregnant before her father, and NYTimes did a piece on how it’s not just Target, but any and every corporation you shop with. Mind you, if you shop at various stores for various people, it might be harder for one single company to track you, but online is a whole other world.
Living in a Bubble
Online is different, because you can be tracked from one website to another. Particularly when you’re signed in, every search you do is put into your history. Even when you’re not signed in, you’ll be tracked by IP address (but on the up side, rarely does anyone have a truly static IP at home or at work). Your search results will be skewed based personalized data, not just ads, but search results as well. dontbubble.us provides a nicely illustrated explanation of how it works and why it’s important.
Big Brother (and Everyone Else) is Tracking You
Online is also different because it’s not just Google tracking you, trackers are built into sites that follow you on the web to build a profile on your behaviour (and very few sites do not have this). Check out donttrack.us for another illustrated explanation, but if you really want to see how insidious behavioural trackers are, take a look at Collusion, which will give you a demo on a short journey on the web from IMDB to news sites.
What to Do
So how do we protect ourselves from all of this? Live in a cave. No really, practically speaking, there is no way to prevent being tracked and having personal information stored some way or another. It’s no secret that every app and every site that has access will keep information on you and many will sell it to advertisers.
Nevertheless, while it’s virtually impossible to prevent tracking altogether, you can prevent advertisers from building a profile about you to a larger or lesser degree.
Opt Out of Google History
Just about everyone has covered this, and zdnet.com provides a nice summary with lots of links, but here are some direct links:
- Remove Google Search History
- Remove YouTube Search/Viewed History
- Remove Google Latitude Location History (just login, choose Disable, and save)
- Opt Out of Google’s Ads Personalization (just login, and hit Opt Out)
You could also of course, delete your Google account completely and not use any Google products. (Just saying.)
Browser Plugins
Plugins are nothing knew as a way to help manage privacy and security in browsers. At the bottom of donttrack.us, there is a list of browser plugins you can consider. Some of these are only supported by one or two browsers, but similar plugins are available for other browsers. In particular, I use:
- Ghostery (remember to check all, frequently by default it doesn’t check any)
- Adblock Plus
- NoScript
For greater anonymity, add HTTPS Everywhere and Tor. Not on the donttrack list is: TrackerBlock for Firefox, and Internet Explorer.
Browser Settings & Options
Changing some of your security and privacy settings in your browser will also help. The farther down the list, the more extreme you get, but they’re there to consider.
- Change your default search engine
- I use duckduckgo, which doesn’t track or bubble and has a neat !bang syntax.The drop down next to the search icon also gives you options for searches it doesn’t have built-in like images and news. (Plus it has an awesome logo)
- Just set it once. If you’re in doubt, here’s the ‘search URL’ to enter: https://duckduckgo.com/?q=
- Do not allow sites to track physical location
- Disable Third-party Cookies
- Disable Cookies Altogether (optional: add exceptions for sites you visit frequently and want auto-login)
- Do not allow local data to be set
- Clear all data when closing the browser
- Browse privately – use InPrivate (IE), Private Browsing (Firefox, Safari), Incognito (Chrome), Private Tab (Opera) – and set it as the default (if possible)
Opera actually has a great guide to security and privacy covering a lot of Opera settings on one handy page.
Change Your Browsing Habits
Admittedly, I find it hard to do without using any Google products having a gmail account including googletalk, and Google Reader (if someone has suggestions on an alternative that is just as good, I’d love to hear it). Nevertheless, at work, I will log into Google with one browser while using a different browser for everything else. At home, googletalk pops up email in my default browser, so I make sure to log out when I’m done.
On the more extreme side of things, you can set up your work flow such that nothing is stored locally, check out a blog post on microcosm about browsing privately.
Non-Techsavvy Friendly
While some of these options are great for those who are tech-savvy enough, many of these options will create barriers for those who would prefer some protection but with the same experience as before. In those cases, I recommend:
- All the Google History stuff
- Adblock/plus
- Ghostery + making sure common sites are not blocked e.g. facebook, twitter
- Changing the default engine in the address and search bar
- Do not allow tracking of physical location
- Disable third-party cookies + making sure common sites are added to exceptions e.g. bookmarklets
Of course, it’s all about the individual. If they can handle NoScript (which is fairly easy to use once you’re taught), that’s great. The problem is always if the user encounters an error or some functionality that isn’t working properly because it’s being blocked. It’s great if they’re willing to call you and you can talk them through it on the phone, but otherwise, we all know how frustrating it can be for something to not work like we think it should.
Summary
Some key takeaways if you thought that was a bit long to read through.
- Remove and opt out of all Google history and personalization
- Install some easy to use plugins, and adjust your browser settings
- Most of all: Use duckduckgo.com for your default search engine
Code4lib Day 2: How People Search the Library from a Single Search Box
by Cory Lown, North Carolina State University
While there is only one search box, typically there are multiple tabs, which is especially true of academic libraries.
- 73% of searches from the home page start from the default tab
- which was actually opposite of usability tests
Home grown federated search includes:
- catalog
- articles
- journals
- databases
- best bets (60 hand crafted links based on most frequent queries e.g. Web of Science)
- spelling suggestions
- loaded links
- FAQs
- smart subjects
Show top 3-4 results with link to full interface.
Search Stats
From Fall 2010 and Spring 2011, ~739k searches 655k click-throughs
By section:
- 7.8% best bets (sounds very little, but actually a lot for 60 links)
- 41.5% articles, 35.2% books and media, 5.5% journals, ~10% everything else
- 23% looking for other things, e.g. library website
- for articles: 70% first 3 results, other 30% see all results
- trends of catalogue use is fairly stable, but articles peaks at the end of term
How to you make use of these results?
Top search terms are fairly stable over time. You can make the top queries work well for people (~37k) by using the best bets.
Single/default search signals that our search tools will just work.
It’s important to consider what the default search box doesn’t do, and doubly important to rescue people when they hit that point.
Dynamic results drive traffic. When putting few actual results, the use of the catalogue for books went up a lot compared to suggesting to use the catalogue.
Collecting Data
Custom log is being used right now by tracking searches (timestamp, action, query, referrer URL) and tracking click-throughs. An alternative might be to use Google Analytics.
For more, see the slides below or read the C&RL Article Preprint.
Code4lib Day 1: Lightning Talks Notes
Al Cornish – XTF in 300 seconds (Slides in PDF)
- technology developed and maintained by California Digital Library
- supports the search/display of digital collections (images, PDFs, etc)
- fully open source platform, based on Apache Lucene search toolkit
- Java framework, runs in Tomcat or Jetty servlet engine
- extensive customization possible through XSLT programming
- user and developer group communication through Google Groups
- search interface running on Solr with facets
- can output in RSS
- has a debug mode
Makoto Okamoto – saveMLAK (English)
- Aid activities for the Great East Japan Earthquake through collaboration via wiki
- input from museum, library, archive, kominkan = MLAK
- 20,000 data of damaged area
- Information about places, damages, and relief support
- Key Lessons
- build synergy with twitter
- have offline meet ups & training
Andrew Nagy – Vendors Suck
- vendors aren’t really that bad
- used to think vendors suck, and that they don’t know how to solve libraries’ problems
- but working for a vendor allows to make a greater impact on higher education, more so than from one university (he started to work for SerialsSolution)
- libraries’ problems aren’t really that unique
- together with the vendor, a difference can be made
- call your vendors and talk to the product managers
- if they blow you off, you’ve selected the wrong vendor
- sometimes vendor solutions can provide a better fit
Andreas Orphanides – Heat maps
The library needed grad students to teach instructional sessions, but how to set schedule when classes have a very inflexible schedule? So, he used the data of 2 semesters of instructional sessions using date and start time, but there were inconsistent start times and duration. The question is how best to visualize the data.
- heatmap package from clickheat
- time of day – x-dimension
- day of the week – y-dimension
- could see patterns in way that you can’t in histogram or bar graph
- heat map needn’t be spatial
- heat maps can compare histogram-like data along a single dimension or scatter-like plot data to look for high density areas
Gabriel Farrell – ElasticSearch
- similar to Solr
- goes across servers
- e.g. Free103Point9
Nettie Lagace from NISO
- National Information Standards Organization (NISO)
- work internationally
- want to know: What environment or conditions are needed to identify and solve the problem of interoperability problems?
Eric Larson – Finding images in book page images
A lot of free books exist out there, but you can’t have the time to read them all. What if you just wanted to look at the images? Because a lot of books have great images.
He used curl to pull all those images out, then use imagemagick to manage the images. The processing steps:
- Convert to greyscale
- Contrast boost x8
- Covert image to 1px by height
- Sharpen image
- Heavy-handed grayscaling
- Convert to text
- Look for long continuous line of black to pull pages with images
Code is on github
Adam Wead – Blacklight at the Rock Hall
- went live, soft launch about a month ago
- broken down to the item level
- find bugs he doesn’t know about for a beer!
Kelley McGrath – Finding Movies with FRBR & Facets
- users are looking for movies, either particular movie or genre/topic
- libraries describe publications e.g. date by DVD, not by movie
- users care about versions e.g. Blu-Ray, language
- Try the prototyped catalog
- Hit list provides one result per movie, can filter by different facets
Bohyun Kim – Web Usability in terms of words
- don’t over rely on the context
- but context is still necessary for understanding e.g. “mobile” – means on the go, what they want on the go
- sometimes there is no better term e.g. “Interlibrary Loan”
- brevity will cost you “tour” vs. “online tour”
- Time ran out, but check out the rest of the slides
Simon Spero – Restriction Classes, Bitches
OWL:
- lets you define properties
- control what the property can apply to
- control the values the property can take
- provides an easy way to do this
- provides a really confusing way to do this
The easy way is usually wrong!
When defining what can apply to and the range, this applies to every use of the property. An alternative is Attempto.
Cynthia Ng – Processing & ProcessingJS
- Processing: open source visual programming language
- Processing.js: related project to make processing available through web browsers without plugins
- While both tend to focus on data visualizations, digital art, and (in the case of PJS) games, there are educational oriented applications.
- Examples:
- Kanji Compositing – allows visual breakdown of Japanese kanji characters, interact with parts, and see children.
- Primer on Bezier Curves – scroll down to see interactive (i.e. if you move points, replots on the fly) and animated graphs.
- Obvious use might be instructional materials, but how might we apply it in this context? What other applications might we think of in the information organization world?
Since doing the presentation, I have already gotten one response by Dan Chudnov who did a quick re-rendering of newspaper data from OCR data. Still thinking on (best) use in libraries and other information organizations.
It’s over for today, but if you’d like more, do remember that there is a livestream and you can follow on twitter, #c4l12 or IRC.
You must be logged in to post a comment.