Mozilla Festival Day 1: Notes from Disassembling the world’s worst data wrapper: PDFs

It’s no secret that PDFs are a terrible way to distribute data, so some tips and tools on helping to extract data and information from PDFs.


For extracting data in tables. Online version at Also available a version to download and run locally.

If you have any issues, try the other detection mode.

Data can be exported to CSV and some other formats. Must have text-basd characters already, but won’t do OCR for you.

Can use online version to select the area you want and export the script and copy the script into the command line.

If you can, the local version will be much faster, and has more options.


Command line application to dump text from PDF that attempts to preserve layout (with layout switch), but generally need to regular expression to parse the information.


Another command line tool that will extract text from PDF.


Comes with a tool called pdfimages that will extract images from PDF files.


More notes on the session etherpad.

Author: Cynthia

Technologist, Librarian, Metadata and Technical Services expert, Educator, Mentor, Web Developer, UXer, Accessibility Advocate, Documentarian

Leave a Comment

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: