Mozilla Festival Day 1: Notes from Disassembling the world’s worst data wrapper: PDFs

It’s no secret that PDFs are a terrible way to distribute data, so some tips and tools on helping to extract data and information from PDFs.

Tabula

For extracting data in tables. Online version at try.tabula.technology. Also available a version to download and run locally.

If you have any issues, try the other detection mode.

Data can be exported to CSV and some other formats. Must have text-basd characters already, but won’t do OCR for you.

Can use online version to select the area you want and export the script and copy the script into the command line.

If you can, the local version will be much faster, and has more options.

pdftotext

Command line application to dump text from PDF that attempts to preserve layout (with layout switch), but generally need to regular expression to parse the information.

mudraw

Another command line tool that will extract text from PDF.

pdftk

Comes with a tool called pdfimages that will extract images from PDF files.

Notes

More notes on the session etherpad.