It’s no secret that PDFs are a terrible way to distribute data, so some tips and tools on helping to extract data and information from PDFs.
Tabula
For extracting data in tables. Online version at try.tabula.technology. Also available a version to download and run locally.
If you have any issues, try the other detection mode.
Data can be exported to CSV and some other formats. Must have text-basd characters already, but won’t do OCR for you.
Can use online version to select the area you want and export the script and copy the script into the command line.
If you can, the local version will be much faster, and has more options.
pdftotext
Command line application to dump text from PDF that attempts to preserve layout (with layout switch), but generally need to regular expression to parse the information.
mudraw
Another command line tool that will extract text from PDF.
pdftk
Comes with a tool called pdfimages that will extract images from PDF files.
Notes
More notes on the session etherpad.