Tips on Converting PDF to an Accessible Document

I’ve talked about making documents accessible and the editing guidelines, but the more editing I do, the more I realize I save a lot of time because I don’t do all my editing manually. Some of these tips might also help when editing after converting from EPUB and other ebook formats.

Which Converter to Use

While I briefly spoke about which PDF to document converter I use in a previous post, I have since tried a couple of others and wanted to touch on pros and cons here.

Adobe Acrobat is well known as the all-in-one, best PDF converter, editor, etc. (which I don’t dispute), but if all you’re using it for is compressing/optimizing PDF and converting PDF to doc/x format, then consider using SmallPDF. If you want a desktop version (Windows only available), then you can click through their links to buy it (for a reasonable price) from the company that created the software.

Although even though I’ve discovered a free solution that does exactly what Acrobat does, I don’t use it often, favouring calibre instead.

Acrobat and the like does whatever it takes to make the doc look exactly like the PDF.

The big upside is that it looks like same, the big downside is that it will have a million and one column breaks, section breaks, page layouts, styles, and whatever else it has at its disposal to do that. It tends to become very messy.

For example, the converter unfortunately does not put page numbers and running headers/footers into the header/footer, but simply at the top/bottom of the page, using section breaks and formatting to force them into the right place.

As a result, this is what I would recommend:
1) Use calibre to convert to HTMLZ.
2) Unzip the HTMLZ.
3) Open the HTML file in your document editor of choice (if your document editor does not support opening HTML files, consider using pandoc to convert to a supported format).
4) Edit the document.
5) Remember to break links to the images (so that they are embedded into the document).
6) Use Acrobat or similar to convert the few pages that calibre did not handle well (e.g. index), and copy/paste what you need into your working document.

The problem occurs when you have a multi-column layout in the PDF, which calibre does not handle well at all. Obviously, instead of just a few pages, you will need Acrobat or similar to convert the whole PDF. I suggest that you clean up the document as much as possible (removing headers/footers, replacing section breaks with page breaks where possible, etc.). Hopefully, the rest of this post will give you ideas on how best to do this.

Fiction vs. Non-Fiction

Before I get into the actual tips, one of the major considerations is whether pagination matters. With fiction, I tend to remove all page numbers and page breaks, because it simply interrupts the flow of the text. However, in many non-fiction documents, you have notes, appendices, and an index. Page numbers need to match so that the reader can find what they are looking for.

Everything I talk about will be with the latter in mind, but for documents where page numbers do not matter, it’s simpler. For example, instead of replacing running headers with page breaks, you can just replace it with nothing.

A Quick Note on Editing Software

I will be focusing on how to do everything in Microsoft Word (should work in any version) and using its syntax, but all of these tips should apply to LibreOffice and OpenOffice if you install the Alt. Search plugin.

Use Split Screen

Remember to use the split screen (mouse over the bit between the scroll bar and the ribbon/menu and your cursor should change) as much as you need to. Use it whenever you need to constantly refer to the table of contents, or simply to another part of the document (such as when adding endnotes).
MS word split screen option

Inserting Page Breaks Using Running Headers/Footers

You will need to look through your converted document to see what running header and/or footer is used to replace them with page breaks. For books, you will pretty much always find a combination of page number and text. Typically, it will look something like (where ^p is a paragraph mark):

5^p
Introduction^p

If you have a running header like this, you’ll pretty much have to do each chapter separately. I also recommend that you start with the highest number of digits. Example (where ^# means any single digit, and ^m means a page break):

Find: ^#^#^#^pbook or chapter title^p
Replace: ^m

Then do it with two digits (^#^#) and finally, just one.

A more efficient method would be to use wildcards and look for 1-3 digits, but depending on the direction you’re find/replacing in Word, this may or may not work as intended. If you want to try it, I suggest putting your cursor where you want to start searching (after contents), then search “Down”. With wildcards (Word’s version of regular expression) checked, then you can say:

Find: [0-9]{1,3}^13book or chapter title^13
Replace: ^m

Note: ^p and many other find/replace syntax are not accepted when using wildcards, so you need to use ^13 in place of ^p.

Marking (Chapter) Headers

Sometimes you may be lucky enough to get a different style for your chapter headings. If you search for “Chapter” and more than one has a specific style, it’s likely all your chapter headings have the same style. You can then mark all your chapters headings with heading 1. Example:

Find: (style) chap-header
Replace: (style) Heading 1

This is a good way to mark other headings too if possible.

Check the Navigation sidebar to see how much of your text has now been marked as a Level One heading. Hopefully it’s only the text you want. If it included other headers, remember that you can use the outline view, select to view only headers, select the headers you want to change, and move them up or down one or more levels.

word outline view

Before chapter headings, also make sure page numbers and other pieces of running header/footer have been left over, since they tend to be a little different before major headings.

Removing Extra Paragraph Breaks

Usually, when converting from PDF in calibre, the end of every line will have a paragraph break. You may prefer to copy/paste text from an Acrobat or similar conversion, or from a plain text file. I find that takes more time that just getting rid of the extra paragraph breaks.

First we want to make the hyphenated words into a single word. I have found that two words with a hyphen in between have a space before the paragraph mark, so we can do this:

Find: (-) ^13
Replace: \1

Note: When using wildcards, much like regex, what you put in parentheses can be grouped, and then “replaced” with that group. In the example above, we’re basically saying put what’s in group 1 back to where you found it.

Now we need to replace the hyphenated words that are actually one word, using:

Find: -^13
Replace: (nothing)

Do check that this is the case, as sometimes both two connected words and hyphenated single words may not have a space, in which case you unfortunately have to find and manually choose when to replace instead of doing it all at once.

For everything else, the idea is that we’re looking for all lines that do not end with a period, closing quotation mark, exclamation mark or questions mark, as we are going to assume that that is where a paragraph break belongs. Using the wildcard option, we can do this:

Find: ([!.!\”\”\?] )^13 (style: Normal)
Replace: \1

Note: The extra back slashes are to “escape” the characters, to say “this character, not whatever special meaning it might have using wildcards”. You will also notice there is the straight double quotation mark as well as the “smart” or curved closing quotation mark.

Using the “Normal” style (or whichever style your body text uses) will make sure we don’t touch the headers we’ve marked.

Images and Descriptions

To easily find images, you can find ^g (stands for graphic). To add or to check that alt text has been added for all the images that need it, you can leave the “Alt text” box open as you continue to find images.

Microsft Word Alt Text Box

This method does not work for anchored images. The easiest way to embed your image inline is to cut and paste it into an image editing program, then copy/paste it back (because it will get rid of other formatting as well). Otherwise, look for the option to position the image inline with text. You can also copy the image from your PDF if necessary.

Tables, Footnotes, and Endnotes

Tables, footnotes, and endnotes usually need to be reinserted (even when using Acrobat and similar to convert).

If you have a lot of (big) tables, then consider using tabula. You can also copy and paste the data into Excel and manipulate the data so that it is in the correct order (row and column). Then you can select all the cells in your table (in Word) and paste the data back into the table.

With footnotes and endnotes, unfortunately, it has to be done manually. However, a quick tip that you can right-click and choose “Go to endnote” (when you’re in the endnote section) and double-click on an endnote reference (in the body of the text) to move from the note to the reference and vice versa respectively. Now that you’ve removed the page numbers, the easiest way to find the references is to search for numbers.

Page Numbers, Sections Breaks, and Page Layout

After all of that, insert the page numbers in the header, then use a section break where necessary to change the page numbering.

You will also need a section break if you’re switching the number of columns. Avoid using multiple columns whenever you can, even if the original is laid out in multiple columns. There are cases where you need multiple columns to fit the amount of text on a page (such as the index). In these cases, still try to stick to using “Section Break (Next Page)” instead of the “Section Break (Continuous)”.

The simpler the layout, the better for your reader.

Make Less Work

If you find that you’re doing something repeatedly, then you can likely find a way to do it faster.

And if anyone is doing something even more efficiently, I’d love to hear about it.