Hack 81 Edit PDF Code Freely

figs/expert.gif figs/hack81.gif

Take control of PDF code by mastering its XREF table.

[Hack #80] revealed the hackable plain text behind PDF. Here we edit this PDF text and then use pdftk [Hack #79] to cover our tracks. pdftk can also compress the page streams when we're done.

An unsuitable text editor can quietly damage your PDF. Test your text editor by simply opening a PDF, saving it into a new file, and then trying to open this new file in Acrobat or Reader. If your editor corrupted the PDF's data, Acrobat or Reader should display a brief warning before displaying the PDF. Sometimes, however, this warning flashes by too quickly to notice. After the PDF is repaired, Acrobat and Reader will display the PDF as if nothing happened.

Since Acrobat and Reader aren't the most reliable tools for testing PDFs, you should consider some alternatives. The free command-line pdfinfo program from the Xpdf project (http://www.foolabs.com/xpdf/) can tell you whether a PDF is damaged. The Multivalent Tools (http://multivalent.sourceforge.net/Tools/index.html) also provide a free PDF validator.

If you need a good text editor, try gVim [Hack #82] .


First, uncompress your PDF's page streams [Hack #80] :

pdftk  mydoc.pdf  output  mydoc.uncompressed.pdf  uncompress

Then, open this new PDF in your text editor. Locate your page of interest by searching for the text /pdftk_pageNum N, where N is the number of your page (the first page is 1, not 0). This text was added to the page dictionaries by pdftk.

Find the /Contents key in your page's dictionary. It is probably mapped to an indirect object reference: m n R. Locate this indirect object by searching for the text m n obj. This will take you to a stream or to an array of streams. If it is an array, look up any of its referenced streams the same way.

Now you should be looking at a stream of PDF drawing operations that describe your page. These operations and their interactions are best understood by studying the PDF Reference [Hack #98] . However, if your page has a lot of text on it, you can probably make it out. An example of a legal change in page text is changing [(gr)17.7(oup)] to [(grip)], or (storey) to (story). Anything inside parentheses this way is fair game. So, change something and save your work.

Editing PDF at the text level typically corrupts the XREF lookup table at the end of the file. Repair your edited PDF using pdftk like so:

pdftk  mydoc.edited.pdf  output  mydoc.fixed.pdf

Or, if you want to compress the output and remove the /pdftk_pageNum entries, add compress to the end like so:

pdftk  mydoc.edited.pdf  output  mydoc.fixed.pdf  compress

Open your new PDF in Reader and view your page. Do you see the change you made? If it was in the middle of a paragraph, you might be surprised to find that the paragraph hasn't rewrapped to fit your altered word. Most PDFs have no concept of a paragraph, so how could it?

This procedure is an unlikely way to fix typos. We put it to better use in [Hack #82] .