Hack 80 Decipher and Navigate PDF at the Text Level

Turn obfuscated PDF code into transparent data so you can work with it directly.

PDF uses an element framework for organizing data. When editing PDFs at the text level, it helps to know how to navigate these nodes. The data itself usually is compressed and unreadable. pdftk [Hack #79] can uncompress these streams, making the PDF more interesting to read and much more hackable.

First, uncompress your PDF document using pdftk:

pdftk  mydoc.pdf  output  mydoc.uncompressed.pdf  uncompress

Next, fire up your text editor. A good text editor enables you to inspect any document at its lowest level by reading its bytes right off of the disk. Not all text editors can handle the mix of human-readable text and machine-readable binary data that PDF contains. Other editors can read and display this data, but they can't write it properly. I recommend using gVim [Hack #82] .

Get the full story on PDF by reading the specification at http://partners.adobe.com/asn/acrobat/sdk/public/docs/PDFReference15_v5.pdf.

Open a PDF in your text editor and you will find some plain-text data and some unreadable binary data. All of this data is organized using a few basic objects. The PDF Reference 1.5 section 3.2 describes these in detail. Here is a quick key to get you started.

Names: / ...: A slash indicates the beginning of a name. Examples include /Type and /Page. Most names have very specific meanings prescribed by the PDF Reference. They are never compressed or encrypted.
Strings: ( ...): Strings are enclosed by parentheses. An example is (Now is the time). You use them for holding plain-text data in annotations and bookmarks. You can encrypt them but you can't compress them. Mind escaped characters?e.g., \), \(, or \\.
Dictionaries: << key1 value1 key2 value2 ... >>: Dictionaries map keys to values. Keys must be names and values can be anything, even dictionaries or arrays.
Arrays: [ object1 object2 ... ]: Arrays represent a list of objects. All PDF objects are part of one big tree, interconnected by arrays and dictionaries.
Streams: << ... >> stream ...endstream: Most PDF data is stored in streams. Dictionary data precedes the stream data and holds information about the stream, such as its length and encoding. stream and endstream bracket the actual stream data. Streams are used to hold bitmap images and page-drawing instructions, among other things. Use pdftk to make compressed page streams readable. Some streams use PDF objects (dictionaries, strings, arrays, etc.) to represent information.
Indirect object references: m n R: Indirect object references allow an object to be referenced in one place (or many places) and described in another. The reference is a pair of numbers followed by the letter R, such as: 3528 0 R. You find them in dictionaries and arrays. To locate the object referenced by m n R, search for m n obj.
Indirect object identifiers: m n obj ... endobj: An indirect object is any object that is preceded by the identifying m n obj, where m and n are numbers that uniquely identify the object. Another object can then reference the indirect object by simply invoking m n R, described earlier.

Dictionaries tend to be the most interesting objects. They represent things such as pages and annotations. You can tell what a dictionary describes by checking its /Type and /Subtype keys. Conversely, you can find something in a PDF by searching on its type. For example, you can find each page in a PDF by searching for the text /Page. For annotations, search for /Annot, and for images, /Image.

At the end of the PDF file is the XREF lookup table. It gives the byte offset for every indirect object in the PDF file. This allows rapid random access to PDF pages and other data. Text-level PDF editing can corrupt the XREF table, which breaks the PDF. [Hack #81] solves this problem.