Hack 64 Get and Set PDF Metadata

figs/moderate.gif figs/hack64.gif

Add document information to your PDF, even without using Acrobat.

Traditional metadata includes things such as your document's title, authors, and ISBN. But you can add anything you want, such as the document's revision number, category, internal ID, or expiration date. PDF can store this information in two different ways: using the PDF's Info dictionary [Hack #80] or using an embedded Extensible Metadata Platform (XMP) stream. When you change the PDF's title, authors, subject, or keywords using Acrobat, as shown in Figure 5-13, it updates both of these resources. Acrobat 6 also enables you to export or import PDF XMP datafiles. Visit http://www.adobe.com/products/xmp/ to learn about Adobe's XMP.

Figure 5-13. Viewing or changing a PDF's basic metadata in Acrobat

In Acrobat 6, view and update metadata by selecting File Document Properties . . . Description or Advanced Document Metadata . . . . In Acrobat 5, select File Document Properties Summary. Save your PDF after making changes to the metadata.

Our pdftk [Hack #79] currently reads and writes only the metadata in a PDF's Info dictionary. However, it does not restrict you to just the title, authors, subject, and keywords. This solves the basic problem of embedding information into a PDF document; pdftk allows you to add custom metadata fields to PDF as needed. pdftk is free software.

Xpdf's (http://www.foolabs.com/xpdf/) pdfinfo reports a PDF's Info dictionary contents, its XMP stream, and other document data. pdfinfo is free software.

5.15.1 Get Document Metadata

To create a plain-text report of PDF metadata, use pdftk's dump_data operation. It will also report PDF bookmarks and page labels, among other things. The command looks like this:

pdftk  mydoc.pdf  dump_data output  mydoc.data.txt

Metadata will be represented as key/value pairs, like so:

InfoKey: Creator

InfoValue: Acrobat PDFMaker 6.0 for Word

InfoKey: Title

InfoValue: Brian Eno: His Music and the Vertical Color of Sound

InfoKey: Author

InfoValue: Eric Tamm

InfoKey: Producer

InfoValue: Acrobat Distiller 6.0.1 (Windows)

InfoKey: ModDate

InfoValue: D:20040420234132-07'00'

InfoKey: CreationDate

InfoValue: D:20040420234045-07'00'

Another tool for reporting PDF metadata is pdfinfo, which is part of the Xpdf project (http://www.foolabs.com/xpdf/). In addition to metadata, it also reports page sizes, page count, and PDF permissions [Hack #52] . Running pdfinfo mydoc.pdf yields a report such as this:

Title:          Brian Eno: His Music and the Vertical Color of Sound

Author:         Eric Tamm

Creator:        Acrobat PDFMaker 6.0 for Word

Producer:       Acrobat Distiller 6.0.1 (Windows)

CreationDate:   04/20/04 23:40:45

ModDate:        04/22/04 14:39:30

Tagged:         no

Pages:          216

Encrypted:      no

Page size:      522 x 756 pts

File size:      1126904 bytes

Optimized:      yes

PDF version:    1.4

Use pdfinfo's options to fine-tune its behavior. Use its -meta option to view a PDF's XMP stream.

5.15.2 Set Document Metadata

pdftk can take a plain-text file of these same key/value pairs and update a PDF's Info dictionary to match. Currently, it does not update the PDF's XMP stream. The command would look like this:

pdftk  mydoc.pdf  update_info  new_info.txt  output  mydoc.updated.pdf

This will add or modify the Info keys given by mydoc.new_data.txt. Note that the output PDF filename must be different from the input. To remove a key/value pair, simply pass in the key/value with an empty value, like so:

InfoKey: MyDataKey


Use pdftk to strip all Info and XMP metadata from a document by copying its pages into a new PDF, like so:

pdftk mydoc.pdf cat A output mydoc.no_metadata.pdf

The PDF specification defines several Info fields. Be careful to use these only as described in the specification. They are Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate, and Trapped.