Hack 70 Create an HTML Table of Contents from PDF Bookmarks

figs/moderate.gif figs/hack70.gif

Give web surfers an inviting HTML gateway into your PDF.

When browsing the Web, I usually groan at the sight of a PDF link. You have probably experienced it, too. My research has brought me to this point where I must now download a large PDF before I can proceed. The problem isn't so much with the PDF file, but with my inability to gauge just how much this PDF might help me before I commit to downloading it.

The PDF author might have even gone to great lengths to ensure a good, online read, with nice, clear fonts, navigational bookmarks, and page-at-a-time byte serving for quick, random access. But I can't tell that from looking at this PDF link. Chances are that I'll click and wait, and wait. When it finally opens, I'll probably need to flip, page by page, through illegible text looking for a clue that this tome will help me somehow. I might never find out, especially because I have a dozen other possible lines of inquiry I am pursuing at the same time.

Don't let this happen to your online PDF. If your PDF has bookmarks, use this hack to create an HTML table of contents that hyperlinks every heading directly to its PDF page (see Figure 5-16.

Figure 5-16. An HTML table of contents, which links readers directly to PDF topics
figs/pdfh_0516.gif


This kind of random access into an online PDF is convenient only if the PDF is linearized and the web server is configured for byte serving [Hack #67] . Without both of these, your readers must download the entire document before viewing a single page.


5.21.1 Create a PDF Table of Contents in HTML with pdftk and pdftoc

pdftk [Hack #79] can report on PDF data, including bookmarks. pdftoc converts this plain-text report into HTML. Visit http://www.pdfhacks.com/pdftoc/ and download pdftoc-1.0.zip. Unzip, and move pdftoc.exe to a convenient location, such as C:\Windows\system32\. On other platforms, build pdftoc from the source code.

Use pdftk to grab the bookmark data from your PDF, like so:

pdftk  mydoc.pdf  dump_data output mydoc_data.txt

Next, use pdftoc to convert this plain-text report into HTML:

pdftoc  mydoc.pdf < mydoc_data.txt  >  mydoc_toc.html

Alternatively, you can run these two steps together, like so:

pdftk  mydoc.pdf  dump_data | pdftoc  mydoc.pdf  >  mydoc_toc.html

The first argument to pdftoc is the document location that you want pdftoc to use in its hyperlinks. The previous example assumes that mydoc.pdf and mydoc_toc.html will be in the same directory. You can also give a relative path to your PDF, like so:

pdftoc  ../pdf/mydoc.pdf  <  mydoc_data.txt  >  mydoc_toc.html

or a full URL:

pdftoc  http://pdfhacks.com/pdf/mydoc.pdf  <  mydoc_data.txt  >  mydoc_toc.html

Once readers enter the PDF, they can use its bookmarks for further navigation. To ensure they see your bookmarks, set your PDF to display them upon opening [Hack #62] .

You can also add a download link [Hack #68] on the web page that prompts the user to save the PDF on her local disk. As a courtesy to the user, mention the download file size, too.