Add a search feature to your print edition.
Creating a good document Index section is a difficult job performed by professionals. However, an automatically generated index still can be very helpful. Use automatic keywords [Hack #19] or select your own keywords. This hack will locate their pages, build a reference, and then create PDF pages that you can append to your document, as shown in Figure 5-5. It even uses your PDF's page labels (also known as logical page numbering) to ensure trouble-free lookup.
Download and install pdftotext [Hack #19], our kw_index [Hack #19], and pdftk [Hack #79] . You must also have enscript (Windows users visit http://gnuwin32.sf.net/packages/enscript.htm) and ps2pdf. ps2pdf comes with Ghostscript [Hack #39] . Our kw_index package includes the kw_catcher and page_refs programs (and source code) that we use in the following sections.
First, set your PDF's logical page numbering [Hack #62] to match your document's page numbering. Then, use pdftk to dump this information into a text file, like so:
pdftk mydoc.pdf dump_data output mydoc.data.txt
Next, convert your PDF to plain text with pdftotext:
pdftotext mydoc.pdf mydoc.txt
Create a keyword list [Hack #19] from mydoc.txt using kw_catcher, like so:
kw_catcher 12 keywords_only mydoc.txt > mydoc.kw.txt
Edit mydoc.kw.txt to remove duds and add missing keywords. At present, only one keyword is allowed per line. If two or more keywords are adjacent in mydoc.txt, our page_refs program will assemble them into phrases.
Now pull all these together to create a text index using page_refs:
page_refs mydoc.txt mydoc.kw.txt mydoc.data.txt > mydoc.index.txt
Finally, create a PDF from mydoc.index.txt using enscript and ps2pdf:
enscript --columns 2 --font 'Times-Roman@10' \ --header '|INDEX' --header-font 'Times-Bold@14' \ --margins 54:54:36:54 --word-wrap --output - mydoc.index.txt \ | ps2pdf - mydoc.index.pdf
Of course, the thing to do is to wrap this procedure into a tidy script. Copy the following Bourne shell script into a file named make_index.sh, and make it executable by applying chmod 700. Windows users can get a Bourne shell by installing MSYS [Hack #97] .
#!/bin/sh # make_index.sh, version 1.0 # usage: make_index.sh <PDF filename> <page window> # requires: pdftk, kw_catcher, page_refs, # pdftotext, enscript, ps2pdf # # by Ross Presser, Imtek.com # adapted by Sid Steward # http://www.pdfhacks.com/kw_index/ fname=`basename $1 .pdf` pdftk ${fname}.pdf dump_data output ${fname}.data.txt && \ pdftotext ${fname}.pdf ${fname}.txt && \ kw_catcher $2 keywords_only ${fname}.txt \ | page_refs ${fname}.txt - ${fname}.data.txt \ | enscript --columns 2 --font 'Times-Roman@10' \ --header '|INDEX' --header-font 'Times-Bold@14' \ --margins 54:54:36:54 --word-wrap --output - \ | ps2pdf - ${fname}.index.pdf
Pass the name of your PDF document and the kw_catcher window size to make_index.sh like so:
make_index.sh mydoc.pdf 12
The script will create a document index named mydoc.index.pdf. Review this index and append it to your PDF document [Hack #51] if you desire. The script also creates two intermediate files: mydoc.data.txt and mydoc.txt. If the PDF index is faulty, review these intermediate files for problems. Delete them when you are satisfied with the PDF index.
The second argument to make_index.sh controls the keyword detection sensitivity. Smaller numbers yield fewer keywords at the risk of omitting some keywords; larger numbers admit more keywords and also more noise. [Hack #19] discusses this parameter and the kw_catcher program that uses it.