Hack 23 Identify Related PDFs

figs/expert.gif figs/hack23.gif

Analyze word frequency to find relationships between PDFs.

Organizing a large collection into categories requires a firsthand familiarity with every document. This level of care generally is not possible. In any case, some documents inevitably get filed into the wrong categories.

Here is a pair of Bourne shell scripts that measure the similarity between two PDF documents. You can use them to help categorize PDFs, to help identify misfiled documents, or to suggest related material to your readers. Their logic is easy to reproduce using any scripting language. To install the Bourne shell on Windows, see [Hack #97] .

They use the following command-line tools: pdftotext [Hack #19], sed (Windows users visit http://gnuwin32.sf.net/packages/sed.htm), sort, uniq, cat, and wc (Windows users visit http://gnuwin32.sf.net/packages/textutils.htm). These tools are available on most platforms. Here are some brief descriptions of what the tools do:


pdftotext

Converts PDF to plain text


sed

Filters text and makes substitutions


sort

Sorts lines of text files


uniq

Removes duplicate lines from a sorted file


cat

Concatenates files


wc

Prints the number of bytes, words, and lines in a file

The first script, wordlist.sh, takes the filename of a PDF and creates a text file that contains a sorted list of each word that occurs at least twice in the document. Save this script to your disk as wordlist.sh and then apply chmod 700 to it, if necessary:

#!/bin/sh

pdftotext $1 - | \

sed 's/ /\n/g' | \

sed 's/[^A-Za-z]//g' | \

sed '/^$/d' | \

sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/' | \

sort | \

uniq -d > $1.words.txt

First, it converts the PDF to text. Next, it puts each word on its own line. Then, it removes any nonalphabetic characters, so you'll becomes youll. It removes all blank lines and then converts all characters to lowercase. It sorts the words and then creates a list of individual words that appear at least twice. The output filename is the same as the input filename, except the extension .words.txt is added.

If you call wordlist.sh like this:

wordlist.sh  mydoc1.pdf

it creates a text file named mydoc1.pdf.words.txt. For example, the word list for Brian Eno: His Music and the Vertical Color of Sound (http://www.pdfhacks.com/eno/) includes:

anything

anyway

anywhere

apart

aperiodic

aphorisms

apollo

apparatus

apparent

apparently

appeal

appear

The second script, percent_overlap.sh, compares two word lists and reports what percentage of words they share. If you compare a document to itself, its overlap is 100%. The percentage is calculated using the length of the shorter word list, so if you were to take a chapter from a long document and compare it to the entire, long document, it would report a 100% overlap as well.

Given any two, totally unrelated documents, their overlap still might be 35%. This also makes sense, because all documents of the same language use many of the same words. Two unrelated fiction novels might have considerable overlap. Two unrelated technical documents would not.

In this next Bourne shell script, note that we use backtick characters (`), not apostrophes ('). The backtick character usually shares its key with the tilde (~) on your keyboard.


Save this script to your disk as percent_overlap.sh and then apply chmod 700 to it, if necessary:

#!/bin/sh

num_words_1=`cat $1 | wc -l`

num_words_2=`cat $2 | wc -l`

num_common_words=`sort $1 $2 | uniq -d | wc -l`



if [ $num_words_1 -lt $num_words_2 ]

then echo $(( 100 * $num_common_words/$num_words_1 ))

else echo $(( 100 * $num_common_words/$num_words_2 ))

fi

Run percent_overlap.sh like this, and it returns the overlap between the two documents as a single number (in this example, the overlap is 38%):

$ percent_overlap.sh  mydoc1.pdf.words.txt mydoc2.pdf.words.txt

38

If you do this on multiple documents, you can see a variety of relationships emerge. For example, Table 2-1 shows the overall overlaps between various documents on my computer.

Table 2-1. The results of comparing various documents with percent_overlap.sh
 

A

B

C

D

E

F

G

A=PDF Reference, 1.4

100

98

65

36

48

50

35

B=PDF Reference, 1.5

98

100

67

37

51

52

34

C=PostScript Reference, Third Edition

65

67

100

38

47

49

36

D=The ANSI C++ Specification

36

37

38

100

38

40

25

E=Corporate Annual Report #1

48

51

47

38

100

62

49

F=Corporate Annual Report #2

50

52

49

40

62

100

52

G=Brian Eno Book by Eric Tamm

35

34

36

25

49

52

100