Hack 23 Identify Related PDFs

Analyze word frequency to find relationships between PDFs.

Organizing a large collection into categories requires a firsthand familiarity with every document. This level of care generally is not possible. In any case, some documents inevitably get filed into the wrong categories.

Here is a pair of Bourne shell scripts that measure the similarity between two PDF documents. You can use them to help categorize PDFs, to help identify misfiled documents, or to suggest related material to your readers. Their logic is easy to reproduce using any scripting language. To install the Bourne shell on Windows, see [Hack #97] .

They use the following command-line tools: pdftotext [Hack #19], sed (Windows users visit http://gnuwin32.sf.net/packages/sed.htm), sort, uniq, cat, and wc (Windows users visit http://gnuwin32.sf.net/packages/textutils.htm). These tools are available on most platforms. Here are some brief descriptions of what the tools do:

pdftotext: Converts PDF to plain text
sed: Filters text and makes substitutions
sort: Sorts lines of text files
uniq: Removes duplicate lines from a sorted file
cat: Concatenates files
wc: Prints the number of bytes, words, and lines in a file

The first script, wordlist.sh, takes the filename of a PDF and creates a text file that contains a sorted list of each word that occurs at least twice in the document. Save this script to your disk as wordlist.sh and then apply chmod 700 to it, if necessary:

#!/bin/sh

pdftotext $1 - | \

sed 's/ /\n/g' | \

sed 's/[^A-Za-z]//g' | \

sed '/^$/d' | \

sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/' | \

sort | \

uniq -d > $1.words.txt

First, it converts the PDF to text. Next, it puts each word on its own line. Then, it removes any nonalphabetic characters, so you'll becomes youll. It removes all blank lines and then converts all characters to lowercase. It sorts the words and then creates a list of individual words that appear at least twice. The output filename is the same as the input filename, except the extension .words.txt is added.

If you call wordlist.sh like this:

wordlist.sh  mydoc1.pdf

it creates a text file named mydoc1.pdf.words.txt. For example, the word list for Brian Eno: His Music and the Vertical Color of Sound (http://www.pdfhacks.com/eno/) includes:

anything

anyway

anywhere

apart

aperiodic

aphorisms

apollo

apparatus

apparent

apparently

appeal

appear

The second script, percent_overlap.sh, compares two word lists and reports what percentage of words they share. If you compare a document to itself, its overlap is 100%. The percentage is calculated using the length of the shorter word list, so if you were to take a chapter from a long document and compare it to the entire, long document, it would report a 100% overlap as well.

Given any two, totally unrelated documents, their overlap still might be 35%. This also makes sense, because all documents of the same language use many of the same words. Two unrelated fiction novels might have considerable overlap. Two unrelated technical documents would not.

In this next Bourne shell script, note that we use backtick characters (`), not apostrophes ('). The backtick character usually shares its key with the tilde (~) on your keyboard.

Save this script to your disk as percent_overlap.sh and then apply chmod 700 to it, if necessary:

#!/bin/sh

num_words_1=`cat $1 | wc -l`

num_words_2=`cat $2 | wc -l`

num_common_words=`sort $1 $2 | uniq -d | wc -l`



if [ $num_words_1 -lt $num_words_2 ]

then echo $(( 100 * $num_common_words/$num_words_1 ))

else echo $(( 100 * $num_common_words/$num_words_2 ))

fi

Run percent_overlap.sh like this, and it returns the overlap between the two documents as a single number (in this example, the overlap is 38%):

$ percent_overlap.sh  mydoc1.pdf.words.txt mydoc2.pdf.words.txt

38

If you do this on multiple documents, you can see a variety of relationships emerge. For example, Table 2-1 shows the overall overlaps between various documents on my computer.

Table 2-1. The results of comparing various documents with percent_overlap.sh
	A	B	C	D	E	F	G
A=PDF Reference, 1.4	100	98	65	36	48	50	35
B=PDF Reference, 1.5	98	100	67	37	51	52	34
C=PostScript Reference, Third Edition	65	67	100	38	47	49	36
D=The ANSI C++ Specification	36	37	38	100	38	40	25
E=Corporate Annual Report #1	48	51	47	38	100	62	49
F=Corporate Annual Report #2	50	52	49	40	62	100	52
G=Brian Eno Book by Eric Tamm	35	34	36	25	49	52	100