Analyze word frequency to find relationships between PDFs.
Organizing a large collection into categories requires a firsthand familiarity with every document. This level of care generally is not possible. In any case, some documents inevitably get filed into the wrong categories.
Here is a pair of Bourne shell scripts that measure the similarity between two PDF documents. You can use them to help categorize PDFs, to help identify misfiled documents, or to suggest related material to your readers. Their logic is easy to reproduce using any scripting language. To install the Bourne shell on Windows, see [Hack #97] .
They use the following command-line tools: pdftotext [Hack #19], sed (Windows users visit http://gnuwin32.sf.net/packages/sed.htm), sort, uniq, cat, and wc (Windows users visit http://gnuwin32.sf.net/packages/textutils.htm). These tools are available on most platforms. Here are some brief descriptions of what the tools do:
Converts PDF to plain text
Filters text and makes substitutions
Sorts lines of text files
Removes duplicate lines from a sorted file
Concatenates files
Prints the number of bytes, words, and lines in a file
The first script, wordlist.sh, takes the filename of a PDF and creates a text file that contains a sorted list of each word that occurs at least twice in the document. Save this script to your disk as wordlist.sh and then apply chmod 700 to it, if necessary:
#!/bin/sh pdftotext $1 - | \ sed 's/ /\n/g' | \ sed 's/[^A-Za-z]//g' | \ sed '/^$/d' | \ sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/' | \ sort | \ uniq -d > $1.words.txt
First, it converts the PDF to text. Next, it puts each word on its own line. Then, it removes any nonalphabetic characters, so you'll becomes youll. It removes all blank lines and then converts all characters to lowercase. It sorts the words and then creates a list of individual words that appear at least twice. The output filename is the same as the input filename, except the extension .words.txt is added.
If you call wordlist.sh like this:
wordlist.sh mydoc1.pdf
it creates a text file named mydoc1.pdf.words.txt. For example, the word list for Brian Eno: His Music and the Vertical Color of Sound (http://www.pdfhacks.com/eno/) includes:
anything anyway anywhere apart aperiodic aphorisms apollo apparatus apparent apparently appeal appear
The second script, percent_overlap.sh, compares two word lists and reports what percentage of words they share. If you compare a document to itself, its overlap is 100%. The percentage is calculated using the length of the shorter word list, so if you were to take a chapter from a long document and compare it to the entire, long document, it would report a 100% overlap as well.
Given any two, totally unrelated documents, their overlap still might be 35%. This also makes sense, because all documents of the same language use many of the same words. Two unrelated fiction novels might have considerable overlap. Two unrelated technical documents would not.
|
Save this script to your disk as percent_overlap.sh and then apply chmod 700 to it, if necessary:
#!/bin/sh num_words_1=`cat $1 | wc -l` num_words_2=`cat $2 | wc -l` num_common_words=`sort $1 $2 | uniq -d | wc -l` if [ $num_words_1 -lt $num_words_2 ] then echo $(( 100 * $num_common_words/$num_words_1 )) else echo $(( 100 * $num_common_words/$num_words_2 )) fi
Run percent_overlap.sh like this, and it returns the overlap between the two documents as a single number (in this example, the overlap is 38%):
$ percent_overlap.sh mydoc1.pdf.words.txt mydoc2.pdf.words.txt
38
If you do this on multiple documents, you can see a variety of relationships emerge. For example, Table 2-1 shows the overall overlaps between various documents on my computer.
A |
B |
C |
D |
E |
F |
G | |
---|---|---|---|---|---|---|---|
A=PDF Reference, 1.4 |
100 |
98 |
65 |
36 |
48 |
50 |
35 |
B=PDF Reference, 1.5 |
98 |
100 |
67 |
37 |
51 |
52 |
34 |
C=PostScript Reference, Third Edition |
65 |
67 |
100 |
38 |
47 |
49 |
36 |
D=The ANSI C++ Specification |
36 |
37 |
38 |
100 |
38 |
40 |
25 |
E=Corporate Annual Report #1 |
48 |
51 |
47 |
38 |
100 |
62 |
49 |
F=Corporate Annual Report #2 |
50 |
52 |
49 |
40 |
62 |
100 |
52 |
G=Brian Eno Book by Eric Tamm |
35 |
34 |
36 |
25 |
49 |
52 |
100 |