Hack 7 Copy Data from PDF Pages

figs/moderate.gif figs/hack7.gif

Extract data from PDF files and use it in your own documents or spreadsheets.

Copying data from one electronic document to paste into another should be painless and predictable, such as the process depicted in Figure 1-7. Trying to copy data from a PDF, however, can be frustrating. The solution for Acrobat 6 and Adobe Reader users (on Windows, anyway) comes from an unlikely source: Acrobat 5.

Figure 1-7. TAPS faithfully copying formatted text and tables using Acrobat or Reader

Acrobat 5 includes the excellent TAPS text/table selection plug-in. Acrobat 6 does not. Because Acrobat plug-ins are modular, you can copy the TAPS folder (named Table) from the Acrobat 5 plug_ins folder [Hack #4] and paste it into the Acrobat 6 plug_ins folder. Voilà! Don't have Acrobat 5? The TAPS license permits liberal distribution, so visit http://www.pdfhacks.com/TAPS/ to view the license and download a copy. Don't have Acrobat 6, either? Use Adobe Reader instead. TAPS works in both Acrobat and Reader. Who would have guessed?

1.8.1 Adobe Reader 5 and 6

Adobe Reader gives you a single, simple Text Select tool that works well on single lines of text but not on tables or paragraphs. Sometimes it selects more text than you want. For greater control, hold down the Alt key (Version 6) or the Ctrl key (Version 5) and drag out a selection rectangle. Multiline paragraphs copied with this tool do not preserve their flow. Pasted into Word, each line is a single paragraph. Yuck!

You need the TAPS plug-in, which copies paragraphs and tables with fidelity. Copy the entire Table folder from your Acrobat 5 plug-ins directory (e.g., C:\Program Files\Adobe\Acrobat 5.0\Acrobat\plug_ins\Table) into your Reader plug-ins directory (e.g., C:\Program Files\Adobe\Acrobat 6.0\Reader\plug_ins). Restart Reader.

If you don't have Acrobat 5, visit http://www.pdfhacks.com/TAPS/ and download Acrobat_5_TAPS.zip. Unzip, and then move the resulting TAPS folder into your Reader plug_ins directory. Restart Reader. You'll now have the Table/Formatted Text Select Tool, as shown in Figure 1-8.

Figure 1-8. TAPS adding the Table/Formatted Text Select Tool under your Select Text button

The next section provides tips on how to use TAPS.

1.8.2 Acrobat 5

Acrobat 5 provides the same simple Text Select tool that Reader has. Use this basic tool for copying small amounts of unformatted text, as described previously in this hack.

For copying large amounts of formatted text, use the Table/Formatted Text Select (a.k.a. TAPS) tool. You can use it on paragraphs, columns, and tables. It preserves paragraph flow and text styles. Check its preferences (Edit Preferences Table/Formatted Text . . . ) to be sure you are getting the best performance for your purposes.

Activate the TAPS tool, then click and drag a rectangle around the text you want copied. Release the mouse and your rectangle turns into a resizable zone. There are two types of zones: Table (blue) and Text (green). If the tool's autodetection creates the wrong type of zone, right-click the zone and a context menu opens where you can configure it manually.

Copy the selection to the clipboard or drag-and-drop it into your target program.

1.8.3 Acrobat 6

Something went wrong with Acrobat 6 text selection. Adobe dropped the Table/Formatted Text Select tool (a.k.a. TAPS) and added the Select Table tool (a.k.a. TablePicker). This new tool is slow and performs poorly on many PDFs.

The solution is to get a copy of TAPS and install it into Acrobat 6. Section 1.8.1 explains how to find and install TAPS. Section 1.8.2 explains how to use TAPS.

A PDF owner can secure his document to prevent others from copying the document's text. In such cases, the text selection tools will be disabled. See [Hack #52] for a discussion on PDF security.

1.8.4 Selecting Text from Scanned Pages

If your document pages are bitmap images instead of text, try using Acrobat's Paper Capture OCR tool. It will convert page images into live text, though the quality of the conversion varies with the clarity of the bitmap image. You can tell when a page is a bitmap image by activating the Text Select tool and then selecting all text (Edit Select All). If the page has any text on it, the tool will highlight it. If nothing gets highlighted, yet the page appears to contain text, it is probably a bitmap image.

Sometimes, page text is created using vector drawings. This kind of text is not live text (so you can't copy it) and it also does not respond to OCR.

Acrobat 6 users can begin capturing a PDF by selecting Document Paper Capture Start Capture . . . . Unlike Acrobat 5, Acrobat 6 has no built-in limit on the number of pages you can OCR.

Acrobat 5 users (on Windows) must download the Paper Capture plug-in from Adobe. Select Tools Download Paper Capture Plug-in, and a web page will open with instructions and a download link. Or, download it directly from http://www.adobe.com/support/downloads/detail.jsp?ftpID=1907. This plug-in will OCR only 50 pages per PDF document.