Hack 8 Convert PDF Documents to Word

figs/moderate.gif figs/hack8.gif

Automatically scrape clipboard data into a new Word document.

In general, PDFs aren't as smart as they appear. Unless they are tagged [Hack #34], they have no concept of paragraph, table, or column. This becomes a problem only when you must create a new document using material from an old document. Ideally, you would use the old document's source file, or maybe even its HTML edition. This isn't always possible, however. Sometimes you have only a PDF to work with.

1.9.1 Save As . . . DOC, RTF, HTML

Adobe Acrobat 6 enables you to convert your PDF to many different formats with the Save As . . . dialog. These filters work best when the PDF is tagged. Try one to see if it suits your requirements. Adobe Reader enables you to convert your PDF to text by selecting File Save As Text . . . .

If your PDF is not tagged, Acrobat uses an inference engine to assemble the letters into words and the words into paragraphs. It tries to detect and create tables. It works best on documents with very simple formatting. Tables and formatted pages generally don't survive.

1.9.2 The Human Touch

Fully automatic conversion of PDF to a structured format such as Word's DOC is not generally possible because the problem is too big. One workaround is to break the problem down to the point where the automation has a chance. The TAPS tool [Hack #7] works well because you meet the automation halfway. You tell it where the table is and it creates a table from the given data. This approach can be scaled to fit the larger problem of converting entire documents.

1.9.3 Scrape the Clipboard into a New Document with AutoPasteLoop

Copy/Paste works fine for a few items, but it grows cumbersome when processing several pages of data. AutoPasteLoop is a Word macro that watches the clipboard for new data and then immediately pastes it into your new document. Instead of copy/paste, copy/paste, copy/paste, you can just copy, copy, copy. Word automatically pastes, pastes, pastes.

Scott Tupaj has ported AutoPasteLoop to OpenOffice. Download the code from http://www.pdfhacks.com/autopaste/.


Create a new Word macro named AutoPasteLoop in Normal.dot and program it like this:

'AutoPasteLoop, version 1.0

'Visit: http://www.pdfhacks.com/autopaste/

'

'Start AutoPasteLoop from MS Word and switch to Adobe Reader or Acrobat.

'Copy the material you want, and AutoPasteLoop will automatically

'paste it into the target Word document.  When you are done, switch back

'to MS Word and AutoPasteLoop will stop.



Option Explicit



' declare Win32 API functions that we need

Declare Function Sleep Lib "kernel32" (ByVal insdf As Long) As Long

Declare Function GetForegroundWindow Lib "user32" ( ) As Long

Declare Function GetOpenClipboardWindow Lib "user32" ( ) As Long

Declare Function GetClipboardOwner Lib "user32" ( ) As Long



Sub AutoPasteLoop( )

    'the HWND of the application we're pasting into (MS Word)

    Dim AppHwnd As Long

    'assume that we are executed from the target app.

    AppHwnd = GetForegroundWindow( )

    

    'keep track of whether the user switches out

    'of the target application (MS Word).

    Dim SwitchedApp As Boolean

    SwitchedApp = False

    

    'reset this to stop looping

    Dim KeepLooping As Boolean

    KeepLooping = True

    

    'the HWND of our target document; GetClipboardOwner returns the

    'HWND of the app. that most recently owned the clipboard;

    'changing the clipboard's contents (Cut) makes us the "owner"

    '

    'note that "owning" the clipboard doesn't mean that it's locked

    '

    Dim DocHwnd As Long

    Selection.TypeText Text:="abc"

    Selection.MoveLeft Unit:=wdCharacter, Count:=3, Extend:=wdExtend

    Selection.Cut

    DocHwnd = GetClipboardOwner( )

    

    Do While KeepLooping

        Sleep 200 'milliseconds; 100 msec == 1/10 sec

        

        'if the user switches away from the target

        'application and then switches back, stop looping

        '

        Dim ActiveHwnd As Long

        ActiveHwnd = GetForegroundWindow( )

        If ActiveHwnd = AppHwnd Then

            If SwitchedApp Then KeepLooping = False

        Else

            SwitchedApp = True

        End If

    

        'if the clipboard owner has changed, then somebody else

        'has put something on it; if the clipboard resource isn't

        'locked (GetOpenClipboardWindow), then paste its contents

        'into our document; use Copy to change the clipboard owner

        'back to DocHwnd

        '

        If GetClipboardOwner( ) <> DocHwnd And _

        GetOpenClipboardWindow( ) = 0 Then

            Selection.Paste

            Selection.MoveLeft Unit:=wdCharacter, Count:=1, Extend:=wdExtend

            Selection.Copy

            Selection.Collapse wdCollapseEnd

        End If

    Loop

End Sub

1.9.4 Running AutoPasteLoop

Open a new Word document. Start AutoPasteLoop by opening the Macros dialog box (Tools Macros Macros . . . ), selecting the macro name AutoPasteLoop, and clicking Run. When your loop is running, you are not able to interact with Word. Stop the loop by switching to another application and then switching back to Word.

Start the loop. Switch to Acrobat (or Reader) and use its tools to individually select and copy its columns, tables, paragraphs, and images. Switch back to Word and you should find all of your selections pasted into the new document. Start AutoPasteLoop again if you want to copy more material.

1.9.5 Hacking AutoPasteLoop

Add content filters or your own inference logic to the AutoPasteLoop macro. Use your knowledge of the input documents to tailor the loop, so it creates documents that require less postprocessing.

AutoPasteLoop isn't just a PDF hack. It works with any program that can copy content to the clipboard.