Automatically scrape clipboard data into a new Word document.
In general, PDFs aren't as smart as they appear. Unless they are tagged [Hack #34], they have no concept of paragraph, table, or column. This becomes a problem only when you must create a new document using material from an old document. Ideally, you would use the old document's source file, or maybe even its HTML edition. This isn't always possible, however. Sometimes you have only a PDF to work with.
Adobe Acrobat 6 enables you to convert your PDF to many different formats with the Save As . . . dialog. These filters work best when the PDF is tagged. Try one to see if it suits your requirements. Adobe Reader enables you to convert your PDF to text by selecting File Save As Text . . . .
If your PDF is not tagged, Acrobat uses an inference engine to assemble the letters into words and the words into paragraphs. It tries to detect and create tables. It works best on documents with very simple formatting. Tables and formatted pages generally don't survive.
Fully automatic conversion of PDF to a structured format such as Word's DOC is not generally possible because the problem is too big. One workaround is to break the problem down to the point where the automation has a chance. The TAPS tool [Hack #7] works well because you meet the automation halfway. You tell it where the table is and it creates a table from the given data. This approach can be scaled to fit the larger problem of converting entire documents.
Copy/Paste works fine for a few items, but it grows cumbersome when processing several pages of data. AutoPasteLoop is a Word macro that watches the clipboard for new data and then immediately pastes it into your new document. Instead of copy/paste, copy/paste, copy/paste, you can just copy, copy, copy. Word automatically pastes, pastes, pastes.
Create a new Word macro named AutoPasteLoop in Normal.dot and program it like this:
'AutoPasteLoop, version 1.0 'Visit: http://www.pdfhacks.com/autopaste/ ' 'Start AutoPasteLoop from MS Word and switch to Adobe Reader or Acrobat. 'Copy the material you want, and AutoPasteLoop will automatically 'paste it into the target Word document. When you are done, switch back 'to MS Word and AutoPasteLoop will stop. Option Explicit ' declare Win32 API functions that we need Declare Function Sleep Lib "kernel32" (ByVal insdf As Long) As Long Declare Function GetForegroundWindow Lib "user32" ( ) As Long Declare Function GetOpenClipboardWindow Lib "user32" ( ) As Long Declare Function GetClipboardOwner Lib "user32" ( ) As Long Sub AutoPasteLoop( ) 'the HWND of the application we're pasting into (MS Word) Dim AppHwnd As Long 'assume that we are executed from the target app. AppHwnd = GetForegroundWindow( ) 'keep track of whether the user switches out 'of the target application (MS Word). Dim SwitchedApp As Boolean SwitchedApp = False 'reset this to stop looping Dim KeepLooping As Boolean KeepLooping = True 'the HWND of our target document; GetClipboardOwner returns the 'HWND of the app. that most recently owned the clipboard; 'changing the clipboard's contents (Cut) makes us the "owner" ' 'note that "owning" the clipboard doesn't mean that it's locked ' Dim DocHwnd As Long Selection.TypeText Text:="abc" Selection.MoveLeft Unit:=wdCharacter, Count:=3, Extend:=wdExtend Selection.Cut DocHwnd = GetClipboardOwner( ) Do While KeepLooping Sleep 200 'milliseconds; 100 msec == 1/10 sec 'if the user switches away from the target 'application and then switches back, stop looping ' Dim ActiveHwnd As Long ActiveHwnd = GetForegroundWindow( ) If ActiveHwnd = AppHwnd Then If SwitchedApp Then KeepLooping = False Else SwitchedApp = True End If 'if the clipboard owner has changed, then somebody else 'has put something on it; if the clipboard resource isn't 'locked (GetOpenClipboardWindow), then paste its contents 'into our document; use Copy to change the clipboard owner 'back to DocHwnd ' If GetClipboardOwner( ) <> DocHwnd And _ GetOpenClipboardWindow( ) = 0 Then Selection.Paste Selection.MoveLeft Unit:=wdCharacter, Count:=1, Extend:=wdExtend Selection.Copy Selection.Collapse wdCollapseEnd End If Loop End Sub
Open a new Word document. Start AutoPasteLoop by opening the Macros dialog box (Tools Macros Macros . . . ), selecting the macro name AutoPasteLoop, and clicking Run. When your loop is running, you are not able to interact with Word. Stop the loop by switching to another application and then switching back to Word.
Start the loop. Switch to Acrobat (or Reader) and use its tools to individually select and copy its columns, tables, paragraphs, and images. Switch back to Word and you should find all of your selections pasted into the new document. Start AutoPasteLoop again if you want to copy more material.
Add content filters or your own inference logic to the AutoPasteLoop macro. Use your knowledge of the input documents to tailor the loop, so it creates documents that require less postprocessing.
AutoPasteLoop isn't just a PDF hack. It works with any program that can copy content to the clipboard.