Hack 95 Script Acrobat Using Perl on Windows

figs/expert.gif figs/hack95.gif

Install Perl and use it instead of Visual Basic to drive Acrobat.

Depending on your tastes or requirements, you might want to use the Perl scripting language instead of Visual Basic [Hack #94] to program Acrobat. Perl can access the same Acrobat OLE interface used by Visual Basic to manipulate PDFs. Perl is well documented, is widely supported, and has been extended with an impressive collection of modules. A Perl installer for Windows is freely available from ActiveState.

We'll describe how to install the ActivePerl package from ActiveState, and then we'll use an example to show how to access Acrobat's OLE interface using Perl.

Acrobat OLE documentation comes with the Acrobat SDK [Hack #98] . Look for IACOverview.pdf and IACReference.pdf. Acrobat Distiller also has an OLE interface. It is documented in DistillerAPIReference.pdf.


7.4.1 Install Perl on Windows

The ActivePerl installer for Windows is freely available from http://www.ActiveState.com/Products/ActivePerl/. Download and install. It comes with excellent documentation, which you can access by selecting Start Programs ActiveState ActivePerl 5.8 Documentation.

ActivePerl also includes the OLE Browser, shown in Figure 7-8, which enables you to browse the OLE servers available on your machine (Start Programs ActiveState ActivePerl 5.8 OLE-Browser). The OLE Browser is an HTML file that must be opened in Internet Explorer to work properly.

Figure 7-8. The OLE Browser, which you can use to discover OLE servers available on your machine
figs/pdfh_0708.gif


7.4.2 The Code

In this example, the Perl script will use Acrobat to read annotation (e.g., sticky notes) data from the currently open PDF. The script will format this data using HTML and then output it to stdout.

Copy the script in Example 7-2 into a file named SummarizeComments.pl. You can download this code from http://www.pdfhacks.com/summarize/.

Example 7-2. Perl code for summarizing comments
# SummarizeComments.pl ver. 1.0

use strict;

use Win32::OLE;



my $app = Win32::OLE->new("AcroExch.App");

if( 0< $app->GetNumAVDocs ) { # a PDF is open in Acrobat

  # open the HTML document

  print "<html>\n<head>\n<title>PDF Comments Summary</title>\n</head>\n<body>\n";

  my $found_notes_b= 0;



  # get the active PDF and drill down to its PDDoc

  my $avdoc= $app->GetActiveDoc;

  my $pddoc= $avdoc->GetPDDoc;



  # iterate over pages

  my $num_pages= $pddoc->GetNumPages;

  for( my $ii= 0; $ii< $num_pages; ++$ii ) {



    my $pdpage= $pddoc->AcquirePage( $ii );

    if( $pdpage ) {



      # interate over annotations (e.g., sticky notes)

      my $page_head_b= 0;

      my $num_annots= $pdpage->GetNumAnnots;

      for( my $jj= 0; $jj< $num_annots; ++$jj ) {

  

        my $annot= $pdpage->GetAnnot( $jj );

        # Pop-up annots give us duplicate contents

        if( $annot->GetContents ne '' and

          $annot->GetSubtype ne 'Popup' ) {



          if( !$page_head_b ) { # output the page number

            print "<h2>Page: " . ($ii+ 1) . "</h2>\n";

            $page_head_b= 1;

          }



          # output the annotation title and format it a little

          print "<p><i>" . $annot->GetTitle . "</i></p>\n";

          

          # output the note text; replace carriage returns

          # with paragraph breaks

          my $comment= $annot->GetContents;

          $comment =~ s/\r/<\/p>\n<p>/g;

          print "<p>" . $comment . "</p>\n";



          $found_notes_b= 1;

        }

      }

    }

  }

  if( !$found_notes_b ) {

    print "<h3>No Notes Found in PDF</h3>\n";

  }

  

  # close the HTML document

  print "</body>\n</html>\n";

}

7.4.3 Running the Hack

Open a PDF in Acrobat, as shown in Figure 7-6, and then run this script from the command line by typing:

C:\> perl SummarizeComments.pl > comments.html

It will take a few seconds to complete. When it is done, you can open comments.html in your browser to see a summary of the PDF's comments, as shown in Figure 7-9.

Figure 7-9. The PDF Comments in Mozilla after extraction via SummarizeComments.pl
figs/pdfh_0709.gif


As noted in [Hack #94], this example demonstrates the relationships between several fundamental PDF objects.