Hack 84 Tailor PDF Text at Serve-Time

figs/expert.gif figs/hack84.gif

Create a PDF template that you can populate as it is served.

Sometimes a PDF needs to include dynamic information. For example, you could fashion the cover of your personalized PDF sales brochure [Hack #89] to include the customer's name: "Created for Mary Jane Doe on March 15, 2004." To do this, let's use what we know about modifying PDF text in a plain-text editor [Hack #80] to create a PDF template. Then we'll fill in this template using a web server script.

The overall process resembles [Hack #83] . Instead of PDF links, you will add placeholders to the PDF's page streams. As it is served, these placeholders can be replaced with your data.

6.12.1 Create the PDF

Design the document using your favorite authoring application. Add placeholder text where you want the dynamic data to appear. Placeholders should have a common prefix, such as textbeg_customer. Style this text to taste, but align it to the left (not the center). Before creating a PDF, be careful with the placeholder fonts to avoid results such as the one in Figure 6-14.

Figure 6-14. Acrobat displaying parentheses around "Jane" as empty rectangles, because we omitted them from our alphabet soup
figs/pdfh_0614.gif


Whichever font you choose for your placeholder, you must make sure the font gets adequately embedded into the PDF [Hack #43] . An embedded font is often subset, which means it includes only the characters that are used in your document. If your placeholder text uses a Type 1 font, you can configure Distiller to not subset this font [Hack #43] . If your placeholder text uses a TrueType or OpenType font, you must be sure that every character you might need occurs in your document. To be safe, create a separate page that includes every letter in the alphabet, every number, and every punctuation mark you'll need. Set this alphabet soup to the font of your placeholder.

Print to PDF and delete this alphabet page.

6.12.2 Convert the PDF into a Template

Prepare the PDF for text editing with pdftk [Hack #79] like this (if you use gVim and our plug-in [Hack #82] to edit PDF, this step isn't necessary):

pdftk  mydoc .pdf output  mydoc.plain .pdf uncompress

Open the results in your editor and search for your placeholder text. If you can't find it, search on its page number?e.g., pageNum 5 ? and then dig down [Hack #81] to find the page stream that has your placeholder. Distiller probably split it into pieces?e.g., textbeg_customer might end up as [(text)5(b)-1.7(eg_cust)5(o)-1.7(mer)].

When creating PDF with Ghostscript, text that uses TrueType fonts ends up getting a strange, custom encoding. This means your PDF code will be incomprehensible. The solution is to use Type 1 fonts in your document instead of TrueType.


Make a few changes to this page stream. First, repair your placeholder text so that grep can find it. So:

[(text)5(b)-1.7(eg_cust)5(o)-1.7(mer)]TJ

becomes:

[(textbeg_customer)]TJ

Or, if your string ends in Tj, such as this:

(Created for textbeg_customer on textbeg_date)Tj

rewrite it like this, adding square brackets and changing the Tj at the end to TJ:

[(Created for textbeg_customer on textbeg_date)]TJ

Next, isolate each placeholder on its own line, if necessary. So, the previous example becomes:

[(Created for )

(textbeg_customer)

( on )

(textbeg_date)]JT

Finally, pad the placeholders with asterisks (*). Add enough asterisks so that the placeholder is longer than any possible data you might write there. Padding the previous example would look like this:

[(Created for )

(textbeg_customer***********************************)

( on )

(textbeg_date**********************)]JT

Save and close your altered PDF.

What happens to excess padding when the file is served? Our script replaces it with whitespace outside of the PDF string, so it won't be rendered on the page. The preceding example might look like this, after it is served by our script:

[(Created for )

(Mary Jane Doe)

( on )

(March 15, 2004)                    ]JT

where (Mary Jane Doe) and (March 15, 2004) are followed by numerous space characters.


6.12.3 Add Placeholder Offsets to the PDF

If you used gVim and our plug-in to edit the PDF, now you must uncompress the PDF. If you did not use gVim, now you must repair the PDF's XREF table and stream lengths. One command accomplishes both tasks:

pdftk  mydoc.plain .pdf output  mydoc .pdfsrc uncompress

From this point on, you should not treat the file like a PDF, and this pdfsrc extension will remind you.

Find the byte offsets to your placeholders with grep (Windows users visit http://gnuwin32.sf.net/packages/grep.htm or install MSYS [Hack #97] to get grep):

ssteward@armand:~$ grep -ab  textbeg mydoc .pdfsrc

9202:(textbeg_customer***************************)

9247:(textbeg_date***************************)]TJ

11793:(textbeg_customer***************************)

In your text editor, add one line for each offset to the beginning of your pdfsrc file. Each line should look like this:

#- dataname - dataoffset

The dataname is used in the following script code to identify the data to be written into the PDF. In this example, customer will be replaced with the customer's name. For example, here is how the preceding grep output would appear at the beginning of a pdfsrc file:

#-customer-9202

#-date-9247

#-customer-11793

%PDF-1.3...

After adding these lines, do not modify the PDF with pdftk, gVim, or Acrobat. The pdfsrc extension should remind you to not treat this file like a PDF. Altering the PDF could invalidate these byte offsets.

6.12.4 The Code

This example PHP script, alter_pdf_text_example.php, opens a pdfsrc file, reads the offset data we added, and then serves the PDF. As it serves the PDF, it replaces the placeholders with the given text. Note how the replacement text is escaped using escape_pdf_string.

<?php

// alter_pdf_text_example.php, version 1.0

// http://www.pdfhacks.com/dynamic_text/



// the filename of the source PDF file, which 

// contains placeholders for our dynamic text

$pdfsrc_fn= './cover.pdfsrc';



// the data we will place into the PDF text;

$customer_text= "Mary Jane Doe";

$date_text= "March 15, 2004";



function escape_pdf_string( $ss )

{

  $ss_esc= '';

  $ss_len= strlen( $ss );

  for( $ii= 0; $ii< $ss_len; ++$ii ) {

    if( ord($ss{$ii})== 0x28 ||  // open paren

        ord($ss{$ii})== 0x29 ||  // close paren

        ord($ss{$ii})== 0x5c )   // backslash

      {

        $ss_esc.= chr(0x5c).$ss{$ii}; // escape the character w/ backslash

      }

    else if( ord($ss{$ii}) < 32 || 126 < ord($ss{$ii}) ) {

      $ss_esc.= sprintf( "\\%03o", ord($ss{$ii}) ); // use an octal code

    }

    else {

      $ss_esc.= $ss{$ii};

    }

  }

  return $ss_esc;

}



// open the source PDF file, which contains placeholders

$fp= @fopen( $pdfsrc_fn, 'r' );

if( $fp ) {



  if( $_GET['debug'] ) {

    header("Content-Type: text/plain"); // debug

  }

  else {

    header('Content-Type: application/pdf');

  }



  $pdf_offset= 0;

  $text_offsets= array( );



  // iterate over first lines of pdfsrc file to load $text_offsets;

  while( $cc= fgets($fp, 1024) ) {

    if( $cc{0}== '#' ) { // one of our comments

      list($comment, $name, $offset)= explode( '-', $cc );



      if( $name== 'customer' ) {

        $text_offsets[(int)$offset]= 

          escape_pdf_string( $customer_text );

      }

      else if( $name== 'date' ) {

        $text_offsets[(int)$offset]= 

          escape_pdf_string( $date_text );

      }

      else { // default

        $text_offsets[(int)$offset]= 

          escape_pdf_string( '[ERROR]' );

      }

    }

    else { // finished with our comments

      echo $cc;

      $pdf_offset= strlen($cc)+ 1;



      break;

    }

  }



  // sort by increasing offsets

  ksort( $text_offsets, SORT_NUMERIC );

  reset( $text_offsets );



  $output_text_line_b= false;

  $output_text_b= false;

  $closed_string_b= false;



  list( $offset, $text )= each( $text_offsets );

  $text_ii= 0;

  $text_len= strlen($text);



  // iterate over rest of file

  while( ($cc= fgetc($fp))!= "" ) {



    if( $output_text_line_b && $cc== '(' ) {

      // we have reached the beginning of our TEXT

      $output_text_line_b= false;

      $output_text_b= true;



      echo '(';

    }

    else if( $output_text_b ) {

      if( $cc== ')' ) { // finished with this TEXT

        if( $closed_string_b ) {

          // string has already been capped; pad

          echo ' ';

        }

        else {

          echo ')';

        }



        // get next offset/TEXT pair

        list( $offset, $text )= each( $text_offsets );

        $text_ii= 0;

        $text_len= strlen($text);



        // reset

        $output_text_b= false;

        $closed_string_b= false;

      }

      else if( $text_ii< $text_len ) {

        // output one character of $text

        echo $text{$text_ii++};

      }

      else if( $text_ii== $text_len ) {

        // done with $text, so cap this string

        echo ')';

        $closed_string_b= true;

        $text_ii++;

      }

      else {

        echo ' '; // replace padding with space

      }

    }

    else {

      // output this character

      echo $cc;



      if( $offset== $pdf_offset ) {

        // we have reached a line in pdfsrc where

        // our TEXT should be; begin a lookout for '('

        $output_text_line_b= true;

      }

    }



    ++$pdf_offset;

  }



  fclose( $fp );

}

else { // file open failure

  echo 'Error: failed to open: '.$pdfsrc_fn;

}

?>

6.12.5 Running the Hack

IndigoPerl users (see Section 6.2.2 in [Hack #74] ) can copy alter_pdf_text_example.php into C:\indigoperl\apache\htdocsdf_hacks along with a PDF template named cover.pdfsrc. Point your browser to http://localhost/pdf_hacks/alter_pdf_text_example.php, and a PDF should appear. All instances of textbeg_customer should be replaced with "Mary Jane Doe," and all instances of textbeg_date should be replaced with "March 15, 2004." Naturally, you will need to adapt this script to your own purposes.