Recipe 20.4 Converting ASCII to HTML

20.4.1 Problem

You want to convert ASCII text to HTML. For example, you have mail you want to display intelligently on a web page.

20.4.2 Solution

Use the simple little encoding filter in Example 20-3.

Example 20-3. text2html
  #!/usr/bin/perl -w -p00
  # text2html - trivial html encoding of normal text
  # -p means apply this script to each record.
  # -00 mean that a record is now a paragraph
  
  use HTML::Entities;
  $_ = encode_entities($_, "\200-\377");
  
  if (/^\s/) {
      # Paragraphs beginning with whitespace are wrapped in <PRE> 
      s{(.*)$}        {<PRE>\n$1</PRE>\n}s;           # indented verbatim
  } else {
      s{^(>.*)}       {$1<BR>}gm;                     # quoted text
      s{<URL:(.*?)>}    {<A HREF="$1">$1</A>}gs       # embedded URL  (good)
                      ||
      s{(http:\S+)}   {<A HREF="$1">$1</A>}gs;        # guessed URL   (bad)
      s{*(\S+)*}    {<STRONG>$1</STRONG>}g;           # this is *bold* here
      s{\b_(\S+)\_\b} {<EM>$1</EM>}g;                 # this is _italics_ here
      s{^}            {<P>\n};                        # add paragraph tag 
  }

20.4.3 Discussion

Converting arbitrary plain text to HTML has no general solution because there are too many conflicting ways to represent formatting information. The more you know about the input, the better you can format it.

For example, if you knew that you would be fed a mail message, you could add this block to format the mail headers:

BEGIN {
    print "<TABLE>";
    $_ = encode_entities(scalar <>);
    s/\n\s+/ /g;  # continuation lines
    while ( /^(\S+?:)\s*(.*)$/gm ) {                # parse heading
        print "<TR><TH ALIGN='LEFT'>$1</TH><TD>$2</TD></TR>\n";
    }
    print "</TABLE><HR>";
}

The CPAN module HTML::TextToHTML has options for headers, footers, indentation, tables, and more.

20.4.4 See Also

The documentation for the CPAN modules HTML::Entities and HTML::TextToHTML