Recipe 20.18 Parsing HTML

20.18.1 Problem

You need to extract complex information from a web page or pages. For example, you want to extract news stories from web sites like CNN.com or news.bbc.co.uk.

20.18.2 Solution

Use regular expressions for data that's well identified:

# story is everything from <!-- story --> to <!-- /story -->
if ($html =~ m{<!-- story -->(.*?)<!-- /story -->}s) {
  my $story = $1;
  # ...
} else {
  warn "No story found in the page";
}

But for tables and data identifiable only by complex patterns of HTML, use a parser:

use HTML::TokeParser;

my $parser = HTML::TokeParser->new($FILENAME)
    or die "Can't open $FILENAME: $!\n";
while (my $token = $parser->get_token( )) {
    my $type = $token->[0];
    if    ($type eq 'S')  { ... }   # start tag
    elsif ($type eq 'E')  { ... }   # end tag
    elsif ($type eq 'T')  { ... }   # text
    elsif ($type eq 'C')  { ... }   # comment
    elsif ($type eq 'D')  { ... }   # declaration
    elsif ($type eq 'PI') { ... }   # processing instruction
    else { die "$type isn't a valid HTML token type" }
}

20.18.3 Discussion

Regular expressions are a convenient way to extract information from HTML. However, as the complexity of the HTML and the amount of information to be extracted go up, the maintainability of the regular expressions goes down. For a few well-defined fields, regular expressions are fine. For anything else, use a proper parser.

As an example of processing HTML with regular expressions, let's get the list of recent O'Reilly book releases. The list is found on http://www.oreilly.com/catalog/new.html, but there's also a navigation bar and a list of upcoming releases, so we can't simply extract all links.

The relevant HTML from the page looks like this:

<!--  New titles  -->
<h3>New Titles</h3>
<ul><li><a href="netwinformian/">.NET Windows Forms in a
Nutshell</a> <em>(March)</em></li><li><a href="actscrptpr/">
ActionScript for Flash MX Pocket Reference</a> <em>(March)</em>
</li><li><a href="abcancer/">After Breast Cancer</a> <em>(March)
...
<li><a href="samba2/">Using Samba, 2nd Edition</a> <em>(February)
</em></li><li><a href="vbscriptian2/">VBScript in a Nutshell, 2nd
Edition</a> <em>(March)</em></li><li><a href="tpj2/">Web, Graphics
& Perl/Tk</a> <em>(March)</em></li></ul></td>
<td valign="top">
<!--  Upcoming titles  -->

In fact, it's even uglier than this at the time of this writingthere are no newlines in the list of new books. It's all on one long line. Fortunately, this turns out to be comparatively simple to match. First we extract the HTML for the new titles, and then we extract the individual book links using the list anchors to anchor the regular expression:

($new_titles) = $html =~ m{<!--  New titles  -->(.*?)<!--  Upcoming titles  -->}s
  or die "Couldn't find new titles HTML";

while (m{<li>            # list item
         <a\ href="
         ([^\"]+)        # link to book = $1 = everything to next quote
         \">
         ([^<]+)         # book title = $2 = everything up to </a>
         </a>\ <em>\(
         ([^)]+)         # month = $3 = everything in the parentheses
        }gx) {
  printf("%-1010s%s\n", $3, $2); # could use $1 if we wanted
}

This produces output like:

March     .NET Windows Forms in a Nutshell
March     ActionScript for Flash MX Pocket Reference
March     After Breast Cancer
...
February  Using Samba, 2nd Edition
March     VBScript in a Nutshell, 2nd Edition
March     Web, Graphics & Perl/Tk

Regular expressions are difficult for this problem because they force you to work at the level of characters. The CPAN module HTML::TokeParser treats your HTML file as a series of HTML-y things: starting tags, closing tags, text, comments, etc. It decodes entities for you automatically, so you don't have to worry about converting & back into & in your code.

The argument to the new constructor of HTML::TokeParser is either a filename, a filehandle (or any object providing a read method), or a reference to the HTML text to be parsed:

$parser = HTML::TokeParser->new("foo.html") or die;
$parser = HTML::TokeParser->new(*STDIN) or die;
$parser = HTML::TokeParser->new(\$html) or die;

Each time you invoke get_token on the parser object, you get back a reference to an array. The first element in the array is a string identifying what type of token you have: start tag, end tag, etc. The rest of the array varies depending on what type of token it is. The four types of tokens that most people are interested in are starting tags, ending tags, text, and comments.

Starting tags have four more values in the token array: the tag name (in lowercase), a reference to a hash of attributes (lowercased attribute name as key), a reference to an array containing lowercased attribute names in the order they appeared in the tag, and a string containing the opening tag as it appeared in the text of the document. Parsing the following HTML:

<IMg SRc="/perl6.jpg" ALT="Steroidal Camel">

creates a token like this:

[ 'S',
  'img',
  { "src" => "/perl6.jpg",
    "alt" => "Steroidal Camel"
  },
  [ "src", "alt" ],
  '<IMg SRc="/perl6.jpg" ALT="Steroidal Camel">'
]

Since ending tags have fewer possibilities than opening tags, it follows that their tokens have a simpler structure. A token for an end tag contains "E" (identifying it as an end tag), the lowercased name of the tag being closed (e.g., "body"), and the tag as it appeared in the source (e.g., "</BODY>").

A token for a text tag has three values: "T" (to identify it as a text token), the text, and a flag identifying whether you need to decode entities on it (decode only if this flag is false).

use HTML::Entities qw(decode_entities);

if ($token->[0] eq "T") {
    $text = $token->[1];
    decode_entities($text) unless $token->[2];
    # do something with $text
}

Even simpler, a comment token contains only "C" (to indicate that it is a comment) followed by the comment text.

For a detailed introduction to parsing with tokens, see Perl & LWP by Sean Burke (O'Reilly).