You have data in an HTML table, and you would like to turn that into a Perl data structure. For example, you want to monitor changes to an author's CPAN module list.
Use the HTML::TableContentParser module from CPAN:
use HTML::TableContentParser; $tcp = HTML::TableContentParser->new; $tables = $tcp->parse($HTML); foreach $table (@$tables) { @headers = map { $_->{data} } @{ $table->{headers} }; # attributes of table tag available as keys in hash $table_width = $table->{width}; foreach $row (@{ $tables->{rows} }) { # attributes of tr tag available as keys in hash foreach $col (@{ $row->{cols} }) { # attributes of td tag available as keys in hash $data = $col->{data}; } } }
The HTML::TableContentParser module converts all tables in the HTML document into a Perl data structure. As with HTML tables, there are three layers of nesting in the data structure: the table, the row, and the data in that row.
Each table, row, and data tag is represented as a hash reference. The hash keys correspond to attributes of the tag that defined that table, row, or cell. In addition, the value for a special key gives the contents of the table, row, or cell. In a table, the value for the rows key is a reference to an array of rows. In a row, the cols key points to an array of cells. In a cell, the data key holds the HTML contents of the data tag.
For example, take the following table:
<table width="100%" bgcolor="#ffffff"> <tr> <td>Larry & Gloria</td> <td>Mountain View</td> <td>California</td> </tr> <tr> <td><b>Tom</b></td> <td>Boulder</td> <td>Colorado</td> </tr> <tr> <td>Nathan & Jenine</td> <td>Fort Collins</td> <td>Colorado</td> </tr> </table>
The parse method returns this data structure:
[ { 'width' => '100%', 'bgcolor' => '#ffffff', 'rows' => [ { 'cells' => [ { 'data' => 'Larry & Gloria' }, { 'data' => 'Mountain View' }, { 'data' => 'California' }, ], 'data' => "\n " }, { 'cells' => [ { 'data' => '<b>Tom</b>' }, { 'data' => 'Boulder' }, { 'data' => 'Colorado' }, ], 'data' => "\n " }, { 'cells' => [ { 'data' => 'Nathan & Jenine' }, { 'data' => 'Fort Collins' }, { 'data' => 'Colorado' }, ], 'data' => "\n " } ] } ]
The data tags still contain tags and entities. If you don't want the tags and entities, remove them by hand using techniques from Recipe 20.6.
Example 20-11 fetches a particular CPAN author's page and displays in plain text the modules they own. You could use this as part of a system that notifies you when your favorite CPAN authors do something new.
#!/usr/bin/perl -w # dump-cpan-modules-for-author - display modules a CPAN author owns use LWP::Simple; use URI; use HTML::TableContentParser; use HTML::Entities; use strict; our $URL = shift || 'http://search.cpan.org/author/TOMC/'; my $tables = get_tables($URL); my $modules = $tables->[4]; # 5th table holds module data foreach my $r (@{ $modules->{rows} }) { my ($module_name, $module_link, $status, $description) = parse_module_row($r, $URL); print "$module_name <$module_link>\n\t$status\n\t$description\n\n"; } sub get_tables { my $URL = shift; my $page = get($URL); my $tcp = new HTML::TableContentParser; return $tcp->parse($page); } sub parse_module_row { my ($row, $URL) = @_; my ($module_html, $module_link, $module_name, $status, $description); # extract cells $module_html = $row->{cells}[0]{data}; # link and name in HTML $status = $row->{cells}[1]{data}; # status string and link $description = $row->{cells}[2]{data}; # description only $status =~ s{<.*?>}{ }g; # naive link removal, works on this simple HTML # separate module link and name from html ($module_link, $module_name) = $module_html =~ m{href="(.*?)".*?>(.*)<}i; $module_link = URI->new_abs($module_link, $URL); # resolve relative links # clean up entities and tags decode_entities($module_name); decode_entities($description); return ($module_name, $module_link, $status, $description); }
The documentation for the CPAN module HTML::TableContentParser; http://search.cpan.org