You want to create an Rich Site Summary (RSS) file, or read one produced by another application.
Use the CPAN module XML::RSS to read an existing RSS file:
use XML::RSS; my $rss = XML::RSS->new; $rss->parsefile($RSS_FILENAME); my @items = @{$rss->{items}}; foreach my $item (@items) { print "title: $item->{'title'}\n"; print "link: $item->{'link'}\n\n"; }
To create an RSS file:
use XML::RSS; my $rss = XML::RSS->new (version => $VERSION); $rss->channel( title => $CHANNEL_TITLE, link => $CHANNEL_LINK, description => $CHANNEL_DESC); $rss->add_item(title => $ITEM_TITLE, link => $ITEM_LINK, description => $ITEM_DESC, name => $ITEM_NAME); print $rss->as_string;
There are at least four variations of RSS extant: 0.9, 0.91, 1.0, and 2.0. At the time of this writing, XML::RSS understood all but RSS 2.0. Each version has different capabilities, so methods and parameters depend on which version of RSS you're using. For example, RSS 1.0 supports RDF and uses the Dublin Core metadata (http://dublincore.org/). Consult the documentation for what you can and cannot call.
XML::RSS uses XML::Parser to parse the RSS. Unfortunately, not all RSS files are well-formed XML, let alone valid. The XML::RSSLite module on CPAN offers a looser approach to parsing RSSit uses regular expressions and is much more forgiving of incorrect XML.
Example 22-13 uses XML::RSSLite and LWP::Simple to download The Guardian's RSS feed and print out the items whose descriptions contain the keywords we're interested in.
#!/usr/bin/perl -w # guardian-list -- list Guardian articles matching keyword use XML::RSSLite; use LWP::Simple; use strict; # list of keywords we want my @keywords = qw(perl internet porn iraq bush); # get the RSS my $URL = 'http://www.guardian.co.uk/rss/1,,,00.xml'; my $content = get($URL); # parse the RSS my %result; parseRSS(\%result, \$content); # build the regex from keywords my $re = join "|", @keywords; $re = qr/\b(?:$re)\b/i; # print report of matching items foreach my $item (@{ $result{items} }) { my $title = $item->{title}; $title =~ s{\s+}{ }; $title =~ s{^\s+}{ }; $title =~ s{\s+$}{ }; if ($title =~ /$re/) { print "$title\n\t$item->{link}\n\n"; } }
The following is sample output from Example 22-13:
UK troops to lead Iraq peace force http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00.html?=rss Shia cleric challenges Bush plan for Iraq http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00.html?=rss
We can combine this with XML::RSS to generate a new RSS feed from the filtered items. It would be easier, of course, to do it all with XML::RSS, but this way you get to see both modules in action. Example 22-14 shows the finished program.
#!/usr/bin/perl -w # guardian-filter -- filter the Guardian's RSS feed by keyword use XML::RSSLite; use XML::RSS; use LWP::Simple; use strict; # list of keywords we want my @keywords = qw(perl internet porn iraq bush); # get the RSS my $URL = 'http://www.guardian.co.uk/rss/1,,,00.xml'; my $content = get($URL); # parse the RSS my %result; parseRSS(\%result, \$content); # build the regex from keywords my $re = join "|", @keywords; $re = qr/\b(?:$re)\b/i; # make new RSS feed my $rss = XML::RSS->new(version => '0.91'); $rss->channel(title => $result{title}, link => $result{link}, description => $result{description}); foreach my $item (@{ $result{items} }) { my $title = $item->{title}; $title =~ s{\s+}{ }; $title =~ s{^\s+}{ }; $title =~ s{\s+$}{ }; if ($title =~ /$re/) { $rss->add_item(title => $title, link => $item->{link}); } } print $rss->as_string;
Here's an example of the RSS feed it produces:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"> <channel> <title>Guardian Unlimited</title> <link>http://www.guardian.co.uk</link> <description>Intelligent news and comment throughout the day from The Guardian newspaper</description> <item> <title>UK troops to lead Iraq peace force</title> <link>http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00.html?=rss</link> </item> <item> <title>Shia cleric challenges Bush plan for Iraq</title> <link>http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00.html?=rss</link> </item> </channel> </rss>
The documentation for the modules XML::RSS and XML::RSSLite