9.2 Other Outputs and Selective Parsing

In all of our examples so far, we have written the parsed feed to a file handle for use inside another web page. This is just the start. We could use the same basic structure to output to just about anything that handles text. Example 9-4 is a script that sends the top headline of a feed to a mobile phone via the Short Message Service (SMS). It uses the WWW::SMS module, outputting to the first web-based free SMS service it can find that works.

Example 9-4. rsssms.pl sends the first headline title to a mobile phone via SMS

#!/usr/local/bin/perl
use strict;
use warnings;
use LWP::Simple;
use XML::Simple;
use WWW::SMS;
   
# Take the command line arguments, URL first, then complete number of mobile
my $url=$ARGV[0];
my $number=$ARGV[1];
   
# Retrieve the feed, or die disgracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";
   
# Parse the XML
my $parser = XML::Simple->new(  );
my $rss = $parser->XMLin("$feed_to_parse");
   
# Get the data we want
my $message = "NEWSFLASH:: $rss->{'channel'}->{'item'}->[0]->{'title'}";
   
# Send the message
my @gateway = WWW::SMS->gateways(  );
my $sms = WWW::SMS->new($number, $message);
foreach my $gateway(@gateway) {if ($sms->send($gateway)) {
          print 'Message sent!';
            last;
     } else {
          print "Error: $WWW::SMS::Error\n";
     }}

You can use the script in Example 9-4 from the command line or crontab like so:

perl rsssms.pl http://full.urlof/feed.xml 123456789

You can see how one might set this up on crontab to send the latest news at the desired interval. But how about using the system status module, mod_systemstatus, to automatically detect and inform you of system failures? Perhaps you could use something like Example 9-5.

Example 9-5. mod_systemstatusSMS.pl

#!/usr/local/bin/perl
   
use strict;
use warnings;
use LWP::Simple;
use XML::Simple;
use WWW::SMS;
   
# Take the command line arguments, URL first, then complete number
my $url=$ARGV[0];
my $number=$ARGV[1];
   
# Retrieve the feed, or die gracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";
   
# Parse the XML
my $parser = XML::Simple->new(  );
my $rss = $parser->XMLin("$feed_to_parse");
   
# initialise the $message
my $message;
   
# Look for downed servers
foreach my $item (@{$rss->{'item'}}) {
    next unless ($item->{'ss:responding'}) eq 'false';
    $message .= "Emergency! $item->{'title'} is down.";
           }
   
# Send the message
if ($message) {
my @gateway = WWW::SMS->gateways(  );
my $sms = WWW::SMS->new($number, $message);
foreach my $gateway(@gateway) {if ($sms->send($gateway)) {
          print 'Message sent!';
     } else {
          print "Error: $WWW::SMS::Error\n";
     }}
     };

Again, run from cron, this little beasty will let you monitor hundreds of machines?as long as they are generating the correct RSS?and inform you of a server outage via your mobile phone.

This combination of selective parsing, interesting output methods, and cron allows us to do many things with RSS feeds that a more comprehensive system may well inhibit. Monitoring a list of feeds for mentions of keywords is simple, as is using RSS feeds of stock prices to alert you of falls in the market. Combining these techniques with Publish and Subscribe systems (discussed in Chapter 12) gives us an even greater ability to monitor the world. Want an IRC channel to be notified of any new weblog postings? No problem. Want an SMS whenever the phrase "Free Beer" appears in your local feeds? Again, no problem.

9.2.1 Transforming RSS with XSLT

The transformation of RSS into another form of XML, using XSLT, is not very common at the moment, but it may soon have its time in the sun. This is because RSS?especially RSS 1.0, with its complicated relationships and masses of metadata?can be reproduced in many useful ways.

While the examples in this book have been text-based and mostly XHTML, there is no reason we cannot render RSS into an SVG graphic, a PDF (via the Apache FOP tool), an MMS-SMIL message for new-generation mobile phones, or any of the hundreds of other XML-based systems. XSLT and the arcane art of writing XSLT style sheets to take care of all of this is a subject too large for this book to cover in detail?for that, check out O'Reilly's XSLT, by Doug Tidwell.

Nevertheless, I will show you some nifty stuff. Example 9-6 is an XSLT style sheet that transforms an RSS 1.0 feed into the XHTML we produced in Example 9-2.

Example 9-6. RSS 1.0 Transforming into XHTML fragments

<?xml version="1.0"?>
   
<xsl:stylesheet version = '1.0'
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rss="http://purl.org/rss/1.0/"
exclude-result-prefixes="rss rdf"
>
<xsl:output method="html"/>
   
<xsl:template match="/">
 <div class="channellink">
  <a href="{rdf:RDF/rss:channel/rss:link}">
   <xsl:value-of select="rdf:RDF/rss:channel/rss:title"/>
  </a>
 </div>
 <div class="linkentries">
  <ul>
   <xsl:apply-templates select="rdf:RDF/*"/>
  </ul>
 </div>
</xsl:template>
   
<xsl:template match="rss:channel|rss:item">
 <li>
  <a href="{rss:link}">
   <xsl:value-of select="rss:title"/>
  </a>
 </li>
</xsl:template>
   
</xsl:stylesheet>

Again, just like the parsing code in Example 9-1, it is easy to extend this style sheet to take the modules into account. Example 9-7 extends Example 9-6 to look for the description, dc:creator, and dc:date elements. Note the emphasized code?those are the changes.

Example 9-7. Making the XSLT style sheet more useful

<?xml version="1.0"?>
   
<xsl:stylesheet version = '1.0'
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rss="http://purl.org/rss/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
exclude-result-prefixes="rss   rdf  dc "
>
<xsl:output method="html"/>
   
<xsl:template match="/">
 <div class="channellink">
  <a href="{rdf:RDF/rss:channel/rss:link}">
   <xsl:value-of select="rdf:RDF/rss:channel/rss:title"/>
  </a>
 </div>
 <div class="linkentries">
  <ul>
   <xsl:apply-templates select="rdf:RDF/*"/>
  </ul>
 </div>
</xsl:template>
   
<xsl:template match="rss:channel|rss:item">
 <li>
  <a href="{rss:link}"><xsl:value-of select="rss:title"/></a>
   <ol>
     <xsl:value-of select="rss:description" />
   </ol>
   <ol>
    <xsl:text>Written  by: </xsl:text>
    <xsl:value-of select="dc:creator"/>
   </ol>
   <ol>
    <xsl:text>Written  on: </xsl:text>
    <xsl:value-of select="dc:date"/> 
   </ol>
 </li>
</xsl:template>
   
</xsl:stylesheet>

9.2.2 Client-Side Inclusion

As mentioned in the beginning of this chapter, client-side inclusion is the way to go if you are setting up a third-party parsing service or hosting the majority of the site on a server that forbids server-side scripting. Doing this is very simple. All you need to do is create a script that returns a JavaScript script that displays the necessary XHTML.

To do this, just wrap each line of the XHTML that your ordinary script would produce in a document.writeln( ) function:

document.writeln("<h1>This is the heading<h1>");

and have the script return this document as the result of a call by the script element from the HTML document. So, the HTML document contains this line:

<script src="PATH TO PARSING SCRIPT APPENDED  WITH FEED URL" />

The CGI script will return the document.writeln script, which the browser will execute and then parse the resulting XHTML.

The upshot of this technique is that you can start a third-party RSS-parsing service with little effort. All you need to do is distribute the URL of the CGI script you are using and tell people to append the URL of the feed they want to the end of it. Give them the resulting script element to insert into their site code, and everyone is in business:

<script src="http://www.bensparsers.com?feed=http://bensfeed.com/index.xml"/>

9.2.3 Server-Side Inclusion

The more powerful method is server-side inclusion (SSI). It allows you to parse the feed using any technique and any language you like, and it allows greater flexibility for how the feed is used.

Let's look at an example of how it works. Example 9-8 produces an XHTML page with a server-side include directive.

Example 9-8. An XHTML page with a server-side include

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
    "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>An Example of a SSI</title>
</head>
<body>
<h1>This here is a News Feed from a really good site</h1>
<!--#include file="parsedfeed.html" -->
</body>
</html>

A server serving the page in Example 9-8 will, if the server is set up correctly, import the contents of parsedfeed.html and insert them in place of the SSI directive .

So, by parsing RSS files into XHTML and saving them to disk, we can use SSI to place them within our existing XHTML page, apply formatting to change the way they look via the site's CSS style sheet, and present them to the end user.

9.2.3.1 Enabling server-side includes within Apache 1.3.x

Turning on server-side includes within Apache is straightforward, but it involves delving into places where a wrong move can make a nasty mess. Have a coffee, then concentrate. (N.B.: this section discusses Apache Version 1.3.x. Apache's configuration structure may change in later versions. Consult the documentation online at http://www.apache.org.)

To permit SSI on your server, you must have the following directive either in your httpd.conf file, or in a .htaccess file:

Options +Includes

This tells Apache that you want to permit files to be parsed for SSI directives. Of course, real-world installations are more complicated than that?most Apache installations have multiple Options directives set: one for each directory in some cases. You will most likely want to apply the Options to the specific directory in which you want SSI enabled?where the document in which you want to include the RSS feeds resides.

Example 9-9 shows the relevant section of the httpd.conf file for my own server.

Example 9-9. A section of an Apache http.conf file that allows for CGI and SSI

<Directory "/usr/local/apache/htdocs/rss">
Options ExecCGI Includes
DirectoryIndex index.shtml
</Directory>

Note that this configuration defines the directory's index page as index.shtml, because it is not a good idea to make your browser seek out SSI directives in every page it serves. Rather, you should tell it to look for SSI directives solely in pages that end with a certain file extension, by adding the following lines to your httpd.conf file:

AddType text/html .shtml
AddHandler server-parsed .shtml

This makes Apache search any file ending in .shtml (the traditional extension for such things) for SSI directives and replace them with their associated files before serving them to the end user.

This approach has a disadvantage: if you want to add SSI directives to an existing page, you have to change the name of that page. All links to that page will therefore be broken, in order to get the SSI directives to work. So, if you're retrofitting a site with RSS, the other method is to use the XBitHack directive within your httpd.conf file:

XBitHack on

XBitHack tells Apache to parse files for SSI directives if the files have the execute bit set. So, to add SSI directives to an existing page, rather than having to change the filename, you just need to make the file executable using chmod.

How Often to Read the Feed

RSS feeds do change, it is true. People update their sites at all times of the day or night, and it would be lovely to have the very latest headlines. Currently, however, it is not a good idea to keep requesting a new RSS feed every few minutes. Etiquette and convention limit our requests for a new file to once every 60 minutes, unless the feed's publisher has specifically said that we can grab it more often, or unless they are using Publish and Subscribe (see Chapter 12).

In many cases, even requesting the feed every hour is too much. Feeds that change only once a day require downloading only once a day. It's a simple courtesy to pay attention to these conventions.

Now all that remains is to write the server-side include. Apache's SSI abilities are quite powerful, but we need to concern ourselves only with a limited subset here. If you're curious, take a look at http://httpd.apache.org/docs/howto/ssi.html.

9.2.3.2 Server-side includes with Microsoft IIS

Microsoft's Internet Information Services (IIS) server package comes with server-side includes enabled?by default, it will process any file ending in .stm, .shtm, or .shtml. However, files will be processed only if they're inside directories with Scripts or Execute access permissions.

To set these permissions:

Open My Computer, select the directory in which you want to allow SSI, and right-click to open its property menu.
On the Security property menu, select the Windows account for which you want to change permissions.
Under Permissions, select the types of access for the selected user or group. Use Allow to specifically allow access and Deny to specifically deny access. For more choices, click Advanced.