Hack 93 Use Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It for Data

figs/expert.gif figs/hack93.gif

Cocoon is a popular web development framework from Apache.

To an XML hacker, the Web is a frustrating place. Little islands of well-formed XML content are awash in vast seas of "tag soup" in the form of malformed HTML documents [Hack #49] . Using a technique known as screen-scraping, it's possible to extract information from these pages, relying on knowledge of specific markup practices or document structures to pick out the data items from amongst the presentation elements.

Generally, screen-scraping involves using text processing tools like Perl that ignore the markup completely. However, there are ways to apply screen-scraping techniques using XML tools such as XSLT, the benefit being that one can go a little further with the tools immediately at hand without having to context switch to another environment?good news when you're after fast results.

This hack will demonstrate how to use elements of Cocoon (http://cocoon.apache.org/), an open source XML processing framework that can create a well-formed view of any web page and then apply XSLT to the results to extract some useful metadata.

7.4.1 Cocoon in 60 Seconds

For the uninitiated, Cocoon is an open source XML processing framework written in Java. It has a loyal community that is continually enhancing its functionality, but at the core is a very simple but powerful design pattern: the pipeline.

A Cocoon pipeline consists of three basic components: a Generator, responsible for providing the data to be processed; a Transformer that performs some useful processing on that data; and, ultimately, a Serializer that assembles the results. The interface used to glue these components together is the SAX API: a Generator produces SAX events, which are fed into Transformers, which in turn deliver events further down the pipeline until they're finally delivered to a Serializer that constructs the resulting document.

Cocoon is bundled with many different implementations of each of these components, with the most common being: the XML Generator, which is an XML parser; the XSLT Transformer, which applies an XSLT transform to the data passing through the pipeline; and the XML Serializer, which turns the SAX events back into an XML document.

However, among the other varied generators available in Cocoon is the HTML Generator, which is capable of turning any HTML page into well-formed XML that can then be processed by other components in a pipeline. The HTML Generator achieves this using JTidy (http://jtidy.sourceforge.net/), a Java port of the command-line tool HTML Tidy [Hack #22] .

Cocoon runs as a web application, making it a quick and simple way to publish XML data using XSLT. An individual instance of Cocoon uses a configuration file called a sitemap.xmap that describes the required processing pipelines, binding them to a particular request URL that will trigger their processing.

7.4.2 Running the Hack

To run this hack you'll need to download and install Cocoon from http://cocoon.apache.org. Cocoon is available only as a source distribution, but the install and setup is very straightforward.

First, ensure that you have Java installed and a JAVA_HOME environment variable pointing to the location of the installation. Then, after unpacking the source distribution, change into the newly created directory and execute the following:

./build.sh

./cocoon.sh servlet

This will build the Cocoon application and then start it up as a standalone service that will be available at http://localhost:8888/. Consult the Cocoon documentation for more information on tweaking the build as well as how to install Cocoon into an existing servlet container. For the rest of this hack we'll refer to the location of the cocoon installation as $COCOON_HOME.

The first step toward implementing this hack is to configure Cocoon using a sitemap. Copy the file in Example 7-10, sitemap.xmap, into the directory $COCOON_HOME/build/webapp. You'll find sitemap.xmap in the working directory where you unzipped the file archive that came with the book.

Be sure to back up the existing file named sitemap.xmap in $COCOON_HOME/build/webapp if you want to try out Cocoon demos later.


Example 7-10. sitemap.xmap
<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0">

 <map:components>

 

 <map:generators default="html">

   <map:generator name="html" 

        src="org.apache.cocoon.generation.HTMLGenerator">

   </map:generator>

  </map:generators>

   

  <map:transformers default="xslt">

   <map:transformer name="xslt" 

        src="org.apache.cocoon.transformation.TraxTransformer">

         <use-request-parameters>false</use-request-parameters>

   </map:transformer>

  </map:transformers>

   

  <map:serializers default="xml">

   <map:serializer mime-type="text/xml" 

                   name="xml" 

                   src="org.apache.cocoon.serialization.XMLSerializer"/>

  </map:serializers>

   

  <map:matchers default="param">

   <map:matcher name="param"

        src="org.apache.cocoon.matching.WildcardRequestParameterMatcher">

        <parameter-name>url</parameter-name>

   </map:matcher>

  </map:matchers>

   

  <map:selectors />

   

  <map:actions/>

   

  <map:pipes default="caching">

   <map:pipe name="caching" 

       src="org.apache.cocoon.components.

pipeline.impl.CachingProcessingPipeline"/>

  </map:pipes>

   

 </map:components>

   

 <map:views/>

 

 <map:resources/>

 

 <map:action-sets/>

   

 <map:pipelines>            

  <map:pipeline>

   <map:match pattern="**">

    <map:generate type="html" src="{1}"/>           

    <map:transform src="extractMetadata.xsl">

     <map:parameter name="url" value="{1}"/>

    </map:transform>

    <map:serialize type="xml"/>

   </map:match>

  </map:pipeline>

 </map:pipelines>

   

</map:sitemap>

A sitemap consists of two main sections. The first portion of the sitemap is a series of component definitions that declare the different kinds of Generator, Transformer, and Serializer components that will be available for use by pipelines described later in the sitemap. Each component is named so that it can be referred to later; it's possible to declare an implementation as the default for a particular component type.

In this instance there are three component definitions:

  • The HTML Generator, which will be responsible for fetching and parsing the required HTML document, applying JTidy to ensure that it's well-formed.

  • The XSLT Transform responsible for invoking a stylesheet to process the content.

  • The XML Serializer, which will produce the resulting document that will be delivered in response to the request.

The other component worth mentioning is the Matcher. This is used to bind an HTTP request to a particular pipeline that will be used to generate the response. A Matcher uses a wildcard or regular expression to select a given pipeline based on some aspect of the request. In this case, we're using a Matcher that tests for a request parameter named url.

The second half of a sitemap consists of the pipeline definition, which combines the declared components to perform some useful processing. In this simple example there is only a single pipeline definition:

<map:pipeline>

  <map:match pattern="**">

  <map:generate src="{1}"/>            

  <map:transform src="extractMetadata.xsl">

   <map:parameter name="url" value="{1}"/>

  </map:transform>

  <map:serialize/>

 </map:match>

</map:pipeline>

The pipeline will match any incoming request with a url parameter and will then take the following steps:

  1. Generate data for the pipeline by accessing the web page referenced in the url parameter. The HTML Generator will internally run the content of the page through JTidy to generate well-formed HTML.

  2. Transform the results using a stylesheet called extractMetadata.xsl, passing the original URL as a stylesheet parameter.

  3. Serialize the results of the transform as an XML document, which will be returned as the response.

The XSLT stylesheet in Example 7-11, extractMetadata.xsl, should be copied from the working directory and stored in $COCOON_HOME/build/webapp.

Example 7-11. extractMetadata.xsl
<xsl:stylesheet version="1.0"

                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

                xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

                xmlns:dc="http://purl.org/dc/elements/1.1/"

                xmlns:xhtml="http://www.w3.org/1999/xhtml">

   

<xsl:output method="xml" indent="yes"/>

   

<xsl:param name="url"/>

   

<xsl:variable name="lcletters">abcdefghijklmnopqrstuvwxyz</xsl:variable>

<xsl:variable name="ucletters">ABCDEFGHIJKLMNOPQRSTUVWXYZ</xsl:variable>

   

<xsl:template match="xhtml:body"/>

   

<xsl:template match="xhtml:html">

   <xsl:apply-templates select="xhtml:head"/>

</xsl:template>

   

<xsl:template match="xhtml:head">

   <rdf:Description rdf:about="{$url}">

      <xsl:apply-templates select="xhtml:title"/>

      <xsl:apply-templates select="xhtml:meta"/>

   </rdf:Description>

</xsl:template>

   

<xsl:template match="xhtml:title">

   <dc:title><xsl:value-of select="."/></dc:title>

</xsl:template>

   

<xsl:template match="xhtml:meta">

   <xsl:variable name="name">

      <xsl:choose>

         <xsl:when test="@http-equiv">

            <xsl:value-of select="translate(@http-equiv, $ucletters, 

                                $lcletters)"/>

         </xsl:when>

         <xsl:when test="@name">

            <xsl:value-of select="translate(@name, $ucletters, $lcletters)"/>

         </xsl:when>

      </xsl:choose>

   </xsl:variable>

   <xsl:choose>

      <xsl:when test="$name = 'content-type' or $name='dc.format'">

         <dc:format><xsl:value-of select="@content"/></dc:format>

      </xsl:when>

      <xsl:when test="$name = 'content-language' or $name='dc.language'">

         <dc:language><xsl:value-of select="@content"/></dc:language>

      </xsl:when>            

      <xsl:when test="$name = 'description' or $name = 'dc.description'">

         <dc:description><xsl:value-of select="@content"/></dc:description>

      </xsl:when>

      <xsl:when test="$name = 'keywords' or $name = 'dc.subject'">

         <dc:subject><xsl:value-of select="@content"/></dc:subject>

      </xsl:when>

      <xsl:when test="$name = 'copyright' or $name = 'dc.rights'">

         <dc:rights><xsl:value-of select="@content"/></dc:rights>

      </xsl:when>      

      <xsl:when test="$name = 'dc.title'">

         <dc:title><xsl:value-of select="@content"/></dc:title>

      </xsl:when>

      <xsl:when test="$name = 'dc.publisher'">

         <dc:publisher><xsl:value-of select="@content"/></dc:publisher>

      </xsl:when>

      <xsl:when test="$name = 'dc.date'">

         <dc:date><xsl:value-of select="@content"/></dc:date>

      </xsl:when>

      <xsl:when test="$name = 'dc.creator'">

         <dc:creator><xsl:value-of select="@content"/></dc:creator>

      </xsl:when>

      <xsl:when test="$name = 'dc.type'">

         <dc:type><xsl:value-of select="@content"/></dc:type>

      </xsl:when>      

      <xsl:when test="$name = 'dc.contributor'">

         <dc:contributor><xsl:value-of select="@content"/></dc:contributor>

      </xsl:when>     

      <xsl:when test="$name = 'dc.coverage'">

         <dc:coverage><xsl:value-of select="@content"/></dc:coverage>

      </xsl:when>                 

      <xsl:otherwise/>

   </xsl:choose>

</xsl:template>

   

</xsl:stylesheet>

The stylesheet is capable of processing any well-formed HTML page to extract some useful metadata. The stylesheet generates an RDF document as its output. (RDF is the standard way for capturing metadata about web resources.) The Dublin Core project (in the namespace http://purl.org/dc/elements/1.1/) defines a number of standard properties that can be used to describe a web resource using RDF. These properties cover simple items such as title, author, and so forth.

The Dublin Core project also defines a standard way to embed those properties in an HTML document using the meta element. All this stylesheet essentially does is extract that metadata from this standard location (as well as a few other, common, nonstandard ones) to build an appropriate RDF document.

The stylesheet itself is straightforward, consisting primarily of a large conditional block that tests for the presence of different items of metadata, emitting the appropriate RDF property if found.

With both the sitemap and stylesheet in place, it's now possible to try out the hack. Make sure that Cocoon is running and try a URL such as:

http://localhost:8888/tidy?url=http://hacks.oreilly.com

This will trigger the pipeline and should deliver an RDF document like the one shown in Example 7-12.

Example 7-12. Output from Cocoon hack
<?xml version="1.0" encoding="ISO-8859-1"?>

<rdf:Description xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 

                 xmlns:xhtml="http://www.w3.org/1999/xhtml" 

                 xmlns:dc="http://purl.org/dc/elements/1.1/" 

                 rdf:about="http://hacks.oreilly.com">

   <dc:title>hacks.oreilly.com -- O'Reilly Hacks Series</dc:title>

   

   <dc:description>Hacks are tools, tips, and tricks that help users 

   solve problems. They are aimed at intermediate-level power users 

   and scripters. Each book is a collection of 100 article-length 

   hacks, and each one provides detailed examples that show how to 

   solve practical problems. Got a hack? Share it with us.

   </dc:description>

   

</rdf:Description>

Substitute any web address for the value of the url parameter to process a different page. Substitute another stylesheet in the pipeline definition to perform a more complex transformation.

7.4.3 Extending the Hack

There are several ways that this hack could be extended. One example is to exploit more of Cocoon's functionality to build a full-fledged application or web service that harvests some or even all of its data by scraping web pages and other data sources.

The example stylesheet is also fairly generic. It attempts to provide some useful basic metadata about any web page. However, in some cases the required data may be part of the actual page body, requiring a more complex transform. This extension can be used to extract data from services that don't currently offer an XML interface.

Extracting data using only XSLT can be quite tricky. By adopting Cocoon as the basic framework it's possible to take advantage of additional features as you require them?for example, writing a custom Transformer component to process the data using the SAX API rather than relying on just XSLT. The mark of any good framework is that there's room for growth.

?Leigh Dodds