3.4 Documents Describing Documents

Many XML documents contain metadata, information about themselves that help search engines to categorize them. But not everyone takes advantage of the possibilities of metadata. And, unless you're using an exhaustive program that spiders through an entire document collection, it's difficult to summarize the set and choose a particular article from it. Making matters worse, not all documents have the capability to describe themselves, such as sound and graphics files. To address these problems, a class of documents evolved that specialize in describing other documents.

To fully describe different kinds of documents, these markup languages have some interesting features in common. They list the time documents have been updated using standard time formats. They label the content type, be it text, image, sound, or something else. They may contain text descriptions for a user to peruse. For international documents, they may track the language encodings. Also interesting is the way documents are uniquely identified: using a physical address or some nonphysical identifier.

3.4.1 Describing Media

Rich Site Summary (or Really Simple Syndication, depending on whom you talk to) was created by Netscape Corp. to describe content on web sites. They wanted to make a portal that was customizable, allowing readers to subscribe to particular subject areas or channels. Each time they returned to the site, they would see updates on their favorite topics, saving them the trouble of hunting around for this news on their own. Thus was born the service known as content aggregation.

Since the time when there were a few big content aggregators like Netscape and Userland, the landscape has shifted to include hundreds of smaller, more granular services. Instead of subscribing to channels that mix together lots of different sources, you can subscribe to individual sites for an even higher level of customization. Everything from the BBC to a swarm of one-person weblogs are at your disposal. Publishing has never been easier.

RSS works like the cover of a magazine, beckoning to you from the newsstand. Splashed all over the cover graphic are the titles of articles, like "Lose Weight with the Ice Cream Diet" and "Ten Things the Government Doesn't Want You to Know About UFOs." At a glance, you can decide whether you must have this issue or National Geographic instead. It saves you time and keeps the newsstand owner from yelling at you for reading without buying.

There are a few different models of publishing with RSS. The pull model is where a content aggregator checks an RSS file periodically to see if anything has been updated, pulling in new articles as they appear. In the push model, also called publish and subscribe, the information source informs the content aggregator when it has something new to offer. In both cases, RSS serves as a menu for the aggregator (or the user logged into it) to decide whether the articles are of interest.

Example 3-11 shows a sample of RSS describing a fictional web site. The owner of a web site will register this file with all the aggregators it wants to be listed under and hope that people will be convinced by the descriptions that it's interesting enough to subscribe to.

Example 3-11. RSS describing a web site
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<rss version="2.0">
    <title>Lifestyles of the Foolhardy</title>
Incredibly bold or just plain stupid? Tips and tricks for folks who
just don't have time to think about safety.
    <copyright>Copyright 2002 Liv Dangerously</copyright>
    <lastBuildDate>Fri, 20 Sep 2002 11:05:02 GMT</lastBuildDate>
    <managingEditor>liv@foolhardy.org (Liv Dangerously)</managingEditor>
      <title>Using a Hair Dryer in the Bathtub.</title>
      <pubDate>Fri, 20 Sep 2002 11:05:02 GMT</pubDate>
Don't wait till bathtime is over to dry your hair. Save time and do
both at once.
Sounds of someone falling down the stairs; a brave soul proving that 
rollerskates aren't just for flat surfaces.
      <pubDate>Fri, 20 Sep 2002 10:28:19 GMT</pubDate>
      <enclosure url="http://www.foolhardy.org/sounds/stairs.mp3"
                 length="44456" type="audio/mpeg"/>

The first thing inside the document element rss is a channel element giving a general overview of the site. In it, we find:

  • Descriptive text including a short title and longer paragraph.

  • Administrative details: contact information, link to the main page, copyright.

  • Language identifier en-us, which means that it uses the American variant of English (more about language encodings in Chapter 9).

  • Time of last update, using a standard time format (RFC 822) recognized all over the Internet.

After the channel comes a series of elements describing each item in the site. This example has two: a text document and a sound file. Each has a corresponding item element containing a text description, link to the resource, and the date it was posted.

The sound file has an additional element, enclosure, which provides some details about the format. The type attribute gives the content type, audio/mp3. The format of this description comes from the Multipurpose Internet Mail Extensions standard (RFC 2046).

For more about RSS, see Ben Hammersley's Context Syndication with RSS (O'Reilly & Associates, 2003)

3.4.2 Templates

Describing documents is also the job of the transformation language XSLT. But in this case, we're talking about documents in the future. XSLT generates new documents from old ones, following rules in a transformation stylesheet. For each element in the source document, there will be a rule dictating what to do with it and its content. The rule can be explicit (defined by you) or implicit (not finding a specific rule, the processor falls back on a default one).

These rules are encoded in an XSLT document using an ingenious mechanism called a template, which is a sample of a piece of the result document. Some blanks need to be filled into the template, but otherwise, you can see by looking at the template how the future document will look.

Here is a typical XSLT stylesheet with a couple of templates:


  <!-- first template: how to process a <para> element -->
  <xsl:template match="para">

  <!-- second template: how to process a <note> element -->
  <xsl:template match="note">
    <div class="note">

The first template tells the XSLT processor what to output when it comes across a para element in the source document. The second is a rule for note elements. In each case, the template element's contents mirror a piece of the document to be created.

Note the use of the namespace prefix xsl: in some of the elements. This is how the XSLT processor can tell the difference between markup to be obeyed as instructions and markup to be output. In other words, if there is no xsl: prefix, the markup is treated as data and carried to the output document as is. Some of the instruction elements, like xsl:apply-templates, control the flow of processing, making the XSLT processor recurse through the source document looking for more elements to transform.

In some ways, XML schemas are similar to XSLT. They use templates to describe parts of documents as they should be, instead of as they will be. In other words, a schema is a test to determine whether a document can be labeled a valid instance of a language.

Templates are a good design mechanism for documents describing documents because they are modular and easy to understand. Instead of looking at the whole document, you only have to imagine one element at a time. Templates can be imported from other files and mixed to add more flexibility.

XML schemas will be discussed in greater detail in Chapter 4, while XSLT will be discussed in greater detail in Chapter 7.