10.2 Streams and Events

The stream approach treats XML content as a pipeline. As it rushes past, you have one chance to work with it, no look-ahead or look-behind. It is fast and efficient, allowing you to work with enormous files in a short time, but depends on simple markup that closely follows the order of processing.

In programming jargon, a stream is a sequence of data chunks to be processed. A file, for example, is a sequence of characters (one or more bytes each, depending on the encoding). A program using this data can open a filehandle to the file, creating a character stream, and it can choose to read in data in chunks of whatever size it chooses. Streams can be dynamically generated too, whether from another program, received over a network, or typed in by a user. A stream is an abstraction, making the source of the data irrelevant for the purpose of processing.

To summarize, here are a stream's important qualities:

It consists of a sequence of data fragments.
The order of fragments transmitted is significant.
The source of data (e.g., file or program output) is not important.

XML streams are more clumpy than character streams, which are just long sequences of characters. An XML stream emits a series of tokens or events, signals that denote changes in markup status. For example, an element has at least three events associated with it: the start tag, the content, and the end tag. The XML stream is constructed as it is read, so events happen in lexical order. The content of an element will always come after the start tag, and the end tag will follow that.

Parsers can assemble this kind of stream very quickly and efficiently thanks to XML's parser-friendly design. Other formats often require some look-ahead or complex lookup tables before processing can begin. For example, SGML does not have a rule requiring nonempty elements to have an end tag. To know when an element ends requires sophisticated reasoning by the parser, making code more complex, slowing down processing speed, and increasing memory usage.

You might wonder why an XML stream does not package up complete elements for processing. The reason is that XML is hierarchical. Elements are nested, so it is not possible to separate them into discrete packages in a stream. In fact, it would resemble the tree method, handing out exactly one element, the root of the document assembled into a single data structure.

The event model of processing is quite simple. There are only a few event types to keep track of, including element tags, character data, comments, processing instructions, and the boundaries of the document itself. Let us look an example of how a parser might slice up a document into an XML event stream. Consider the data file in Example 10-1.

Example 10-1. A simple XML document with lots of markup types

<recipe>
  <name>peanut butter and jelly sandwich</name>
  <!-- add picture of sandwich here -->
  <ingredients>
    <ingredient>Gloppy™ brand peanut butter</ingredient>
    <ingredient>bread</ingredient>
    <ingredient>jelly</ingredient>
  </ingredients>
  <instructions>
    <step>Spread peanut butter on one slice of bread.</step>
    <step>Spread jelly on the other slice of bread.</step>
    <step>Put bread slices together, with peanut butter and
  jelly touching.</step>
  </instructions>
</recipe>

A stream-generating parser would report these events:

A document start
A start tag for the recipe element
A start tag for the name element
The piece of text "peanut butter and jelly sandwich"
An end tag for the name element
A comment with the text "add picture of sandwich here"
A start tag for the ingredients element
A start tag for the ingredient element
The text "Gloppy"
A reference to the entity trade
The text "brand peanut butter"
An end tag for the ingredient element

...and so on, until the final eventthe end of the documentis reached.

Somewhere between chopping up a stream into tokens and processing the tokens is a layer one might call an event dispatcher. It branches the processing depending on the type of token. The code that deals with a particular token type is called an event handler. There could be a handler for start tags, another for character data, and so on. A common technique is to create a function or subroutine for each event type and register it with the parser as a call-back, something that gets called when a given event occurs.

Streams are good for a wide variety of XML processing tasks. Programs that use streams will be fast and able to handle very large documents. The code will be simple and fit the source data like a glove. Where streams fail are situations in which data is so complex that it requires a lot of searching around. For example, XSLT jumps from element to element in an order that may not match the lexical order at all. When that is the case, we prefer to use the tree model.