10.8 SAX

The Simple API for XML (SAX) is one of the first and currently the most popular method for working with XML data. It evolved from discussions on the XML-DEV mailing list and, shepherded by David Megginson,[1] was quickly shaped into a useful specification.

[1] David Megginson maintains a web page about SAX at http://www.saxproject.org.

The first incarnation, called SAX Level 1 (or just SAX1), supports elements, attributes, and processing instructions. It doesn't handle some other things like namespaces or CDATA sections, so the second iteration, SAX2, was devised, adding support for just about any event you can imagine in generic XML. Since there's no good reason not to use SAX2, you can assume that SAX2 is what we are talking about when we say "SAX."

SAX was originally developed in Java in a package called org.xml.sax. As a consequence, most of the literature about SAX is Java-centric and assumes that is the environment you will be working in. Furthermore, there is no formal specification for SAX in any programming language but Java. Analogs in other languages exist, such as XML::SAX in Perl, but they are not bound by the official SAX description. Really they are just whatever their developer community thinks they should be.

David Megginson has made SAX public domain and has allowed anyone to use the name. An unfortunate consequence is that many implementations are really just "flavors" of SAX and do not match in every detail. This is especially true for SAX in other programming languages where the notion of strict compliance would not even make sense. This is kind of like the plethora of Unix flavors out today; they seem much alike, but have some big differences under the surface.

10.8.1 Drivers

SAX describes a universal interface that any SAX-aware program can use, no matter where the data is coming from. Figure 10-1 shows how this works. Your program is at the right. It contacts the ParserFactory object to request a parser that will serve up a stream of SAX events. The factory finds a parser and starts it running, routing the SAX stream to your program through the interface.

Figure 10-1. ParserFactory

The workhorse of SAX is the SAX driver. A SAX driver is any program that implements the SAX2 XMLReader interface. It may include a parser that reads XML directly, or it may just be a wrapper for another parser to adapt it to the interface. It may even be a converter, transmuting data of one kind (say, SQL queries) into XML. From your program's point of view, the source doesn't matter, because it is all packaged in the same way.

The SAX driver calls subroutines that you supply to handle various events. These call-backs fall into four categories, usually grouped into objects:

  • Document handler

  • Entity resolver

  • DTD handler

  • Error handler

To use a SAX driver, you need to create some or all of these handler classes and pass them to the driver so it can call their call-back routines. The document handler is the minimal requirement, providing methods that deal with element tags, attributes, processing instructions, and character data. The others override default behavior of the core API. To ensure that your handler classes are written correctly, the Java version of SAX includes interfaces, program constructs that describe methods to be implemented in a class.

The characters method of the content handler may be called multiple times for the same text node, as SAX drivers are allowed to split text into smaller pieces. Your code will need to anticipate this and stitch text together if necessary.

The entity resolver overrides the default method for resolving external entity references. Ordinarily, it is assumed that you just want all external entity references resolved automatically, and the driver tries to comply, but in some cases, entity resolution has to be handled specially. For example, a resource located in a database would require that you write a routine to extract the data, since it is an application-specific process.

The core API doesn't create events for lexical structures like CDATA sections, comments, and DOCTYPE declarations. If your environment provides the DTD handling extension, you can write a handler for that. If not, then you should just assume that the CDATA sections are treated as regular character data, comments are stripped out, and DOCTYPE declarations are out of your reach.

The error handler package gives the programmer a graceful way to deal with those unexpected nasty situations like a badly formed document or an entity that cannot be resolved. Unless you want an angry mob of users breaking down your door, you had better put in some good error checking code.

10.8.2 A Java Example: Element Counter

In this first example, we will use SAX to create a Java program that counts elements in a document. We start by creating a class that manages the parsing process, shown in Example 10-2.

Example 10-2. Contents of SAXCounter.java
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.InputSource;
import java.io.FileReader;

public class SAXCounter {

  public SAXCounter () {

  public static void main (String args[]) throws Exception {
    XMLReader xr = XMLReaderFactory.createXMLReader();
    SAXCounterHandler h = new SAXCounterHandler();     // create a handler
    xr.setContentHandler(h);                // register it with the driver
    FileReader r = new FileReader(args[0]);
    xr.parse(new InputSource(r));

This class sets up the SAX environment and requests a SAX driver from the parser factory XMLReaderFactory. Then it creates a handler and registers it with the driver via the setContentHandler( ) method. Finally, it reads a file (supplied on the command line) and parses it. Because I am trying to keep this example short, I will not register an error handler, although ordinarily this would be a mistake.

The next step is to write the handler class, shown in Example 10-3.

Example 10-3. Contents of SAXCounterHandler.java
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;

public class SAXCounterHandler extends DefaultHandler {

  private int elements;

  public SAXCounterHandler () {

  // handle a start-of-document event
  public void startDocument ()
    System.out.println("Starting to parse...");
    elements = 0;

  // handle an end-of-document event
  public void endDocument ()
    System.out.println("All done!");
    System.out.println("There were " + elements + " elements.");

  // handle a start-of-element event
  public void startElement (String uri, String name,
                            String qName, Attributes atts) {
    System.out.println("starting element (" + qName + ")");
    if ("".equals(uri));
      System.out.println("  namespace: " + uri);
    System.out.println("  number of attributes: " + atts.getLength());

  // handle an end-of-element event
  public void endElement (String uri, String name, String qName)
    elements ++;
    System.out.println("ending element (" + qName + ")");

  // handle a characters event
  public void characters (char ch[], int start, int length)
        System.out.println("CDATA: " + length + " characters.");

This class implements five types of events:

Start of document

Initialize the elements counter and print a message.

End of document

Print the number of elements counted.

Start of element

Output the qualified name of the element, the namespace URI, and the number of attributes.

End of element

Increment the element counter and print a message.

Any other events are handled by the superclass DefaultHandler.

We run the program on the data in Example 10-4.

Example 10-4. Contents of text.xml
<?xml version="1.0"?>
<bcb:breakfast-cereal-box xmlns:bcb="http://www.grubblythings.com/">
  <bcb:name>Sugar Froot Snaps</bcb:name>
  <bcb:graphic file="bcbcover.tif"/>
  <bcb:prize>Decoder ring</bcb:prize>

The full command is:

java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser 
SAXCounter test.xml

The -D option sets the property org.xml.sax.driver to the Xerces parser. This is necessary because my Java environment does not have a default SAX driver. Here is the output:

Starting to parse...
starting element (bcb:breakfast-cereal-box)
  namespace: http://www.grubblythings.com/
  number of attributes: 0
CDATA: 3 characters.
starting element (bcb:name)
  namespace: http://www.grubblythings.com/
  number of attributes: 0
CDATA: 17 characters.
ending element (bcb:name)
CDATA: 3 characters.
starting element (bcb:graphic)
  namespace: http://www.grubblythings.com/
  number of attributes: 1
ending element (bcb:graphic)
CDATA: 3 characters.
starting element (bcb:prize)
  namespace: http://www.grubblythings.com/
  number of attributes: 0
CDATA: 12 characters.
ending element (bcb:prize)
CDATA: 1 characters.
ending element (bcb:breakfast-cereal-box)
All done!
There were 4 elements.

There you have it. Living up to its name, SAX is uncomplicated and wonderfully easy to use. It does not try to do too much, instead offloading the work on your handler program. It works best when the processing of a document follows the order of elements, and only one pass through it is sufficient. One common task by event processors is to assemble tree structures, which brings us to the next topic, the tree processing API known as DOM.