Hack 97 Processing XML with SAX

figs/expert.gif figs/hack97.gif

SAX is the de facto standard XML parser interface for Java. You learn how to use it here with a simple SAX application written in Java.

The Simple API for XML (SAX) is a streaming, event-based API for XML (http://www.saxproject.org). It was (and continues to be) developed by members of the xml-dev mailing list (http://www.xml.org/xml/xmldev.shtml). Discussion of a uniform API for XML parsers began on xml-dev in December 1997 (http://lists.xml.org/archives/xml-dev/199712/msg00170.html) and resulted in the release of the first version of SAX in May 1998 (http://lists.xml.org/archives/xml-dev/199805/msg00226.html), with David Megginson as the chief maintainer (http://www.megginson.com/). SAX 2.0.1, which has namespace support, was released in January 2002, with David Brownell as the lead maintainer (http://lists.xml.org/archives/xml-dev/200201/msg01943.html).

SAX provides an interface to SAX parsers. As an event-based API, it munches on documents and reports, parsing events along the way, usually in one fell swoop. These reports come directly to the application through callbacks. This is called push parsing. To push these events, an application must implement event handlers (methods) from the SAX interfaces, such as startDocument() or startElement(). Without implementing or registering these handlers, a SAX application won't "see" the results that are pushed up from its underlying parser.

Pull parsing, on the other hand, allows you to pull events on demand. Examples of pull parsers include the C# XmlReader [Hack #98], Paul Prescod's Python pull parser (http://www.prescod.net/python/pulldom.html), Aleksander Slominski's XML pull parser (http://www.extreme.indiana.edu/xgws/xsoap/xpp/), and the Streaming API for XML (StAX), which is a pull parser API just now emerging from the Sun Java Specification Request, JSR 173 (http://www.jcp.org/en/jsr/detail?id=173).


SAX was originally written in Java and continues to be maintained in Java, but it is also available in other languages, such as C++, Pascal, and Perl (http://www.saxproject.org/?selected=langs). This hack demonstrates a simple Java program that uses SAX.

7.8.1 A Little Help from SAX

First, have a look at the document blob.xml:

<time timezone="PST"><hour>11</hour><minute>59</minute><second>59

</second><meridiem>p.m.</meridiem></time>

Not much to look at, is it? It's just a blob of elements with only one attribute, no pretty whitespace between elements, no XML declaration, and no comments. Having elements crammed together is not a big problem from a processing standpoint, except that it gives me a headache when I'm looking at it.

When I was first learning Java a few years ago, I searched high and low for simple SAX programs, ones that were reduced down to something I could grasp. I didn't have much luck finding such programs, so I decided to write a few of my own. Example 7-21 is a short SAX program, Poco.java. This program does some readily discernible things, just right for someone getting up to speed with SAX. It will also help us do something interesting with blob.xml.

Example 7-21. Poco.java
import org.xml.sax.XMLReader;

import org.xml.sax.Attributes;

import org.xml.sax.helpers.DefaultHandler;

import org.xml.sax.helpers.XMLReaderFactory;



public class Poco extends DefaultHandler {



    private int depth = -1;

    private static String parser = "org.apache.crimson.parser.XMLReaderImpl";



    public static void main (String[  ] args) throws Exception {



        XMLReader reader = XMLReaderFactory.createXMLReader(parser);

        reader.setContentHandler(new Poco());

        reader.parse(args[0]);



    } 



    public void startDocument() {

        System.out.println("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n");

        System.out.println("<!-- processed with Poco -->");

    }

   

    public void startElement (String uri, String name,

    String qName, Attributes atts) {

    depth++;

    if (depth > 0)

    System.out.print(" ");

    System.out.print("<" + qName + ">");

    if (depth =  = 0)

    System.out.println();

    }

   

    public void endElement (String uri, String name, String qName) {

    System.out.print("</" + qName + ">");

    if (depth =  = 1)

    System.out.println();

    depth--;

    }

   

    public void characters (char ch[  ], int start, int length)

    {

    for (int i = start; i < start + length; i++) {

    System.out.print(ch[i]);

    }

    }

   

}

Compiling this program as shown here requires Java version 1.4 or later:

java Poco.java

Because 1.4 has JAXP built in, you don't have to place a SAX JAR file (such as sax2r2.jar, available from http://sourceforge.net/projects/sax/) in the classpath. When the program is compiled, you can run it like this:

java Poco blob.xml

or like this in Windows:

java Poco file:///C:/Hacks/examples/blob.xml

The results of processing blob.xml with Poco.class are shown in Example 7-22.

Example 7-22. Results of processing blob.xml with Poco.class
<?xml version="1.0" encoding="ISO-8859-1"?>

   

<!-- processed with Poco -->

<time>

 <hour>11</hour>

 <minute>59</minute>

 <second>59</second>

 <meridiem>p.m.</meridiem>

</time>

An XML declaration and comment are added to the top of the resulting document. All the elements from the source file are copied, properly indented, and sent to standard output, including their character data content. The attribute on the time element, however, is not processed and so is excluded from the output. Now let's talk about how all this happened.

On line 1 of Example 7-21, the program imports the XMLReader interface from the package org.xml.sax, then on line 4 imports the class XMLReaderFactory from org.xml.sax.helpers. Line 13 creates an XML reader for SAX using the factory. Creating the reader is number one on your list of things to do when writing a SAX program.

The createXMLReader() method takes as an argument a string that represents a Java class name. This class name is the entry point for the underlying XML parser. JAXP's default XML parser is Crimson, identified with the class name org.apache.crimson.parse.XMLReaderImpl. If createXMLReader() has no argument, you can pass in a class name for the parser using the -D command-line option and the system property org.xml.sax.driver. For example, you could use the -D option on the command line, like this:

java -Dorg.xml.sax.driver=org.apache.crimson.parse.XMLReaderImpl 

Poco blob.xml

and get the same results as placing the class name in the program itself. You might choose the Xerces parser instead of Crimson. In this case, use this command line:

java -cp .;xercesImpl.jar -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser 

Poco blob.xml

This command assumes that xercesImpl.jar is in the current directory (download it from http://xml.apache.org/xerces2-j/download.cgi).

The DefaultHandler class (line 3) is also from the helpers package. It implements several other SAX interfaces, and is the default base class for SAX2 event handlers. For example, the DefaultHandler class implements methods from the ContentHandler interface. More precisely, ContentHandler contains only signatures for such methods or event handlers as startDocument( ) and startElement(), and DefaultHandler provides no-op implementations of these and other methods. The Poco class extends the DefaultHandler class (line 6), and the call to setContentHandler() on line 14 registers a content event handler. Without this content handler, all reported events are quietly ignored.

The rubber hits the road when the parse() method is called on line 15. The argument (args[0]) is a string that represents the filename from the command line. The argument for parse() is of type InputSource (http://www.saxproject.org/apidoc/org/xml/sax/InputSource.html), which can be a system identifier (or URI), a bytes stream, or a character stream.

Poco.java provides working implementations for four methods: startDocument() (line 24), startElement() (line 24), endElement( ) (line 34), and characters() (line 41). SAX uses callbacks, which are registered to handle certain events when encounterd, hence we call these methods event handlers. If we don't implement them in our program, they actually get called at runtime, but nothing apparent happens! Only by implementing the event handlers do we get into action.

startDocument() writes an XML declaration (line 20) and a comment (line 21) to standard output. startElement() writes a start tag, and endElement() writes an end tag. The only reason why the Attributes interface is imported (line 2) is to satisfy the required method signature for startElement( ), whose fourth argument is of type Attributes.

Both startElement() and endElement() use the depth variable (line 8) to determine element depth and to add whitespace appropriately, but this is not a general solution because it only works for a depth of 0 or 1! For a solid technique on handling element depth and whitespace, see David Megginson's DataWriter.java, available at http://megginson.com/Software/xml-writer-0.2.zip. characters() (line 44) simply prints any characters it encounters.

The program is admittedly weak in its exception handling. It only does the minimum by throwing Exception from main(). SAXException and SAXParseException are both imported by DefaultHandler, which Poco extends. A more responsible program?and therefore a more complex one?would use try/catch blocks to handle the exceptions intelligently. I have chosen to keep this program simple so it is easier to understand.


Poco.java is only the beginning of you can do, but it should give you a fairly good understanding of the basics of SAX programming in Java.

7.8.2 See Also

  • Karl Waclawek's SAX for .NET: http://sf.net/projects/saxdotnet

  • SAX API reference: http://www.saxproject.org/apidoc/overview-summary.html

  • SAX2, by David Brownell (O'Reilly)

  • The Book of SAX, by W. Scott Means and Michael A. Bodie (No Starch Press)