Tim Bray, lead editor of XML 1.0, calls pull parsing "the way to go in the future." Like event-based parsing, it's fast, memory efficient, streamable, and read-only. The difference is in how the application and parser interact. SAX implements what we call push parsing. The parser pushes events at the program, requiring it to react. The parser doesn't store any state information, contextual clues that would help in decisions for how to parse, so the application has to store this information itself.
Pull parsing is just the opposite. The program takes control and tells the parser when to fetch the next item. Instead of reacting to events, it proactively seeks out events. This allows the developer more freedom in designing data handlers, and greater ability to catch invalid markup. Consider the following example XML:
<catalog> <product id="ronco-728"> <name>Widget</name> <price>19.99</price> </product> <product id="acme-229"> <name>Gizmo</name> <price>28.98</price> </product> </catalog>
It is easy to write a SAX program to read this XML and build a data structure. The following code assembles an array of products composed of instances of this class:
class Product { String name; String price; }
Here is the code to do it:
StringBuffer cdata = new StringBuffer(); Product[] catalog = new Product[10]; String name; Float price; public void startDocument () { index = 0; } public void startElement( String uri, String local, String raw, Attributes attrs ) throws SAXException { cdata.clear(); } public void characters( char ch[], int start, int length ) throws SAXException { cdata.append( ch, start, length ); } public void endElement( String uri, String local, String raw ) throws SAXException { if("product".equals(local)) { index ++; } else if( "name".equals(local) ) { catalog[index].name = cdata.toString; } else if( "price".equals(local) ) { catalog[index].price = cdata.toString; } else { throw new SAXException( "Unexpected element: " + local ); } }
The program maintains a little bit of state information in the form of an index variable. As this counter increments, it stores data from the next product in the next slot. Thus it builds a growing list of products in its catalog array.
At first glance, this program seems to be adequate. It will handle a data file that is valid, but if you throw some bad markup at it, it will do strange things. Imagine what would happen if you gave it this data file:
<catalog> <product id="grigsby-123"> <name>Woofinator</name> </product> <price>8.77</price> </catalog>
Oops. The price element is not inside the product like it should be. The program we wrote will not catch the mistake. Instead, it will save the product data for the woofinator, without a price, then increment the index. When the parser finally reaches the price, it will be too late to insert into the product slot. Clearly, this ought to be a validation error, but our program is not smart enough to catch it.
To protect against problems like this, we could add a test for a missing price element, or an extra one outside the product element. But then we would have to insert tests everywhere and the code would get ugly quickly. A better solution is provided by pull parsing.
This example uses the XMLPULL API (see http://www.xmlpull.org/) in a recursive descent style of processing:
import org.xmlpull.v1.XmlPullParser; import org.kxml2.io.*; import org.xmlpull.v1.*; import java.io.*; import java.util.Vector; public class test { public static void main(String[] args) throws IOException, XmlPullParserException { Vector products=new Vector(); try { XmlPullParser parser = new KXmlParser(); parser.setInput(new FileReader(args[0])); parser.nextTag(); parser.require(XmlPullParser.START_TAG, null, "catalog"); while (parser.nextTag () != XmlPullParser.END_TAG) { Product newProduct=readProduct(parser); products.add(newProduct); } parser.require(XmlPullParser.END_TAG, null, "catalog"); parser.next(); parser.require(XmlPullParser.END_DOCUMENT, null, null); } catch (Exception e) { e.printStackTrace(); } System.out.println("Products:"); int count=products.size(); for (int i=0; i<count; i++) { Product report=(Product) products.get(i); System.out.println("Name: "+report.name ); System.out.println("Price: "+report.price ); } } static public Product readProduct(XmlPullParser parser) throws IOException, XmlPullParserException { Vector products=new Vector(); parser.require(XmlPullParser.START_TAG, null, "product"); String productName = null; String price = null; while (parser.nextTag() != XmlPullParser.END_TAG) { parser.require(XmlPullParser.START_TAG, null, null); String name = parser.getName(); String text = parser.nextText(); if (name.equals("name")) productName = text; else if (name.equals("price")) price = text; parser.require(XmlPullParser.END_TAG, null, name); } parser.require(XmlPullParser.END_TAG, null, "product"); Product newProduct=new Product(); newProduct.name=productName; newProduct.price=price; return newProduct; } }
Pull parsing is quickly becoming a favorite of developers. Current implementations include Microsoft's .NET XML libraries, the streamable API for XML (StAX), XMLPULL, and NekoPull. Sun is standardizing a pull API for Java through JSR-172.