Hack 30 Look at XML Documents Through the Lens of the XML Information Set

figs/moderate.gif figs/hack30.gif

The XML Information Set or Infoset (http://www.w3.org/TR/xml-infoset) is a recommendation from the W3C that describes an abstract data set whose definitions can be used to describe well-formed XML documents (documents don't have to be valid). These definitions are set forth so that other W3C specs can use the same terminology and not trip over each other's shoelaces.

An infoset is supposed to describe the result of parsing an XML document; it can also be constructed by other means, such as in a Document Object Model (DOM) tree (http://www.w3.org/TR/xml-infoset/#intro.synthetic). Normally, you don't hear folks talk about structures in XML documents using the terms defined in this spec.

The infoset consists of a set of 11 information items, each with a set of properties. The following list briefly outlines these information items and their associated properties:


Document information item

Properties: all declarations processed, base URI, character encoding scheme, children, document element, notations, standalone, unparsed entities, version


Element information item

Properties: attributes, base URI, children, in-scope namespaces, local name, namespace attributes, namespace name, parent, prefix


Attribute information item

Properties: attribute type, local name, namespace name, normalized value, owner element, prefix, references, specified


Processing instruction information item

Properties: base URI, content, notation, parent, target


Unexpanded entity reference information item

Properties: declaration base URI, name, parent, public identifier, system identifier


Character information item

Properties: character code, element content whitespace, parent


Comment information item

Properties: content, parent


Document type declaration information item

Properties: children, parent, public identifier, system identifier


Unparsed entity information item

Properties: declaration base URI, name, notation, notation name, public identifier, system identifier


Notation information item

Properties: declaration base URI, name, public identifier, system identifier


Namespace information item

Properties: namespace name, prefix

If you need help understanding the meanings behind the individual information items and properties, consult the spec. There isn't enough space in this little hack to explain them all here. Applying the stylesheet infoset.xsl should help you understand better what the infoset describes.


To help you understand the infoset better, the file archive includes infoset.xsl, an XSLT 2.0 stylesheet. The reason I used XSLT 2.0 is that it has more facilities for creating an infoset implementation than XSLT 1.0. infoset.xsl is only a partial XSLT implementation of the reporting infoset.

To use the stylesheet, you need an XSLT 2.0 processor, such as Saxon 8.0 or later (http://saxon.sourceforge.net). Saxon 8.0 isn't a complete XSLT 2.0/XPath 2.0 implementation, but it's getting closer. Download and unzip Saxon, and place saxon8.jar in the working directory where you installed the archive of files that came with the book. You'll need Java Version 1.4 or later, too.

You can apply this stylesheet to any XML document, as demonstrated here:

java -jar saxon8.jar prefix.xml infoset.xsl

Your results will be as follows:

Comment information item (1)

[content]:  a time instant

[parent]: /

   

Document information item

[document element]: time

[base URI]: file:/C:/Hacks/examples/115959p.m.

   

Element information item (document element)

[namespace]: http://www.wyeast.net/time

[local name]: time

[prefix]: tz

[children]:

[attributes]: timezone

[base URI]: file:/C:/Hacks/examples/115959p.m.

   

Element information item (1)

[namespace]: http://www.wyeast.net/time

[local name]: hour

[prefix]: tz

[children]: 11

[attributes]:

[parent]: tz:time

[base URI]: file:/C:/Hacks/examples/11

   

Element information item (2)

[namespace]: http://www.wyeast.net/time

[local name]: minute

[prefix]: tz

[children]: 59

[attributes]:

[parent]: tz:time

[base URI]: file:/C:/Hacks/examples/59

   

Element information item (3)

[namespace]: http://www.wyeast.net/time

[local name]: second

[prefix]: tz

[children]: 59

[attributes]:

[parent]: tz:time

[base URI]: file:/C:/Hacks/examples/59

   

Element information item (4)

[namespace]: http://www.wyeast.net/time

[local name]: meridiem

[prefix]: tz

[children]: p.m.

[attributes]:

[parent]: tz:time

[base URI]: file:/C:/Hacks/examples/p.m.

   

Element information item (5)

[namespace]: http://www.wyeast.net/time

[local name]: atomic

[prefix]: tz

[children]:

[attributes]: signal

[parent]: tz:time

[base URI]: file:/C:/Hacks/examples/