DOM is a recommendation by the World Wide Web Consortium (W3C). Designed to be a language-neutral interface to an in-memory representation of an XML document, versions of DOM are available in Java, ECMAscript,[2] Perl, and other languages.
[2] A standards-friendly language patterned after JavaScript.
While SAX defines an interface of handler methods, the DOM specification calls for a number of classes, each with an interface of methods that affect a particular type of XML markup. Thus, every object instance manages a portion of the document tree, providing accessor methods to add, remove, or modify nodes and data. These objects are typically created by a factory object, making it a little easier for programmers who only have to initialize the factory object themselves.
In DOM, every piece of XML (the element, text, comment, etc.) is a node represented by a node object. The Node class is extended by more specific classes that represent the types of XML markup, including Element, Attr (attribute), ProcessingInstruction, Comment, EntityReference, Text, CDATASection, and Document. These classes are the building blocks of every XML tree in DOM.
The standard also calls for a couple of classes that serve as containers for nodes, convenient for shuttling XML fragments from place to place. These classes are NodeList, an ordered list of nodes, like all the children of an element; and NamedNodeMap, an unordered set of nodes. These objects are frequently required as arguments or given as return values from methods. Note that these objects are all live, meaning that any changes done to them will immediately affect the nodes in the document itself, rather than a copy.
When naming these classes and their methods, DOM merely specifies the outward appearance of an implementation and leaves the internal specifics up to the developer. Particulars like memory management, data structures, and algorithms are not addressed at all, as those issues may vary among programming languages and the needs of users. This is like describing a key so a locksmith can make a lock that it will fit into; you know the key will unlock the door, but you have no idea how it really works. Specifically, the outward appearance makes it easy to write extensions to legacy modules so they can comply with the standard, but it does not guarantee efficiency or speed.
DOM is a very large standard, and you will find that implementations vary in their level of compliance. To make things worse, the standard has not one, but two (soon to be three) levels. DOM1 has been around since 1998, DOM2 emerged more recently, and they're already working on a third. The main difference between Levels 1 and 2 is that the latter adds support for namespaces. If you aren't concerned about namespaces, then DOM1 should be suitable for your needs.
In this section, I describe the interfaces specified in DOM.
The Document class controls the overall document, creating new objects when requested and maintaining high-level information such as references to the document type declaration and the root element.
Following are the properties for the Document class.
Document Type Declaration (DTD).
The root element of the document.
Here are the methods for the Document class:
Generates a new node object.
Generates a new element or attribute node object with a specified namespace qualifier.
Creates a container object for a document's subtree.
Returns a NodeList of all elements having a given tag name at any level of the document.
Returns a NodeList of all elements having a given namespace qualifier and local name. The asterisk character (*) matches any element or any namespace, allowing you to find all elements in a given namespace.
Returns a reference to the node that has a specified ID attribute.
Creates a new node that is the copy of a node from another document. Acts like a "copy to the clipboard" operation for importing markup.
The DocumentFragment class is used to contain a document fragment. Its children are (zero or more) nodes representing the tops of XML trees. This class contrasts with Document, which has at most one child element, the document root, plus metadata like the document type. In this respect, DocumentFragment's content is not well-formed, though it must obey the XML well-formed rules in all other respects (no illegal characters in text, etc.)
No specific methods or properties are defined; use the generic node methods to access data.
This class contains all the information contained in the document type declaration at the beginning of the document, except the specifics about an external DTD. Thus, it names the root element and any declared entities or notations in the internal subset.
No specific methods are defined for this class, but the properties are public (but read-only).
Here are the properties for the DocumentType class:
The name of the root element.
A NamedNodeMap of entity declarations.
A NamedNodeMap of notation declarations.
The internal subset of the DTD represented as a string.
The external subset of the DTD's public identifier.
The external subset of the DTD's system identifier.
All node types inherit from the class Node. Any properties or methods common to all node types can be accessed through this class. A few properties, such as the value of the node, are undefined for some node types, like Element. The generic methods of this class are useful in some programming contexts, such as when writing code that processes nodes of different types. At other times, you'll know in advance what type you're working with, and you should use the specific class's methods instead.
All properties but nodeValue and prefix are read-only.
Here are the properties for the Node class:
A property that is defined for elements, attributes, and entities. In the context of elements this property would be the tag's name.
A property defined for attributes, text nodes, CDATA nodes, processing instructions, and comments.
One of the following types of nodes: Element, Attr, Text, CDATASection, EntityReference, Entity, ProcessingInstruction, Comment, Document, DocumentType, DocumentFragment, or Notation.
A reference to the parent of this node.
An ordered list of references to children of this node (if any).
References to the first and last of the node's children (if any).
The node immediately preceding or following this one, respectively.
An unordered list (NamedNodeMap) of nodes that are attributes of this one (if any).
A reference to the object containing the whole document. It's useful when you need to generate a new node.
A namespace URI if this node has a namespace prefix; otherwise it is null.
The namespace prefix associated with this node.
Here are the methods for the Node class:
Inserts a node before a reference child element.
Swaps a child node with a new one you supply, giving you the old one in return.
Adds a new node to the end of this node's list of children.
True if there are children of this node; otherwise, it is false.
Returns a duplicate copy of this node. It provides an alternate way to generate nodes. All properties will be identical except for parentNode, which will be undefined, and childNodes, which will be empty. Cloned elements will all have the same attributes as the original. If the argument deep is set to true, then the node and all its descendants will be copied.
Returns true if this node has defined attributes.
Returns true if this implementation supports a specific feature.
This class is a container for an ordered list of nodes. It is "live," meaning that any changes to the nodes it references will appear in the document immediately.
Here are the properties for the NodeList class:
Returns an integer indicating the number of nodes in the list.
Here are the properties for the NodeList class:
Given an integer value n, returns a reference to the nth node in the list, starting at zero.
This unordered set of nodes is designed to allow access to nodes by name. An alternate access by index is also provided for enumerations, but no order is implied.
Here are the properties for the NamedNodeMap class:
Returns an integer indicating the number of nodes in the list.
Here are the properties for the NamedNodeMap class:
Retrieves or adds a node using the node's nodeName property as the key.
Takes a node with the specified name out of the set and returns it.
Given an integer value n, returns a reference to the nth node in the set. Note that this method does not imply any order and is provided only for unique enumeration.
Retrieves a node based on a namespace-qualified name (a namespace prefix and a local name).
Takes an item out of the list and returns it, based on its namespace-qualified name.
Adds a node to the list using its namespace-qualified name.
This class extends Node to facilitate access to certain types of nodes that contain character data, such as Text, CDATASection, Comment, and ProcessingInstruction. Specific classes like Text inherit from this class.
Here are the properties for the CharacterData class:
The character data itself.
The number of characters in the data.
Here are the methods for the CharacterData class:
Appends a string of character data to the end of the data property.
Extracts and returns a segment of the data property from offset to offset + count.
Inserts a string inside the data property at the location given by offset.
Sets the data property to an empty string.
Changes the contents of data property with a new string that you provide.
This is the most common type of node you will encounter. An element can contain other nodes and has attribute nodes.
Here are the properties for the Element class:
The name of the element.
Here are the methods for the Element class:
Returns the value of an attribute, or a reference to the attribute node, with a given name.
Adds a new attribute to the element's list or replaces an existing attribute of the same name.
Returns the value of an attribute and removes it from the element's list.
Returns a NodeList of descendant elements who match a name.
Collapses adjacent text nodes. You should use this method whenever you add new text nodes to ensure that the structure of the document remains the same, without erroneous extra children.
Retrieves an attribute value based on its qualified name (the namespace prefix plus the local name).
Gets an attribute's node by using its qualified name.
Returns a NodeList of elements among this element's descendants that match a qualified name.
Returns true if this element has an attribute with a given name.
Returns true if this element has an attribute with a given qualified name.
Removes and returns an attribute node from this element's list, based on its namespace-qualified name.
Adds a new attribute to the element's list, given a namespace-qualified name and a value.
Adds a new attribute node to the element's list with a namespace-qualified name.
This kind of node represents attributes.
Here are the properties for the Attr class:
The attribute's name.
True if the program or the document explicitly set the attribute. If it was set in the DTD as a default and not reset anywhere else, then it will be false.
The attribute's value, represented as a text node.
The element to which this attribute belongs.
This type of node represents text.
Here are the methods for the Text class:
Breaks the text node into two adjacent text nodes, each with part of the original text content. The first node contains text from the beginning of the original node up to, but not including, a character whose position is given by offset. The second node has the rest of the original node's content. This method is useful for inserting a new element inside a span of text.
CDATASection is like a text node, but protects its contents from being parsed. It may contain markup characters (<, &) that would be illegal in text nodes. Use generic Node methods to access data.
This class represents processing instructions.
Here are the properties for the ProcessingInstruction class:
The target value for the node.
The data value for the node.
This is a class representing comment nodes. Use the generic Node methods to access the data.
This is a reference to an entity defined by an Entity node. Sometimes the parser will be configured to resolve all entity references into their values for you. If that option is disabled, the parser should create this node. No explicit methods force resolution, but some actions to the node may have that side effect.
This class provides access to an entity in the document, based on information in an entity declaration in the DTD.
Here are the properties for the Entity class:
A public identifier for the resource (if the entity is external to the document).
A system identifier for the resource (if the entity is external to the document).
If the entity is unparsed, its notation reference is listed here.
Notation represents a notation declaration appearing in the DTD.
Here are the properties for the Notation class:
A public identifier for the notation.
A system identifier for the notation.
Perl is quite different from Java. It was not designed from the outset to be object oriented. That functionality was added later in kind of an ad hoc manner. Perl is loose with type checking and rather idiomatic. For these reasons, it is not always taken seriously by XML pundits.
Yet Perl is a fixture in the World Wide Web, being the original duct tape that holds web sites together. It has a huge following and excellent support in books and online resources, and it's very easy to get started using it. For small, quick-and-dirty utilities that achieve fast results, it simply cannot be beat. Having cut my teeth in the text processing world of publishing, I found Perl to be a boon.
Including a Perl example to contrast with Java gives us a nice range of programming environments to showcase XML development strategies. If you are developing a large, complex system, you will likely want to consider Java for its robustness and strong object-oriented programming capabilities. If you want a small tool for simple tasks in shaping your XML files, then Perl would be a great candidate.
The example I propose for using DOM is a small application that fixes a simple problem. When I used to prepare DocBook-XML documents for formatting, I found there were a few common structural errors that would cause problems in the formatting software. One of these was the tendency of busy indexing specialists to insert <indexterm> elements inside titles. It is an easy mistake to make, and just as easy to fix.
Now I will show you how to go about solving this problem with Perl. My favorite parser in Perl is Matt Sargent's XML::LibXML. It is an interface to the C library libxml2 which is incredibly fast and reliable. This module also implements most of the DOM2 specification and adds XPath node-fetching capability. In this portion of the script, we set up the parser and use it to assemble DOM trees out of files from the command line:
use XML::LibXML; my $parser = new XML::LibXML; # a parser object # This table gives us that ability to test the type of # most common nodes. It is not a complete list, but these are # the ones we are most likely to encounter (and care about # for this example). my %nodeTypes = ( element => 1, attribute => 2, text => 3, cdatasection => 4, entityref => 5, entitynode => 6, procinstruc => 7, comment => 8, document => 9 ); # Loop through the arguments on the command line, feeding them to # the parser as filenames. After testing that parsing was successful, # apply the map_proc_to_elems subroutine to the document node to # make the needed fixes. Finally, write the XML back out to the file. foreach my $fileName ( @ARGV ) { my $docRef; eval{ $docRef = $parser->parse_file( $fileName ); }; die( "Parser error: $@" ) if( $@ ); map_proc_to_elems( \&fix_iterms, $docRef ); open( OUT, ">$fileName" ) or die( "Can't write $fileName" ); print OUT $docRef->toString(); close OUT; }
After instantiating the parser, we created a hash table that maps English words for node types to the numeric codes used in the parser. This will give us the ability to test what kind of node we are looking at when we traverse through the file.
In the loop below that declaration, we take filenames from the command line argument list (@ARGV) and feed them to the parser. The eval{ } statement catches any parse errors, which we detect in the following die( ) statement. The parser puts helpful error messages in $@ to indicate what may have confused the parser. If all goes well, the parser will return a reference to the top of the DOM tree, specifically an XML::LibXML::Document object.
The map_proc_to_elems( ) is a yet-to-be-written subroutine that will apply a procedure (also not yet written) to nodes in the DOM tree. This is where the real work will take place in the program. It makes changes directly to the object tree, so all we have to do is print it out as text with the toString( ) method.
Now let us dig into the map_proc_to_elems( ) routine. The purpose of this function is to map a procedure to every element in the document:
sub map_proc_to_elems { my( $proc, $nodeRef ) = @_; my $nodeType = $nodeRef->nodeType; if( $nodeType == $nodeTypes{document} ) { map_proc_to_elems( $proc, $nodeRef->getDocumentElement ); } elsif( $nodeType == $nodeTypes{element} ) { &$proc( $nodeRef ); foreach my $childNodeRef ( $nodeRef->getChildnodes ) { map_proc_to_elems( $proc, $childNodeRef ); } } }
You start it with the document node or any element and it will visit every element in that subtree, recursing on the children and their children and so on. Testing the node's type allows us to make sure we don't try to apply the procedure to anything that isn't the document node or an element. The procedure to be applied comes in the form of a subroutine reference, which we dereference to call in two places: when the current node is a document node, and when it is an element. For any other case, the subroutine just returns without doing anything.
Driving this traversal are the methods getDocumentElement( ), which obtains the root element, and getChildnodes( ),[3] which returns a list of child nodes in the order they appear in the document.
[3] No, that lowercase "n" is not a typo.
Now we turn our attention to the subroutine that performs the fix on elements. It is called fix_iterms( ) because it moves indexterm elements out of title elements where they would cause trouble. We could just as easily substitute this procedure with another that does something else to elements. That is the beauty of this program: it can be quickly re-engineered to do any task on elements you want. Here it is:
sub fix_iterms { my $nodeRef = shift; # test: is this an indexterm? return unless( $nodeRef->nodeName eq 'indexterm' ); # test: is the parent a title? my $parentNodeRef = $nodeRef->parentNode; return unless( $parentNodeRef->nodeName eq 'title' ); # If we get this far, we must be # looking at an indexterm inside a title. # Therefore, remove this indexterm and # stick it just after the parent (title). $parentNodeRef->removeChild( $nodeRef ); my $ancestorNodeRef = $parentNodeRef->parentNode; $ancestorNodeRef->insertAfter( $nodeRef, $parentNodeRef ); }
At the top of the procedure are lines that select which element to process. Since this procedure is called for every element, we have to weed out the ones we don't want to touch. The first test determines whether the element is an <indexterm> and, if it is not, returns immediately. The next two lines examine the parent of this element, aborting unless it is of type title. If processing gets past these two tests, we know this must be an indexterm inside a title.
The processing that follows removes the offending indexterm element from its parent's list of children and inserts it into the list of its parent's parent's children, just after the parent. So the indexterm goes from being a child of title to being its sibling, positioned immediately after it. This puts the element where it will do no harm to the formatter and will still be seen by an index generator later.
Wasn't that simple? Example 10-5 shows the complete program.
#!/usr/bin/perl use XML::LibXML; my $parser = new XML::LibXML; my %nodeTypes = ( element => 1, attribute => 2, text => 3, cdatasection => 4, entityref => 5, entitynode => 6, procinstruc => 7, comment => 8, document => 9 ); foreach my $fileName ( @ARGV ) { my $docRef; eval{ $docRef = $parser->parse_file( $fileName ); }; die( "Parser error: $@" ) if( $@ ); map_proc_to_elems( \&fix_iterms, $docRef ); open( OUT, ">$fileName" ) or die( "Can't write $fileName" ); print OUT $docRef->toString(); close OUT; } sub map_proc_to_elems { my( $proc, $nodeRef ) = @_; my $nodeType = $nodeRef->nodeType; if( $nodeType == $nodeTypes{document} ) { map_proc_to_elems( $proc, $nodeRef->getDocumentElement ); } elsif( $nodeType == $nodeTypes{element} ) { &$proc( $nodeRef ); foreach my $childNodeRef ( $nodeRef->getChildnodes ) { map_proc_to_elems( $proc, $childNodeRef ); } } } sub fix_iterms { my $nodeRef = shift; return unless( $nodeRef->nodeName eq 'indexterm' ); my $parentNodeRef = $nodeRef->parentNode; return unless( $parentNodeRef->nodeName eq 'title' ); $parentNodeRef->removeChild( $nodeRef ); my $ancestorNodeRef = $parentNodeRef->parentNode; $ancestorNodeRef->insertAfter( $nodeRef, $parentNodeRef ); }
Now, let's make sure this thing works. Here is a sample data file, before processing:
<chapter> <title><indexterm><primary>wee creatures</primary></indexterm> Habits of the Wood Sprite <indexterm><primary>woodland faeries</primary></indexterm></title> <indexterm> <primary>sprites</primary> <secondary>woodland</secondary> </indexterm> <para>The wood sprite likes to hang around rotting piles of wood and is easily dazzled by bright lights.</para> <section> <title><indexterm><primary>little people</primary></indexterm> Origins</title> <para>No one really knows where they came from.</para> <indexterm><primary>magical folk</primary></indexterm> </section> </chapter>
I have placed indexterms in various places, both inside and outside titles to see which ones are affected. Here is the result, after running the script on it:
<?xml version="1.0"?> <chapter> <title>Habits of the Wood Sprite</title><indexterm><primary>woodland faeries</ primary></indexterm> <indexterm><primary>wee creatures</primary></indexterm> <indexterm> <primary>sprites</primary> <secondary>woodland</secondary> </indexterm> <para>The wood sprite likes to hang around rotting piles of wood and is easily dazzled by bright lights.</para> <section> <title>Origins</title><indexterm><primary>little people</primary></indexterm> <para>No one really knows where they came from.</para> <indexterm><primary>magical folk</primary></indexterm> </section> </chapter>
The indexterms have been moved out of the titles as we expected. Other indexterms have not been affected. The other contents in titles are still there, unchanged, including some extra space that abutted the indexterm elements. In short, it worked!
Perl works well for most of my XML needs. Historically, it has had a few issues with character encodings, but these problems are gradually going away as Perl adopts multibyte characters and adds support for Unicode. Check out http://www.cpan.org for a huge list of modules that do everything with XML including XSLT, XPath, DOM, SAX, and more.
You will also want to check out Python, which many people tout as superior in its object-oriented support. It is quickly growing in popularity, though it will be a while before it can match Perl's wealth of libraries.