5.3 DOM

DOM is one of PHP 5's two major XML processing extensions. This section introduces DOM, providing an overview of how it organizes information. It also demonstrates how to turn XML documents into DOM objects and vice versa.

5.3.1 About DOM

There is only one way to read XML into a tree using PHP 4: DOM. DOM, short for Document Object Model, is a W3C standard describing a platform- and language-neutral interface for interacting with XML and other structured documents. DOM then provides a series of utility functions to recurse through the branches and pick out the nodes and data that you want.

It's easier to parse XML into a tree than to use a streaming parser such as SAX. When you read XML into a tree, you can move through the document in a way that is similar to navigating PHP data structures such as multi-dimensional arrays or objects that have subobjects. However, if your document is large, DOM can use lot of memory.

PHP 5's DOM utilities have undergone a complete rewrite. If you used the DOM functions in PHP 4, you know that PHP's DOM support has largely been sketchy and incomplete. The original implementation did not conform at all to the W3C naming conventions, thus partially defeating the purpose of a language-neutral API. Although PHP 4.3 unveiled an improved and more compliant set of DOM functions, there are still holes and memory leaks.

On top (or perhaps because) of all this, a large "EXPERIMENTAL" tag had seemingly been permanently placed upon PHP 4's DOM functions. When the warning "The behavior of this extension?including the names of its functions and anything else documented about this extension?may change without notice in a future release of PHP. Use this extension at your own risk" appears at the top of documentation, it does not engender comfort.

Happily, all that has changed in PHP 5. The new DOM extension not only has updated internals, but you now interact with it in the standard way, and it has a few new features, such as validation. Still, the entire DOM specification is quite large and complex, and not all features are available yet. But what has been implemented is done correctly and is consistent with other languages.

Unfortunately, if you've written any applications that use the old DOM extension, they won't work with PHP 5. You must update them.

5.3.2 Turning XML Documents into DOM objects

Before you can do anything DOM-related in PHP 5, you need to create a new instance of the DOM object, called DOMDocument:

$dom = new DOMDocument;

Now you can load XML into $dom. DOM differentiates between XML stored as a string and XML stored in a file. To read from a string, call loadXML( ); to read from a file, call load( ):

$dom = new DOMDocument;

// read from a string

$dom->loadXML('<string>I am XML</string>');

// read from a file


The DOM load( ) method, like all XML input and output methods, actually works for more than just files. It really works with streams, so it can read from HTTP or write to FTP. See Chapter 8 for more information about streams.

If DOM encounters problems reading the XML?for example, the XML is not well-formed, your file does not exist, or you try to pass in an array instead of a string?DOM emits a warning. In some cases, such as a failure of the DOM extension to create a new DOM object or safe_mode blocking off access to the file, it returns false instead.

This example tries to load a string that's invalid XML:

$dom = new DOMDocument;

// read non well-formed XML document

$dom->loadXML('I am not XML');

It causes DOM to give a PHP Warning that begins like this:

PHP Warning:  DOMDocument::loadXML( ): Start tag expected, '<' not found

Whitespace is considered significant in XML, so spaces between tags are considered text elements. For example, there are five elements inside the person element:





It looks like there are only two elements, firstname and lastname, but there are actually three additional text nodes. They're hard to see because they're whitespace. They occur between the opening person tag and the opening firstname tag, the closing firstname tag and opening lastname tag, and the closing lastname and closing person tag.

However, removing the whitespace makes the document hard for humans to read. Happily, you can tell DOM to ignore whitespace:

$dom = new DOMDocument;

// Whitespace is no longer significant

$dom->preserveWhiteSpace = false;

$dom->loadXML('<string>I am XML</string>');

Setting the preserveWhiteSpace attribute to false makes DOM skip over any text nodes that contain only spaces, tabs, returns, or other whitespace.

5.3.3 DOM Nodes

DOM organizes XML documents into nodes. You can use DOM to retrieve the text stored in a node, find a node's children, insert another node at that location, and so forth.

Figure 5-1 shows how DOM represents the beginning of the address book.

Figure 5-1. A DOM representation of an XML address book
figs/uphp_0501.gif Accessing the root element

The root element of an XML document is stored as the documentElement property of a DOM object:

$dom = new DOMDocument;

$dom->preserveWhiteSpace = false;


$root = $dom->documentElement;

The $root variable now holds a pointer to the document root. Navigating through nodes

DOM has a whole set of tree iteration properties that allow you to explicitly move from one element to another. In PHP 4, these are object methods, but they're object properties in PHP 5.

The easiest way to process all of a node's children is with a foreach upon its childNodes. For example, to process all the person elements in the address book:

$dom = new DOMDocument;

$dom->preserveWhiteSpace = false;


$root = $dom->documentElement;

foreach ($root->childNodes as $person) {



The childNodes attribute is not an array, but a DOMNodeList object. The item( ) method allows you to access individual items, and the length property tells you the number of items in the list.

This code is equivalent to the foreach loop:

$people = $root->childNodes;

for ($i = 0; $i < $people->length; $i++) {



The first element lives in position 0, the second in 1, and so on.

Table 5-1 contains the complete list of properties and what they do.

Table 5-1. DOM iteration properties

PHP 5 property

PHP 4 method



parent_node( )

The node above the current node


child_nodes( )

A list of nodes below the current node


first_child( )

The "first" node below the current node


last_child( )

The "last" node below the current node


previous_sibling( )

The node "before" the current node


next_sibling( )

The node "after" the current node Determining node types

libxml2 has 21 different types of nodes. The most frequently encountered types are elements, attributes, and text. The nodeType method returns a number describing the node.

For instance, the documentElement is always an element:

$root = $dom->documentElement;

print $root->nodeType;


Table 5-2 lists libxml2's node types.

Table 5-2. libxml2's numeric node types

Node type








CDATA section


Entity reference




PI (Processing Instruction)






Document type


Document fragment




HTML document




Element declaration


Attribute declaration


Entity declaration


Namespace declaration


XInclude start


XInclude end


DocBook document

21 Accessing text nodes

DOM never makes any assumptions about how your data is organized or what you wish to do with it. If you have a snippet of XML that looks like this:


DOM does not assume that your primary interest is the string Rasmus. To DOM, Rasmus is just the text portion of a child node associated with the node for the firstname element.

Here's how to access Rasmus:

// load in XML

$rasmus = newDOMDocument;


// two DOM longhand ways



// the first element of the children method


// a DOM shorthand way of saying the same thing


// a PHP 5 shorthand method

// *NOT* portable across DOM implementations


// yet another way, because this is the root element


DOM does not couple the element with the text wrapped by its tags. Therefore, you must ask the node for its first child. This gives you the text node holding PHP. However, you can't print the node, because it's an object, not a string. To access the text portion of a text node, you need to grab its nodeValue.

PHP 5's DOM implementation has a special attribute textContent that's equivalent to firstChild->nodeValue. This attribute name is shorter, but it is not portable, because it's not part of the DOM standard. Accessing element nodes

DOM stores an element's name in the tagName property. This code loops through a person element from the address book and prints out the names of all the elements and the values of their first children:

$dom = new DOMDocument;

$dom->preserveWhiteSpace = false;


$person = $dom->documentElement->firstChild;

foreach ($person->childNodes as $field) {

    if ($field->nodeType =  = 1) {

        print "$field->tagName: {$field->firstChild->nodeValue}\n";



firstname: Rasmus

lastname: Lerdorf

city: Sunnyvale

state: CA

email: rasmus@php.net

The $person object holds the person node, and its children are the address book fields.

Inside the foreach, you need to check the nodeType to make sure you have an element node. All elements in libxml2 have a nodeType of 1. Skipping this check processes the comment node because DOM does not ignore comments.

5.3.4 Turning DOM Objects into XML Documents

To take a DOM document and convert it back into XML, you have two options: save( ) and saveXML( ). The first method saves a document to a file; the other returns a string representation of the document, which you can print out or store in a variable.


print $dom->saveXML( );

As always, you must have write permission for the directory in which you're saving the file.

If you disable the preserveWhiteSpace attribute, your XML ends up as a single line:

$dom = new DOMDocument;

$dom->preserveWhiteSpace = false;


print $dom->saveXML( );

<address-book><person id="1"><!--Rasmus Lerdorf--><firstname>Rasmus</fi


ate><email>rasmus@php.net</email></person><person id="2"><!--Zeev Suras

ki--><firstname>Zeev</firstname><lastname>Suraski</lastname><city>Tel A


To prevent this, set the formatOutput attribute to true:

$dom = new DOMDocument;

$dom->preserveWhiteSpace = false;

$dom->formatOutput = true;


print $dom->saveXML( );

<?xml version="1.0"?>


  <person id="1">

<!--Rasmus Lerdorf-->







  <person id="2">

<!--Zeev Suraski-->



    <city>Tel Aviv</city>





Now the elements are indented. However, since libxml2 does not indent comments, those nodes remain on the left.