6.2 Finding Nodes

There are still a few cultures on earth who can name their ancestors back ten generations or further. "Here is Sam, the son of Ben, the son of Andrew, the son of..." This chain of generations helps establish the identity of a person, showing that he or she is a member of such and such a clan or related to another person through some shared great-great-uncle.

XPath, too, uses chains of steps, except that they are steps in an XML tree rather than an actual family tree. The terms "child" and "parent" are still applicable. A location path is a chain of location steps that get you from one point in a document to another. If the path begins with an absolute position (say, the root node), then we call it an absolute path. Otherwise, it is called a relative path because it starts from a place not yet determined.

A location step has three parts: an axis that describes the direction to travel, a node test that specifies what kinds of nodes are applicable, and a set of optional predicates that use Boolean (true/false) tests to winnow down the candidates even further.

The axis is a keyword that specifies a direction you can travel from any node. You can go up through ancestors, down through descendants, or linearly through siblings. Table 6-1 lists all the types of node axes.

Table 6-1. Node axes

Axis type

Matches

Ancestor

All nodes above the context node, including the parent, grandparent, and so on up to the root node.

Ancestor-or-self

The ancestor node plus the context node.

Attribute

Attributes of the context node.

Child

Children of the context node.

Descendant

Children of the context node, plus their children, and so on down to the leaves of the subtree.

Descendant-or-self

The descendant node plus the context node.

Following

Nodes that follow the context node at any level in the document. This does not include any descendants of the context node, but does include its following siblings and their descendants.

Following-sibling

Nodes that follow the context node at the same level (i.e., that share the same parent as thecontext node).

Namespace

All the namespace nodes of an element.

Parent

The parent of the context node.

Preceding

Nodes that occur before the context node at any level in the document. This does not include any descendants of the context node, but does include its preceding siblings and their descendants.

Preceding-sibling

Nodes that occur before the context node at the same level (i.e., that share the same parent as the context node).

Self

The context node itself.

After the axis comes a node test parameter, joined to the axis by a double colon (::). A name can be used in place of an explicit node type, in which case the node type is inferred from the axis. For the attribute axis, the node is assumed to be an attribute, and for the namespace axis, the node is assumed to be a namespace. For all other axes, the node is assumed to be an element. In the absence of a node axis specifier, the axis is assumed to be child and the node is assumed to be of type element. Table 6-2 lists the node tests.

Table 6-2. Node tests

Term

Matches

/

The root node: not the root element but the node containing the root element and any comments or processing instructions that precede it.

node( )

Matches any node. For example, the step attribute::node( ) would select all the attributes of the context node.

*

In the attribute axis, any attribute. In the namespace axis, any namespace. In all other axes, any element.

crabcake

In the attribute axis, the attribute named crabcake of the context node. In a namespace axis, it's a namespace called crabcake. In all other axes, any element named crabcake.

text( )

Any text node.

processing-instruction( )

Any processing instruction.

processing-instruction('for-web')

Any processing instruction with target for-web.

comment( )

Any comment node.

Location path steps are chained together using the slash (/) character. Each step gets you a little closer to the node you want to locate. It's sort of like giving directions to a restaurant ("Go to Davis Square, head down College Avenue; at the Powderhouse rotary, turn left and you'll see a great Vietnamese restaurant"). For example, to get from the root node to a para element inside a section inside a chapter inside a book, a path might look like this:

book/chapter/section/para

This syntax can be verbose; XPath defines some handy shortcuts as listed in Table 6-3.

Table 6-3. Location path shortcuts

Pattern

Matches

@role

Matches an attribute named role. This is equivalent to attribute::role.

.

The context node. This is equivalent to self::node( ).

/*

Matches the document element. Any location path that starts with slash (/) is an absolute path, with the first step representing the root node. The next step is *, which matches any element.

parent::*/following-sibling::para

Matches all paras that follow the parent of the context node.

..

Matches the parent node. The double dot (..) is shorthand for parent::node( ).

.//para

Matches any element of type para that is a descendant of the current node. The double slash (//) is shorthand for /descendant-or-self::node( )//.

//para

Matches any <para> descending from the root node. In other words, it matches all paras anywhere in the document. A location path starting with a double slash (//) is assumed to begin at the root node.

../*

Matches all sibling elements (and the context node if it is an element).

To see how axis and node tests can be used to retrieve nodes, let's now look at some examples. Consider the sample document in Example 6-1.

Example 6-1. A sample XML document
<quotelist>
  <quotation style="wise" id="q1">
    <text>Expect nothing; be ready for everything.</text>
    <source>Samurai chant</source>
  </quotation>
  <quotation style="political" id="q2">
    <text>If one morning I walked on top of the water across the Potomac
    River, the headline that afternoon would read "President Can't
    Swim".</text>
    <source>Lyndon B. Johnson</source>
  </quotation>
  <quotation style="silly" id="q3">
    <?human laugh?>
    <text>What if the hokey-pokey IS what it's all about?</text>
  </quotation>
  <quotation style="wise" id="q4">
    <text>If they give you ruled paper, write the other way.</text>
    <source>Juan Ramon Jiminez</source>
  </quotation>
  <!-- the checkbook is mightier than the sword? -->
  <quotation style="political" id="q5">
    <text>Banking establishments are more dangerous than standing
    armies.</text>
    <source>Thomas Jefferson</source>
  </quotation>
</quotelist>

Table 6-4 shows some location paths and what they would return.

Table 6-4. Location path examples

Path

Matches

/quotelist/child::node( )

All the quotation elements plus the XML comment.

/quotelist/quotation

All the quotation elements.

/*/*

All the quotation elements.

//comment( )/following-sibling::*/@style

The style attribute of the last quotation element.

id('q1')/parent::*

The first quotation element.

id('q2')/..

The document element.

id('q1')/ancestor-or-self::*

The document element and the first quotation element.

id('q3')/self::aphorism

Nothing! The first step does match the third quotation element, but the next step invalidates it because it's looking for an element of type aphorism. In a context where you don't know what type the element is, this is a good way to test it.

//processing-instruction( )/../following::source

The source elements from the last two quotation elements.

Note that the id( ) step will only work on attributes that have been declared to be type ID in a DTD. It is this declaration that tells a validating parser to require an attribute to have a unique value.

If the axis and node type aren't sufficient to narrow down the selection, you can use one or more predicates. A predicate is a Boolean expression enclosed within square brackets ([ ]). Every node that passes this test (in addition to the node test and axis specifier) is included in the final node set. Nodes that fail the test (the predicate evaluates to false) are not. Table 6-5 shows some examples.

Table 6-5. XPath predicates

Path

Matches

//quotation[@id="q3"]/text

The text element in the third quotation element. This is an example of an equality test, where the string value of the attribute is matched against another string. You can also test numerical and Boolean values.

//quotation[source]

All the quotation elements but the third, which doesn't have a source element. Here, the presence of a child element source is evaluated; if at least one node matching it is found, the value of the test is true, otherwise false.

//quotation[not(source)]

The third quotation element. not( ) is true if there are no source elements.

/*[@id="q2"]/preceding-sibling::*/source

The source element with the content "Samurai Chant."

//*[source='Thomas Jefferson']/text

The text element in the last quotation element.

//*[source='Thomas Jefferson'][@id='q7']

Nothing! The two predicates are evaluated as a Boolean and function. Both have to be true or the path fails. Since there is no element matching both these tests, we are left with nothing.

/*/*[position( )=last( )]

The last quotation element. The position( ) function equals the position of the most recent step among eligible candidates. The function last( ) is equal to the total number of candidates (in this case, 5).

//quotation[position( )!=2]

All quotation elements but the second one.

//quotation[4]

The fourth quotation element. A number alone in the predicate is shorthand for position()=...

//quotation[@type='silly' or @type='wise']

The first, third, and fourth quotation elements. The or keyword acts as a Boolean or operator.