Recipe 22.6 Finding Elements and Text Within an XML Document

22.6.1 Problem

You want to get to a specific part of the XML; for example, the href attribute of an a tag whose contents are an img tag with alt text containing the word "monkey".

22.6.2 Solution

Use XML::LibXML and construct an XPath expression to find nodes you're interested in:

use XML::LibXML;

my $parser = XML::LibXML->new;
$doc = $parser->parse_file($FILENAME);
my @nodes = $doc->findnodes($XPATH_EXPRESSION);

22.6.3 Discussion

Example 22-9 shows how you would print all the titles in the book XML from Example 22-1.

Example 22-9. xpath-1

#!/usr/bin/perl -w

use XML::LibXML;

my $parser = XML::LibXML->new;
$doc = $parser->parse_file("books.xml");

# find title elements
my @nodes = $doc->findnodes("/books/book/title");

# print the text in the title elements
foreach my $node (@nodes) {
  print $node->firstChild->data, "\n";
}

The difference between DOM's getElementsByTagName and findnodes is that the former identifies elements only by their name. An XPath expression specifies a set of steps that the XPath engine takes to find nodes you're interested in. In Example 22-9 the XPath expression says "start at the top of the document, go into the books element, go into the book element, and then go into the title element."

The difference is important. Consider this XML document:

<message>
  <header><to>Tom</to><from>Nat</from></header>
  <body>
    <order><to>555 House St, Mundaneville</to>
           <product>Fish sticks</product>
    </order>
  </body>
</message>

There are two to elements here: one in the header and one in the body. If we said $doc->getElementsByTagName("to"), we'd get both to elements. The XPath expression "/message/header/to" restricts output to the to element in the header.

XPath expressions are like regular expressions that operate on XML structure instead of text. As with regular expressions, there are a lot of things you can specify in XPath expressionsfar more than the simple "find this child node and go into it" that we've been doing.

Let's return to the books file and add another entry:

<book id="4">
  <!-- Perl Cookbook -->
  <title>Perl Cookbook</title>
  <edition>2</edition>
  <authors>
    <author>
      <firstname>Nathan</firstname>
      <lastname>Torkington</lastname>
    </author>
    <author>
      <firstname>Tom</firstname>
      <lastname>Christiansen</lastname>
    </author>
  </authors>
  <isbn>123-345-678-90</isbn>
</book>

To identify all books by Tom Christiansen, we need simply say:

my @nodes = $doc->findnodes("/books/book/authors/author/
        firstname[text( )='Tom']/../
        lastname[text( )='Christiansen']/
        ../../../title/text( )");

foreach my $node (@nodes) {
  print $node->data, "\n";
}

We find the author with firstname equal to "Tom" and lastname equal to "Christiansen", then back out to the "title" element and get its text child nodes. Another way to write the backing out is "head out until you find the book element again":

my @nodes = $doc->findnodes("/books/book/authors/author/
      firstname[text( )='Tom']/../
      lastname[text( )='Christiansen']/
      ancestor::book/title/text( )");

XPath is a very powerful system, and we haven't begun to touch the surface of it. For details on XPath, see XPath and XPointer, by John E. Simpson (O'Reilly), or the W3C specification at http://www.w3.org/TR/xpath. Advanced users should look at the XML::LibXML::XPathContext module (also available from CPAN), which lets you write your own XPath functions in Perl.

22.6.4 See Also

The documentation for the modules XML::LibXML and XML::LibXML::XPathContext; http://www.w3.org/TR/xpath; XPath and XPointer