Hack 23 Transform Documents with XQuery

figs/expert.gif figs/hack23.gif

XQuery is a new language under development by the W3C that's designed to query collections of XML data. XQuery provides a mechanism to efficiently and easily extract data from XML documents or from any data source that can be viewed as XML, such as relational databases.

XQuery (http://www.w3.org/XML/Query) provides a powerful mechanism to pull XML content from multiple sources and dynamically generate new content using a programmer-friendly declarative language. The XQuery code in Example 2-12 (shakes.xqy) formats in XHTML a list of unique speakers in each act of Shakespeare's play Hamlet. The hamlet.xml file can be found at http://www.oasis-open.org/cover/bosakShakespeare200.html.

Example 2-12. A simple XQuery to search Shakespeare (shakes.xqy)
<html><head/><body>

{

  for $act in doc("hamlet.xml")//ACT

  let $speakers := distinct-values($act//SPEAKER)

  return

    <span>

      <h1>{ $act/TITLE/text( ) }</h1>

      <ul>

      {

        for $speaker in $speakers

        return <li>{ $speaker }</li>

      }

      </ul>

    </span>

}

</body></html>

This example demonstrates a XQuery FLWOR (pronounced flower) expression. The name comes from the five possible clauses of the expression: for, let, where, order by, and return. Example 2-12 says that for every ACT element appearing at any level in the hamlet.xml file, let the $speakers variable equal the distinct values of all the SPEAKER elements found under that instance of ACT. Then for every $act and $speakers value, return the $act's TITLE text using an h1 element followed by a ul listing of every speaker in an li element.

XML is a native data type of XQuery and can be used in queries directly without quoted strings, objects, or other tricks. You separate XML elements from enclosed expressions using curly braces. Example 2-13 shows the query result (using ellipses to shorten the output).

Example 2-13. Shakespeare speakers
<html>

  <span>

    <h1>ACT I</h1>

    <ul>

      <li>BERNARDO</li><li>FRANCISCO</li><li>HORATIO</li> ...

    </ul>

  </span><span>

    <h1>ACT II</h1>

    <ul>

      <li>LORD POLONIUS</li><li>REYNALDO</li><li>OPHELIA</li> ...

    </ul>

  </span><span>

    <h1>ACT III</h1>

    <ul>

      <li>KING CLAUDIUS</li><li>ROSENCRANTZ</li><li>GUILDENSTERN</li> ...

    </ul>

  </span><span>

    <h1>ACT IV</h1>

    <ul>

      <li>KING CLAUDIUS</li><li>QUEEN GERTRUDE</li><li>HAMLET</li> ...

    </ul>

  </span><span>

    <h1>ACT V</h1>

    <ul>

      <li>First Clown</li><li>Second Clown</li><li>HAMLET</li> ...

    </ul>

  </span>

</html>

Example 2-14 (speakers.xqy) demonstrates a more advanced form of the query. This longer query pulls content from multiple source documents and beautifies the speaker names so that they're always printed in standard case (capitalized first letters only).

Example 2-14. An XQuery with multiple inputs and beautified output
declare function local:singleWordCase($name as xs:string) 

    as xs:string {

  if ($name = "") then "" else

  let $first := substring($name, 1, 1)

  let $rest := substring($name, 2)

  let $firstUpper := upper-case($first)

  let $restLower := lower-case($rest)

  return concat($firstUpper, $restLower)

};

   

declare function local:multiWordCase($name as xs:string)

    as xs:string {

  string-join(

    let $words := tokenize($name, "\s+")

    for $word in $words

    return local:singleWordCase($word)

  , " ")

};

   

<html><head/><body>

{

  for $file in ("all_well.xml", "dream.xml", "hamlet.xml", "lear.xml",

                "macbeth.xml", "merchant.xml", "much_ado.xml", 

                "r_and_j.xml")

  let $play := doc($file)

  let $speakers := distinct-values($play//SPEAKER)

  order by $play/PLAY/TITLE/text( )

  return

    <span>

      <h1>{ $play/PLAY/TITLE/text( ) }</h1>

      <ul>

      {

        for $speaker in $speakers

        let $speakerPretty := local:multiWordCase($speaker)

        order by $speakerPretty

        return

        <li>{ $speakerPretty }</li>

      }

      </ul>

    </span>

}

</body></html>

The top portion of the query defines two functions to handle the conversion of names to standard case. The first function, singleWordCase() placed in the special local namespace, takes an xs:string source name and returns an xs:string that is the input parameter converted to standard case. Typing is optional in XQuery. When used, typing is based on XML Schema types (http://www.w3.org/TR/xquery/#id-types).

The first line of the function short-circuits so that if the $name is empty, then the expression evaluates to empty; otherwise, the second half of the expression gets evaluated. Assuming $name is non-empty, we assign $first to its first character and $rest to the remainder, uppercase the $first, lowercase the $rest, and return the concatenation. The return keyword is not used to return a value but rather as a clause of a FLWOR expression. A better name for it might have been do.

XQuery Expressions

XQuery is a functional language consisting entirely of expressions. There are no statements, even though some of the keywords imply statement-like behaviors. To execute a function, the expression within the body gets evaluated and its value returned. Thus, to write a function to double an input value, you simply write:

declare function local:doubler($x) { $x * 2 }

To write a full query that says Hello World, you write the expression:

"Hello World"

That's probably the simplest Hello World program you've ever seen.


The second function, multiWordCase(), tokenizes the input string based on whitespace characters (\s is the regular-expression pattern for a whitespace character and the + modifier means "one or more"). Then for every word returned by that tokenization, it executes singleWordCase() with the result joined together with the string-join() function, which adds a space between each reformatted word.

The query body executes against eight plays that have been named explicitly. For every $file in the list we assign the $play variable to be the document node associated with that document name. Then we use distinct-values() to calculate the unique speakers in the $play. The order by clause of the FLWOR expression orders the tuples (ordered sequence of values) coming out of the for and let clauses so that the tuples are sorted alphabetically by the play's title text. The return clause is evaluated once for each tuple and prints the play title followed by the list of unique speakers in the play, beautified and sorted alphabetically. The result appears in Example 2-15.

Example 2-15. More Shakespeare speakers
<html>

  <span>

    <h1>A Midsummer Night's Dream</h1>

    <ul>

      <li>All</li><li>Bottom</li><li>Cobweb</li><li>Demetrius</li> ...

    </ul>

  </span><span>

    <h1>All's Well That Ends Well</h1>

    <ul>

      <li>All</li><li>Bertram</li><li>Both</li><li>Both</li>

<li>Clown</li> ...

    </ul>

  </span><span>

    <h1>Much Ado about Nothing</h1>

    <ul>

      <li>Antonio</li><li>Balthasar</li><li>Beatrice</li>

<li>Benedick</li> ...

    </ul>

  </span><span>

    <h1>The Merchant of Venice</h1>

    <ul>

      <li>All</li><li>Antonio</li><li>Arragon</li>

<li>Balthasar</li> ...

    </ul>

  </span><span>

    <h1>The Tragedy of Hamlet, Prince of Denmark</h1>

    <ul>

      <li>All</li><li>Bernardo</li><li>Captain</li>

<li>Cornelius</li> ...

    </ul>

  </span><span>

    <h1>The Tragedy of King Lear</h1>

    <ul>

      <li>Albany</li><li>Burgundy</li><li>Captain</li>

<li>Cordelia</li> ...

    </ul>

  </span><span>

    <h1>The Tragedy of Macbeth</h1>

    <ul>

      <li>All</li><li>Angus</li><li>Attendant</li>

<li>Banquo</li> ...

    </ul>

  </span><span>

    <h1>The Tragedy of Romeo and Juliet</h1>

    <ul>

      <li/><li>Abraham</li><li>Apothecary</li>

<li>Balthasar</li> ...

    </ul>

  </span>

</html>

Development of XQuery 1.0 is not yet complete. As of this writing, the W3C specification documents are in Last Call (http://www.w3.org/XML/Query#specs). It looks like there will be a second Last Call before the specifications proceed to candidate recommendation (with two more formal stages after that). The example code shown here was written against the Last Call draft from November 2003.

2.14.1 See Also

  • You can find pointers to the XQuery specifications, online articles, mailing lists, and a community Wiki at http://www.xquery.com

?Jason Hunter