3.5 Query Language Extensions

  Previous section   Next section

Let's take a quick look at some useful extensions eXist adds to standard XPath to enable you to efficiently use the database.

3.5.1 Specifying the Input Document Set

Since a database may contain an unlimited set of documents, two additional functions are required by eXist's query engine to determine the set of documents against which an expression will be evaluated: document() and collection(). document() accepts a single document name, a list of document names, or a wildcard as parameters. The wildcard (*) selects all documents in the database. The collection() function specifies the collection whose documents are to be included into query evaluation. For example:

collection('/db/shakespeare')//SCENE[ SPEECH[ SPEAKER='JULIET']]/TITLE

The root collection of the database is always called /db. By default, documents found in subcollections below the specified collection are included. So you don't have to specify the full path to /db/shakespeare/plays in the preceding expression.

3.5.2 Querying Text

The XPath standard defines only a few limited functions to search for a given string inside the character content of a node, which is a weak point if you have to search through documents containing larger sections of text. For many types of documents, the provided standard functions will not yield satisfying results. For example, you might remember to have read something about "XML" and "databases" in some chapter of a book, but you may not be sure exactly where it was. Using standard XPath, you could try a query like:

//chapter[ contains(., 'XML') and contains(., 'databases')]

Still you can't be sure to find all matches?for example, "databases" might have been written with a capital letter at the start of a sentence. Also, query execution will probably be quite slow for large sets of documents, because the XPath engine has to scan over the entire character content of all chapters and their descendant nodes in all books to find matches.

The solution: eXist offers two additional operators and several extension functions to provide efficient, index-based access to the full-text content of nodes. For example, you might remember that the words "XML" and "database" were mentioned near to each other but not in the same paragraph. So with eXist, you could query:

//section[ near(., 'XML database', 50)]

This query will return all sections containing both keywords in the correct order and with less than 50 words between them. Besides making query formulation easier and in many cases more exact, using eXist's full-text search extensions instead of standard XPath expressions yields a much better query performance. The query engine will process the previous query based entirely on indexing information. We will have a closer look at how this works later.

In cases where the order and distance of search terms are not important, eXist offers two other operators for simple keyword queries. The following XPath expression will select the scene in the cavern from Shakespeare's The Tragedy of Macbeth:

//SCENE[ SPEECH[ SPEAKER &= 'witch' and LINE &= 'fenny snake']]

&= is a special text search operator. It selects context nodes containing all of the space-separated terms in the argument on the right. To find nodes containing any of the terms, the |= operator is provided. For example, we may use the subexpression LINE |= 'fenny snake' in the preceding query to get all lines containing either "fenny" or "snake".

Note that eXist's default keyword tokenizer will treat dates, floating point numbers, and any character sequence containing at least one digit as a single keyword. The operators accept simple wildcards, for example, 'witch*' will select 'witch' as well as 'witches'. To match more complex string patterns, regular expression syntax is supported through the match, match-all, and match-any functions. For example, to find all lines containing "live", "lives", as well as "life", you may use the following expression:

//SPEECH[ match-all(LINE, 'li[ fv] e[ s] ')]

Match-all and match-any perform a search on keywords in a similar fashion as the &= and |= operators, while match corresponds to the contains function. More information on this topic is available in the eXist documentation.

3.5.3 Outstanding Features

eXist's XPath query engine currently implements major parts of the standard requirements, though at the time of writing it is not yet complete. Only abbreviated XPath syntax is supported so, for example, node-axis specifiers like previous-sibling/next-sibling still wait to be implemented. However, the existing functionality covers most commonly needed XPath expressions. Some work is under way in the project to rewrite the query processor. In the long run, we would like to replace XPath with an XQuery-based implementation.


Top

Part IV: Applications of XML