8.7 Querying the Repository

  Previous section   Next section

In this section, we will describe how to query the XML repository.

8.7.1 Ad Hoc SQL Queries

Once we have uploaded some documents into the database, we can start experimenting with queries. For example, the query in Listing 8.68 will return a list of the element names that appear, with their x and y coordinates and namespace prefixes (if any).

Listing 8.68 SQL Query
SELECT n.x, n.y, e.ns_prefix, e.local_name
  FROM element_name e, node n
 WHERE n.doc_id = 1 AND n.node_id = e.node_id;

The results for wml-example.xml (document ID = 1, if we loaded this document first) are shown in Listing 8.69.

Listing 8.69 Results for wml-example.xml
x    y    ns_prefix local_name
---- ---- --------- ----------
2    49   NULL      wml
5    34   NULL      card
8    19   NULL      p
11   12   NULL      img
15   16   NULL      br
22   31   NULL      p
25   28   NULL      a
37   46   NULL      card
40   43   NULL      p
(9 row(s) affected)

A variation on this theme is shown in Listing 8.70, where we return the element names (qualified with a namespace prefix, if any) with a count of their occurrences in the specified document (note the use of a group by expression).

Listing 8.70 Modified Query
SELECT ISNULL(e.ns_prefix + ':', '') + e.local_name AS qName,
       COUNT(1) AS occurrences
  FROM element_name e, node n
 WHERE n.doc_id = 1 AND n.node_id = e.node_id
 GROUP BY ISNULL(e.ns_prefix + ':', '') + e.local_name;

The results are shown in Listing 8.71.

Listing 8.71 Results for Modified Query
qName    occurrences
-------- -----------
a        1
br       1
card     2
img      1
p        3
wml      1
(6 row(s) affected)

For our simple example, the results are pretty obvious; there are two cards, one link, three paragraphs, and so on. However, for larger documents, the query can be very informative; for example, in an XHTML document, we can quickly determine how many paragraph elements there are, or how many images.

The query in Listing 8.72 will return a list of the attribute names that appear, with their x and y coordinates and namespace prefixes (if any).

Listing 8.72 More Informative Query
SELECT n.x, n.y, a.attribute_id, a.ns_prefix, a.local_name
  FROM attribute_name a, node n
 WHERE n.doc_id = 1 AND n.node_id = a.node_id;

The results for our example are shown in Listing 8.73.

Listing 8.73 Results for More Informative Query
x    y    attribute_id ns_prefix local_name
---- ---- ------------ --------- ----------
5    34   1            NULL      id
5    34   2            NULL      title
5    34   3            NULL      newcontext
8    19   1            NULL      align
11   12   1            NULL      src
11   12   2            NULL      alt
11   12   3            NULL      align
25   28   1            NULL      href
37   46   1            NULL      id
37   46   2            NULL      title
40   43   1            NULL      align
(11 row(s) affected)

To retrieve the text leaves for this document, run the query shown in Listing 8.74.

Listing 8.74 Text Leaves Query
SELECT n.x, t.leaf_id, t.leaf_text
  FROM text_leaf t, node n
 WHERE n.doc_id = 1 AND n.node_id = t.node_id

The results for our example are shown in Listing 8.75 (note that I've ignored whitespace nodes for brevity?there are 14 whitespace entries in addition to these three rows).

Listing 8.75 Results for Text Leaves Query
x    leaf_id leaf_text
---- ------- ---------------------------
17   1       Content of the first card.
26   1       Next card.
41   1       Content of the second card.

To return the text content of a specific element node (e.g., the second card element), see Listing 8.76.

Listing 8.76 Text Content Query
SELECT n.x, t.leaf_text
  FROM text_leaf t, node n
 WHERE n.doc_id = 1
   AND n.node_id = t.node_id
 AND n.x BETWEEN 37 AND 46;

where 37 and 46 are the x and y coordinates of the element node. The results are shown in Listing 8.77 (I've ignored whitespace again).

Listing 8.77 Results for Text Content Query
x    leaf_id leaf_text
---- ------- ---------------------------
41   1       Content of the second card.

We can generalize the last query to return, for a given document, all text nodes of elements with a specified name (in this case, all <a> elements), as shown in Listing 8.78.

Listing 8.78 Generalized Query
SELECT n2.x, n2.y, e.local_name, t.leaf_id, t.leaf_text
  FROM node n1, node n2, element_name e, text_leaf t
 WHERE n1.doc_id = 1
   AND n1.node_id = e.node_id
   AND e.local_name = 'a'
   AND n2.doc_id = 1
   AND n2.node_id = t.node_id
   AND n2.x BETWEEN n1.x AND n1.y;

In our example, there is only one <a> element, as shown in Listing 8.79.

Listing 8.79 Generalized Query Result
x   y   local_name leaf_id leaf_text
--- --- ---------- ------- ---------
26  27  a          1       Next card
(1 row(s) affected)

The queries in this section represent just a tiny sample of the queries we can dream up now that we have parsed XML into our relational database.

8.7.2 Searching for Text

The reader may have noticed a potential pitfall for applications that need to provide full text searching capabilities for document-centric XML. If we intend to run queries on large text nodes?for example, to find all text nodes that contain a certain word or words?we need to be mindful that the content may well be split over leaves. One solution would be to have the leaf_text column in our text_leaf table of type text rather than a varchar (this would also require some changes to the stored procedures that access this table, since the syntax for handling this type of content will differ). Another solution would be to build an indexing mechanism that creates a full text index at parse time, with references to the nodes that contain the text. Either approach will allow searches to weight hits according to information about the containing nodes (e.g., a word found in text content of a title element might be worth more than the same text found in a paragraph element).

8.7.3 Some More Stored Procedures

Time to introduce some more stored procedures that may prove useful in a number of applications. The first, rep_s_parent, selects the parent node of a specified node (identified by a document ID and x coordinate). This procedure makes use of the Nested Sets property that the parent of a given node (x, y) has the largest x' in the set defined by x' , x . y' (see Listing 8.80).

Listing 8.80 Create rep_s_parent Procedure
CREATE PROCEDURE dbo.rep_s_parent
  @doc_id  t_doc_id,
  @x_child t_xy_index AS
 SELECT n1.node_id,
   FROM node n1
  WHERE n1.doc_id = @doc_id
    AND n1.x = (SELECT MAX(n2.x)
                  FROM node n2
                 WHERE n2.doc_id = @doc_id
                   AND n2.x < @x_child
                   AND n2.y > @x_child)

The next procedure, rep_s_children, shown in Listing 8.81 returns the child nodes of a specified node (again identified by a document ID and x coordinate). Descendants are easy (they are identified by x, x', y; we have already exploited this property in rep_serialise_nodes), but immediate children are a bit harder to determine; the trick here is a correlated subquery that limits the results of the first section of the WHERE clause (the set of all descendants) to just those for which the specified node is the parent.

Listing 8.81 Create rep_s_children Procedure
CREATE PROCEDURE dbo.rep_s_children
  @doc_id   t_doc_id,
  @x_source t_xy_index AS
 SELECT n1.node_id,
   FROM node n1
  WHERE n1.doc_id = @doc_id
    AND n1.x > @x_source
    AND n1.y < (SELECT y
                  FROM node
                 WHERE doc_id = @doc_id
                   AND x = @x_source)
    AND @x_source  = (SELECT MAX(x)
                        FROM node n2
                       WHERE n2.doc_id = @doc_id
                         AND n2.x < n1.x
                         AND n2.y > n1.y)
  ORDER BY n1.x

For our example document, we can (for example) ask for the parent node of the <p> element with x = 40, which produces the result shown in Listing 8.82.

Listing 8.82 Result for rep_s_parent with Two Parameters
rep_s_parent 2, 40
node_id x   y   node_type_id
------- --- --- ------------
52      37  46  1
(1 row(s) affected)

As expected, this is the x coordinate of the second <card> element.

8.7.4 Generating XPath Expressions

The W3C's XPath recommendation provides a very useful syntax for querying XML documents. While a full implementation of XPath is beyond the scope of this chapter, we can at least make a start. We will create a stored procedure rep_gen_xpath (see Listing 8.83) that returns an XPath expression for the path between two nodes (identified by their x values: x_source and x_target) in a specified document. There is no error checking, and we need to specify valid x coordinates for two element nodes, or for the document node (x = 1) and an element node.

First we need to select (into a temporary table) the data needed to construct the path expression, and for each element in the path, we may also need a sequence number (if there is more than one element with the same name at the same level in the fragment).

Listing 8.83 Create rep_gen_xpath Procedure
[View full width]
CREATE PROCEDURE dbo.rep_gen_xpath
                 @doc_id   t_doc_id,
                 @x_source t_xy_index,
                 @x_target t_xy_index
-- Returns an XPath expression for a given target element node (starting at the source 
graphics/ccc.gifelement node)
 -- Local variables
 DECLARE @seq_no         INTEGER
 DECLARE @this_element   t_leaf_text
 DECLARE @this_ns_prefix t_leaf_text
 DECLARE @xpath_expr     t_xpath_maxlen
 -- Build a temporary table
 SELECT n1.x,
        (SELECT COUNT(1) + 1
           FROM node n3,
                element_name en2
          WHERE n3.doc_id = @doc_id
            AND n3.node_type_id = 1
            AND (SELECT MAX(n4.x)
                   FROM node n4
                  WHERE n4.doc_id = @doc_id
                    AND n4.node_type_id IN (1, 9)
                    AND n4.x < n3.x
                    AND n4.y > n3.x) = (SELECT MAX(n5.x)
                                          FROM node n5
                                         WHERE n5.doc_id = @doc_id
                                           AND n5.node_type_id IN (1, 9)
                                           AND n5.x < n1.x
                                           AND n5.y > n1.x)
            AND n3.node_id = en2.node_id
            AND en2.local_name = en1.local_name
            AND n3.x < n1.x) AS seq_no,
        (SELECT COUNT(1)
           FROM node n6,
                element_name en3
          WHERE n6.doc_id = @doc_id
            AND n6.node_type_id = 1
            AND (SELECT MAX(n7.x)
                   FROM node n7
                  WHERE n7.doc_id = @doc_id
                    AND n7.node_type_id IN (1, 9)
                    AND n7.x < n6.x
                    AND n7.y > n6.x) = (SELECT MAX(n8.x)
                                          FROM node n8
                                         WHERE n8.doc_id = @doc_id
                                           AND n8.node_type_id IN (1, 9)
                                           AND n8.x < n1.x
                                           AND n8.y > n1.x)
            AND n6.node_id = en3.node_id
            AND en3.local_name = en1.local_name) AS total_count,
   INTO #results
   FROM node n1,
        element_name en1
  WHERE n1.doc_id = @doc_id
    AND n1.x <= @x_target
    AND n1.y > @x_target
    AND n1.x >= @x_source
    AND n1.node_type_id = 1
    AND n1.node_id = en1.node_id
  ORDER BY n1.x
 -- Null sequence numbers if element is the only one of its type
 UPDATE #results
    SET seq_no = NULL
  WHERE seq_no = 1
    AND total_count = 1

Next we create a cursor to walk over the results set, building the XPath expression as we step through the rows, as shown in Listing 8.84.

Listing 8.84 Cursor to Walk the Results Set
 -- Initialize XPath expr (as a slash if we are starting from
 -- the root element or the document node)
 IF @x_source <= (SELECT MIN(x) FROM node
                   WHERE doc_id = @doc_id AND node_type_id = 1)
   SELECT @xpath_expr = '/'
   SELECT @xpath_expr = ''
-- Create a cursor to walk through #results in depth order
 SELECT seq_no, ns_prefix, local_name FROM #results ORDER BY x
 OPEN xpath_generator
 -- Loop over the entries
 FETCH NEXT FROM xpath_generator
  INTO @seq_no, @this_ns_prefix, @this_element
  -- Append the element name, seq no (if required) and a slash
  SELECT @xpath_expr = @xpath_expr
     + ISNULL((@this_ns_prefix + ':'), '')
     + @this_element + ISNULL('[' + CONVERT(VARCHAR(10), @seq_no)
     + ']', '') + '/'
  -- Continue
  FETCH NEXT FROM xpath_generator
   INTO @seq_no, @this_ns_prefix, @this_element
 -- Remove the final slash
 SELECT @xpath_expr = LEFT(@xpath_expr, DATALENGTH(@xpath_expr) - 1)
 -- Tidy up
 CLOSE xpath_generator
 DEALLOCATE xpath_generator
 DROP TABLE #results
 -- Return the XPath expression
 SELECT 'XPath(' + CONVERT(VARCHAR(4), @doc_id) + ', '
        + CONVERT(VARCHAR(4), @x_source) + ', '
        + CONVERT(VARCHAR(4), @x_target) + ')' AS parameter,
        RTRIM(@xpath_expr) AS value

We can execute this procedure for our example document. For example, to obtain the path between the document node and the <p> element of the second <card> element would produce the result shown in Listing 8.85.

Listing 8.85 Result for rep_gen_xpath with Three Parameters
rep_gen_xpath 2, 1, 40
parameter        value
---------------  --------------
XPath(2, 1, 40)  /wml/card[2]/p
(1 row(s) affected)

Note the sequence number 2 for the <card> element ("/wml/card/p" would match more than one node; the sequence number allows us to be specific).


Part IV: Applications of XML