In this section, we will describe how to query the XML repository.
Once we have uploaded some documents into the database, we can start experimenting with queries. For example, the query in Listing 8.68 will return a list of the element names that appear, with their x and y coordinates and namespace prefixes (if any).
SELECT n.x, n.y, e.ns_prefix, e.local_name FROM element_name e, node n WHERE n.doc_id = 1 AND n.node_id = e.node_id;
The results for wml-example.xml (document ID = 1, if we loaded this document first) are shown in Listing 8.69.
x y ns_prefix local_name ---- ---- --------- ---------- 2 49 NULL wml 5 34 NULL card 8 19 NULL p 11 12 NULL img 15 16 NULL br 22 31 NULL p 25 28 NULL a 37 46 NULL card 40 43 NULL p (9 row(s) affected)
A variation on this theme is shown in Listing 8.70, where we return the element names (qualified with a namespace prefix, if any) with a count of their occurrences in the specified document (note the use of a group by expression).
SELECT ISNULL(e.ns_prefix + ':', '') + e.local_name AS qName, COUNT(1) AS occurrences FROM element_name e, node n WHERE n.doc_id = 1 AND n.node_id = e.node_id GROUP BY ISNULL(e.ns_prefix + ':', '') + e.local_name;
The results are shown in Listing 8.71.
qName occurrences -------- ----------- a 1 br 1 card 2 img 1 p 3 wml 1 (6 row(s) affected)
For our simple example, the results are pretty obvious; there are two cards, one link, three paragraphs, and so on. However, for larger documents, the query can be very informative; for example, in an XHTML document, we can quickly determine how many paragraph elements there are, or how many images.
The query in Listing 8.72 will return a list of the attribute names that appear, with their x and y coordinates and namespace prefixes (if any).
SELECT n.x, n.y, a.attribute_id, a.ns_prefix, a.local_name FROM attribute_name a, node n WHERE n.doc_id = 1 AND n.node_id = a.node_id;
The results for our example are shown in Listing 8.73.
x y attribute_id ns_prefix local_name ---- ---- ------------ --------- ---------- 5 34 1 NULL id 5 34 2 NULL title 5 34 3 NULL newcontext 8 19 1 NULL align 11 12 1 NULL src 11 12 2 NULL alt 11 12 3 NULL align 25 28 1 NULL href 37 46 1 NULL id 37 46 2 NULL title 40 43 1 NULL align (11 row(s) affected)
To retrieve the text leaves for this document, run the query shown in Listing 8.74.
SELECT n.x, t.leaf_id, t.leaf_text FROM text_leaf t, node n WHERE n.doc_id = 1 AND n.node_id = t.node_id
The results for our example are shown in Listing 8.75 (note that I've ignored whitespace nodes for brevity?there are 14 whitespace entries in addition to these three rows).
x leaf_id leaf_text ---- ------- --------------------------- 17 1 Content of the first card. 26 1 Next card. 41 1 Content of the second card.
To return the text content of a specific element node (e.g., the second card element), see Listing 8.76.
SELECT n.x, t.leaf_text FROM text_leaf t, node n WHERE n.doc_id = 1 AND n.node_id = t.node_id AND n.x BETWEEN 37 AND 46;
where 37 and 46 are the x and y coordinates of the element node. The results are shown in Listing 8.77 (I've ignored whitespace again).
x leaf_id leaf_text ---- ------- --------------------------- 41 1 Content of the second card.
We can generalize the last query to return, for a given document, all text nodes of elements with a specified name (in this case, all <a> elements), as shown in Listing 8.78.
SELECT n2.x, n2.y, e.local_name, t.leaf_id, t.leaf_text FROM node n1, node n2, element_name e, text_leaf t WHERE n1.doc_id = 1 AND n1.node_id = e.node_id AND e.local_name = 'a' AND n2.doc_id = 1 AND n2.node_id = t.node_id AND n2.x BETWEEN n1.x AND n1.y;
In our example, there is only one <a> element, as shown in Listing 8.79.
x y local_name leaf_id leaf_text --- --- ---------- ------- --------- 26 27 a 1 Next card (1 row(s) affected)
The queries in this section represent just a tiny sample of the queries we can dream up now that we have parsed XML into our relational database.
The reader may have noticed a potential pitfall for applications that need to provide full text searching capabilities for document-centric XML. If we intend to run queries on large text nodes?for example, to find all text nodes that contain a certain word or words?we need to be mindful that the content may well be split over leaves. One solution would be to have the leaf_text column in our text_leaf table of type text rather than a varchar (this would also require some changes to the stored procedures that access this table, since the syntax for handling this type of content will differ). Another solution would be to build an indexing mechanism that creates a full text index at parse time, with references to the nodes that contain the text. Either approach will allow searches to weight hits according to information about the containing nodes (e.g., a word found in text content of a title element might be worth more than the same text found in a paragraph element).
Time to introduce some more stored procedures that may prove useful in a number of applications. The first, rep_s_parent, selects the parent node of a specified node (identified by a document ID and x coordinate). This procedure makes use of the Nested Sets property that the parent of a given node (x, y) has the largest x' in the set defined by x' , x . y' (see Listing 8.80).
CREATE PROCEDURE dbo.rep_s_parent @doc_id t_doc_id, @x_child t_xy_index AS SELECT n1.node_id, n1.x, n1.y, n1.node_type_id FROM node n1 WHERE n1.doc_id = @doc_id AND n1.x = (SELECT MAX(n2.x) FROM node n2 WHERE n2.doc_id = @doc_id AND n2.x < @x_child AND n2.y > @x_child) GO
The next procedure, rep_s_children, shown in Listing 8.81 returns the child nodes of a specified node (again identified by a document ID and x coordinate). Descendants are easy (they are identified by x, x', y; we have already exploited this property in rep_serialise_nodes), but immediate children are a bit harder to determine; the trick here is a correlated subquery that limits the results of the first section of the WHERE clause (the set of all descendants) to just those for which the specified node is the parent.
CREATE PROCEDURE dbo.rep_s_children @doc_id t_doc_id, @x_source t_xy_index AS SELECT n1.node_id, n1.x, n1.y, n1.node_type_id FROM node n1 WHERE n1.doc_id = @doc_id AND n1.x > @x_source AND n1.y < (SELECT y FROM node WHERE doc_id = @doc_id AND x = @x_source) AND @x_source = (SELECT MAX(x) FROM node n2 WHERE n2.doc_id = @doc_id AND n2.x < n1.x AND n2.y > n1.y) ORDER BY n1.x GO
For our example document, we can (for example) ask for the parent node of the <p> element with x = 40, which produces the result shown in Listing 8.82.
rep_s_parent 2, 40 GO node_id x y node_type_id ------- --- --- ------------ 52 37 46 1 (1 row(s) affected)
As expected, this is the x coordinate of the second <card> element.
The W3C's XPath recommendation provides a very useful syntax for querying XML documents. While a full implementation of XPath is beyond the scope of this chapter, we can at least make a start. We will create a stored procedure rep_gen_xpath (see Listing 8.83) that returns an XPath expression for the path between two nodes (identified by their x values: x_source and x_target) in a specified document. There is no error checking, and we need to specify valid x coordinates for two element nodes, or for the document node (x = 1) and an element node.
First we need to select (into a temporary table) the data needed to construct the path expression, and for each element in the path, we may also need a sequence number (if there is more than one element with the same name at the same level in the fragment).
CREATE PROCEDURE dbo.rep_gen_xpath @doc_id t_doc_id, @x_source t_xy_index, @x_target t_xy_index AS BEGIN -- Returns an XPath expression for a given target element node (starting at the source element node) -- Local variables DECLARE @seq_no INTEGER DECLARE @this_element t_leaf_text DECLARE @this_ns_prefix t_leaf_text DECLARE @xpath_expr t_xpath_maxlen -- Build a temporary table SELECT n1.x, n1.y, (SELECT COUNT(1) + 1 FROM node n3, element_name en2 WHERE n3.doc_id = @doc_id AND n3.node_type_id = 1 AND (SELECT MAX(n4.x) FROM node n4 WHERE n4.doc_id = @doc_id AND n4.node_type_id IN (1, 9) AND n4.x < n3.x AND n4.y > n3.x) = (SELECT MAX(n5.x) FROM node n5 WHERE n5.doc_id = @doc_id AND n5.node_type_id IN (1, 9) AND n5.x < n1.x AND n5.y > n1.x) AND n3.node_id = en2.node_id AND en2.local_name = en1.local_name AND n3.x < n1.x) AS seq_no, (SELECT COUNT(1) FROM node n6, element_name en3 WHERE n6.doc_id = @doc_id AND n6.node_type_id = 1 AND (SELECT MAX(n7.x) FROM node n7 WHERE n7.doc_id = @doc_id AND n7.node_type_id IN (1, 9) AND n7.x < n6.x AND n7.y > n6.x) = (SELECT MAX(n8.x) FROM node n8 WHERE n8.doc_id = @doc_id AND n8.node_type_id IN (1, 9) AND n8.x < n1.x AND n8.y > n1.x) AND n6.node_id = en3.node_id AND en3.local_name = en1.local_name) AS total_count, en1.ns_prefix, en1.local_name INTO #results FROM node n1, element_name en1 WHERE n1.doc_id = @doc_id AND n1.x <= @x_target AND n1.y > @x_target AND n1.x >= @x_source AND n1.node_type_id = 1 AND n1.node_id = en1.node_id ORDER BY n1.x -- Null sequence numbers if element is the only one of its type UPDATE #results SET seq_no = NULL WHERE seq_no = 1 AND total_count = 1
Next we create a cursor to walk over the results set, building the XPath expression as we step through the rows, as shown in Listing 8.84.
-- Initialize XPath expr (as a slash if we are starting from -- the root element or the document node) IF @x_source <= (SELECT MIN(x) FROM node WHERE doc_id = @doc_id AND node_type_id = 1) SELECT @xpath_expr = '/' ELSE SELECT @xpath_expr = '' -- Create a cursor to walk through #results in depth order DECLARE xpath_generator CURSOR LOCAL FORWARD_ONLY STATIC READ_ONLY FOR SELECT seq_no, ns_prefix, local_name FROM #results ORDER BY x OPEN xpath_generator -- Loop over the entries FETCH NEXT FROM xpath_generator INTO @seq_no, @this_ns_prefix, @this_element WHILE @@FETCH_STATUS = 0 BEGIN -- Append the element name, seq no (if required) and a slash SELECT @xpath_expr = @xpath_expr + ISNULL((@this_ns_prefix + ':'), '') + @this_element + ISNULL('[' + CONVERT(VARCHAR(10), @seq_no) + ']', '') + '/' -- Continue FETCH NEXT FROM xpath_generator INTO @seq_no, @this_ns_prefix, @this_element END -- Remove the final slash SELECT @xpath_expr = LEFT(@xpath_expr, DATALENGTH(@xpath_expr) - 1) -- Tidy up CLOSE xpath_generator DEALLOCATE xpath_generator DROP TABLE #results -- Return the XPath expression SELECT 'XPath(' + CONVERT(VARCHAR(4), @doc_id) + ', ' + CONVERT(VARCHAR(4), @x_source) + ', ' + CONVERT(VARCHAR(4), @x_target) + ')' AS parameter, RTRIM(@xpath_expr) AS value END GO
We can execute this procedure for our example document. For example, to obtain the path between the document node and the <p> element of the second <card> element would produce the result shown in Listing 8.85.
rep_gen_xpath 2, 1, 40 GO parameter value --------------- -------------- XPath(2, 1, 40) /wml/card/p (1 row(s) affected)
Note the sequence number 2 for the <card> element ("/wml/card/p" would match more than one node; the sequence number allows us to be specific).