SQL Server provides both a SQL-centric and an XML-centric way to generate XML from relational data. This section describes the SQL-centric approach.
Programmers that are familiar with writing SQL select queries want to easily generate XML from their query result. Unfortunately, such a serialization into XML can be done many different ways. SQL Server therefore provides different modes for the serialization with different levels of complexity and XML-authoring capability. All three modes are provided via a new select clause called FOR XML. The modes are raw, auto, nested, and explicit. The syntax of the FOR XML clause is ([ ] indicates optional, | indicates alternative):
FOR XML (raw | (auto | nested) [, elements] | explicit) [, binary base64] [, xmldata]
All modes basically map rows to elements and column values to attributes. The optional directive elements changes the mapping of all column values to subelements in the auto and nested modes (the explicit mode has column-specific control over the mapping). The optional directive binary base64 is required in the raw and explicit modes if a binary column is returned, such as binary, varbinary, or image. It indicates for all modes that the binary data should be returned inline in the XML document in base64 encoding. In the auto and nested modes, a direct query will be generated if no mode is specified. The optional directive xmldata will generate an inline schema using the XML-Data Reduced schema language as part of the result that describes the structure and data types of the XML query result.
Before we describe the modes in detail, we need to understand some of the general architectural design decisions that are common for all the current FOR XML modes.
The first requirement for the current implementation was to not impact the database system's relational engine. Therefore the serialization process has to happen as a post-processing step on the resulting rowset after the query execution is done. Thus the serialization process is not part of the general query processing (see Figure 7.2). As a consequence to the current implementation, FOR XML query results cannot be assigned to columns but need to be returned directly to the OLEDB provider. It also means that information about the lineage of the data in case of nonprimary key-foreign key joins is lost, since we cannot tell if the master data in the join comes from one or multiple rows.
The second requirement wants to avoid the caching of large XML fragments on the server. In order to avoid such caching, the serialization rules for hierarchical results in the auto, nested, and explicit modes therefore require that rows containing parent data need to be directly followed by their children and children's children data.
Furthermore, due to the open-world assumption of XML (in contrast to the closed-world assumption of relational systems), relational NULL values are serialized by the absence of the instance value. Finally, when a row is mapped to an element, a query returning multiple top rows will generate an XML fragment. To make it into a well-formed XML document, a root element needs to be added via the template mechanism or via the root property of SQLOLEDB.
Later Web releases provide both client-side (in the SQLXML provider) and server-side processing of FOR XML. Since not all information of the auto mode is available on the client side, a new mode called nested replaces the auto mode.
The raw mode is the simplest mode. It performs a so-called canonical mapping where any row of the query result is mapped into an element with the name row and any column value that is not null into an attribute value of the attribute with the column name. For example, the query in Listing 7.1 may return the results in Listing 7.2.
SELECT CustomerID, OrderID FROM Customers LEFT OUTER JOIN Orders ON Customers.CustomerID = Orders.CustomerID FOR XML raw
<row CustomerID="ALFKI" OrderID="10643" /> <row CustomerID="ALFKI" OrderID="10692" /> <row CustomerID="ANATR" OrderID="10308" /> <row CustomerID="FISSA" />
Since the query results do not contain nested rowsets, the raw mode only returns flat XML documents where the hierarchy of the data is lost. However, it works with any SQL query, and the serialization process is very efficient.
The auto and nested modes apply heuristics on the returned rowset to determine nesting of the data. They basically map each row to an element while using the table alias as the element name. Nesting is determined by taking schema-level lineage information provided by the SQL Server query processor into account.
Basically, the left-to-right appearance of a table alias in the SELECT clause determines the nesting. Columns of aliases that are already placed in the hierarchy are grouped together even if they appear interspersed with columns of other aliases. Computed and constant columns are associated with the deepest hierarchy so far encountered (or with the top level of the first alias). These heuristics together with the loss of the instance level lineage make it impossible to provide differently typed sibling elements: The generated hierarchy will be a simple hierarchy that will introduce a new level for every new table alias.
Due to the streaming requirement, the serialization process then looks at each row that arrives from the query processor, opens a new hierarchy level for the level where all the ancestor data is unchanged, previously closing any lower hierarchies of the sibling. The hierarchy serialization of the auto mode together with the lin-eage issue of the first requirement means that multiple, indistinguishable parents will be merged to one parent and that parents without children and parents with children without properties will be represented as parents with children without properties. The serialization process also means that if children are not directly following their parent, a duplicate parent will be generated where it reappears in the rowset stream.
For example, the auto mode query in Listing 7.3 may return the results in Listing 7.4.
SELECT Customers.CustomerID, OrderID FROM Customers LEFT OUTER JOIN Orders ON Customers.CustomerID = Orders.CustomerID ORDER BY Customers.CustomerID FOR XML auto
<Customers CustomerID="ALFKI"> <Orders OrderID="10643" /> <Orders OrderID="10692" /> </Customers> <Customers CustomerID="ANATR"> <Orders OrderID="10308" /> </Customers> <Customers CustomerID="FISSA"> <Orders /> </Customers>
Note the order by clause to group all children with their parent and the empty Orders element.
The explicit mode allows generating arbitrary XML without any of the auto mode limitations. However, the explicit mode expects that the query be explicitly authored to return the rowset in a specific format. This format, commonly known as a universal table format, provides enough information to generate arbitrary XML. In particular, the explicit mode can generate arbitrary tree structured hierarchies, collapse or hoist hierarchical levels independently of the involved tables, and can generate IDREFS type collection attributes.
The general format and approach is best explained with an example. Explaining every detail of the explicit mode is beyond the scope of this chapter. Therefore, the reader is referred to the documentation for the details. The goal is to generate an XML document of the form shown in Listing 7.5.
<Customer cid="ALFKI"> <name>Alfreds Futterkiste</name> <Order oid="O-10643" /> <Order oid="O-10692" /> </Customer> <Customer cid="BOLID"> <name>Bolido Comidas preparadas</name> <Order oid="O-10326" /> </Customer>
Note that while this example still consists only of a simple hierarchy, one Customer column has to be mapped to an attribute and the other one to a subelement, a task that cannot be accomplished by the auto mode. In order to generate XML of this format, the explicit mode expects a universal table of the format shown in Table 7.1.
Each row corresponds to an element (with the exception of IDREFS where each row is an element of the list). The columns Tag and Parent are used to encode the hierarchy levels for each row (if the parent tag is 0 or NULL, the tag is the top level). The column names encode the mapping of the hierarchy levels to the element name; in the given example in Table 7.1, level 1 corresponds to an element of name Customer. The column names also encode the name of the attribute (or subelement) of the values in that column as well as additional information such as whether the value is a subelement or some other information (such as IDREFS, CDATA section, etc.).
The serialization process takes each row, and based on the tag level and the parent tag, determines what level the element is. It uses the information encoded in the column name to only generate the column attributes and subelements for the current level. Thus all other columns can contain NULL. Due to the streaming requirement, children have to immediately follow their parent; thus the key field columns of the parent often contain the key values like in the example in Table 7.1 because they were used to group the children with their parent element. However, the explicit mode cares only about the order and does not care about the parent's key value.
In principle, the explicit mode does not care about the query that generates the universal table format. One could create a temporary table of this format, insert the data, and then perform an explicit mode query over the temporary table that guarantees the right grouping of children and parents to generate the XML. This however, would most likely not perform. Thus, the best way today to generate this format by means of a single query is to issue a selection for each level, union all them together, and use an order by statement to group children under their parents. This basically generates a left outer join where each join partner is placed into its own vertical and horizontal partition of the rowset. Thus the query for generating the universal table and therefore the XML in our example would look like Listing 7.6.
SELECT 1 as Tag, NULL as Parent, CustomerID AS "Customer!1!cid", CompanyName AS "Customer!1!name!element", NULL AS "Order!2!oid" FROM Customers WHERE CustomerID = 'ALFKI' OR CustomerID='BOLID' UNION ALL SELECT 2, 1, Customers.CustomerID, NULL, 'O-'+CAST(Orders.OrderID AS varchar(32)) FROM Customers INNER JOIN Orders ON Customers.CustomerID=Orders.CustomerID WHERE Customers.CustomerID = 'ALFKI' OR Customers.CustomerID='BOLID' ORDER BY "Customer!1!cid" FOR XML explicit
Users who prefer a simpler way to formulate these queries can use the XML view and XPath mechanism (explained shortly) to generate such explicit mode queries under the covers.