At the time we were conducting these evaluations, the public discussion of comparing XML data management approaches was poor. Most work was done in the field of semi-structured data, where database management systems and query languages were developed. The first publications that explored the role of relational database systems in storing and retrieving XML documents were presented by researchers from INRIA (Florescu and Kossmann 1999b) and from the University of Wisconsin (Shanmugasundaram et al. 1999).
In the meantime, several research teams have studied this subject and published their results. Their work can be divided into three categories. The first category contains studies in additional improvements of the mapping of XML document structures into data and object models. Researchers in the second category are developing benchmarks for XML data management systems based on existing benchmarks for the common database management systems like relational and object-oriented. The work in the third category focuses on guidelines for benchmarking XML databases to support the decisions of users and of developers of XML data management systems.
Most of the published work deals with the mapping between XML document structures and relations. Due to the extreme difference between the highly nested, hierarchical structure of XML documents and the flat table structure of relational databases, several substantially different approaches have been developed. The object-oriented view of XML documents provided by the Document Object Model suggests storing XML documents in object-oriented databases but also bringing object-relational databases into play.
The so-called edge approach was intensively studied at INRIA by D. Florescu and D. Kossmann (Florescu and Kossmann 1999b) and AT&T Labs by A. Deutsch, M. Fernandez, and D. Suciu (Deutsch, Fernandez, Suciu 1999). It is derived from the graphical representation of semi-structured data (Abiteboul 1997; Buneman 1997), which maps the labels of the data?corresponding to XML tags?to labels of the graph. To do this, the elements in a document get identification numbers and are mapped to the nodes of the tree. The element names appear as labels of the edges between these nodes. The sequence of the elements of one level can be determined by ordering the edges. The attributes that belong to an element and which possess a name and a value are represented by an edge bearing their names from the element node to the value node. The mapping of this tree to relations is represented quite easily now. The edges have a source, a target, and a name, and they are ordered. This information can be stored in a table. The key is composed of the values of the source and the number.
As with our typed DOM implementation, the single large element table can be split into many small tables using the edge approach. The tables receive the name of the element. Here, too, the key is formed from the value of the source and the number. The target of a table entry is either a value or the identification number of a node that occurs in one of the tables as the source. As the identification number is only a part of the key and the table is not uniquely defined, the target field cannot contain a foreign key. For the extraction of a complete document, all tables must be unified into one table, as with the typed DOM approach. The Select statement must determine the equality of target and source values of the edges with the names of nested elements.
A different approach was published by J. Shanmugasundaram et al. (Shanmugasundaram et al. 1999), who first established the relationship between XML elements?defined by the Document Type Definition and applied as XML tags?and relations. The database schema is created from the DTD, which is done by building a so-called DTD graph and constructing an element graph from it. When the schema is created, several inlining techniques that flatten nested elements can be applied to avoid the fragmentation of the XML document over several tables. The authors developed benchmarks to compare the different inlining techniques but were not able to relate to other approaches at this early stage of research.
The idea of J. Shanmugasundaram et al. was adopted and developed by a number of other researchers. We only mention a few of them (Bourret 2002; Kappel et al. 2001; Lee and Chu 2000; Mani et al. 2001; Williams et al. 2001). They all use the typed approach while improving and extending the original.
One publication that does not follow the typed approach is XRel?a path-based approach to storing XML documents in relational databases. This work was done by researchers from the Nara Institute of Science and Technology in Japan together with IBM Japan (see Yoshikawa et al. 2001). The idea is to store the textual content of an XML document in a table together with a link to the position of the text in the document tree. The position is expressed by the path from the root to the element in the tree together with the information where the text starts and ends within the document. Four tables are necessary: the path table that contains all paths of the document tree; the element table that contains a relationship to the path table, the start and end position of the element in the XML document, and a number for ordering issues; the text table that contains a relationship to the path table, the start and end position of the element in the XML document, and the value (i.e., the text itself); the attribute table that contains a relationship to the path table, the start and end position of the element in the XML document, and the value of the attribute. All tables have a docID attribute referring to the XML document. Although this interesting technique is compared to the edge approach of D. Florescu and D. Kossmann, and is very successful in querying the database, no figures were published for storing and reconstructing the complete XML document.
Object-relational databases allow database developers to define their own data types and to use them as scopes for relations. Looking at a relation as a data type, relations can then be defined via relations?that is, tables can be stored in tables. This seems to be the ideal way for storing XML documents, which can then nest elements by embedding tables in tables or defining object types and assigning them to attributes of tables. The problem of mapping element types to tables, attributes, or even object types has encouraged several research teams to study the subject.
A simple method maps element types with complex content to object types and element types with PCDATA content to attributes of the parent element. This may cause a strong reduction of the relational portion of the object-relational database system and raises the question whether an object-oriented database system might be more appropriate. Additionally, the document structure is reflected in the database schema?the nesting of the elements determines the nesting of the tables of linked object types. A modification of the document structure also forces a modification of the database schema and makes previously stored, valid documents invalid.
More complex methods use the relational portion of object-relational database systems more intensively (see Runapongsa and Patel 2002). They apply the mapping described by J. Shanmugasundaram et al. (Shanmugasundaram et al. 1999) and define a new type, the XADT (XML Abstract Data Type). Attributes of type XADT can store a fragment of an XML document. The operations of the XADT provide the programmer with data contained in the attribute value of type XADT.
It seems that additional research is required to prove the contribution of object-relational database techniques to the storage and retrieval of XML documents.
Although several content management systems?especially commercial systems?are based on object-oriented databases, only minor discussions and publications are known. There might be several reasons for this. First, applying object-oriented database techniques is less difficult compared to relational and object-relational techniques. Second, publications on this subject would have to disclose technical details not suited for publication. Finally, the absence of a query language has hindered the development of tests to compare these implementations.
Nevertheless, several technical papers indicate that XML documents are mapped to persistent object trees. Thereby, the nontyped approach is applied. Sometimes, DTD and XML schemas are implemented additionally to support validation of documents. Implementing the nontyped approach, the database schema does not depend on the DTD or the XML schema and allows importation of different DTDs and changes to the imported DTDs. Special indexing techniques like B-trees are used to guarantee fast access to parts of the object tree.
The combination of XML and directory servers is rarely discussed. Only a few projects use directory servers to organize huge quantities of XML documents. As described by K. Polatschek (Polatschek 2000), Tagesspiegel Online stores articles as XML documents in files and organizes these within a directory server. The organization of the documents is stored, not their content. Yet, the latter is done by P. Marrón et al. (Marrón and Lausen 2001) who propose a mapping between a DOM tree and a directory information tree. It corresponds to our approach described earlier.
Even today, performance evaluations of applications processing XML documents in databases are rare. F. Tian et al. (Tian et al. 2000) published performance evaluations of four storage approaches including relational databases, object-oriented data bases, and a file system. They implement the edge approach of Florescu and Kossmann, use an object-oriented database with support by a B-tree implementation and files parsed by an XML parser. The data are created from the DBLP (Database systems and Logic Programming) database of bibliographical entries and have a size of 65MB. The benchmark contains the reconstruction of the original document, selection queries, join queries, and updates. A comparison to other performance tests is difficult due to the nonstandardized benchmark.
Today, interest in performance evaluations of XML data management systems is growing, and the first benchmarks have been published. The XOO7 benchmark (Bressan, Dobbie 2001) is derived from the OO7 benchmark for measuring the performance of object-oriented database management systems. The benchmark contains a DTD and parameters to control the production of XML documents. There are eight queries taken from OO7 and an additional five queries covering special XML aspects. The benchmark has been used to compare a public domain database management system for semi-structured data, a file system, and a commercial object-relational database system.
XMach-1 is a multiuser benchmark developed by T. Böhme and E. Rahm (Böhme and Rahm 2001). Documents and queries are derived from Web applications. The documents contain structured data and structured text. They are produced by assigning values to a number of parameters and by randomly generating text. Thus, the sizes of the documents vary. Four different-sized databases with 10,000 to 10,000,000 documents can be created. Böhme and Rahm propose eight query operations covering reconstruction of a complete document, text search, path traversal, test of efficient index implementation, and application of group-by, join, sort, as well as three manipulation operations including insertion of a document, deletion of a document, and update of an element in a document.
The Michigan benchmark (Runapongsa et al. 2002), which is discussed in Chapter 18 of this book, was developed to support engineers in designing improved XML-processing engines. It is inspired by the Wisconsin benchmark for measuring the performance of relational database management systems. Among the 49 queries are several that hit the core of an XML database implementation. The benchmark contains parameters to generate XML documents of different sizes that are characterized by the number and fanout of the nodes and the depth of the document trees. The benchmark was applied to compare a native XML database management system developed at the University of Michigan and a commercial object-relational database management system.
The XMark project is designed to provide a benchmark suite that allows users and developers to gain insights into the characteristics of their XML repositories. It is described by A. Schmidt et al. (Schmidt, Waas, Kersten, Florescu, Manolescu et al. 2001). The project has produced guidelines for developing XML database benchmarks that comprise ten general challenges for performance analysis (as described in Schmidt, Waas, Kersen, Florescu, Carey et al. 2001). These challenges cover
Bulk loading of XML documents into the database
Reconstruction of the original documents
Path traversal to retrieve parts of the document
Casting of text to data types
Handling of missing elements
Ordering of elements
References between elements
Join operations on values
Construction of large results
Containment and full text search
With this list, Schmidt et al. make an important contribution but do not emphasize the data-centric view of XML documents. Challenge 10 focuses on the aspect that documents can contain large quantities of text that have to be searched for words. Challenges 1 and 2 also contribute to the document-centric view so far as documents are created, stored, and either completely reconstructed or exist in parts. We also have to consider that the standardized query language XQuery is no more than a query language and not a manipulation language. It does not contain operations like insert and update. Therefore, manipulation of XML documents could probably do without this functionality. Updating the document could be done by means of an XML-authoring tool. Thus, reconstruction and repeated storage of XML documents gain special importance that should be supported by efficient storage techniques.