19.4 Benchmarking Specification

This work focuses on the use of directory servers as XML data management systems and a comparison with relational, object-oriented, and native XML database systems approaches. Besides the native XML database system that includes a proprietary query language to access the database, we have to implement the access to each of the other databases. The standardization of an XML query language has just been completed, and because of the missing implementations of the final version, we rely on general requirements that the storage of XML documents has to meet. These requirements were published by the WWW Consortium (in Maier 1998) and can serve as the provisional basis for storage procedures until the officially adopted query language is implemented.

XML documents should be stored in a database and be able to be modified in part. In addition, XML query languages have to fulfill the following requirements:

A document or parts of a document can be searched for, using the structure, the content, or attribute values.
A complete XML document or parts of an XML document can be extracted.
A document can be reduced to parts by omitting subelements.
Parts can be restructured to create a new document.
Elements can be combined to create a document.

Although these statements refer to XML query languages, they make clear the kind of performance that is required of storage technologies.

19.4.1 Benchmarking a Relational Database

In the case of relational databases, we implemented the typed and the nontyped approach, indexing the parentNode in both cases. To insert and extract the XML documents into and from the database, we used pure SQL. However, some database management system providers offer additional statements to extract hierarchical structures from relations.

XML documents are stored by traversing the DOM tree and inserting every node represented by its identification number and the reference to its parent node into the appropriate table. In the typed approach, each element is inserted in its own element table. In the nontyped approach, the tag name also has to be stored in the unique element table common to all elements.

The extracting procedure iteratively collects all parts of an XML document starting with the personnel entry. In the typed approach, it has to find all element entries in the element table with their parent node identical to its own identification number. When an element is selected from the element table, it is also put on a stack that then contains all open tags in the right order. If the procedure does not find a subelement, it writes the end tag and removes the corresponding open tag from the top of the stack. The typed approach has to search for the subelements of an element in all element tables. Because this is very expensive, we consult the DTD to decide on the tables. If the procedure has to find the subelements of professor, it searches only the tables name and course to find its subelements. This is the only difference between the typed and the nontyped approach.

Deleting complete XML documents is similar to extracting them?starting with the root element, all subelements are iteratively identified and deleted.

The statements for extracting and replacing parts of XML documents have to be transformed into SQL statements that select database entries and replace them or even insert new entries. The latter could affect the identification number of the following entries. Therefore, the replacing procedure has to adjust the identification numbers of all entries that are sibling nodes of the replaced node.

19.4.2 Benchmarking an Object-Oriented Database

The object-oriented database environment we used provides special classes to make a DOM tree persistent. This is done by traversing the DOM tree that is built by the DOM parser. Every node of the tree has to be converted into a persistent node that is automatically written into the object-oriented database. The persistent DOM implementation uses the nontyped DOM implementation.

To extract a complete document from the database, the DOM tree is restored beginning at the root node, is transformed into text, and is output by using an XML serializer class.

Deleting a complete XML document by deleting the root node seems to be very fast but not effective?the objects were not removed from the disk. Also deletion of the tree by deleting node by node was not successful.

Selecting parts of documents could not be implemented by using the search function but had to be done by reconstructing the DOM tree and searching the document parts in the tree. The search functions on the persistent tree showed very poor performance.

Replacing document parts was done on the persistent DOM tree although it includes a search of the part that has to be replaced. But replacing the part on a reconstruction of the DOM tree has to make the new tree persistent and therefore has to delete the old tree.

19.4.3 Benchmarking a Directory Server

To store an XML document in a directory server, we traverse the DOM tree, create entries, and insert them in the directory information tree.

A complete XML document is extracted by selecting the entries in the directory information tree. The selection method of LDAP allows the return of the whole tree in a result set that is unfortunately not ordered. After storing an XML document in the directory server, the sequence of entries in the result set will be identical to the sequence of elements. But the first update of an entry will no longer preserve that sequence. Ordering the result set containing the whole tree would be very expensive. We decided to select just the set of child nodes of an entry, order this small set, and process it by printing the element name as well as attribute value pairs, and then recursively select the set of child nodes.

LDAP also provides a rich set of methods to search for entries based on their distinguished names and on the specification of filters.

Parts of documents have to be replaced by deleting the entries representing the document part and inserting the new entries. This will cause an update of entries that follow in the sequence of the replaced part if the set of replacing entries is larger. The updating is restricted to entries in the subtree of the node that represents the replaced part.

19.4.4 Benchmarking a Native XML Database

The XML database provides methods for inserting an XML document into the database and extracting the complete document from the database. It also implements an XML query and updating language that allows us to express the statements of our benchmarking specification.

Top