18.3 Benchmark Data Set

In this section, we first discuss characteristics of XML data sets that can have a significant impact on the performance of query operations. Then we present the schema and the generation algorithms for the benchmark data.

18.3.1 A Discussion of the Data Characteristics

In the relational paradigm, the primary data characteristics are the selectivity of attributes (important for simple selection operations) and the join selectivity (important for join operations). In the XML paradigm, several complicating characteristics must be considered as discussed in the sections that follow.

Depth and Fanout

Depth and fanout are two structural parameters important to tree-structured data. The depth of the data tree can have a significant performance impact when we are computing containment relationships that include an indirect containment between ancestor and descendant and a direct containment between parent and child. It is possible to have multiple nodes at different levels satisfying the ancestor and the descendant predicates. Similarly, the fanout of the node tree can affect the way in which the DBMS stores the data and answers queries that are based on selecting children in a specific order (for example, selecting the last child).

One potential way of testing fanout and depth is to generate a number of distinct data sets with different values for each of these parameters and then run queries against each data set. The drawback of this approach is that the large number of data sets makes the benchmark harder to run and understand. In this chapter, our approach is to create a base benchmark data set of a depth of 16. Then, using a "level" attribute of an element, we can restrict the scope of the query to data sets of certain depth, thereby quantifying the impact of the depth of the data tree.

To study the impact of fanout, we generate the data set in the following way. There are 16 levels in the tree, and each level has a fanout of 2, except levels 5, 6, 7, and 8. Levels 5, 6, and 7 have a fanout of 13, whereas level 8 has a fanout of 1/13 (at level 8 every thirteenth node has a single child). This variation in fanout is designed to permit queries that measure the effect of the fanout factor. For instance, the number of nodes is 2,704 for nodes at levels 7 and 9. Nodes at level 7 have a fanout of 13, whereas nodes at level 9 have a fanout of 2. Queries against these two levels can be used to measure the impact of fanout. The distribution of nodes is shown in Table 18.1.

Data Set Granularity

To keep the benchmark simple, we chose a single large document tree as the default data set. If it is important to understand the effect of document granularity, one can modify the benchmark data set to treat each node at a given level as the root of a distinct document. One can compare the performance of queries on this modified data set against queries on the original data set.

Scaling

A good benchmark needs to be able to scale in order to measure the performance of databases on a variety of platforms. In the relational model, scaling a benchmark data set is easy?we simply increase the number of tuples. However, with XML, there are many scaling options, such as increasing the number of nodes, depth, or fanout. We would like to isolate the effect of the number of nodes from the effects due to other structural changes, such as depth and fanout. We achieve this by keeping the tree depth constant for all scaled versions of the data set and changing the number of fanouts of nodes at only a few levels.

The default data set, which was described earlier, is called DSx1. This data set has about 728K nodes, arranged in a tree of a depth of 16 and a fanout of 2 for all levels except levels 5, 6, 7, and 8, which have fanouts of 13, 13, 13, 1/13, respectively. From this data set we generate two additional "scaled-up" data sets, called DSx10 and DSx100 such that the numbers of nodes in these data sets are approximated 10 and 100 times the number of nodes in the base data set, respectively. We achieve this scaling factor by varying the fanout of the nodes at levels 5?8. For the data set DSx10 levels 5?7 have a fanout of 39, whereas level 8 has a fanout of 1/39. For the data set DSx100 levels 5?7 have a fanout of 111, whereas level 8 has a fanout of 1/111. The total number of nodes in the data sets DSx10 and DSx100 is 7,180K and 72,350K, respectively (which translates into a scale factor of 9.9x and 99.4x, respectively).

In the design of the benchmark data set, we deliberately keep the fanout of the bottom few levels of the tree constant. This design implies that the percentage of nodes in the lower levels of the tree (levels 9?16) is nearly constant across all the data sets. This allows us to easily express queries that focus on a specified percentage of the total number of nodes in the database. For example, to select approximately 1/16 of all the nodes, irrespective of the scale factor, we use the predicate aLevel = 13.

18.3.2 Schema of Benchmark Data

The construction of the benchmark data is centered on the element type BaseType. Each BaseType element has the following attributes:

aUnique1: A unique integer generated by traversing the entire data tree in a breadth-first manner. This attribute also serves as the element identifier.
aUnique2: A unique integer generated randomly.
aLevel: An integer set to store the level of the node.
aFour: An integer set to aUnique2 mod 4.
aSixteen: An integer set to aUnique1 + aUnique2 mod 16. Note that this attribute is set to aUnique1 + aUnique2 mod 16 instead of aUnique2 mod 16 to avoid a correlation between the predicate on this attribute and one on either aFour or aSixtyFour.
aSixtyFour: An integer set to aUnique2 mod 64.
aString: A string approximately 32 bytes in length.

The content of each BaseType element is a long string that is approximately 512 bytes in length. The generation of the element content and the string attribute aString is described further below.

In addition to the attributes listed above, each BaseType element has two sets of subelements. The first is of type BaseType. The number of repetitions of this subelement is determined by the fanout of the parent element, as described in Table 18.1. The second subelement is an OccasionalType and can occur either 0 or 1 time. The presence of the OccasionalType element is determined by the value of the attribute aSixtyFour of the parent element. A BaseType element has a nested (leaf) element of type OccasionalType if the aSixtyFour attribute has the value 0. An OccasionalType element has content that is identical to the content of the parent but has only one attribute, aRef. The OccasionalType element refers to the BaseType node with aUnique1 value equal to the parent's aUnique1-11 (the reference is achieved by assigning this value to aRef attribute). In the case where no BaseType element has the parent's aUnique1-11 value (e.g., top few nodes in the tree), the OccasionalType element refers to the root node of the tree.

The XML Schema specification of the benchmark data is shown in Listing 18.1.

Listing 18.1 Benchmark Specification in XML Schema

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  targetNamespace="http://www.eecs.umich.edu/db/mbench/bm.xsd"
  xmlns="http://www.eecs.umich.edu/db/mbench/bm.xsd"
  elementFormDefault="qualified">
 <xsd:complexType name="BaseType" mixed="true">
  <xsd:sequence>
   <xsd:element name="eNest" type="BaseType" minOccurs="0">
    <xsd:key name="aU1PK">
     <xsd:selector xpath=".//eNest"/>
     <xsd:field xpath="@ aUnique1"/>
    </xsd:key>
    <xsd:unique name="aU2">
     <xsd:selector xpath=".//eNest"/>
     <xsd:field xpath="@aUnique2"/>
    </xsd:unique>
   </xsd:element>
   <xsd:element name="eOccasional" type="OccasionalType"
        minOccurs="0" maxOccurs="1">
    <xsd:keyref name="aU1FK" refer="aU1PK">
    <xsd:selector xpath="../eOccasional"/>
    <xsd:field xpath="@aRef"/>
    </xsd:keyref>
   </xsd:element>
  </xsd:sequence>
  <xsd:attributeGroup ref="BaseTypeAttrs"/>
 </xsd:complexType>
 <xsd:complexType name="OccassionalType">
  <xsd:simpleContent>
   <xsd:extension base="xsd:string">
    <xsd:attribute name="aRef" type="xsd:integer" use="required"/>
   </xsd:extension>
  </xsd:simpleContent>
 </xsd:complexType>
 <xsd:attributeGroup name="BaseTypeAttrs">
  <xsd:attribute name="aUnique1" type="xsd:integer" use="required"/>
  <xsd:attribute name="aUnique2" type="xsd:integer" use="required"/>
  <xsd:attribute name="aLevel" type="xsd:integer" use="required"/>
  <xsd:attribute name="aFour" type="xsd:integer" use="required"/>
  <xsd:attribute name="aSixteen" type="xsd:integer" use="required"/>
  <xsd:attribute name="aSixtyFour" type="xsd:integer" use="required"/>
  <xsd:attribute name="aString" type="xsd:string" use="required"/ >
 </xsd:attributeGroup>
</xsd:schema>

18.3.3 Generating the String Attributes and Element Content

The element content of each BaseType element is a long string. Since this string is meant to simulate a piece of text in a natural language, it is not appropriate to generate this string from a uniform distribution. Selecting pieces of text from real sources, however, involves many difficulties, such as how to maintain roughly constant size for each string, how to avoid idiosyncrasies associated with the specific source, and how to generate more strings as required for a scaled benchmark. Moreover, we would like to have benchmark results applicable to a wide variety of languages and domain vocabularies.

To obtain the string value that has a distribution similar to the distribution of a natural language text, we generate these long strings synthetically, in a carefully stylized manner. We begin by creating a pool of 2¹⁶ -1 (over sixty thousands) synthetic words. This is roughly twice the number of entries in the second edition of the Oxford English Dictionary. However, half the words that are used in the benchmark are "derived" words, produced by appending "ing" to the end of the word. The words are divided into 16 buckets, with exponentially growing bucket occupancy. Bucket i has 2^i-1 words. For example, the first bucket has only one word, the second has two words, the third has four words, and so on. The words are not meaningful in any language but simply contain information about the bucket from which they are drawn, and the word number in the bucket. For example, "15twentynineB14" indicates that this is the 1,529th word from the fourteenth bucket. To keep the size of the vocabulary in the last bucket at roughly 30,000 words, words in the last bucket are derived from words in the other buckets by adding the suffix "ing" (to get exactly 2¹⁵ words in the sixteenth bucket, we add the dummy word "oneB0ing").

The value of the long string is generated from the template shown in Listing 18.2, where "PickWord" is actually a placeholder for a word picked from the word pool described above. To pick a word for "PickWord", a bucket is chosen, with each bucket equally likely, and then a word is picked from the chosen bucket, with each word equally likely. Thus, we obtain a discrete Zipf distribution of parameter roughly 1. We use the Zipf distribution since it seems to reflect word occurrence probabilities accurately in a wide variety of situations. The value of aString attri-bute is simply the first line of the long string that is stored as the element content. Through the above procedures, we now have the data set that has the structure that facilitates the study of the impact of data characteristics on system performance and the element/attribute content that simulates a piece of text in a natural language.

Listing 18.2 Generation of the String Element Content

Sing a song of PickWord,
A pocket full of PickWord
Four and twenty PickWord
All baked in a PickWord.
When the PickWord was opened,
The PickWord began to sing;
Wasn't that a dainty PickWord
To set before the PickWord?
The King was in his PickWord,
Counting out his PickWord;
The Queen was in the PickWord
Eating bread and PickWord.
The maid was in the PickWord
Hanging out the PickWord;
When down came a PickWord,
And snipped off her PickWord!

Top