Several proposals for generating synthetic XML data have been proposed (Aboulnaga et al. 2001; Barbosa et al. 2002). Aboulnaga et al. proposed a data generator that accepts as many as twenty parameters to allow a user to control the properties of the generated data. Such a large number of parameters adds a level of complexity that may interfere with the ease of use of a data generator. Furthermore, this data generator does not make available the schema of the data that some systems could exploit. Most recently, Barbosa et al. proposed a template-based data generator for XML, which can generate multiple tunable data sets. In contrast to these previous data generators, the data generator in the Michigan benchmark produces an XML data set designed to test different XML data characteristics that may affect the performance of XML engines. In addition, the data generator requires only a few parameters to vary the scalability of the data set. The schema of the data set is also available to exploit.
Three benchmarks have been proposed for evaluating the performance of XML data management systems (Böhme and Rahm 2001; Bressan, Dobbie 2001; Schmidt, Waas, Kersten, Florescu, Manolescu et al. 2001). XMach-1 (Böhme and Rahm 2001) and XMark (Schmidt, Waas, Kersten, Florescu, Manolescu et al. 2001) generate XML data that models data from particular Internet applications. In XMach-1, the data are based on a Web application that consists of text documents, schemaless data, and structured data. In XMark, the data are based on an Internet auction application that consists of relatively structured and data-oriented parts. XOO7 (Bressan, Dobbie 2001) is an XML version of the OO7 Benchmark (Carey et al. 1993) that provides a comprehensive evaluation of OODBMS performance. The OO7 schema and instances are mapped into a Document Type Definition (DTD) and the corresponding XML data sets. The eight OO7 queries are translated into three respective languages of the query-processing engines: Lore (Goldman et al. 1999; McHugh et al. 1997), Kweelt (Sahuguet et al. 2000), and an ORDBMS. While each of these benchmarks provides an excellent measure of how a test system would perform against data and queries in their targeted XML application, it is difficult to extrapolate the results to data sets and queries that are different from ones in the targeted domain. Although the queries in these benchmarks are designed to test different performance aspects of XML engines, they cannot be used to perceive the system performance change as the XML data characteristics change. On the other hand, we have different queries to analyze the system performance with respect to different XML data characteristics, such as tree fanout and tree depth, and different query characteristics, such as predicate selectivity.
A desiderata document (Schmidt, Waas, Kersten, Florescu, Carey et al. 2001) for a benchmark for XML databases identifies components and operations, and ten challenges that the XML benchmark should address. Although the proposed benchmark is not a general-purpose benchmark, it meets the challenges that test performance-critical aspects of XML processing.