17.2 Benchmark Specification

Various domain-specific benchmarks have been developed because no single metric can measure the performance of computer systems for all applications. The Benchmark Handbook by Jim Gray (Gray 1993) has laid down the following four key criteria for a domain-specific benchmark:

Relevance: The benchmark must capture the characteristics of the system to be measured.
Portability: The benchmark should be able to be implemented in different systems.
Scalability: The benchmark should be able to test various databases in different computer systems.
Simplicity: The benchmark must be understandable; otherwise it will not be credible.

Basically, a benchmark is used to test the peak performance of a system. Different aspects of a system have varying importance in different domains. For example, the transaction cost is crucial in a network system, while the processing time and storage space are critical in a database system. Hence, it is important that a benchmark captures the characteristics of the system to be measured. In addition, systems vary in hardware and software support. Some run on Windows, while others run on Linux. As a consequence, it is important that a benchmark be portable and scalable. Finally, it is obvious that a benchmark should be easy to understand, and the results analyzable.

There are many domain-specific benchmarks. For example, the Wisconsin benchmark (Bitton et al. 1983) is widely used to test the performance of relational query systems on simple relational operators; the AS³AP benchmark (Turbyfill et al. 1989) provides a more complete evaluation of relational database systems by incorporating features such as testing utility functions, mix batch and interactive queries, and multiuser tests; the Set Query benchmark (O'Neil 1997) evaluates the ability of systems to process complex queries that are typical in decision-support and data-mining applications; TPC-D is used for online transaction processing (OLTP); TPC-H, TPC-R, and APB-1 (OLAP 1998) are used for decision support and information retrieval; Sequoia (Stonebraker et al. 1993) is used for spatial data management, OO7 (Carey et al. 1993) for object-oriented databases, BUCKY (Carey et al. 1997) for object-relational databases, and the most recent TPC-W benchmark (see http://www.tpc.org/) for e-commerce. These benchmarks mainly evaluate query-processing performance. Each of these benchmarks meets the important criteria of being relevant to its domain, portable, simple, and scalable.

The design of a benchmark to evaluate XML management systems is a nontrivial task. The XML is a self-describing data representation language that has emerged as the standard for electronic information interchange. Thus its potential use covers a variety of complex usage scenarios. The objective of the benchmarks presented in this chapter is not to capture all possible uses of XML but rather to focus on the query-processing aspect of XML. In fact, for the sake of fairness and simplicity, the XML query-processing tools are evaluated in the simplest possible setup?that is, with locally stored data (without data transaction over a network) and in a single machine/single user environment.

In the following sections, we will discuss the various issues that arise when developing a benchmark for XML management systems. This involves designing a benchmark data set and the corresponding set of benchmark queries that adhere to the four criteria for benchmark design. We note that these criteria are interrelated, often to the extent of being conflicting. This may affect the ability of a benchmark to adequately capture the performance of the systems under evaluation. For example, if the test data of an XML management system benchmark is too simple (simplicity), it will not be able to capture the ability of XML to represent complex structures (relevance). On the other hand, if the schema of the XML data is very complex (relevance), then some XML systems may not have the capability to store the data properly (portability), and it may be difficult to change the size of the data (scalability).