This book is divided into five parts, each containing a coherent and closely related set of chapters. It should be noted that these parts are self-contained and can be read in any order. The five parts are as follows:
Part I: Introduction
Part II: Native XML Databases
Part III: XML and Relational Databases
Part IV: Applications of XML
Part V: Performance and Benchmarks
The parts are summarized in the sections that follow.
This part contains a chapter that focuses on guidelines for achieving good grammar and style when modeling information using XML. Brandin, the author, argues that good grammar alleviates the need for redundant domain knowledge required for interpretation of XML by application programs. Good style, on the other hand, ensures improved application performance, especially when it comes to storing, retrieving, and managing information. The discussion offers insight into information-modeling patterns inherent in XML and common XML information-modeling pitfalls.
Two native XML database systems, Tamino and eXist, are covered in this part. In Chapter 2, Schöning provides an overview of Tamino's architecture and APIs before moving on to discussing its XML storage and indexing features. Querying, tool support, and access to data in other types of repositories are also described. The chapter offers a comprehensive discussion of the features that are of key importance during the development of an XML data management application.
In a similar fashion, Chapter 3 by Meier introduces the various features and APIs of the Open Source system eXist. However, in contrast with Chapter 2, the main focus is on how query processing works within the system. As a result, the author provides deeper insight into its indexing and storage architectures. Together both chapters offer a balanced discussion, both on high-level application-programming features of the two systems and underlying indexing and storage mechanisms pertaining to efficient query processing.
Finally in Chapter 4, we have included an example of an embedded XML database system. This is based upon the general-purpose embedded database engine, Berkeley DB. Berkeley DB XML is able to store XML documents natively, and it provides indexing and an XPath query interface. Some of the capabilities of the product are demonstrated through code examples.
This part provides an interesting mix of products and approaches to XML data management in relational and object-relational database systems. Chapters 5, 6, and 7 discuss three commercial products: IBM DB2, Oracle9i, and MS SQL Server 2000, respectively, while Chapters 8 and 9 describe more general, roll-your-own strategies for relational and object-relational systems.
Chapter 5 by Benham highlights the technology and architecture of XML data management and information integration products from IBM. The focus is on the DB2 Universal Database and Xperanto. The former is the family of products providing relational and object-relational data management support for XML applications through the DB2 XML Extender, extended SQL, and support for Web services. The latter is the planned set of products and functions for addressing information integration requirements, which are aimed at complementing DB2 capabilities with additional support for XML and both structured and unstructured applications.
In Chapter 6, Hohenstein discusses similar features in Oracle9i: the use of Oracle's CLOB functionality and OracleText cartridge, for handling data-centric XML documents, and XMLType, a new object type based on the object-relational functionality in Oracle9i, for managing document-centric ones. He presents the Oracle SQL extensions for XML and provides examples on how to use them in order to build XML documents from relational data. Special features and tools for XML such as URI (Uniform Resource Identifier) support, parsers, class generator and Java Beans encapsulating these features are also described.
In Chapter 7, Rys covers a feature set, similar to the ones in Chapters 5 and 6, for MS SQL Server 2000. He focuses on scenarios involving exporting and importing structured XML data. As a result, the focus is on the different building blocks such as HTTP and SOAP access, queryable and updateable XML views, rowset views over XML, and XML serialization of relational results. Rowset views and XML serialization are aimed at providing XML support for users more familiar with the relational world. XML views, on the other hand, offer XML-based access to the database for users more comfortable with XML.
Collectively, Chapters 5, 6, and 7 furnish an interesting comparison of the functionality offered by the three commercial systems and the various similarities and differences in their XML data management approaches. In contrast, Chapters 8 and 9, by Edwards and Brown, respectively, focus on generic, vendor-independent solutions.
Edwards describes a generic architecture for storing XML documents in a relational database. The approach is aimed at avoiding vendor-specific database extensions and providing the database application programmer an opportunity to experiment with XML data storage without recourse to implementing much new technology. The database model is based on merging DOM with the Nested Sets Model, hence offering ease of navigation and the ability to store any well-formed XML document. This results in fast serialization and querying but at the expense of update performance.
While Edwards' architecture is aimed at supporting the traditional relational database programmer, Brown's approach seeks to exploit the advanced features offered by the object-relational model and respective extensions of most relational database systems. He discusses object-relational schema design based on introducing into the DBMS core types and operators equivalent to the ones standardized in XML. The key functionality required of the DBMS core is an extensible indexing system allowing the comparison operator for built-in SQL types to be overloaded. The new SQL 3 types thus defined act as a basis during the mapping of XPath expressions to SQL 3 queries over the schema.
This part presents several applications and case studies in XML data management ranging from bioinformatics, geographical and engineering data management, to customer services and cash flow improvement, through to large-scale distributed systems, data warehouses, and inductive database systems.
In Chapter 10, Direen and Jones discuss various challenges in bioinformatics data management and the role of XML as a means to capture and express complex biological information. They argue that the flexible and extensible information model employed by XML is well suited for the purpose and that database technology must exhibit the same characteristics if it is to keep in step with biological data management requirements. They discuss the role of the NeoCore XML management system in this context and the integration of a BLAST (Basic Local Alignment Search Tool) sequence search engine to enhance its ability to capture, manipulate, analyze, and grow the information pertaining to complex systems that make up living organisms.
Kowalski presents two case studies involving XML and IBM's DB2 Universal Database in Chapter 11. Her first case study is that of a customer services unit that needs to react to problems from the most important customers first. The second case study focuses on improving cash flow in a school by reducing the time for reimbursement from the Department of Education. The author presents the scenario and the particular problem to be solved for each case study, which is followed by an analysis identifying existing conditions preventing the solution of the problem. A description of how XML and DB2 have been used to devise an appropriate solution concludes each case study.
Chapter 12, by Eglin, Hendra, and Pentakalos, describes the design and implementation of the JEDMICS Open Access Interface, an EJB-based API that provides access to image data stored on a variety of storage media and metadata stored in a relational database. The JEDMICS system uses XML as a portable data exchange solution, and the authors discuss issues relating to its integration with the object-oriented core of the system and the relational database providing the persistent storage. A very interesting feature of the chapter is the authors' reflection on their experiences with a range of XML technologies such as DOM, JDOM, JAXB, XSLT, and Oracle XSU in the context of JEDMICS.
In Chapter 13, Wilson and her coauthors offer insight into the use of XML to enhance the GIDB (Geospatial Information Database) system to exchange geographical data over the Internet. They describe the integration of meteorological and oceanographic data, received remotely via the METCAST system, into GIDB. XML plays a key role here as it is utilized to express the data model catalog for METCAST. The authors also describe their implementation of the OpenGIS Web Map Server (WMS) specification to facilitate displaying georeferenced map layers from multiple WMS-compliant servers. Another interesting feature of this chapter is the implementation of the ability to read and write vector data using the OpenGIS Geographic Markup Language (GML), an XML-based language standard for data interchange in Geographic Information Systems (GISs).
Rine sketches his vision of an Interstellar Space Wide Web in Chapter 14. He contrasts the issues relating to the development and deployment of such a facility with the problems encountered in today's World Wide Web. He mainly focuses on adapters as configuration mechanisms for large-scale, next-generation distributed systems and as the means to increase the reusability of software components and architectures in this context. His approach to solving the problem is a configuration model and network-aware runtime environment called Space Wide Web Adapter Configuration eXtensible Markup Language (SWWACXML). The language associated with the environment captures component interaction properties and network-level QoS constraints. Adapters are automatically generated from the SWWACXML specifications. This facilitates reuse because components are not tied to interactions or environments. Rine also discusses the role of the SWWACXML runtime system from this perspective as it supports automatic configuration and dynamic reconfiguration.
In Chapter 15, Meo and Psaila present an XML-based data model used to bridge the gap between various analysis models and the constraints they place on data representation, retrieval, and manipulation in inductive databases. XDM (XML for Data Mining) allows simultaneous representation of source raw data and patterns. It also represents the pattern definition resulting from the pattern derivation process, hence supporting pattern reuse by the inductive database system. One of the significant advantages of XML in this context is the ability to describe complex heterogeneous topologies such as trees and association rules. In addition, the inherent flexibility of XML makes it possible to extend the inductive database framework with new pattern models and data-mining operators resulting in an open system customizable to the needs of the analyst.
Chapter 16, the last chapter in this part, describes Baril's and Bellahsene's experiences in designing and managing an XML data warehouse. They propose the use of a view model and a graphical tool for the warehouse specification. Views defined in the warehouse allow filtering and restructuring of XML sources. The warehouse is defined as a set of materialized views, and it provides a mediated schema that constitutes a uniform query interface. They also discuss mapping techniques to store XML data using a relational database system without redundancies and with optimized storage space. Finally, the DAWAX system implementing these concepts is presented.
XML database management systems face the same stringent efficiency and perfor-mance requirements as any other database technology. Therefore, the final part of this book is devoted to a discussion of benchmarks and performance analyses of such systems.
Chapter 17 focuses on the need to design and adopt benchmarks to allow comparative performance analyses of the fast-growing number of XML database management systems. Here Bressan and his colleagues describe three existing benchmarks for this purpose, namely XOO7, XMach-1, and XMark. They present the database and queries for each of the three benchmarks and compare them against four quality attributes: simplicity, relevance, portability, and scalability. The discussion is aimed at identifying challenges facing the definition of a complete benchmark for XML database management systems.
In Chapter 18, Patel and Jagadish describe a benchmark that is aimed at measuring lower-level operations than those described in Chapter 17. The inspiration for their work is the Wisconsin Benchmark that was used to measure the performance of relational database systems in the early 1980s.
Schmauch and Fellhauer describe a detailed performance analysis in Chapter 19. They compare the time and space consumed by a range of XML data management approaches: relational databases, object-oriented databases, directory servers, and native XML databases. XML documents are converted to DOM trees, hence reducing the problem to storing and extracting trees. Instead of using a particular benchmark, they derive their test suite from general requirements that the storage of XML documents has to meet. Different-sized XML documents are stored using the four types of systems, selected fragments and complete documents are extracted, and the disk space is used and performance is measured. Similar to the next chapter, Chapter 20, the authors offer a thorough set of empirical results. They also provide detailed insight into existing XML data management approaches using the four systems analyzed. Finally, the experiences presented in the chapter are used as a basis to derive guidelines for benchmarking XML data management systems.
In Chapter 20, Fong, Wong, and Fong present a comparative performance analysis of a native XML database and a relational database extended with XML data management features. They do not use any existing benchmarks but instead devise their own methodology and database. The key contribution of this chapter is a detailed set of empirical results presented as bar graphs.