9.1 Introduction

This chapter explores techniques for managing XML document data by exploiting the extensibility features of a modern Object-Relational Database Management System (ORDBMS). The motivation for this prototype system lies in the observation that new applications making use of XML are likely to coexist with preexisting information systems supported by SQL-centric (object-) relational databases. A further goal of the integrated data store described here is to investigate how to preserve all of the desirable quality-of-service features provided by an ORDBMS?read/write ACID transactions, scalability, standard client programming APIs (e.g., JDBC, ODBC), and a declarative data language interface?without compromising the potency of XML as a driver of inter-business communications.

We describe our prototype in cookbook fashion, explaining what its ingredients are and how they were assembled. Production-quality implementations of these ideas could make use of any number of freely available software libraries from the Internet. While this prototype was developed on one object-relational DBMS?namely IBM Informix IDS 9.x-the functionality needed to support its key features is implemented in a number of other DBMS products?notably the Java-based ORDBMS Cloudscape and the Open Source PostgreSQL DBMS?suggesting that the techniques described in this chapter should work equally well in these other systems. In other words, this prototype describes an approach to managing XML that has quite broad utility and does not require any engineering investment on the part of the DBMS vendors. Information systems developers need not wait for vendors to catch up or tolerate proprietary "extensions" to standards that characterize many vendor offerings.

We begin this chapter with an overview of the kind of use-case scenario addressed by our system. XML research is taking a number of divergent paths. Some researchers are focused on the unstructured and partially structured forms of XML, with applications in areas such as content management and information retrieval. This prototype system focuses instead on XML's likely use as an enabler for e-business communication. Pan-enterprise information systems?typified by supply chain management infrastructure?will tend towards much larger numbers of relatively small and more rigorously structured XML "documents" (with and without predefined schema). Participating business organizations will use XML as a lingua franca to exchange structured messages containing quantitative information: what, how much, how many, where, and when. It is thought that e-businesses will be motivated to retain these messages in a query-able repository in order to perform post hoc analysis involving components of the messages for which they had no operational use at the time the message was originally received.

Next, we describe the fairly conventional architecture of our prototype. Our description emphasizes logical architecture?the nature of the software modules making up the system and on the interfaces between them. Briefly, the prototype consists of a database schema that employs a number of SQL-3 user-defined types and functions and a small number of programs that convert XML and XPath into relational data and SQL-3 and back again. This logical architecture can be mapped to several physical architectures. In a modern DBMS, Java class libraries and even compiled binary libraries, which are conventionally integrated within some middleware or client program, can be dynamically linked into the DBMS runtime and treated as a kind of sophisticated stored procedure. Alternatively, the same class libraries may be loaded into middleware or even client programs. Deciding what physical model to adopt is contingent on the specifics of the system under construction.

We then move on to describe several of the more interesting features of the prototype in more detail. The areas of focus are

The design of the object-relational database schema used by the prototype. Practical applications will need to build upon the simpleminded approach described here, and we suggest ways to employ other ORDBMS techniques to accommodate certain additional features of the XML data model.
The functionality and design of the new types introduced into the ORDBMS in order to manage hierarchical structures efficiently. We emphasize how the approach taken here meets the requirements of XML XPath processing, how it differs from other proposals, and what is required of the DBMS to support it.
The algorithms used to convert XPath expressions into their equivalent SQL-3 given the schema design and user-defined types already introduced. XPath expressions constitute a sizeable portion of the XQuery specification. Because XML is conceived as a fundamentally hierarchical data model, getting XPath expressions to run efficiently presents a number of challenges to SQL query processors.

Finally we conclude this chapter with a review of the prototype and a summary of its key contributions.

Top