8.1 Introduction

The relational database is by far the most pervasive data storage paradigm encountered across all application domains in industry and academia. The majority of organizations with a substantial IT infrastructure will have successfully implemented some sort of relational database technology, and many of these will have teams of programmers, database administrators, and other technical staff to design, deploy, support, maintain, and evolve their relational database solutions. Even many nontechnical staff involved in the provision or use of database systems will have a reasonable appreciation of the capabilities and approximate structure of the underlying relational database platform. The relational data model is, in short, one of computer science's most prominent success stories, and one of the IT industry's biggest revenue earners.

In a typical application, a database server acts as an information repository for a user interface?the "client"?that runs on the user's desktop machine. In this "client/server" architecture, the client sends requests to create, retrieve, modify, or delete data stored on the server. More sophisticated applications may involve multiple client applications interacting with information partitioned across heterogeneous data servers. Furthermore, in a multitiered architecture, additional application layers (often acting as database proxy servers, information brokers, or business rule enforcers) may lie between the user interface and the database(s).

The advent of object-oriented approaches to programming may have encouraged innovative thinking in terms of data representation and manipulation, but?while many client applications are written using object-oriented programming languages?the underlying data storage technologies that serve these applications are still predominantly relational. The attractiveness (to the application designer at least) of object databases has to some extent been thwarted by the maturity, robustness, speed, security, and scalability of the many available commercial and Open Source relational database products, coupled with their increasing support for hybrid object-relational functionality (such as object persistence and native Java support).

The increasing use of the Internet as a platform for applications has added to the popularity of the relational database management system (RDBMS) as an information store, with many RDBMS vendors incorporating Web support into the feature sets of their product ranges. Likewise, the providers of content-serving Web application platforms (including Web-savvy scripting languages and other server-side technologies) have all addressed the issue of connection to relational data sources.

XML is well established as a platform-independent information exchange format. Thus far, it has predominantly been used for the encapsulation of data and metadata that will often (in part, at least) have originated from, or be destined for, a relational database. In the expanding information economy, many business-to-business (B2B) applications are exchanging data across the Internet in this way, often with no direct human intervention. The wide availability of standard techniques and tools for creating, parsing, validating, and transforming XML has greatly assisted the goal of systems interoperability, through the automated exchange and verification of XML-encoded data and the validation of its structure and content against agreed schemata.

The rapid take-up of XML is catalyzing a reappraisal of data storage approaches, and in particular the future role of RDBMS products in XML-compliant architectures. The move by many organizations to mobilize information in and out of their secure, proprietary, relational data stores through the medium of XML-compliant markup languages (often incorporating descriptive annotations and meta data) is highlighting the many benefits of standards-based interoperable solutions and?in certain situations?the advantages of the semi-structured paradigm over more rigid and inflexible data models.

RDBMS vendors have not been slow in incorporating XML-related functionality into their products, and most now offer some degree of XML support. There are (for example) SQL extensions in many products that allow data to be extracted from relational tables and recast in XML format. Those that permit the storage of XML data may offer one or more options as to the level of granularity of storage.

XML documents may be stored as single entities, rather like flat files stored on a file server. At first sight, this may seem a relatively pointless exercise (after all, we could just store flat files on a file server!), but it does have the benefit that user access permissions can be more closely integrated with other database object permissions.

XML documents may be stored as linked "chunks"?a useful approach if the XML data is "document centric"; that is, it has recognizable paragraphs of plain text or "mixed content" (where text is interspersed with markup).

XML documents may be broken down into small components that are stored in regular relational tables. This approach often requires an explicit mapping between the XML schema and the database tables?that is, between elements and tables, and between attributes and columns.

The latter strategy may prove the most satisfactory if the aim is to support schema-aware searches and low-level editing, but it is likely to prove expensive as regards the marshalling of data back into XML format (since this may involve a large number of table joins). In practice, our choice will be determined by our application's requirements with regard to querying, cross-linking, and document component reuse, among other factors.

Despite these innovations, there is little industry consensus as to the most appropriate way to support XML within the relational data model. Applications that bind too closely to product-specific database extensions run the risk of vendor tie-in and incompatibility with other products, and systems that involve more than one RDBMS platform may require substantially different interfaces to each. Some organizations are evaluating the native XML database technologies that are emerging, but many are reluctant to make a decision until these products have matured and reached the levels of scalability, performance, and cost-effectiveness of RDBMSs.

In this chapter, a simple generic architecture for the storage, manipulation, and retrieval of well-formed XML documents in relational databases is presented (it will be assumed that the reader has a working knowledge of XML, Java, JDBC, relational database concepts, SQL, and common database extensions such as stored procedures). The generic architecture of the repository has the benefit that it allows for rapid prototyping and experimentation, particularly for the reader who is uncertain of which XML schema(s) they will ultimately be using in their real world applications. While it may be less scalable than a schema-specific model (where there is an explicit mapping between XML fragments and database objects) and will doubtless be less functional than a native XML database, it does at least allow the RDBMS-savvy application developer to dip a toe in the waters of XML and gain hands-on experience of working with data in this format, and it should serve as a useful pointer towards the requirements for a fully functional enterprise-scalable system.

Since we can store any well-formed XML, we can, by definition, also store XML Schema Definitions (XSD) and Extensible Stylesheet Language Transformations (XSLT) alongside the documents to which they refer. Furthermore, since we are dealing with XML, should it become necessary to migrate all of the content at some later stage, we can simply export our data (as XML) and upload it to the desired platform.

Top