Data present on the Web is unstructured, or has incomplete, irregular, or frequently changed structure. XML is becoming the universal data exchange model on the Web. It has been shown that XML is well suited for representing semi-structured data. Compared to HTML, XML provides explicit data structuring, and data presentation is separated from data content. The aim of this chapter is to present a method for designing and managing an XML warehouse. We have designed and implemented a browser to graphically define XML views in order to simplify and improve the specification of XML views. Furthermore, we also have proposed a strategy for storing XML data in a relational DBMS.
The need for information personalization or adaptation for various types of users is crucial in many Web applications, since the gathered information is huge. Moreover, the data are heterogeneous and unstructured, or have incomplete, irregular, or frequently changed structure. XML is taking an important and increasing share of the data published on the Web. The W3C has proposed XSL (eXtensible Stylesheet Language), a language that provides a means for XML document restructuring. This language is designed to define style sheets over XML documents. However, XSL cannot be considered as a view definition language, as its expressive power is insufficient. This is why we have defined a view mechanism for XML data. We propose a view mechanism for XML data in order to customize and adapt the gathered information according to user requirements. Indeed, different users sharing XML data may want to see the same data differently.
Besides, views in a semi-structured (e.g., XML) environment can be used to provide: (1) a unified view of heterogeneous data sources and (2) the means to add a structured interface on top of semi-structured data. This last feature makes query optimization easier on semi-structured data and easier to use classical programming languages for application development. We have defined and implemented a view model for XML data. A view in the relational data model is a virtual relation that combines information from several base relations. While in our approach, a view is a "virtual" document that combines parts of different real documents. The resulting XML documents are stored in a repository, which provides a unified view of heterogeneous information sources and allows us to quickly answer user queries independently of the availability if the data sources. We call this repository an XML warehouse, which is built as a set of materialized views over multiple information sources.
Our system supports filtering documents and storing them in a DBMS. In this chapter, we will focus on that part of the system that allows the XML view specification and its mapping to relational tables in a MySQL database system.
The main contributions of this chapter are
A general architecture for a data warehouse integrating XML data
A formalism for a data warehouse specification
A mapping to store the warehouse in a relational DBMS
A graphic tool implementing our approach: DAWAX
This chapter is organized as follows. Section 16.2, "Architecture," presents the general architecture of our system. Section 16.3, "Data Warehouse Specification," follows this. The next section, 16.4, "Managing the Metadata," presents the metadata defining the warehouse. Section 16.5, "Storage and Management of the Data Warehouse," contains the storage techniques for the warehouse in a relational database. Our system for designing and managing the data warehouse, DAWAX, is presented in section 16.6 where we also discuss implementation details. This is followed by section 16.7, "Related Work," and finally our conclusions.