15.1 Introduction

  Previous section   Next section

Information technology applications present new challenging requests to the database research field. These requests are motivated by the widespread use of database applications and by the complex and new needs that these applications present. Furthermore, the volume of data is growing everyday and is reaching dimensions of terabytes, while the response must be fast and precise.

Instances of these applications are the data warehouse analysis procedures that perform several scans of huge databases in order to provide an answer to a single request. These queries ask the system to perform huge computations based on the aggregation of data. They define a new typology of database applications: the OLAP (On-Line Analytical Processing) applications.

Other instances of these new applications that, like data warehouse applications, work on huge volumes of data and require fast response times are data-mining applications. These are called to extract descriptive patterns from the data in order to represent information on the data itself that is useful for decision making. Examples of these applications are classifiers. They assign the analyzed data to a set of classes, having previously learned the rules that allow the system to assign a data example to a class. Data examples are taken from a subset of the data, the training set, already classified by an expert. Clustering applications perform a similar task but with no supervision of the expert and with a number of classes that is not given in advance.

Other examples are basket analysis applications, which extract patterns such as association rules and sequential patterns. These patterns give a representation of the laws that govern the distribution of the data in the database. They are designed in order to give, to the user/analyst, an intuitive explanation of the underlying laws that guide the customer in his or her purchases.

All these different techniques share the goal to extract data patterns from the database, in order to obtain a description that can be used as a prediction tool for future data. Therefore, all of them constitute a potentially useful tool for the analyst, in order to give an explanation of the patterns themselves and help the analyst in the hard interpretation task of the extracted patterns. In this way extracted patterns are themselves considered data to be analyzed (and not necessarily with the same analysis tool that was used to obtain them).

These are the motivations that inspired Imielinski and Mannila to launch the idea of an inductive database, a general-purpose database in which both the data and the patterns can be represented, retrieved, and manipulated together or separately (Imielinski and Mannila 1996). Inductive databases should help the analyst in the hard task of extracting knowledge from the database and successively in interpreting it with the same suite of analysis tools. This process is known as the Knowledge Discovery process (KDD process for short). It consists of a sequenceof data preprocessing steps, data-mining steps (extracting patterns), and post-processing steps (providing the interpretation of the extracted patterns). According to Imielinski and Mannila, inside the inductive database framework, the knowledge discovery process becomes a simple querying sequence, where each query is an instance of a specialized query language, provided with a highly expressive power. With inductive databases, all the analysis techniques previously described should be integrated inside the same framework, the inductive database management system, in order to be used and intermixed in the analysis process when needed.

However, the underlying analysis models are rather different and require data to be represented, retrieved, and manipulated in different ways. Classifiers and clustering procedures usually adopt a data model that is a classification tree, while the basket analysis problem is solved representing data with the use of a set enumeration model. On the other side, source raw data are very often represented in the relational data model (because they reside inside relational databases), whose simplicity does not provide an easy way to manage data represented in different models.

In this chapter, we explore the feasibility of using XML as the language for the representation and integration of the different models previously cited. XML is particularly suitable for this task, because it can represent at the same time, and in a flexible way, both the data schema and the data values. This allows more generality in the definition of patterns and in the representation of the adopted models. In particular, we propose here a new model called XDM (XML for Data Mining) specifically designed to be adopted inside the unifying framework of inductive databases. We show the features that make it suitable for inductive database applications. The first one is that it allows source raw data and patterns to be represented at the same time in the model. The second one is that it represents, together with patterns, also the pattern definition that results from the pattern derivation process. This is determinant for the phase of pattern interpretation and allows pattern reuse by the inductive database management system. As we explain later in this chapter, this fact may speed up future pattern extractions because it allows an incremental computation of patterns. Furthermore, we show that the use of XML allows the description of complex formats, such as trees, enabling the effective integration of several heterogeneous data-mining techniques and models in the same framework. Finally, we show that the framework can be easily extended with new data-mining operators. In this way inductive databases really become open systems that are easily customizable according to the kinds of patterns in which the user/analyst is interested.

This chapter is organized as follows. In Section 15.2 "Past Work," we discuss the work that motivates the proposed model: We discuss the problem of the extraction and evaluation of association rules, and the problem of data classification. Then, we present the concept of inductive databases, and finally we present a previous proposal concerning both XML and data mining?that is, the Predictive Model Mark-up Language (PMML).

The running example introduced in Section 15.2, "Past Work," will be the basis for Section 15.3, "The Proposed Data Model: XDM": XDM, our proposal for inductive databases, is presented in that section. In particular, we first introduce the basic XDM concepts and notions, and then we show the application of XDM to concrete problems, by exploiting the running examples.

Section 15.4, "Benefits of XDM," presents the benefits of the new model. Section 15.5, "Toward Flexible and Open Systems," presents the inductive database system as an open and flexible framework. Section 15.6, "Related Work," discusses related work, and finally we draw our conclusions.


Part IV: Applications of XML