15.6 Related Work

  Previous section   Next section

The topic addressed by this chapter embraces two wide research areas: the research area of data mining and knowledge discovery, and the research area of XML data management. In both the areas, a large amount of work has been done, although the work in both areas has been rather recent (no more than ten years for data mining and no more than five years for XML data management). Therefore, when writing about this topic, it is difficult to be exhaustive.

The research community considers R. Agrawal's paper (Agrawal et al. 1993a) the work that originated the field of data mining: In that work the authors demonstrated that it is possible to extract patterns from large raw data with acceptable performance. The papers that followed that work provided a large variety of algorithms for efficiently mining association rules (Agrawal et al. 1993b; Agrawal and Srikant 1994; Bayardo 1998; Han et al. 2000?to cite only the most widely known in the field of association rule mining) and other kinds of knowledge, such as classification models, sequential patterns, and so on (see, for example, Kamber et al. 1997; Li et al. 2001; Mehta et al. 1996; Srikant and Agrawal 1996; Quinlan 1993).

Then the problem of integrating data-mining algorithms and databases emerged. Several works addressed the topic. First of all, the problem has been addressed from the language perspective in works by R. Meo et al. (Meo et al. 1996, 1998a), T. Imielinski et al. (Imielinski et al. 1996; Imielinski and Virmani 1998), and J. Han et al. (Han et al. 1996). Different query languages for data mining based on the SQL syntax are proposed: The common idea is to extend SQL with specific constructs in order to enable the user to specify in a declarative form data-mining statements over relational databases. The main advantage of this proposal is the fact that data, patterns, and mining statements belong to the same framework?that is, the relational framework, where data are usually stored. P. L. Lanzi and G. Psaila (Lanzi and Psaila 1999; Psaila 2001) tried to stress this idea, showing that the relational database framework could be effectively used to host several SQL-like data-mining operators, thus transforming the relational database framework in a relational database-mining framework.

The alternative way to address the problem of integrating data-mining technologies and databases considered the integration of data-mining algorithms and databases. On this topic, we cite the works of R. Agrawal and K. Shim (Agrawal and Shim 1996) and R. Meo et al. (Meo et al. 1998b): The former gives an overview of different solutions and problems arising when data-mining algorithms are integrated with relational databases; the latter shows that a specific design for algorithms can take advantage of the presence of the underlying relational database.

Finally, as far as the data-mining area is concerned, we recall that the idea of developing a unifying framework and system for data mining led to the definition of inductive databases. While the idea of an inductive database was introduced for the first time by T. Imielinski and H. Mannila (Imielinski and Mannila 1996), J.-F. Boulicaut et al. (Boulicaut et al. 1998) formally defined this notion for the first time, taking the MINE-RULE operator introduced by R. Meo et al. (Meo et al. 1996) as the first example of an inductive database query language; consequently, it is possible to see that the work of P. L. Lanzi and G. Psaila (Lanzi and Psaila 1999; Psaila 2001) confirms the idea that inductive databases and the relational database-mining frame work are two sides of the same coin. Then the work of J.-F. Boulicaut et al. (Boulicaut et al. 1999) gave a better formalization of the concept of inductive databases, although we think that the definition provided in that paper (and reported in this chapter) is still inadequate to give a full support for flexible data-mining systems.

As far as the research area related to XML is concerned, the situation is still more difficult. XML has been introduced by a W3C recommendation. The language was immediately an object of interest because it easily describes semi-structured and/or complex data, is an open format, and is suitable for data exchange over the Internet.

The research work on XML moved in several directions. At first, query languages for XML were proposed, with the aim of querying XML documents to extract information. XPath is a simple and powerful query language: Its simplicity makes it suitable to be incorporated inside other languages, such as XSLT. At the moment, the work on query languages for XML is converging around XQuery, the official W3C language for querying XML documents and generating other XML documents; however, it is not stable yet, since it is ongoing work.

As far as the connection between database systems and XML is concerned, some work has been done to explore the problem of storing and managing XML documents inside relational, object-relational, and object-oriented databases (see, for example, Klettke and Meyer 2000, Schmidt et al. 2000, Kappel et al. 2000). The common idea behind these works is to map XML documents into the relational or object-oriented structure; then retrieval is performed by translating queries over XML documents to the corresponding database schema.

From a commercial point of view, the most famous database system to store collections of XML documents is Tamino by Software AG: It provides support for XML document storage in their native format, in order to freely manage collections of XML documents without knowing their structure in advance.

Finally, we recall that at the moment the gap between XML and data mining has been filled only by two proposals. The first one is the Predictive Model Mark-up Language (PMML?http://www.dmg.org/pmml-v2-0.htm). This proposal is devised to create a standard format, in order to enable different systems to exchange patterns extracted from within data sets by data-mining tools. Unfortunately, because of the goal for which PMML has been designed, PMML cannot be exploited for improving the notion of an inductive database. Instead, this is exactly the main goal of XDM. Certainly, the open nature of XDM may allow the use of PMML in order to represent patterns and models inside XDM.

The second proposal is introduced in a recent work by D. Braga et al. (Braga et al. 2002): This paper follows the way traced for the MINE-RULE operator (Meo et al. 1996, 1998a) to define an operator for mining association rules from within XML documents based on the XQuery syntax. The clauses on which the operator is based, the MINE-RULE XDM statement, were illustrated in this chapter in Section 15.3.3, "Association Rules with XDM."


Part IV: Applications of XML