In this section, we summarize the benefits achieved by XDM.
Flexibility: A major critique of the data-mining research field is that each technique has been developed with data models that strongly differ from the models of other techniques. This causes difficulties in integrating different data-mining techniques in the same system. Furthermore, the need for database support has been widely recognized, but relational databases are not suitable for managing with ease data represented in other models (such as trees). With XDM, we exploit the flexible structure of XML to integrate any kind of data representation in the XDM database, including trees.
Derivation dependencies: The derivation process is maintained inside the XDM database, which is important for two reasons. The first one is that in this way the database maintains the meaning of each derived data item. This fact plays an important role when data are successively interpreted. This may be one of the most important limitations of the relational database-mining framework, because relational databases do not maintain this kind of information and tables lose their meaning. The second benefit of maintaining the derivation process of each data item arises when a source data item changes: all derived data items are no longer valid and must be recomputed. If the derivation process is not maintained inside the database state, it is impossible to know which data items are still valid and which are not. Furthermore, this information may allow the exploitation of incremental computation techniques for derived data items considering only the changes in the source data items.
Open description: XDM is an open representation of data and derivation processes. This may be exploited by the user, since she/he can clearly read the data and the process descriptions. However, we think that advanced data-mining operators may better exploit this fact. Indeed, process descriptions can be considered as background knowledge about data mining, and new operators can use it to better perform new sophisticated derivations.
The major drawback of choosing XML as the basis for our unifying framework is the amount of space required by XML representations, if compared with flat text or binary representations. For instance, source data sets might significantly increase their size, when they are described in XML format, due to the introduction of markups and attributes. Consequently, we can expect that for huge data sets, it is necessary to take into account this problem.
Earlier, we briefly introduced the Predictive Model Mark-up Language (PMML). We said that this language, developed by the DMG group, is the first attempt to define an XML language for interchange of data-mining and knowledge discovery models among heterogeneous applications over the Internet. Here we want to compare PMML features with XDM, in order to highlight differences and benefits provided by XDM with respect to PMML-based solutions.
PMML is devised to be a standard communication format. For this reason, it is not suitable to be the basis for an integrated database environment that unifies several complex data-mining and knowledge discovery tasks under the same framework. In fact, PMML is not aware of the concept of database state and is unable to represent both source data and patterns.
PMML is devised to describe patterns. However, it does not consider at all that patterns might be extracted by a variety of different data-mining tools with different semantics. Even if these tools produce the same kinds of patterns, their meaning can change significantly.
PMML does not describe processes or multistep derivations. In fact, nothing is said about how to reuse the patterns generated by a mining step; this task is left to the specific tools that receive the PMML documents.
For these reasons, PMML is not a good data model for advanced data-mining systems, like inductive databases. In the next section, we will present a new generation of data-mining systems that can be developed on the basis of XDM. In fact, the main advantage of XDM is that it copes with notions such as database state, derivation of data items, description of complex patterns, and complex mining statements.