Figure 42.37 shows the underlying data files that are generated for the cube. This is the Windows Explorer window of the directory that is created for the Analysis Services database (CompSales in this example). Do not modify or erase any of these files or subdirectories! You can also see subdirectories for each cube and for data mining models.
Security is straightforward in Analysis Services. For each database or cube, roles are identified with varying levels of granularity for users. It is these roles that are used when accessing the data in the cubes. The way it works is that a role is defined and then an individual user or group who is a member of that role (at the NT/Windows level) is assigned that role. Figure 42.38 shows the creation of a database role. You can see where the role will be enforced (such as at the server), and what rules should apply at a dimensional level. All users and groups of a database role have the same access unless it was overridden in cube roles of the same name.
With Analysis Services, the feature of defining OLAP data mining models has arrived. Data mining is the process of understanding potentially undiscovered characteristics or distributions of data. Data mining can be extremely useful for OLAP database design in that patterns or values might define different hierarchy levels or dimensions that were not previously known. As you create dimensions, you can even choose a data mining model as the basis for a dimension.
Basically, a data mining model is a virtual structure that represents the grouping and predictive analysis of relational or multidimensional data. It is composed of rules, patterns, and other statistical information of the data that it was analyzing. These are called cases. A case set is simply a way of viewing the physical data. Different case sets can be constructed from the same physical data. Basically, a case is defined from a particular point of view. If the algorithm you are using supports the view, you can use mining models to make predictions based on these findings.
Another aspect of a data mining model is that of using training data. This is a process that determines the relative importance of each attribute in a data mining model. It does this by recursively partitioning data into smaller groups until no more splitting can occur. During this partitioning process, information is gathered from the attributes used to determine the split. Probability can be established for each categorization of data in these splits. It is this type of data that can be used to help determine factors about other data that is utilizing these probabilities. This training data, in the form of dimensions, levels, member properties and measures, is used to process the OLAP data mining model and further define the data mining column structure for the case set.
In Analysis Services, Microsoft provides two primary types of data mining algorithms; decision trees and clustering. (Other algorithms are accessible via different providers.)
To create an OLAP data mining model, Analysis Services will use a source OLAP cube, a particular data mining technique/algorithm, case dimension and level, predicted entity, and training data. The source OLAP cube provides the information needed to create a case set for the data mining model. It will use a dimension and level that you will choose to establish key columns for the case sets. You then select the data mining technique (decision tree, clustering, or one from another provider). The case dimension and level provide a certain orientation for the data mining model into the cube for creating a case set. The predicted entity can be either a measure from the source OLAP cube, a member property of the case dimension and level, or any member of another dimension in the source OLAP cube.
The wizard can also create a new dimension for the source cube or a virtual cube based on the source cube. This enables users to query the data mining data model data just as they would query OLAP data.
By right-clicking on the Mining Model folder, a new mining model can be defined (Wizard). You will be given the option of either creating your mining model from relational data or OLAP data. Figure 42.39 shows this option. The purpose of the new mining model that this example will show you is that of determining product sales characteristics. It will be based on the information you built in your CompSales OLAP cube.
You will now select the source OLAP cube in which you want to base your data mining model. Your example will select the CompSalesMonthly cube within the CompSales databases (see Figure 42.40).
You must now select the data mining technique of either clustering, or decision trees. The following describes these algorithms in more detail:
Clustering?This algorithm finds natural groupings of data in a multidimensional space. It is useful when you want to see general groupings in your data, such as hot spots.
Decision trees?This algorithm chooses significant characteristics in the data and narrows down sets of data based on those characteristics until clear correlations are established. Decision trees are helpful when you want to make specific predictions based on characteristics of the source data, such as products that have similar sales patterns.
You will then select the case or point of view for the analysis. Figure 42.41 illustrates the case to be based on the product dimension and at the SKU level (individual product level).
You will now establish the training data for this mining model to be that of the case dimension (the default) and, optionally, at least one other piece of data on which to base it (see Figure 42.42).
The mining model is now specified and must be processed. When it has finished executing, you will be automatically taken into the Mining Model Editor. Here you can delve into the discovered characteristics of your mining model as well as change options and reprocess the model (see Figure 42.43).
Microsoft Data Transformation Services (DTS) provides a means to move data between sources. Data can be exported, validated, cleaned up, consolidated, transformed, and then imported into the destination.
You can combine multiple column values into a single calculated destination column or divide column values from a single source column into multiple destination columns. You might need to translate values in operational systems. For example, many OLTP systems use product codes that are stored as numeric data. Few people are willing to memorize an entire collection of product codes. An entry of 100235 for a type of shampoo in a product dimension table is useless to a VP of marketing who is interested in how much of that shampoo was sold in California in the past quarter.
Cleanup and validation of data is critical to the data's value in the data warehouse. The old saying "garbage in, garbage out" applies. If data is missing, redundant, or inconsistent, then high-level aggregations can be inaccurate, so you should at least know that these conditions exist. Perhaps the data is rejected for use in the warehouse until the source data can be reconciled. If the shampoo of interest to the VP is called "Shamp" in one database and "Shampoo" in another, then aggregations on either value do not produce complete information about the product.
The DTS package defines the steps in a transformation workflow. You can execute the steps serially and in combinations of serial, parallel, or conditional. For more information on Microsoft DTS, refer to Chapter 20, "Importing and Exporting SQL Server Data Using BCP and DTS."
The Repository is a sharable metadata information store that supports the data warehouse environment by maintaining technical information on datasources and targets, transformations and mapping (such as DTS packages), data cleanup and enhancement rules, and scheduling. Business metadata is stored for end users to facilitate use of the data warehouse. You can also store query and report definitions and documentation in the Repository.
The Repository is based on the Unified Modeling Language (UML) with Microsoft Extensions to provide a common data warehousing infrastructure. Object sharing and reuse is implemented in a visual component manager for Visual Studio.
The Repository is installed by default in the msdb database in the Workgroup and Enterprise Editions of SQL Server. A SQL server can contain multiple Repositories, but DTS supports only a single Repository database per server in SQL Server Enterprise Manager.
Recently, a data warehousing manager revealed her frustration with the OLAP performance she was getting (using HOLAP and MOLAP approaches) with Analytic Services. This was put to rest by relocating the OLAP physical storage components on a new Solid State disk device (a persistent memory device). The performance gains were tenfold. The price of this type of technology has dramatically been reduced within the last 12 months and the ease of transparently applying this type of solution to OLAP was a natural fit. It affected both the OLAP data population process as well as the day-to-day "what if" usage by the end users. Keep these types of surgical incisions in mind when faced with OLAP performance issues in this platform. They are easy to apply, the gains are huge, and the ROI is quickly attained.