7.3 Data Profiling Methodology

7.3 Data Profiling Methodology

Data profiling should follow a specific methodology to be most effective. The methodology provides for an orderly and logical progression of investigations that build information from one level to the next.

Data profiling methodology uses a bottom-up approach. It starts at the most atomic level of the data and moves to progressively higher levels of structure over the data. By doing this, problems at lower levels are found and can be factored into the analysis at the higher level. If a top-down approach is used, data inaccuracies at the lower levels may confuse the process and make it difficult to establish the true data rules. This is why a top-down approach will not work effectively in the face of data inaccuracies. Many analysts will correct data inaccuracies at each level before moving to a higher level. This is done to make the higher levels of data profiling more successful.

Process Steps

Figure 7.2 shows a diagram of the major steps in the data profiling methodology. Each step concentrates on a specific type of rule, builds or verifies the rule definitions, and then uses this to find discrepancies between the data and the rules.

Click To expand Figure 7.2: Data profiling steps.

Data profiling users often only go partway through the methodology. They find so much useful information in the first step or two they believe they do not need to go further. Eliminating steps in the process leaves open the opportunity for missing facts about inaccurate data. You will find fewer of the potential data inaccuracy facts that are there to be found. The ones you miss may be more useful than the ones you find.

The diagram is a general concept. In practice, a single inaccurate value can cause rule violations in all steps. That does not mean that only one instance of rule violation per inaccurate value is valuable to detect. Each rule violation has a unique business implication. Thus, a single inaccurate value can explode into multiple external problems. It is important to know all of the violations to understand the full potential of impacts possible from the inaccuracy.

Each of the steps, introduced in the sections that follow, is the topic of a separate chapter. The chapters go into greater detail with real examples of the rules and the type of investigative thought required to be effective.

Column Property Analysis

This type of analysis looks at the values stored in a single column independent of all other columns. It is the subject of Chapter 8. The metadata includes properties for each column that define what a valid value is within that column.

For example, a property might be that the DEPARTMENT_ID column is numeric and must be in the range of 100 to 999. A value of 010 would be output as an invalid value

The property list for each column is really a rule set involving only a single column. The more properties included, the more bad values that can be found. The more complete the property definitions, the more opportunity exists for finding bad values.

Too often analysts use the database column specifications as the properties list. This is generally too limited to find most invalid values. The database specifications usually include physical storage properties and are not refined or specific enough to be used for invalid value testing.

Sometimes analysts confuse the column property list with domain definitions. These are different objects. A column may be required to conform to the definition of one or more domains but have additional restrictions that apply only to that particular use of the domain. This concept is described in the next chapter.

Violations of properties discovered are all inaccurate values. However, not all inaccurate values are found through this process. Values that are invalid but for which you do not have a test are not found. Neither are values that are valid but not correct.

Structure Analysis

Structure analysis uses rules that define how columns relate to each other to form tables and how tables relate to each other to form business objects. A business object is the data that describes a single business event or thing. For example, the data for a single customer order is a business object. The order object may consist of multiple parts, such as an order header, customer information, delivery information, and product line items. Each part would consist of one or more columns. The parts are normally stored in separate data tables. The collection of data for all orders is a database or a part of a database. This is shown in Figure 7.3. A full description of structure analysis is provided in Chapter 9.

Click To expand
Figure 7.3: Example of a business object.

The structure rules define the columns that uniquely identify a specific business object or table within the business object. It also defines rules for how tables relate to each other. For example, the order has an ORDER_ID column that defines each order uniquely. The order line items are identified by the ORDER_NUMBER column plus the PRODUCT_ID column. There can be one to many line item parts for each order.

Structure analysis deals with primary keys, primary / foreign key pairs, redundant data columns, column synonyms, and other referential constraints. It also deals with the sticky problem of denormalized data sources.

This is a very important component of data profiling. Every business object has a natural structure. This structure is often not known nor reflected accurately in source system definitions. For example, denormalization is rarely documented anywhere.

Knowing the columns that define the connections between object tables and knowing the rules on how tables of business objects relate to each other is essential in attempting to move data to other data structures that may have a very different set of structure rules.

In addition to the definition of how parts of a business object relate to one another, it is also important to discover how one business object relates to other business objects. For example, the order object above may relate to a customer master business object through the CUSTOMER_NUMBER column. These interobject relationships are important to know when databases are being reengineered or migrated to packaged applications. These outward connections may need to be reengineered as well to maintain consistency.

Database management systems often provide support for structural rule enforcement. Relational systems are generally very good at this. However, there are many opportunities for rules not to be tested or defined to the DBMS system. Nondatabase systems data sources, such as flat files, can contain may cases of rule violations. Identifying these inaccuracies in the data is very important if the data is to be moved to a well-defined target database system. That system will most likely reject the load of the data whenever it encounters rule violations. Discovering these in advance can avoid costly backtracking in data movement processes.

Structure analysis is particularly useful for completing and correcting the metadata. In addition, violations identify specific cases where inaccurate data exists. However, because each structure rule involves more than one column or more than one row of a single table, violations only reduce the set of rows within which inaccurate data exists. They cannot identify which specific values within that set are wrong. The outputs describing the inaccurate data facts are the structure rule identifier, relation, and rowID sets within which the violation occurs.

Simple Data Rule Analysis

Once you have identified all of the invalid data values within columns and all of the structure rule violations, it is time to get more specific about rules that require that values across multiple columns within a business object be acceptable combinations of values. These are defined as data rules. A data rule is a rule that specifies a condition that must hold true across one or more columns at any point in time.

Data rules are a subset of business rules. Business rules include several types of rules that are not data rules. For example, a rule that says that IF LICENSE_TYPE =‘HEAVY_EQUIPMENT’ THEN CURRENT_DATE BIRTH_DATE > 21 YEARS is a data rule. A rule that says that IF CUSTOMER_TYPE = ‘GOLD’ THEN SET DISCOUNT = ‘10%’ is a process rule and not a data rule. Both of them are business rules.

The data profiling analyst and the business analysts would normally collect or speculate about rules that should hold true over the data. These are converted to executable logic and then tested against the data. This is truly an iterative process that does double duty: solidifying the set of rules in the metadata and identifying specific sets of rows in the data that violate them. The output about inaccurate data facts is again relation and rowID sets connected to rule identifiers.

Because data rules involve more than one value, they cannot tell which value within the error set is wrong, only that the combination of values is wrong. A full description of simple data rule analysis is provided in Chapter 10.

Complex Data Rule Analysis

More complex data rules require values to conform to the rule over multiple business objects. An example might be a rule that says that a specific customer cannot be both a retail customer and a commercial customer. This requires finding all of the customer records for each customer and then checking for this condition within the set of records for each customer.

The only difference between this set of rules and the set described for single business objects is that the amount of data needed to test the rule is greater, involving the data of multiple business objects and identified violations, such as inaccurate data that is hidden within a much larger set of data rows. However, this may not always be true. Sometimes the rule can isolate the set of rows that contain the inaccurate values to as few as one row.

It is often not productive to capture all rows that constitute the set of rows within which the inaccurate data exists and store it in the inaccurate data fact table. It generates too many values that have no analytical value. However, it is valuable for these cases to record the ruleID and the number of specific violations of the rule. A full description of complex data rule analysis is provided in Chapter 11.

Value Rule Analysis

Sometimes the existence of inaccurate data can be spotted through aggregation numbers that are obvious to the analyst as being unreasonable. For example, seeing the frequency of occurrence of each value in a column may indicate that one or more values have far too low or far too high a frequency. This may lead the analyst to the conclusion that data is not being entered correctly.

Often it is not possible to convert these into hard rules that say that a particular value should be precisely x% of the data or that it should fall in a specified range. The analyst should collect these types of queries from the business analysts and run them against the data, presenting the results back to the business analysts for reasonable checks.

Examples are cardinality, counts, sums, averages, medians, frequencies, standard deviations, and so on. Any computed value that can be used to test the collection of data for completeness or reasonable values can be used. A full description of value rule analysis is provided in Chapter 12.