10.2 The Process for Analyzing Simple Data Rules

Figure 10.1 shows the basic model for profiling business rules. This is different from the model for handling column properties and data structures. In those cases, discovery routines could be constructed to suggest rules, to augment those that were collected externally. In the case of data rules, there is no such discovery. The potential for data rules is almost infinite from a discovery point of view. It would not be practical or beneficial to try to discover data rules from the data.

Figure 10.1: Process for analyzing simple data rules.

Data rules have an enormous semantic component. You need to understand how the data relates to other data in the context of how the business chooses to operate. Semantics are not discoverable by a program.

Gathering Data

The first step is to gather the data rules you wish to apply to the data. Most metadata repositories do not have much in the way of data rules defined in them. There are multiple places to look for rules: source code of the applications that operate over the data, database stored procedures, business procedures surrounding the applications, and speculation using business and data experts.

Source Code Scavenging

Many, many data rules are captured in the thousands of lines of source code that constitute the application software for the data. These applications have code in them that screen for inaccurate data combinations or that set data values based on other values.

Commercial software exists that assists analysts in examining the tons of source code over an application and extracting discrete bits of logic that constitute a specific business rule. Some of these are data rules and some are process rules. An experienced analyst can fabricate testable data rules from them.

Source code scavenging can be very useful in understanding the rules that evolved over time. However, this approach is rarely used. One reason is that the analysts doing the data profiling are not educated in the programming languages used for many legacy systems. Another reason is that the process is very tedious and time consuming. Often there is not enough time to tackle the mountains of source code that surround legacy applications. An additional reason is that the source code may not be available. This is true for packaged applications or for some locally built applications where it just got lost.

The existence of source code that enforces a rule or executes a process rule would normally appear to be something that does not need to be tested. However, such is not the case. Violations can still exist in the data for a variety of reasons.

One reason is that the code you are looking at may not be the only path for the data to enter the database through. In this day of multiple data channels, there may be many different application paths for creating the same data. For example, an order may be placed by a data entry operator using a legacy green-screen application, or it may be placed using a new XML-based Internet application, or still further placed through a pseudo-legacy application built around a 4GL using EDI (Electronic Data Interchange) format as a data source. Each of these paths to the data may apply rules differently. More likely, some would apply a rule and some would not.

Another way for data to get into a violation state is through updates that come through a separate application path. Rules may be enforced on initial entry but not enforced when updates are made.

Another source is ad hoc queries coming from a database administrator or other person who knows how to use SQL to change things. This often gets used to handle emergency problem situations such as failure on load after a reorganization because of a constraint violation.

And yet another source is that the data in question may have been created before the source code was employed. When you look at source code, you are looking at only the code that is executed now, when new data is created. The data in the database may have been created months ago, when the current source code was not in use. If no attempt was made to bring all old data into conformance with a new rule, there may well be many violations of the rule.

Despite all of these difficulties, source code scavenging should always be considered as a step in the process of gathering data rules. If it is not reasonable to do so, fine. At least it should be considered for those cases where it is reasonable to do so.

Database-Stored Procedures

These encompass source code that is added to the database systems to enforce data conditions on data insert, update, or delete. These are sometimes called stored procedures, triggers, editprocs, and other things. This is a common structure in relational database systems and generally not available in older database systems. The concept is to add data qualifying logic to the database engine and have it deny any data entry to the database records that violate it.

Database procedures can be data rules or process rules. For example, an INSERT transaction may trigger the creation of other database records or determine the values of columns through rules or algorithms. A rule may set a CUSTOMER credit worthiness flag to high or low based on values in the INSERT transaction. Database procedures are more likely to hold true over all of the data because it is very difficult to backdoor the database and change things.

Scavenging rules from database-stored procedures is similar to scavenging from application source code. It requires time and an expert in the procedure language. Some of the older database implementations did not preserve the original source language statements for the procedure. The procedure as stored in the database catalog was a compiled pseudo-code of the logic, not the original statements. It is not understandable in that form. In these cases, the logic is embedded in the CREATE TABLE statements and may no longer be available.

You can generally expect to get very little in the way of data rules from this source. Even though the facility is available, the harshness of no violations allowed leads many a database administrator to drop them from the implementation. They can also severely slow down the execution of database calls with subsequent unacceptable response times to transaction applications. This means that they are not used a lot.

One of the problems seen is that an analyst may find a CREATE TABLE script that contains many data rules and assume they are being used, and further assume it is not necessary to test for them. However, the actual database may not have the rules active, because a database administrator may have turned them off. In spite of all of these possibilities, it may be worthwhile to look for them and to convert them into testable data rules.

Business Procedures

Many data rules are encapsulated in business procedures and not captured in the application logic. Instead, they are instructions to data entry personnel or to business clerks processing various transactions. For example, the assigning of an interest rate to customer loans may be a very distinct business rule, resulting in a potential for a data rule to determine the degree of conformance to the business rule or to find inaccurate data. It may not be executed in any programming language, just a rule for people entering data to follow. The businessperson must determine values externally to the programs and then enter the right value into the database. This leaves plenty of room for inaccurate computations, as well as inaccurate entry of results.

Once a data rule is captured from the business processes, it can be used to test the data for conformance. The fallout can be exceptions to the data rule, incidences of failure to follow the data rule, or inaccurate data entry. The data rule can also be used to automate the process and take much of the error out of operations.

Speculation

Another source is to examine the data columns and structure and create data rules using common sense. It can be very helpful to do this to create rules no one would expect to check for. The simple rules that just define commonsense relations between data columns are those that are often not defined and yet yield inaccurate data when tested.

Gathering data rules through speculation is best done in a group setting that includes data structure experts, business analysts, and subject matter experts. Speculation should be done as a last step in the gathering process and should include a review of rules gathered from other sources as well. Examination of the data rules by business analysts and subject matter experts will help improve the completeness and accuracy of the list.

A group setting will yield many rules, probably more than you want to test. The group should prioritize the rules to help sort out which to test.

All of the rules you gather should be recorded in the data profiling repository. These rules are the subject of rule testing. No data rules should be eliminated from the data profiling repository, whether they are used or not.

Testing Simple Data Rules

Once the data rules are gathered, they need to be tested against the profiling data. Testing can be done through a business rule engine, a data quality software product, or by executing query statements against the data. Each data rule should be executed and row identifiers returned for all rows involved in violations. This information should be stored in the data profiling repository so that the analyst investigating the violations can look back through all of the data surrounding each violation.

The reason you want all of this information is that violation of a rule does not reveal inaccurate values. It only reduces the amount of data within which the inaccurate data exists. As long as more than one value is used to test the rule, you cannot tell which value causes the violation. In addition to the values used in the rule execution, surrounding values in the same rows can be used to help research the violation to determine if it is a true violation, if violations are connected to some other factors, or whether it is an exception to the rule.

Because executing data rules against operational systems can be very processing intensive and disruptive, the testing of rules is often done using large samples of data and preferably on separate machines from those used for operational processing. The analyst needs to be able to execute, probe, and poke on the data to find as much information as possible about each data rule. Once the data rules are accepted as valid, they can then be executed against the real data source to find all violations.

Validation of Output Data

After executing the data rules, you need to reconvene the group of reviewers to determine if the output is reasonable. They need to consider whether the data rule formulation was correct, whether the execution was proper, and whether the output represents real violations.

Validation can cause data rules to be changed either because they were formulated wrong or because the output reveals new insights to the business-people. They may have thought a rule was in effect, but the sheer volume of violations leads them to investigate, and they find out that the data rule is not being used or was changed to something else.

It is important to document everything about the output. If data rule exceptions are found instead of inaccurate data, the reason for the exceptions should be documented as well as the number of such exceptions. This may lead to data issues that result in either training for businesspeople or strengthening of rule conformance processes.

Another important part of this step is to classify rules according to their importance as well as to their value in extracting issues. This is important because rules can be very expensive to execute in operational systems. They consume a lot of processing time and can negatively change the performance characteristics of transactions. Implementors need to make the proper decisions about if, where, and when rules are to be executed. They need input from the data profiling process to make these decisions intelligently.