Figure 10.1 shows the basic model for profiling business rules. This is different from the model for handling column properties and data structures. In those cases, discovery routines could be constructed to suggest rules, to augment those that were collected externally. In the case of data rules, there is no such discovery. The potential for data rules is almost infinite from a discovery point of view. It would not be practical or beneficial to try to discover data rules from the data.
Data rules have an enormous semantic component. You need to understand how the data relates to other data in the context of how the business chooses to operate. Semantics are not discoverable by a program.
The first step is to gather the data rules you wish to apply to the data. Most metadata repositories do not have much in the way of data rules defined in them. There are multiple places to look for rules: source code of the applications that operate over the data, database stored procedures, business procedures surrounding the applications, and speculation using business and data experts.
Many, many data rules are captured in the thousands of lines of source code that constitute the application software for the data. These applications have code in them that screen for inaccurate data combinations or that set data values based on other values.
Commercial software exists that assists analysts in examining the tons of source code over an application and extracting discrete bits of logic that constitute a specific business rule. Some of these are data rules and some are process rules. An experienced analyst can fabricate testable data rules from them.
Source code scavenging can be very useful in understanding the rules that evolved over time. However, this approach is rarely used. One reason is that the analysts doing the data profiling are not educated in the programming languages used for many legacy systems. Another reason is that the process is very tedious and time consuming. Often there is not enough time to tackle the mountains of source code that surround legacy applications. An additional reason is that the source code may not be available. This is true for packaged applications or for some locally built applications where it just got lost.
Despite all of these difficulties, source code scavenging should always be considered as a step in the process of gathering data rules. If it is not reasonable to do so, fine. At least it should be considered for those cases where it is reasonable to do so.
These encompass source code that is added to the database systems to enforce data conditions on data insert, update, or delete. These are sometimes called stored procedures, triggers, editprocs, and other things. This is a common structure in relational database systems and generally not available in older database systems. The concept is to add data qualifying logic to the database engine and have it deny any data entry to the database records that violate it.
Database procedures can be data rules or process rules. For example, an INSERT transaction may trigger the creation of other database records or determine the values of columns through rules or algorithms. A rule may set a CUSTOMER credit worthiness flag to high or low based on values in the INSERT transaction. Database procedures are more likely to hold true over all of the data because it is very difficult to backdoor the database and change things.
Scavenging rules from database-stored procedures is similar to scavenging from application source code. It requires time and an expert in the procedure language. Some of the older database implementations did not preserve the original source language statements for the procedure. The procedure as stored in the database catalog was a compiled pseudo-code of the logic, not the original statements. It is not understandable in that form. In these cases, the logic is embedded in the CREATE TABLE statements and may no longer be available.
You can generally expect to get very little in the way of data rules from this source. Even though the facility is available, the harshness of no violations allowed leads many a database administrator to drop them from the implementation. They can also severely slow down the execution of database calls with subsequent unacceptable response times to transaction applications. This means that they are not used a lot.
One of the problems seen is that an analyst may find a CREATE TABLE script that contains many data rules and assume they are being used, and further assume it is not necessary to test for them. However, the actual database may not have the rules active, because a database administrator may have turned them off. In spite of all of these possibilities, it may be worthwhile to look for them and to convert them into testable data rules.
Many data rules are encapsulated in business procedures and not captured in the application logic. Instead, they are instructions to data entry personnel or to business clerks processing various transactions. For example, the assigning of an interest rate to customer loans may be a very distinct business rule, resulting in a potential for a data rule to determine the degree of conformance to the business rule or to find inaccurate data. It may not be executed in any programming language, just a rule for people entering data to follow. The businessperson must determine values externally to the programs and then enter the right value into the database. This leaves plenty of room for inaccurate computations, as well as inaccurate entry of results.
Once a data rule is captured from the business processes, it can be used to test the data for conformance. The fallout can be exceptions to the data rule, incidences of failure to follow the data rule, or inaccurate data entry. The data rule can also be used to automate the process and take much of the error out of operations.
Another source is to examine the data columns and structure and create data rules using common sense. It can be very helpful to do this to create rules no one would expect to check for. The simple rules that just define commonsense relations between data columns are those that are often not defined and yet yield inaccurate data when tested.
Gathering data rules through speculation is best done in a group setting that includes data structure experts, business analysts, and subject matter experts. Speculation should be done as a last step in the gathering process and should include a review of rules gathered from other sources as well. Examination of the data rules by business analysts and subject matter experts will help improve the completeness and accuracy of the list.
A group setting will yield many rules, probably more than you want to test. The group should prioritize the rules to help sort out which to test.
All of the rules you gather should be recorded in the data profiling repository. These rules are the subject of rule testing. No data rules should be eliminated from the data profiling repository, whether they are used or not.
Once the data rules are gathered, they need to be tested against the profiling data. Testing can be done through a business rule engine, a data quality software product, or by executing query statements against the data. Each data rule should be executed and row identifiers returned for all rows involved in violations. This information should be stored in the data profiling repository so that the analyst investigating the violations can look back through all of the data surrounding each violation.
The reason you want all of this information is that violation of a rule does not reveal inaccurate values. It only reduces the amount of data within which the inaccurate data exists. As long as more than one value is used to test the rule, you cannot tell which value causes the violation. In addition to the values used in the rule execution, surrounding values in the same rows can be used to help research the violation to determine if it is a true violation, if violations are connected to some other factors, or whether it is an exception to the rule.
Because executing data rules against operational systems can be very processing intensive and disruptive, the testing of rules is often done using large samples of data and preferably on separate machines from those used for operational processing. The analyst needs to be able to execute, probe, and poke on the data to find as much information as possible about each data rule. Once the data rules are accepted as valid, they can then be executed against the real data source to find all violations.
After executing the data rules, you need to reconvene the group of reviewers to determine if the output is reasonable. They need to consider whether the data rule formulation was correct, whether the execution was proper, and whether the output represents real violations.
Validation can cause data rules to be changed either because they were formulated wrong or because the output reveals new insights to the business-people. They may have thought a rule was in effect, but the sheer volume of violations leads them to investigate, and they find out that the data rule is not being used or was changed to something else.
It is important to document everything about the output. If data rule exceptions are found instead of inaccurate data, the reason for the exceptions should be documented as well as the number of such exceptions. This may lead to data issues that result in either training for businesspeople or strengthening of rule conformance processes.
Another important part of this step is to classify rules according to their importance as well as to their value in extracting issues. This is important because rules can be very expensive to execute in operational systems. They consume a lot of processing time and can negatively change the performance characteristics of transactions. Implementors need to make the proper decisions about if, where, and when rules are to be executed. They need input from the data profiling process to make these decisions intelligently.