11.3 Profiling Complex Data Rules

The data profiling process is not dissimilar to that for the previous section. You formulate a process for each data rule and execute it against the data. The difference is that you need to ensure that the data used is complete, and the execution time required can be much higher.

Data Rule Execution

Multiple-object rules tend to be much more complicated in their processing logic. Often it is not possible to formulate them as an SQL query. It may require a special program or script of SQL statements to be written by an application developer to test a data rule.

Formulation of the process is very important. They must not only accurately reflect the data rule but be optimized. Some of the data rules will require comparing every record in a table to every other record in the table. If not carefully crafted, this type of process can run for a very long time. Because these rules can require complex logic to execute, it is also possible that the logic is flawed. This means that you should run tests on the data rule logic before you run it against the real database.

Some rules may require access to data from more than one database at the same time. This may involve a scheduling or access issue. If the data is extracted and staged for profiling, it is important that the data from all sources conform to the same time-boundary requirements.

Note how this step often requires collaboration and cooperation with application programmers and database administrators. Everything runs much more smoothly if these folks are involved in the process early and are excited about what they are being asked to do.

Types of Complex Data Rules

Data rules for multiple business objects can take a large variety of forms. This section presents some of the more common forms to provide examples of what to look for. There are two broad categories of complex data rules: data rules that deal with column values and data rules dealing with aggregation.

Dates And Time

One category of multiple business object data rules is where multiple instances of the same object type cannot share the same space at the same time. The obvious example of this is date-and-time space.

An example of this is rental records for equipment. The data rule is that a single piece of equipment cannot be rented out at the same time to more than one client. If the records show an overlap, an error exists in the data. The test is to group all records by equipment type. Then for each record you need to check all other records. The testing logic (assuming all records have start date/times earlier than end date/times) is as follows:

 IF RECORD2.END_DATE&TIME IS
			 EARLIER THAN RECORD1.START_DATE&TIME THEN OK ELSE IF
			 RECORD2.START_DATE&TIME IS LATER THAN RECORD1.END_DATE&TIME THEN OK
			 ELSE ERROR;

Sorting the records by OBJECT_IDENTIFIER and START_DATE&TIME makes execution more efficient. This example is true of anything that requires serial usage. For example, checking books out of a library, scheduling of conference rooms, billable-time consulting hours, billable time for lawyer charges, flying records for corporate jets, or expense records for executives or salesmen can all be checked for overlap inconsistencies using this logic.

This testing can go one step further for objects that should have a record for all or most time periods to compute the time period durations between any rental or checkout. If these are unexpectedly long, it may indicate either inaccurate dates in some of the records or missing records. Durations of checkout time periods can also be looked at to determine if they are too short to be reasonable.

An interesting application that demonstrates these principles is genealogy. Records that define a person in a family tree must satisfy a number of date and time relationships with others in the tree. For example, someone listed as a father of a person must have a reasonable duration of time between his birth date and the birth date of the child. The duration of each person's life must be reasonable. A child cannot be born after a parent is dead. In addition to these checks, it begs for checks for multiple fathers or mothers listed for the same person, as well as other relationships that make no sense.

Location

Time may not be the only factor that causes two objects to be incompatible. Location may be as well. For example, consider an inventory database that has a column for storage location down to the bin number. If you discovered two or more discrete items with the same physical location, this would indicate that an inaccuracy exists.

Other Types Of Exclusivity

There may be other columns that have an exclusion relationship across the rows of a database. For example, consider the SPOUSE column of a personnel record. You would not expect two employees to share the same spouse.

The way this error happens is that one employee gets a divorce and the spouse marries someone else in the same company. The first person fails to update personnel records to show the change in marital status. When the second person updates his record, the database has an incompatibility. This is an example of data decay that is surfaced through analysis of the SPOUSE column using a data rule.

Anytime you have the requirement that the value of one column cannot be shared with other objects and the column is not a structural column that would be enforced through database key processing, you probably have errors that have gone undetected. Profiling these columns will determine the extent of errors in the column and lead you to determining if corrective actions are needed or not.

Aggregations

Aggregations are tests not on individual column values but rather on a computation over a group of objects. The types of aggregation values that can be the subject of the rule are COUNT, SUM, AVERAGE, MEDIAN, STANDARD DEVIATION, or anything else that makes sense.

These are good rules for checking for completeness. For example, you may know that the number of trips taken by a truck should be greater than 10 per month. This is the minimum any truck would take. The rule might be

 GROUPBY TRUCK, MONTH IF
			 OUT-OF-SERVICE NOT YES THEN NUMBER_TRIPS GREATER THAN 10

Sometimes computations are made and checked against data from a different source. For example, accounting records may be required to cross-reference against data maintained at the departmental level.

Lookup

Another way of identifying the existence of inaccurate data is to check the data in a database with corresponding data in another database that should be the same. The simplest form of correlation is lookup. You have a column in the subject database that must contain only values that are in another file or database column. This is another form of verifying that the value is in the set of acceptable values. Reasons this would not be considered a column property (list of acceptable values) may be that the list is too long to encapsulate in metadata or that the list changes over time.

The next step up is a case in which the value in the subject database has a key and the key can be used to look up the corresponding value in another database. For example, an employee Social Security number on a 401K record can be correlated against the employee Social Security number in the HR database. In this case, there are two columns involved, employee number (or name) and Social Security number. If they do not correlate, the error may be in either (or both) columns.

Correlations can get more complicated. For example, in name and address error checking, you can use a database supplied by the U.S. Postal Service that identifies all valid combinations of street name and number, city, state, and Zip code. Name and address correlation is best done with specialized software designed exclusively for that purpose. This is so complicated and important that entire companies have emerged that provide this software.

The other source can either be more trusted or less trusted. If trusted, the lookup source is considered correct when they differ. Less trusted sources can also be used. If they deliver a difference and you have found an inaccurate data situation, you just do not know which one is wrong. In addition, remember that when verifying with less trusted sources you may get correlation while both values are wrong.

You need to be careful when establishing a correlation test. Sometimes the test is not valid but appears to be so. For example, an ORDER_DETAIL record may contain a UNIT_PRICE field, as does the INVENTORY record. What appears to be duplicate data may be tempting to use for accuracy correlation. However, the UNIT_PRICE in the order detail may be different for two reasons: it reflects the unit price on the date the order was placed, not on the current date, or the customer was given a special unit price just for that order. This underscores the need to understand your data thoroughly.