5.1 Turning Facts into Issues

Data quality investigations turn up facts. The primary job of the investigations is to identify inaccurate data. The data profiling process will produce inaccuracy facts that in some cases identify specific instances of wrong values. Other cases identify where wrong values exist but identification of which value is wrong is not known, and in yet other cases identify facts that raise suspicions about the presence of wrong values.

Facts are individually granular. This means that each rule has a list of violations. You can build a report that lists rules, the number of violations, and the percentage of tests performed (rows, objects, groups tested) that violated the rule. The violations can be itemized and aggregated.

Metrics

There is a strong temptation for quality groups to generate metrics about the facts and to "grade" a data source accordingly. Sometimes this is useful; sometimes not. Examples of metrics that can be gathered are

number of rows containing at least one wrong value
graph of errors found by data element
number of key violations (nonredundant primary keys, primary/foreign key orphans)
graph of data rules executed and number of violations returned
breakdown of errors based on data entry locations
breakdown of errors based on data creation date

The data profiling process can yield an interesting database of errors derived from a large variety of rules. A creative analyst can turn this into volumes of graphs and reports. You can invent an aggregation value that grades the entire data source. This can be a computed value that weights each rule based on its importance and the number of violations. You could say, for example, that this database has a quality rating of 7 on a scale of 10.

The Good

Metrics can be useful. One use is to demonstrate to management that the process is finding facts. The facts have little to no significance by themselves but can be circumstantial evidence that something is wrong with the data. When a data quality assurance department is trying to gain traction in a corporation, metrics can be a useful way to show progress.

Metrics can also be useful to show improvements. If data is profiled before and after corrective actions, the metrics can show whether the quality has improved or not.

Another use of metrics is to qualify data. Data purchased from outside the corporation, such as demographic data, can be subjected to a quick data profiling process when received. Metrics can then be applied to generate a qualifying grade for the data source. It can help determine if you want to use the data at all. This can be used to negotiate with the vendor providing the data. It can be the basis for penalties or rewards.

Qualification can also be done for internal data sources. For example, a data warehousing group can qualify data extracts from operational groups before they are applied to the central data warehouse.

The Bad

The downside of metrics is that they are not exact and they do not solve problems. In fact, they do not identify what the problems are; they only provide an indicator that problems exist.

Earlier chapters demonstrated that it is not possible to identify all inaccurate data even if you are armed with every possible rule the data should conform to. Consequently you cannot accurately estimate the percentage of inaccuracies that exist. The only thing you know for sure is that you found a specific number of inaccuracies. The bad news is that there are probably more; the good news is that you found these. If the number you find is significant, you know you have a problem.

Corrective actions have these potential consequences: they can prevent recurrence of some errors that you can detect, they can prevent recurrence of errors you cannot detect, and they can continue to pass errors through. It is also theoretically possible that you would introduce new errors that may or may not be detectable.

The conclusion is that data profiling techniques can show the presence of errors but cannot show the absence of errors nor the number of errors. Therefore, any metrics derived from the output of profiling are inexact. This does not make them useless. On the contrary, the errors found are true errors, and if there are enough of them you have uncovered true problems.

You might conclude from the previous discussion that the number of errors reported is understated. This would be great if it were true. However, poorly defined metrics can actually overstate the error condition. This occurs when a single inaccurate value triggers multiple rule violations. This is difficult to detect and impossible to quantify. When you consider that the majority of rules will find the presence of inaccurate data but will not pinpoint the offending values, you can see why it is difficult, if not impossible, to find the true number of inaccurate values.

Comparing metrics can also be misleading if the yardstick changes between profiling exercises. As analysts gain more knowledge about a data source, they will add to the rule set used to dig out inaccuracies. Comparing two result sets that are derived from different rule sets results in an apples-to-oranges comparison. All presentations of quality metrics need to provide disclaimers so that the readers can understand these dynamics.

The following is an example of preventing recurrence of errors you never detected. A medical clinic's internal system records a code for the medical procedure performed, as well as the gender of the patient. It is discovered in data profiling that procedures are being recorded that are not possible for the gender code recorded. These are inaccuracy facts.

However, the root cause is that the procedure codes are handwritten on paper forms and then sent to the data entry office. Many of them are illegible or missing. The data entry staff has no way of verifying the correct procedure and are motivated to get the data into the system rather than fix it. In addition to the procedure codes being invalid in the case of gender conflicts, there are probably many other procedure codes that are wrong. However, because they are valid procedure codes, they are not detected.

The remedy called for having the data entered directly online by the administrators of the doctors instead of transferring paper documents to a central data entry function. Because so many errors were noted, the new form displays a text description of the procedure when it is entered with a confirmation button. This helps the administrators confirm that they have entered the correct code.

Checks were put in for gender/procedure code conflicts, as well as other conflicts, such as invalid patient age/procedure code combinations. In addition, administrators were educated on the importance of correct procedure codes. Because of the better data entry procedures, the number of errors prevented not only included those that were detectable but many others that were not detectable through analysis.

An additional problem with metrics is that data quality assurance departments often believe that this is the end of their mission. They define their work product as the metrics. However, metrics do not define the source of problems nor the solutions. To improve data quality you need to follow through on getting improvements made. To hand the responsibility for this to other departments is a guarantee that the work items will sit low on priority lists of things to do and will not get done expeditiously. The data quality assurance department needs to track and drive the issues through to solution.

Metrics are not all bad. They are often a good shock factor for driving actions. When you give management a presentation that says the HR database records revealed 700 inaccurate values, this can raise eyebrows and produce a call for action. Knowing that you have 700 and that the real number is higher can be motivation enough.

Often a single fact is more shocking than statistical metrics. For example, telling management that a profiling exercise of the birth date of employees revealed that the youngest employee in the company has not been born yet and that the oldest was born before the Civil War is far more effective than a metric at getting across the point that improvements are needed now. (I did not make this up; it was an actual output of a data profiling exercise.)

Issues

The real output of the fact collection phase is a set of issues that define problems that need to be solved. A single statistic can result in an issue. For example, 30% of the purchase order fields have no supplier ID number. Alternatively, several facts can be grouped into one issue. For example, the customer name and address data is severely flawed: 5% of name fields have invalid names, 15% of address fields are inaccurate or blank, 12% of city fields are blank, 5% of city fields are misspelled, and 12% of Zip codes are invalid or blank. This single issue rolls up several inaccuracy facts into a single issue that needs to be addressed. Addressing each inaccuracy fact is an inefficient use of time.

Issues need to be recorded in a database within an issues tracking system. Each issue needs a narrative description of the findings and facts that are the basis for the issue. It is important to identify the facts and the data source so that comparisons can be correctly made during the monitoring phase. The information needed for the data source is the identification of the database used, whether samples or the entire database were used, the date of the extraction, and any other information that will help others understand what you extracted the facts from. In tracking the issues, all meetings, presentations, and decisions need to be recorded along with dates and persons present.