2.6 Finding Inaccurate Values

Some of the inaccurate values can be found, but probably not all of them. In most cases a high percentage of inaccurate values can be found if enough effort is spent searching for them.

Figure 2.2 shows the general potential of any collection of data. In any collection there will be some data that is accurate and some that is not. This is shown by the first column. Of the data that is not accurate, some of it is valid and some of it is invalid.

Figure 2.2: Chart of accurate/inaccurate values and those that are findable and fixable.

There are two ways you can go about finding inaccurate data: reverification and analysis. If you want to be 100% sure that all values are accurate, you must use reverification. Even then, you cannot achieve this for many types of data. The reverification process may not be possible for some data. Reverification is done by people, and they may make mistakes in the reverification process.

The maximum potential of analytical techniques also falls short of perfect even if you employ all analytical techniques available.

Reverification

This means that you manually go back to the original source of the information and check every value. It is the only sure way to determine what values are right and what values are wrong. The reason for this is that analytical techniques cannot tell if a valid value is accurate unless there is a secondary source of information to correlate the value against.

Reverification is not always possible. For example, if the data records temperatures as of a point in time, it is impossible to go back in time and measure them again.

Reverification is also susceptible to errors. You would generally think that verifiers would be more careful than original entry people. However, this may not be true. In addition, the source of information that provided wrong information initially may continue to provide wrong information in the reverification. If someone lies about their age initially, they will probably lie about it in the reverification.

You would expect that a reverification would improve the quality of the data by some amount. It would catch more errors than it would create. However, you can never be sure that it catches them all.

Reverification takes a long time. If it were implemented as an operational techniques to ensure the quality of the data before use, it would certainly violate the timeliness requirement for almost all uses.

The primary problem with reverification is that it is impractical for most databases. It is enormously time consuming and expensive. Reverification on databases that are continuously changing is extremely difficult if not downright impossible. For example, a web site that takes 10,000 orders per day is not likely to seek reverification of the information on every order. Companies are not willing or able to use this technique widely.

There are isolated cases for which reverification is practical. On small but important databases, it is not only possible but sometimes appropriate. For example, a database on capital equipment expenditures may use this technique. Another case is to use reverification on only part of the objects in a database. For example, an industrial supplier may verify order information on all orders over a certain amount of value or a certain amount of product.

Selective reverification may also be a technique used in monitoring a database. Random selection of transactions or records for reverification can be a good quality assessment tool either to spot developing problems or to instill confidence among users of the data. This improves the trust dimension of the quality of the data.

Analytical Techniques

Analytical techniques involve the use of software in conjunction with the skills of a data or business analyst to search through databases to find the presence of inaccurate data. Analytical techniques can be used against transactions as they are occurring, against databases as they are changing, or against databases on a periodic basis.

There are five categories of analysis that can be applied to data: data element analysis, structural analysis, value correlation, aggregation correlation, and value inspection. These are individually explored in Chapters 8 through 12. Each of these contributes to finding the presence of inaccurate data. However, not all of them can pinpoint the actual wrong values. The sections that follow briefly describe these types of analysis.

Element Analysis

Element analysis involves looking at individual values in isolation to determine if they are valid. To do this you need a definition of what is valid and what is not. For example, a Social Security number must consist of nine numeric digits. If this is your only definition, you will find that all values that are blank, contain characters other than numeric or contain less than or more than nine digits. However, you can go further in your definition. The government employs a scheme of assigning numbers that allows you to examine the value in more detail to determine if it is valid or not. Using the larger rule has the potential for finding more inaccurate values.

Element analysis can include examination based on data type, length, range of values, list of discrete values allowed, patterns allowed, and any other information that is helpful. The more definition you provide, the more potential you have for finding invalid values.

A simplified but often overlooked technique for element analysis is visual inspection. This method involves looking at values to see if they belong to the data element or not. For example, you could not write a programmatic rule to find invalid names for names that contain valid characters. However, a human could look at the list and find wrong or questionable entries because of the superior human capability to deal with semantics and context. For example, names such as Donald Duck, xwdcgex, Don't know, and Wouldn't tell me would be spotted by the analyst and ruled to be invalid. You could not write a comprehensive enough rule to find these through a program.

Structural Analysis

Structural analysis techniques involve checking columns of values for uniqueness or consecutiveness, checking for orphans on collections of records with connecting parts, and checking for circular relationships. For example, no two employees can have the same employee number. Each employee record must have a corresponding emergency contact record, and each emergency contact record must belong to a valid employee record. An example of circular relationships would be that a part cannot be used in the construction of itself.

Structural analysis techniques are used on collections of records. They can identify clear inaccuracies in the data. They generally isolate the error to a small number of records. They do not identify the offending values. Reverification generally is required to pinpoint the actual offender.

Value Correlation

Value correlation analysis involves checking collections of values against a rule that must hold true over the data. For example, if an EMPLOYEE_TYPE field indicates a part-time employee, the PAY_TYPE field must be hourly. This is a rule that applies to a subset of the records in the database. If it is violate, an inaccuracy exists in the data. However, you cannot tell which value is wrong, EMPLOYEE_TYPE or PAY_TYPE. You just know that one of them is wrong.

Value correlation can be very simple or very complex. It may involve only values in a single row, values in multiple rows of the same relation, or values that cross over rows of multiple tables.

There is generally a large number of rules that can be written for cross-checking values to ensure that the set of values represents a valid combination. When a rule is violated, you cannot tell which value is the offender unless the rule involves only one data element. In addition, rules can be satisfied with completely inaccurate data. In the previous example, if the employee record showed EMPLOYEE_TYPE as part-time and PAY_TYPE as hourly, it would be satisfied even though it is possible the EMPLOYEE_TYPE field is inaccurate (i.e., it is really a full-time employee). In fact, both EMPLOYEE_TYPE and PAY_TYPE can satisfy the rule, even though both are inaccurate.

Aggregation Correlation

Aggregation correlation analysis is used to identify the presence of inaccuracies through examining aggregated values over large amounts of data. For example, a rule may say that the count of orders by region for a week should be greater than 100 and less than 1,000. Violations would indicate that either data is missing or orders have the wrong date on them.

Value Inspection

Values can often be computed that allow the analyst to inspect them and determine if they are reasonable or not. Visual inspection differs from aggregation correlation in that it applies to cases where it is not possible to create a clear rule that defines the boundary between right and wrong. Small amounts of inaccurate data may not provide a clear indicator. However, clearly unreasonable data will jump out at the analyst.

For example, the distribution of values within a column may indicate that a problem exists. If 60% of the records indicate the same color of product even though you know that the colors should be approximately evenly divided over a set of 10 colors, the data is clearly inaccurate.

Aggregation tests can be applied to distribution of values, counts, sums, averages, and medians. They can also be used to compare aggregations between groups such as between sales regions.

In summary, element analysis is used to find only invalid values. Structural analysis, value correlation, aggregation correlation, and value inspection are used to find the presence of inaccuracies among valid values. They cannot identify the offending values but can determine with certainty that some of the values are wrong.

Looks Like a Duck, Walks Like a Duck

Although it is clear that you cannot find all inaccurate data values in a database, there is clearly value in finding out what you can detect. If a database has inaccurate data within it, it is very likely that analytical techniques will surface enough of them to get a good picture of the health of the data. It is highly unlikely that all or most of the inaccurate values can hide out as valid values that successfully correlate through all rules. You will find enough inaccurate data, provided you do a thorough enough job of analysis.

You also do not need to worry about seeing only the tip of the iceberg. In most cases, the inaccuracies disclosed through analysis represent a significant portion of the bad values.

A quality improvement program will probably improve the accuracy of data that was not discoverable through analysis along with the inaccuracies that were discovered. This is a secondary effect. For example, if your analysis indicates that 8% of the records in a specific database have at least one wrong value, you implement changes targeting the areas you uncovered and then reassess to find that the new number is 1%. The actual improvement is probably greater than the 7% indicated because of the effect of improvements on values not discoverable through analysis. The result is that users may feel that a much bigger improvement has actually taken place. Figure 2.3 indicates cross-correlations in terms of the effects of improvements in the accuracy of data values.

Figure 2.3: Effects of improvements.