2.4 Distribution of Inaccurate Data

2.4 Distribution of Inaccurate Data

The presence of wrong values will generally not be distributed evenly throughout the database. The reasons for this are as follows:

  • Some data is more important than other data.

  • Some inaccurate data tends to get recognized and fixed when used.

  • How an element of data is used will affect the chances of inaccuracies being recognized.

  • Flaws in data acquisition processes are not equal for all elements.

In every database there are data elements that are more important for an application than others. For example, in an orders database, the order number and customer number are more important than the order date. If the customer number is wrong, it will get recognized very early and get fixed. If the order date is wrong, it may never get recognized and fixed. In an HR (human resources) database, the employee's Social Security number is more important than the last education level achieved. A Social Security number error will get recognized and fixed very early, whereas the educational level achieved will probably never get recognized nor fixed. If a large number of errors occur on a frequent basis in important fields, a major issue erupts and the source of the errors is found and fixed.

Another factor is how a data element is used. A field that is used in computing an amount or for updating inventory levels is more important than one that is merely descriptive and is only printed on reports. Computations and aggregation fields will generally precipitate visual clues to errors, whereas fields not used for these purposes will generally not be recognized for their errors.

The tendency for data elements that are more important to be more accurate is why quality problems rarely surface through the initiating transaction applications. The major problems with fields important to the users of that application have already been recognized, and corrective action has been taken to ensure that they are of sufficient accuracy to satisfy their requirements.

The data inaccuracy problem surfaces when this data is moved and used for decision making. Many of the data elements used only to record secondary information about the transaction now become much more important. For example, trying to correlate promotions to educational levels requires that the "education level achieved" field be very accurate. This new use has a higher demand on this data element than the demands made from the HR application.

This is a major reason data suddenly appears to be awful even though the transaction applications have been running for years with no complaints. The new uses of the data place higher requirements for accuracy on some of the data elements than the transaction applications did.

Unfortunately, another dynamic comes into play regarding the chances of getting improvements made. The only way the data will come up to the level needed by the newer uses is for fundamental changes to occur all the way back to the transaction level. And yet, the farther away you get from the initiating application, the more difficult it is to get changes made. The people who own the data are satisfied with the quality and place low priority on complaints from decision support analysts. This situation screams out for data stewards and data quality controls.