Chapter 2: Definition of Accurate Data

Chapter 2: Definition of Accurate Data

To begin the discussion of data accuracy, it is important to first establish where accuracy fits into the larger picture of data quality.

2.1 Data Quality Definitions

Data quality is defined as follows: data has quality if it satisfies the requirements of its intended use. It lacks quality to the extent that it does not satisfy the requirement. In other words, data quality depends as much on the intended use as it does on the data itself. To satisfy the intended use, the data must be accurate, timely, relevant, complete, understood, and trusted.

Some examples will help in understanding the notion of data quality in the context of intended use. The sections that follow explore examples of the previously mentioned aspects of data integrity.

Case 1: Accuracy

Consider a database that contains names, addresses, phone numbers, and e-mail addresses of physicians in the state of Texas. This database is known to have a number of errors: some records are wrong, some are missing, and some are obsolete. If you compare the database to the true population of physicians, it is expected to be 85% accurate.

If this database is to be used for the state of Texas to notify physicians of a new law regarding assisted suicide, it would certainly be considered poor quality. In fact, it would be dangerous to use it for that intended purpose.

If this database were to be used by a new surgical device manufacturer to find potential customers, it would be considered high quality. Any such firm would be delighted to have a potential customer database that is 85% accurate. From it, they could conduct a telemarketing campaign to identify real sales leads with a completely acceptable success rate. The same database: for one use it has poor data quality, and for another it has high data quality.

Case 2: Timeliness

Consider a database containing sales information for a division of a company. This database contains three years' worth of data. However, the database is slow to become complete at the end of each month. Some units submit their information immediately, whereas others take several days to send in information. There are also a number of corrections and adjustments that flow in. Thus, for a period of time at the end of the accounting period, the content is incomplete. However, all of the data is correct when complete.

If this database is to be used to compute sales bonuses that are due on the 15th of the following month, it is of poor data quality even though the data in it is always eventually accurate. The data is not timely enough for the intended use.

However, if this database is to be used for historical trend analysis and to make decisions on altering territories, it is of excellent data quality as long as the user knows when all additions and changes are incorporated. Waiting for all of the data to get in is not a problem because its intended use is to make long-term decisions.

Case 3: Relevance

Consider an inventory database that contains part numbers, warehouse locations, quantity on hand, and other information. However, it does not contain source information (where the parts came from). If a part is supplied by multiple suppliers, once the parts are received and put on the shelf there is no indication of which supplier the parts came from. The information in the database is always accurate and current. For normal inventory transactions and decision making, the database is certainly of high quality.

If a supplier reports that one of their shipments contained defective parts, this database is of no help in identifying whether they have any of those parts or not. The database is of poor quality because it does not contain a relevant element of information. Without that information, the database is poor data quality for the intended use.

Case 4: Completeness

A database contains information on repairs done to capital equipment. However, it is a known fact that sometimes the repairs are done and the information about the repair is just not entered into the database. This is the result of lack of concern on the part of the repair people and a lack of enforcement on the part of their supervisors. It is estimated that the amount of missing information is about 5%.

This database is probably a good-quality database for assessing the general health of capital equipment. Equipment that required a great deal of expense to maintain can be identified from the data. Unless the missing data is disproportionately skewed, the records are usable for all ordinary decisions.

However, trying to use it as a base for evaluating information makes it a low-quality database. The missing transactions could easily tag an important piece of equipment as satisfying a warranty when in fact it does not.

Case 5: Understood

Consider a database containing orders from customers. A practice for handling complaints and returns is to create an "adjustment" order for backing out the original order and then writing a new order for the corrected information if applicable. This procedure assigns new order numbers to the adjustment and replacement orders.

For the accounting department, this is a high-quality database. All of the numbers come out in the wash. For a business analyst trying to determine trends in growth of orders by region, this is a poor-quality database. If the business analyst assumes that each order number represents a distinct order, his analysis will be all wrong. Someone needs to explain the practice and the methods necessary to unravel the data to get to the real numbers (if that is even possible after the fact).

Case 6: Trusted

A new application is deployed that is used to determine the amount and timing of ordering parts for machinery based on past history and the time in service since last replacement for the machines they are used in. The original application had a programming error that incorrectly ordered 10 times the amount actually required. The error went undisclosed until a large order was sent. A great deal of publicity ensued over the incident. The programming error was fixed and the problem does not repeat.

The database was never wrong; the application was. The large order was actually placed and the database reflected the order as such.

Because of a fear of a repeat of the incident, the maintenance chief has chosen not to use the application nor the information within the database. He orders parts based on a small spreadsheet application he built to keep much of the same information, even though he often misses transactions and does not always know when new parts arrive in inventory.

Unless his confidence in the original application is restored, the database is of poor quality, even though it is entirely accurate. It is not serving its intended use due to a lack of believability.