Chapter 3: Sources of Inaccurate Data

Chapter 3: Sources of Inaccurate Data

Before we can assess data correctness we need to understand the various ways inaccurate values get into databases. There are many sources of data inaccuracies, and each contributes its own part to the total data quality problem. Understanding these sources will demonstrate the need for a comprehensive program of assessment, monitoring, and improvement. Having highly accurate data requires attention to all sources of inaccuracies and appropriate responses and tools for each.

Figure 3.1 shows the four general areas where inaccuracies occur. The first three cause inaccuracies in data within the databases, whereas the fourth area causes inaccuracies in the information products produced from the data. If you roll up all potential sources of errors, the interesting conclusion is that the most important use of the data (corporate decision making) is made on the rendition of data that has the most inaccuracies.

Click To expand Figure 3.1: Areas where inaccuracies occur.

3.1 Initial Data Entry

Most people assume that data inaccuracies are always the result of entering the wrong data at the beginning. This is certainly a major source of data inaccuracies but not the only source. Inaccurate data creation can be the result of mistakes, can result from flawed data entry processes, can be deliberate, or can be the result of system errors. By looking at our systems through these topics, you can gain insight into whether systems are designed to invite inaccurate data or are designed to promote accurate data.

Data Entry Mistakes

The most common source of a data inaccuracy is that the person entering the data just plain makes a mistake. You intend to enter blue but enter bleu instead; you hit the wrong entry on a select list; you put a correct value in the wrong field. Much of operational data originates from a person. People make mistakes; we make them all the time. It is doubtful that anyone could fill out a hundred-field form without making at least one mistake.

A real-world example involves an automobile damage claims database in which the COLOR field was entered as text. Examination of the content of this field yielded 13 different spellings for the word beige. Some of these mistakes were the result of typos. Others were just that the entry person did not know how to spell the word. In some of the latter cases, they thought they knew how to spell the word, whereas in others they were just not able or willing to look it up.

Flawed Data Entry Processes

A lot of data entry begins with a form. A person completes a form either on a piece of paper or on a computer screen. Form design has a lot to do with the amount of inaccurate data that ends up in the database. Form design should begin with a basic understanding of quality issues in order to avoid many of the mistakes commonly seen. For example, having someone select from a list of valid values instead of typing in a value can eliminate the misspellings previously cited.

Another common problem is having fields on the form that are confusing to the user. This often leads them to enter wrong information. The field itself may be confusing to the user. If it is a field that is not commonly understood, or if the database definition is unconventional, the form needs to provide assistance in guiding the user through entry of values into the field. Sometimes the confusion is in the way the field is described in its identifying text or in its positioning on the form. Form design should always be subjected to rigorous quality testing to find the fields a normal user would have difficulty in knowing what to enter.

Data entry windows should have instructions available as HELP functions and should be user friendly in handling errors. Frustration in using a form can lead to deliberate mistakes that corrupt the database.

Forms are better completed by a trained entry person than by a one-time user. This is because the entry person can be taught how things should be entered, can become proficient in using the form mechanisms, and can be given feedback to improve the efficiency and accuracy of the data. A one-time user is always uncertain about what they are supposed to do on the form. Unfortunately, our society is moving by way of the Internet toward eliminating the middle person in the process and having end users complete forms directly. This places a much higher demand on quality form design.

The data entry process includes more than the forms that are filled out. It also includes the process that surrounds it. Forms are completed at a specific point or points in a process. Sometimes we have forms that are required to be completed when not all information is known or easily obtained at that point in the process. This will inevitably lead to quality problems.

An example of a data entry process I helped design a number of years ago for military repair personnel is very instructive of the types of problems that can occur in data collection. The U.S. Navy has a database that collects detailed information on the repair and routine maintenance performed on all aircraft and on all major components of every ship. This database is intended to be used for a variety of reasons, from negotiating contracts with suppliers, to validating warranties, to designing new aircraft and ships.

When an aircraft carrier is in a combat situation, such as in Kuwait and Afghanistan, repairs are being made frequently. The repair crews are working around the clock and under a great deal of pressure to deal with a lot of situations that come up unexpectedly. Completing forms is the least of their concerns. They have a tendency to fix things and do the paperwork later. The amount of undocumented work piles up during the day, to be completed when a spare moment is available. By then the repair person has forgotten some of the work done or the details of some of the work and certainly is in a hurry to get it done and out of the way.

Another part of this problem comes in when the data is actually entered from the forms. The forms are coming out of a hectic, very messy environment. Some of the forms are torn; some have oil or other substances on them. The writing is often difficult to decipher. The person who created it is probably not available and probably would not remember much about it if available.

A database built from this system will have many inaccuracies in it. Many of the inaccuracies will be missing information or valid but wrong information. An innovative solution that involves wireless, handheld devices and employs voice recognition technology would vastly improve the completeness and accuracy of this database. I hope the U.S. Navy has made considerable improvements in the data collection processes for this application since I left. I trust they have.

The Null Problem

A special problem occurs in data entry when the information called for is not available. A data element has a value, an indicator that the value is not known, or an indicator that no value exists (or is applicable) for this element in this record. Have you ever seen an entry screen that had room for a value and two indicator boxes you could use for the case where there is no value? I haven't. Most form designs either mandate that a value be provided or allow it to be left blank. If left blank, you do not know the difference between value-not-known and no-value-applies.

When the form requires that an entry be available and the entry person does not have the value, there is a strong tendency to "fake it" by putting a wrong, but acceptable, value into the field. This is even unintentionally encouraged for selection lists that have a default value in the field to start with.

It would be better form design to introduce the notion of NOT KNOWN or NOT APPLICABLE for data elements that are not crucial to the transaction being processed. This would at least allow the entry people to enter accurately what they know and the users of the data to understand what is going on in the data.

It would make sense in some cases to allow the initial entry of data to record NOT KNOWN values and have the system trigger subsequent activities that would collect and update these fields after the fact. This is far better than having people enter false information or leaving something blank and not knowing if a value exists for the field or not.

An example of a data element that may be NOT KNOWN or NOT APPLICABLE is a driver's license number. If the field is left blank, you cannot tell if it was not known at the point of entry or whether the person it applies to does not have a driver's license. Failure to handle the possibility of information not being available at the time of entry and failure to allow for options to express what you do know about a value leads to many inaccuracies in data.

Deliberate Errors

Deliberate errors are those that occur when the person enters a wrong value on purpose. There are three reasons they do this:

  • They do not know the correct information.

  • They do not want you to know the correct information.

  • They get a benefit from entering the wrong information.

Do Not Know Correct Information

Not knowing the correct information occurs when the form requires a value for a field and the person wants or needs to complete the form but does not know the value to use. The form will not be complete without a value. The person does not believe the value is important to the transaction, at least not relative to what they are trying to do. The result is that they make up a value, enter the information, and go on.

Usually the information is not important to completing the transaction but may be important to other database users later on. For example, asking and requiring a value for the license plate number of your car when registering for a hotel has no effect on getting registered. However, it may be important when you leave your lights on and they need to find out whose car it is.

Do Not Wish To Give The Correct Information

The second source of deliberate errors is caused by the person providing the data not wanting to give the correct information. This is becoming a more and more common occurrence with data coming off the Internet and the emergence of CRM applications. Every company wants a database on all of their customers in order to tailor marketing programs. However, they end up with a lot of incorrect data in their databases because the information they ask people for is more than people are willing to provide or is perceived to be an invasion of privacy.

Examples of fields that people will lie about are age, height, weight, driver's license number, home phone number, marital status, annual income, and education level. People even lie about their name if it can get the result they want from the form without putting in their correct name. A common name appearing in many marketing databases is Mickey Mouse.

The problem with collecting data that is not directly required to complete the transaction is that the quality of these data elements tends to be low but is not immediately detected. It is only later, when you try to employ this data, that the inaccuracies show up and create problems.

Falsifying To Obtain A Benefit

The third case in which deliberate mistakes are made is where the entry person obtains an advantage in entering wrong data. Some examples from the real world illustrate this.

An automobile manufacturer receives claim forms for warranty repairs performed by dealers. Claims for some procedures are paid immediately, whereas claims for other procedures are paid in 60 days. The dealers figure out this scheme and deliberately lie about the procedures performed in order to get their money faster. The database incorrectly identifies the repairs made. Any attempt to use this database to determine failure rates would be a total failure. In fact, it was in attempts to use this data for this purpose that led to the discovery of the practice. It had been going on for years.

A bank gives branch bank employees a bonus for all new corporate accounts. A new division of a larger company opens an account with a local branch. If the bank employee determines that this is a sub-account of a larger, existing customer (the correct procedure), no bonus is paid upon opening the account. If, however, the account is opened as a new corporate customer (the wrong procedure), a bonus is paid.

An insurance company sells automobile insurance policies through independent insurance writers. In a metropolitan area, the insurance rate is determined by the Zip code of the applicant. The agents figure out that if they falsify the ZIP CODE field on the initial application for high-cost Zip codes, they can get the client on board at a lower rate. The transaction completes, the agent gets his commission, and the customer corrects the error when the renewal forms arrive a year later. The customer's rates subsequently go up as a result.

Data entry people are rated based on the number of documents entered per hour. They are not penalized for entering wrong information. This leads to a practice of entering data too fast, not attempting to resolve issues with input documents, and making up missing information. The operators who enter the poorest-quality data get the highest performance ratings.

All of these examples demonstrate that company policy can encourage people to deliberately falsify information in order to obtain a personal benefit.

System Problems

Systems are too often blamed for mistakes when, after investigation, the mistakes turn out to be the result of a human error. Our computing systems have become enormously reliable over the years. However, database errors do occur because of system problems when the transaction systems are not properly designed.

Database systems have the notion of COMMIT. This means that changes to a database system resulting from an external transaction either get completely committed or completely rejected. Specific programming logic ensures that a partial transaction never occurs. In application designs, the user is generally made aware that a transaction has committed to the database.

In older systems, the transaction path from the person entering data to the database was very short. It usually consisted of a terminal passing information through a communications controller to a mainframe, where an application program made the database calls, performed a COMMIT, and sent a response back to the terminal. Terminals were either locally attached or accessed through an internal network.

Today, the transaction path can be very long and very complex. It is not unusual for an application to occur outside your corporation on a PC, over the Internet. The transaction flows through ISPs to an application server in your company. This server then passes messages to a database server, where the database calls are made. It is not unusual for multiple application servers to be in the path of the transaction. It is also not unusual for multiple companies to house application servers in the path. For example, Amazon passes transactions to other companies for "used book" orders.

The person entering the data is a nonprofessional, totally unfamiliar with the system paths. The paths themselves involve many parts, across many communication paths. If something goes wrong, such as a server going down, the person entering the information may not have any idea of whether the transaction occurred or not. If there is no procedure for them to find out, they often reenter the transaction, thinking it is not there, when in fact it is; or they do not reenter the transaction, thinking it happened, when in fact it did not. In one case, you have duplicate data; in the other, you have missing data.

More attention must be paid to transaction system design in this new, complex world we have created. We came pretty close to cleaning up transaction failures in older "short path" systems but are now returning to this problem with the newer "long path" systems.

In summary, there are plenty of ways data inaccuracies can occur when data is initially created. Errors that occur innocently tend to be random and are difficult to correct. Errors that are deliberate or are the result of poorly constructed processes tend to leave clues around that can be detected by analytical techniques.