1.9 Data Quality Assurance Technology

Although information quality has remained at low levels or even degraded over the years, there has been progress in the technology for improving it. Although information technology is not yet considered a formal technology, its parts are coming together and will be recognized as such in the near future. The essential elements of the technology are

availability of experts and consultants
educational materials
methodologies
software tools

These factors combined allow a corporation to establish a data quality assurance program and realize substantial gain. It is important that these factors become established as standard methods that incorporate the best practices. This will allow the entire IT industry to use the emerging technology effectively and for rapid transfer of knowledge between individuals and organizations.

This does not mean that the technology will not evolve, as everything is not yet known about this area. It means that changes to the set of tools should be judged as to whether they advance the technology before they are adopted.

Every manufacturing operation has a quality control department. Every accounting department has auditors. There are inspectors for construction sites at every stage of building. There are requirements for formal specification of construction and manufacturing before anything is built. Any serious software development organization has trained quality assurance professionals.

Information systems need the same formality in a group of people and processes to ensure higher levels of quality. Every serious organization with a large IT operation needs a data quality assurance program. They need to require formal documentation of all information assets and sufficient information about them to satisfy all development and user requirements. They need inspectors, auditors, and development consultants. They need an established methodology to continuously monitor and improve the accuracy of data flowing through their information systems.

Availability of Experts and Consultants

Before any technology can take off, it needs the attention of a lot of smart people. When relational technology got its rocket start in the late 1970s and early 1980s, there was research going on in several corporate research organizations (most notably IBM) and in many universities (most notably the University of California at Berkeley). The vast majority of Ph.D. these in computer science in that era had something to do with relational database technology. An enormous number of start-up companies appeared to exploit the new technology. I did a survey in 1982 and found over 200 companies that had or were building a relational database engine. Today, less than five of them have survived. However, those that did survive have been enormously successful.

Data quality has the attention of a few smart people, not the large group that is desirable for a new technology to emerge. However, the number is increasing every year. Many university research efforts are now addressing this topic. The most notable is the M.I.T. TDQM (total data quality management) research program. There are many more university research efforts being aimed at this field every year. In addition, technical conferences devoted to data and information quality are experiencing significant growth in attendance every year.

A number of consultant experts have emerged who are dedicating their careers to the data quality topic. The number increases every year. The quality of these consultants is superb. Corporations should not hesitate to take advantage of their knowledge and experience.

Educational Materials

There is a clear shortage of educational materials in the field of data and information quality. Materials need to be developed and included in standard college courses on computer science. Corporations need to provide education not only to those responsible for data quality assurance but to everyone who is involved in defining, building, executing, or monitoring information systems. There should also be education for consumers of information so that they can more effectively determine how to use information at their disposal and to provide effective requirements and feedback to system developers.

Books and articles are useful tools for education, and plenty of them are available. However, more specific training modules need to be developed and deployed for quality to become an important component of information systems.

Methodologies

There have emerged a number of methodologies for creating and organizing data quality assurance programs, for performing data quality assessments, and for ongoing data stewardship. These can be found in the various books available on data or information quality. This book provides its own methodology, based on data profiling technology, for consideration. More detailed methodologies need to be employed for profiling existing data stores and monitoring data quality in operational settings.

If data quality assurance programs are going to be successful, they must rally around standard methods for doing things that have been proven to work. They then need to employ them professionally over and over again.

Software Tools

There has been a paucity of software tools available to professionals to incorporate into data quality assurance programs. It is ironic that on the topic of data quality the software industry has been the least helpful. Part of the reason for this is that corporations have not been motivated to identify and solve quality problems and thus have not generated sufficient demand to foster the growth in successful software companies focusing on data quality.

More tools are emerging as the industry is waking up to the need for improving quality. You cannot effectively carry out a good program without detailed analysis and monitoring of data. The area of data accuracy specifically requires software to deal with the tons of data that should be looked at.

Metadata Repositories

The primary software tool for managing data quality is the metadata repository. Repositories have been around for a long time but have been poorly employed. Most IT departments have one or more repositories in place and use them with very little effectiveness. Most people would agree that the movement to establish metadata repositories as a standard practice has been a resounding failure. This is unfortunate, as the metadata repository is the one tool that is essential for gaining control over your data.

The failure of repository technology can be traced to a number of factors. The first is that implementations have been poorly defined, with only a vague concept of what they are there for. Often, the real information that people need from them is not included. They tend to dwell on schema definitions and not the more interesting information that people need to do their jobs. There has been a large mismatch between requirements and products.

A second failure is that no one took them seriously. There was never a serious commitment to them. Information system professionals did not use them in their daily jobs. It was not part of their standard tool set. It appeared to be an unnecessary step that stood in the way of getting tasks done.

A third failure is that they were never kept current. They were passive repositories that had no method for verifying that their content actually matched the information systems they were supposed to represent. It is ironic that repositories generally have the most inaccurate data within the information systems organization.

A fourth failure is that the standard repositories were engineered for data architects and not the wider audience of people who can benefit from valuable information in an accurate metadata repository. The terminology is too technical, the information maintained is not what they all need, and the accessibility is restricted too much.

Since corporations have never accepted the concept of an industry standard repository, most software products on the market deliver a proprietary repository that incorporates only that information needed to install and operate their product. The result is that there are dozens of isolated repositories sitting around that all contain different information, record information in unique ways, and have little, if any, ability to move information to other repositories. Even when this capability is provided, it is rarely used. Repository technology needs to be reenergized based on the requirements for establishing and carrying out an effective data quality assurance program.

Data Profiling

The second important need is analytical tools for data profiling. Data profiling has emerged as a major new technology. It employs analytical methods for looking at data for the purpose of developing a thorough understanding of the content, structure, and quality of the data. A good data profiling product can process very large amounts of data and, with the skills of the analyst, uncover all sorts of issues in the data that need to be addressed.

Data profiling is an indispensable tool for assessing data quality. It is also very useful at periodic checking of data to determine if corrective measures are being effective or to monitor the health of the data over time.

Data profiling uses two different approaches to examining data. One is discovery, whereby processes examine the data and discover characteristics from the data without the prompting of the analyst. In this regard it is performing data mining for metadata. This is extremely important to do because the data will take on a persona of itself and the analyst may be completely unaware of some of the characteristics. It is also helpful in addressing the problem that the metadata that normally exists for data is usually incorrect, incomplete, or both.

The second approach to data profiling is assertive testing. The analyst poses conditions he believes to be true about the data and then executes data rules against the data that check for these conditions to see if it conforms or not. This is also a useful technique for determining how much the data differs from the expected. Assertive testing is normally done after discovery.

The output of data profiling will be accurate metadata plus information about data quality problems. One goal of data profiling is to establish the true metadata description of the data. In effect, it can correct the sins of the past.

Data profiling tools exist in the market and are getting better every year. They did not exist five years ago. Data profiling functions are being implemented as part of some older products, and some new products are also emerging that focus on this area. More companies are employing them every year and are consistently amazed at what they can learn from them.

Data Monitoring

A third tool includes effective methods for monitoring data quality. A data monitoring tool can be either transaction oriented or database oriented. If transaction oriented, the tool looks at individual transactions before they cause database changes. A database orientation looks at an entire database periodically to find issues.

The goal of a transaction monitor is to screen for potential inaccuracies in the data in the transactions. The monitor must be built into the transaction system. XML transaction systems make this a much more plausible approach. For example, if IBM's MQ is the transaction system being employed, building an MQ node for screening data is very easy to do.

A potential problem with transaction monitors is that they have the potential to slow down processing if too much checking is done. If this is the result, they will tend not to be used very much. Another problem is that they are not effective in generating alerts where something is wrong but not sufficiently wrong to block the transaction from occurring. Transaction monitors need to be carefully designed and judiciously used so as to not impair the effectiveness of the transaction system.

Database monitors are useful for finding a broad range of problems and in performing overall quality assessment. Many issues are not visible in individual transactions but surface when looking at counts, distributions, and aggregations. In addition, many data rules that are not possible to use on individual transactions because of processing time become possible when processing is offline.

Database monitors are also useful in examining collections of data being received at a processing point. For example, data feeds being purchased from an outside group can be fed through a database monitor to assess the quality of the submission.

The most effective data monitoring program uses a combination of transaction and database monitoring. It takes an experienced designer to understand when and where to apply specific rules. The technology of data quality monitors is not very advanced at this point. However, this is an area that will hopefully improve significantly over the next few years.

Data Cleansing Tools

Data cleansing tools are designed to examine data that exists to find data errors and to fix them. To find an error, you need rules. Once an error is found, either it can cause rejection of the data (usually the entire data object) or it can be fixed. To fix an error, there are only two possibilities: substitution of a synonym or correlation through lookup tables.

Substitution correction involves having a list of value pairs that associate a correct value for each known wrong value. These are useful for fixing misspellings or inconsistent representations. The known misspellings are listed with correct spellings. The multiple ways of representing a value are listed with the single preferred representation. These lists can grow over time as new misspellings or new ways of representing a value are discovered in practice.

Correlation requires a group of fields that must be consistent across values. A set of rules or lookup tables establish the value sets that are acceptable. If a set of values from a database record is not in the set, the program looks for a set that matches most of the elements and then fixes the missing or incorrect part. The most common example of this is name and address fields. The correlation set is the government database of values that can go together (e.g., city, state, Zip code, and so on). In fact, there is little applicability of this type of scrubbing for anything other than name and address examination.

Database Management Systems

Database management systems (DBMSs) have always touted their abilities to promote correct data. Relational systems have implemented physical data typing, referential constraints, triggers, and procedures to help database designers put transaction screening, database screening, and cleansing into the database structure. The argument is that the DBMS is the right place to look for errors and fix data because it is the single point of entry of data to the database.

Database designers have found this argument useful for some things and not useful for others. The good designers are using the referential constraints. A good database design will employ primary key definitions, data type definitions, null rules, unique rules, and primary/foreign key pair designations to the fullest extent to make sure that data conforms to the expected structure.

The problem with putting quality screens into the DBMS through procedures and triggers are many. First of all, the rules are buried in obscure code instead of being in a business rule repository. This makes them difficult to review and manage. A second problem is that all processing becomes part of the transaction path, thus slowing down response times. A third problem is that the point of database entry is often "too late" to clean up data, especially in Internet-based transaction systems. The proper way to treat data quality issues is to use a combination of DBMS structural support, transaction monitors, database monitors, and external data cleansing.