8.2 The Process for Profiling Columns

Figure 8.4 shows a schematic of what goes on when profiling columns. There are two goals to the process: (1) define the properties that determine what values are acceptable in the column and (2) find occurrences of violations of those properties (inaccurate values). This process looks at values in isolation. The examination concerns itself solely with the values stored in each column without any regard for how these values relate to other values in the business objects. That is covered in later steps of the data profiling process, as described in later chapters.

Figure 8.4: Column property analysis process.

Gathering Information

The available metadata needs to be captured in a data profiling repository used for the data profiling process. This includes as much information as is available about each column, regardless of whether it is complete or accurate. These are referred to as documented properties.

Sources of information can include programming specifications such as COBOL copybooks or PL/1 data definitions, database specifications such as DDL for relational systems, data entry screen specifications, data entry procedure manuals, data dictionary descriptions, and metadata repository information. The original metadata will provide a framework for identifying columns and their properties.

Discovery from Data

The data should be examined for each column to determine what is in it. This should be done independently from the metadata gathered. The reason for this is that gathering the information is generally not difficult to do, provided column boundaries can be found. Some properties of columns are difficult to discover without prior knowledge or discovery software. For example, automatically determining that the data type for a character column is really "date" is not easy to do without special software.

However, it is not rocket science to determine the cardinality, the range of values, whether the content is numeric or character, the range of the length of values, uniqueness, and the use of special-characters blanks, question marks, or strings that may indicate NULL. This should all be computed from the data.

You obviously cannot look at each data value in a column containing thousands or millions of instances to determine what you have. You need computational support. The ideal support is a program that computes the properties from the data and then allows the user to verify against the documented properties. If there is a mismatch, either the data is in error or the documented property is either inaccurate or incomplete.

If you lack a discovery-based program, you need to develop queries against the data for each column in order to determine the information you need to make a judgment whether the data is in conformance with the documented properties.

Trying to perform discovery of column properties against data that is not in tabular form is very difficult and is typically impractical. This is the essential argument for extraction of data to be profiled into tabular form as a part of the data profiling process.

Verification of Results

The next step is to match the content of the data to the documented properties. The matching process is both physical and logical. The goal is to determine if any differences are caused by inaccurate data or by inaccurate metadata. In looking at differences, there is a need to look at the results from two perspectives: inaccurate values and accurate values.

Values determined to be outside the scope of the documented properties would normally be considered inaccurate data. However, given the high suspicion about the accuracy of documented properties, the violations may suggest that the metadata is wrong. It is not at all unusual for data to be used for a fact that is different from the initial purpose of the column. It is also possible that the parameters are too restrictive, that the correct specification for valid values should be larger than specified in the properties. This examination may very well result in the metadata being revised as a result of investigating these differences.

Considering the accurate values is also useful in judging the metadata. For example, if a column is specified as having a length of 30 bytes and the discovery process determines that all entries are 1 byte in length, a strong suspicion is raised that the metadata is referring to some other column.

This process of matching the metadata to the discovered properties of the data should be done in a group setting. The data profiling analyst, business analysts, subject matter experts, and possibly data designers, database administrators, or application developers should be included. They need to approach the exercise from the point of view of rigorously determining what the correct property list is for each column and where it deviates from the discovered data and the extent of quality issues.

The output of this process is an accurate property list for each column. Another output may be a number of data quality issues surrounding situations that should be addressed to avoid future problems in using the data. The use of improperly defined encoding schemes, field overloading, inconsistent representations of NULL conditions, and the lack of domain standards across multiple columns are all good candidates for quality improvement issues.

Validation of Data

The last step is to process against the data one more time after the property lists are determined to be accurate. This allows identification and enumeration of all violations of the property lists. These are inaccurate data values.

These results need to be recorded in the data profiling repository as well. Ideally, these will be linked to each column specification in the repository.

This process is different from the discovery process previously described. In the discovery process you are computing characteristics. In the validation step you are computing the difference between the rules in the property list and the data. You may find that sample data is sufficient for discovery, especially on very large data sources, but that the entire data source needs to be used for validation.

It is a good practice to reconvene the data profiling review group used to determine the accurate metadata to review the outputs of the validation step. Additional confirmation can be achieved that the metadata is correct, as well as educating everyone on the amount of inaccurate values found in the data.

The results need to be recorded as inaccurate data facts. These facts are subsequently aggregated into data quality issues, as defined in Chapter 5.