Chapter 7: Data Profiling Overview

Overview

This chapter begins the examination of the most important technology available to the data quality assurance team: data profiling.

Note to the reader: This text uses the terms column and table throughout the data profiling chapters in order to provide consistency. Data profiling is used for data from a wide variety of data sources that use different terminology for the same constructs. Consider table the equivalent of file, entity, relation, or segment, and column the equivalent of data element, attribute, or field.

The text uses the term data profiling repository to mean a place to record all of the information used in and derived from the data profiling process. Much of this information is metadata. However, I do not want to confuse the reader by referring to it as a metadata repository. A user could use an existing metadata repository for this information provided it was robust enough to hold all of the types of information. Otherwise, they could use the repository provided by a data profiling software vendor or fabricate their own repository. It is not unreasonable to expect that much of this information would subsequently be moved to an enterprise metadata repository after data profiling is complete.