7.1 Goals of Data Profiling

Data profiling is defined as the application of data analysis techniques to existing data stores for the purpose of determining the actual content, structure, and quality of the data. This distinguishes it from data analysis techniques used to derive business information from data. Data profiling is used to derive information about the data instead.

Data profiling technology starts with the assumption that any available metadata describing rules for correctness of the data is either wrong or incomplete. The data profiling process will generate accurate metadata as an output of the process by relying on the data for reverse-engineering the metadata and comparing it to the proposed metadata.

If data were perfectly accurate you would need to use only the information derived from the data. However, because most data stores contain data inaccuracies, the process requires the user to make decisions on whether the data is correct or the metadata is correct. When both are flawed, this can become a complex exercise.

Data profiling is a process that involves learning from the data. It employs discovery and analytical techniques to find characteristics of the data that can then be looked at by a business analyst to determine if the data matches the business intent.

Once the proper definition of data is arrived at, the data profiling methodology allows for computing the violations of the metadata that occurs in the data. This provides both hard instances of inaccurate data as well as evidence of the presence of inaccurate data for which determining the actual wrong values is not possible.

Data profiling cannot find all inaccurate data. It can only find rule violations. This includes invalid values, structural violations, and data rule violations. Some inaccurate data can pass all rule tests and yet still be wrong.

The data profiling process is directly applicable to the inside-out approach of assessing data quality. It is also useful beyond the data quality assessment function by providing valuable metadata and quality information to projects that are developing new uses of data or that are migrating data to different data structures.

Another goal of data profiling is to create a knowledge base of accurate information about your data. This information needs to be captured and retained in a formal data profiling repository. It needs to survive the data profiling process and serve many clients over time. It needs to be updated either as more is learned or as systems change. Data profiling may need to be repeated on critical data stores from time to time to ensure that the information continues to be accurate.