Part III: Data Profiling Technology

Part III: Data Profiling Technology

Chapter 7: Data Profiling Overview
Chapter 8: Column Property Analysis
Chapter 9: Structure Analysis
Chapter 10: Simple Data Rule Analysis
Chapter 11: Complex Data Rule Analysis
Chapter 12: Value Rule Analysis
Chapter 13: Summary

There are many technologies that apply to creating and maintaining an effective data quality assurance program. Professionals in this area need to be familiar with all of the primary ones: data profiling, data cleansing, metadata repositories, data filtering, and data monitoring. This book is focused on data accuracy. The most important technology for data accuracy is data profiling. The remainder of this book is devoted to that single technology.

Data profiling is defined as the use of analytical techniques to discover the true structure, content, and quality of a collection of data. It uses as input both any known metadata about the data and the data itself. The output of data profiling is accurate metadata plus additional information on content and quality of the data.

For purposes of data quality assurance, data profiling is a process used to discover the presence of inaccurate data within a database. The more data profiling you do, the more inaccuracies you dig out. The information discovered becomes facts used to form data quality issues that need resolution.

Data profiling is a new technology that has emerged in the last few years. Practitioners have been using ad hoc methods as a substitute for formal data profiling for many years. Because they lacked a formal methodology and analytical tools designed specifically for data profiling, the process was time consuming and relatively ineffective. Data profiling has matured into a formal and effective technology that enables the inside-out approach to data quality assurance.

The next chapter provides an overview of the data profiling process. It outlines the basic methodology and describes each of the data profiling categories. It also discusses the difference between discovery and verification techniques.

The chapters that follow drill down on specific steps of the data profiling process. They include techniques used to accomplish the analysis, as well as examples that demonstrate the process and value.

This part is not an exhaustive treatment of the topic of data profiling. That would take a large book by itself to accomplish. However, there is enough material here to develop a thorough awareness of the technology, understand how it works, and see how it can return value.

Data profiling applies directly to data quality assurance functions. It is also a valuable technology in developing new uses of data and in migrating data to other projects. When used in these projects, it can be viewed as either a development technology or a quality assurance technology. It provides an organized way of looking at data and delivering the metadata needed to make projects successful.