7.5 When Should Data Profiling Be Done?

7.5 When Should Data Profiling Be Done?

Clearly, data profiling should be done on all data quality assessment projects as well on all IT projects that either move data to another structure or migrate or consolidate data. If a data quality assurance department uncovers significant facts about a data source through the outside-in method, they should profile the data source after the fact to determine the extent of inaccuracies and to discover any additional inaccuracy problems that may exist in the same data.

It is important to do all steps of the data profiling process whenever used. Analysts that think they can bypass a step because they understand the structure or believe that rules are enforced by the application programs will often be surprised by discoveries that go beyond what they think. The biggest task in data profiling is generally getting started, gathering known metadata, and getting the data extracted. Once you have done all of this and gone through the first step of data profiling, the other steps do not add that much more time to the process. Ending early has risk or missing inaccuracies, but costs little to finish.

Important databases should be reprofiled periodically. The rationale is that changes to applications are occurring all of the time. Industry experts have consistently estimated that production applications incur a change of 7% every year. Many of these changes have the potential to introduce new opportunities for generating inaccurate data. Other changes, such as business process changes or personnel changes, can introduce the possibility that data accuracy will deteriorate.

Once data profiling has been done on a source one time, much of the initial work has already been done, making a reprofiling exercise go much faster. Data profiling should be done on data sources after remedies have been implemented and a period of time passes for them to have an impact. This is a good way to measure the effectiveness of the remedies as well as to ensure that new problems have not been introduced.

Data profiling of secondary, derivative data stores is also helpful. For example, data profiling the data warehouse can reveal problems that are unique only at the data warehouse level. Aggregating and integrating data from multiple data sources can generate conditions that are illogical and discoverable only in the aggregation. For example, two data sources that maintain the same information at different levels of granularity will populate a data warehouse column with unusable data. Each data source would pass data profiling just fine.