The column property analysis process discovers a variety of facts about the data. Some of it is identifying specific, inaccurate values. In addition, you end up with a rigorous rule set for defining what is stored in each column. You also identify bad data design, bad data entry processes, and data representation inconsistencies.
It is important to collect all of the facts before you fabricate remedies. The totality of issues should be used to determine what to do with all of the facts. The facts support decisions. Some of the remedies that can be supported at this level of data profiling are examined in the sections that follow.
If the data is being profiled as part of a project, a judgment can now be made as to whether the data contains the information needed for intended uses and has the quality to support the requirements of the intended uses. This is not the final decision point but is an important one.
It is far better to scrap a project at this point than to spend millions of dollars developing a new system that cannot succeed because the data cannot support it. If the data is inconsistent with the intended use, sometimes the target system can be modified to work with the data available. Most packaged applications have a number of customization options that can be examined to determine if they can be used to conform more to the source systems.
Whether data profiling the data for stand-alone assessment or for use on a new project, the requirement to upgrade source systems can be examined at this point in regard to issues involving individual values. The data might suggest improvements to source data structure design, source data capture procedures and programs, or application program changes. Even when the purpose of data profiling is to match data to target systems, an important side effect can be the discovery of enough issues to warrant a secondary project to upgrade the source systems. This may not be just to help the new system but to add value for the source system applications.
Data profiling often reveals wide differences in data encoding across systems. This is an excellent time to review the various practices being used and to define standard methods of representing data. This can then be set as the standard and used for future application development, making changes to existing source systems, or building transformations to have the data land in target systems in the standard form.
Data profiling discovers, develops, and executes a lot of rules for the purpose of digging out inaccurate data values. These rules can be used to build data checkers that can be installed in operational systems to catch bad values at the source. Checkers can be built into data entry screens, into transaction applications, or into DBMS functions, or they can be used to check databases through batch processing on a periodic basis.
You never want to put all rules into the checkers. They would run forever and hopelessly clog up the operational systems. However, selective use of rules can improve overall accuracy considerably.
Checkers can include more than just rules. They can include synonym lists of frequently found wrong values and the correct values to change them to. This is particularly valuable in resolving misspelled words and inconsistent representation problems. The wrong values discovered in data profiling can be turned into synonym lists to be used to clean data.
Checkers are particularly good for qualifying data coming into the corporation in the form of data feeds. It not only catches errors but provides the basis for negotiating improvements from the provider of the data.
One of the most important outputs of profiling values is input to building data transformation routines for use in moving data. This applies to mapping valid values to corresponding valid values in the other system and to cleansing wrong values by substituting correct values.
It is very common for transformation routines not to include logic for handling the NULL condition. This is a source of many problems in trying to get data moved and aggregated. Data profiling should have identified all of the relevant information needed for proper handling and transformation of NULL values.
It is often valuable to clean up some of the data problems before moving to later steps of data profiling. This is because data inaccuracies and inconsistencies can distort the discovery processes of later steps. This is a choice the analyst must make carefully, because decisions made in later steps will be made on the transformed data, not on the original source data.
It is not recommended that you clean up data for later data profiling by eliminating rows with values that are inaccurate or by converting inaccurate values to NULL. This can distort follow-on steps in ways that are not helpful. Cleanup should be restricted to standardizing data representation, standardizing NULL condition representation, and changing misspelled words.
Meetings with management may also be indicated at this point in order to discuss data suitability for future use and to discuss general problems with data quality. This would normally be done after all data profiling is complete. However, the quality of the data may be so poor that aborting the data profiling process may make sense at this point. Poor-quality data will very likely frustrate attempts to perform subsequent data profiling steps.