7.2 General Model

Figure 7.1 shows the general model of a data profiling process. The primary input is the data; the primary output is facts about the data.

Figure 7.1: Data profiling model.

Participants

Data profiling is usually done with a single analyst or small team of analysts performing most of the analytical work, and several other participants adding value to that analysis. The data profiling analyst is generally part of the data quality assurance team. This person should be well versed on data structures, data architecture concepts, data modeling, and analysis techniques. The analyst is not the one who needs to know the business purpose and meaning of the data, although it helps if she does.

Business analysts who understand the applications that work with the data being profiled must play an important role. The data analyst will bring intermediate results to the business analysts for semantic interpretation. The business analyst is the real authority on right or wrong rules. The data will suggest many rules and relationships that need to be verified.

IT staff members that built the application, maintain it, or manage the physical data can also be useful. When it comes to rule definition and verification, the application developers can provide a great deal of insight. The database administrators are helpful in providing the data to be used in data profiling and in explaining some of the structural issues that may surface in the data profiling process.

Other staff may be helpful in understanding situations discovered in the data. For example, people who create data, people who use reports, or people who perform ad hoc queries against the data can also be used to explain mysteries that arise.

It is helpful to periodically gather people from several of these disciplines into a room for a group review of the facts discovered by the data profiling analyst. This is often the most productive way to get through the mysteries to the truth: match the experts to the facts determined from the data.

Some business analysts are more than capable of driving the data profiling process. The key is whether they understand the basic principles of data architecture. If they do not, they will have difficulty getting through the structural analysis. They need to understand concepts such as null, primary keys, functional dependencies, referential constraints, and entity relationship rules.

Some business analysts do understand these, and some do not. If they do, you have the best of all worlds: an analyst who can wade through the technology and who knows the business aspects of the data. If they do not, it is best that a partnership be created to get the work done.

Data Profiling Inputs

There are two inputs: metadata and data. The metadata defines what constitutes accurate data. If you had perfectly accurate and complete metadata, all you would need to do is compare the data to the metadata, and any violations are data inaccuracy facts. However, the metadata is almost always inaccurate and incomplete. This places a higher burden on attempts to use it with the data.

Data profiling depends heavily on the data. The data will tell you an enormous amount of information about your data if you analyze it enough.

Metadata

Metadata is essential for determining inaccuracies. You cannot tell if something is wrong unless you can define what being right is. That definition is metadata.

Appendix B shows a list of some of the information included in metadata. A data profiling repository should include all information the data management team and the business users require. This may include other information that is not included in that list.

Although you can be certain that what you know about the data before you start data profiling is going to be inaccurate and incomplete, you must start there. The existing information will provide a basic structure. This structure is essential to getting started.

The source of information starts with identification of the data source, the database or file technology it is stored in, and how to access the information. You need access to the data to perform data profiling. Access should be as unfettered and unfiltered as possible.

The next part is to collect metadata on the column-level layout of the data within the database or file system. How this is accomplished will generally be dependent on the data source. If the data is in a relational system, this information should be extracted from the relational database directory or catalog. If accessing an IMS or VSAM data source, you should look for a COBOL copybook or a PL/1 INCLUDE file that lays out the data. For other data sources, seek the appropriate source that gives you the layout that best represents the data according to the most likely extraction method.

Extract structural information that is available within the database management system. For example, in relational systems you can extract primary key, foreign key, and other referential constraint information. You can also examine any TRIGGER or STORED PROCEDURE logic embedded within the relational system to cull data filtering and validation rules being enforced. In IMS, the program specification block (PSB) gives insight into the hierarchical structure being enforced by IMS.

Seek out interface definitions to application programs that feed data to the data source. These can often add illuminating information about the expected content of the data. Collect any descriptions of the data that are available in existing data dictionaries, metadata repositories, or other documentation forms.

You may also want to collect application source code for programs that insert, delete, and update information in the database. These can be culled to extract data rules that are currently being enforced through the code. There are software packages available to assist you with extracting rules from source programs. Although this technique is possible, it is rarely done. Some reason for not doing this are that the volume of source code may be extremely high, extracting rules is a difficult task because typical programming logic does not easily isolate rule logic, and the technical staff performing data profiling typically is not trained in the programming languages used (usually COBOL).

Looking for external process rules may also be helpful. Sometimes data rules are enforced through human procedures instead of through program or database logic. Procedure manuals used by data entry personnel may contain valuable rules.

All of this information should be recorded in a common place: the data profiling metadata repository. Recording in a common format will facilitate dealing with many disparate data sources.

It may be useful at this point to verify some of the information with other technical people to determine the completeness of your data-gathering exercise and to identify areas where others suspect wrong or missing information. In addition to talking to business analysts, it is probably just as important to discuss your findings with database administrators or application programmers. In fact, it would be useful for the data profiling analyst to work with them throughout the data-gathering phase.

Data

The next step in the process is to extract some of the data from the data source. There are two reasons for doing this. Executing data profiling tasks can be CPU intensive, and most shops will resist doing this on operational systems. The interference with operational performance is normally prohibitive. The second reason is that you need to recast the data to a normalized form to effectively profile it. Data profiling can be done over data that is in less than third-normal form but may yield incorrect results. It is best to have it in third-normal form. However, you often do not know if it is in first-, second-, or third-normal form when you extract it. The data profiling process will reveal the form it is really in.

Extracting all of the data is clearly preferable to extracting only a sample. However, on very large data sources measuring in the hundreds of millions or billions of rows, using all of the data would be prohibitive for some of the analytical processes. Most data profiling is not done on supercomputers. In reality, the team may only have an NT box available. Some processes will be able to run in acceptable times over even very large amounts of data, whereas others will take a prohibitive amount of time to complete. On these large data sources, samples are generally more than sufficient to accomplish data profiling tasks.

Sampling is a very complex topic that can make data profiling accurate or inaccurate. Common mistakes in sampling are to draw too small of a sample, too large of a sample, having the sample biased through getting records that disproportionately represent only part of the population, or getting samples that violate structural rules. This book does not provide descriptions of sampling techniques. However, a practitioner should understand this topic well if mistakes are to be avoided. Sampling should be used only when necessary and only for data profiling steps that require it.

Extraction routines must be developed. These are normally easy to perform. However, they can become complicated when data structures include either of three common practices: field overloading, redefines, and repeating arrays (called OCCURS clauses in COBOL). These factors need to be identified and appropriate normalization performed in the extraction process to get a meaningful array of normalized files for data profiling.

Overloaded fields need to be broken into multiple columns. Redefines need to be analyzed to determine if they are the same facts being redefined by a different external view or whether they represent different facts based on the value in some other column. Repetitive arrays need to be analyzed to determine if they represent multiple occurrences of the same business fact or whether they represent truly distinct facts. These decisions all lead to a determination of the proper way to normalize. The last two of these decisions are semantic and cannot be determined from looking at the data definitions alone.

Some transformations are also helpful in the extraction process. Data that is bit encoded in the source can usually benefit from being converted to a character format in order to facilitate data profiling activities.

Sometimes extracting data generates errors or nonsense results. This generally occurs due to errors in the information provided that started the process. Often, COBOL copybooks inaccurately map the data as stored. Overloaded fields are often not documented. Sometimes the controlling values that determine redefinition mapping are not the ones actually used in the logic.

All errors need to be examined to determine why they did not extract correctly. Result data should be put into a displayable format and a reasonableness check performed to see if the data appears to map out according to the expected data layouts. Offset problems, columns in the wrong places, and so on will generally be visible in this cursory check.

The analyst may get into a repetitive process here by developing extraction routines, executing them, finding out that the results are nonsense, investigating the causes of the problems, adjusting the extraction routines, and trying again.

The data profiling analyst should verify all extraction logic and resolution of problems encountered with the keepers of the data source. This may be a database administrator or an application developer who knows the data source well. They can often shed light on the problems and catch mistaken choices made by the data profiling analyst.

The decisions to normalize REDEFINE or OCCURS structures are semantic decisions. They require interpretation of the intent of the data structure. To verify these decisions or even to make them in the first place, it may be necessary to consult with a business analyst.

Maps derived from programming materials often do not map correctly to the actual data. One study concluded that approximately 80% of COBOL copybooks incorrectly map to the data. Most IT shops I mentioned this to did not dispute this number.

Typical mismatches include cases in which a column documented in the copybook is subdivided into two or more columns in practice, cases in which field overloading is done and not documented (COBOL specifications do not have constructs for overloading except in the case of REDEFINES that cover only one form of overloading), and cases in which programmers use a single column as a displacement anchor and do as they wish with the space behind it.

I have also observed on multiple occasions cases in which space in the FILLER specification is used for real data. I recommend that analysts always look at the content of FILLER for possible useful but undocumented data.

All of this comes about because a programmer needs to make a change in a hurry to an operational system and does not want to trigger a massive recompilation of hundreds of programs. They simply figure out how to accomplish the change in selected program logic and bypass the correct method of performing the copybook update. The change is made with minimum disruption of the system, and the copybook is forever corrupted.

Many issues may appear in the extraction process. Discovery of the need to separate overloaded fields should be documented because this will need to be done every time this data is accessed for new uses. The need to transform bit-level encodings to character representations must be documented for the same reasons.

All of the normalization logic needs to be documented, as this is important to get an accurate table representation of the source information. Key columns should also be included in the normalized representations.

Sometimes extraction fails due to problems in the data. For example, a column may be defined as packed decimal in the source system documentation, whereas attempts to convert this to a numeric variable in the extract representation may fail if the column is sometimes used for character data instead. The extraction routines may throw away the records containing errors or may substitute a zero for the invalid value. If this situation is found, the errors detected and the values changed need to be documented. One of the goals of data profiling is to find inaccurate data. If inaccuracies are detected in the extraction process and rejected from the output or changed to valid values, the data profiling steps will not catch them. This must be included in the set of inaccurate data facts disclosed.

If too many extraction issues arise, it may indicate that the data is unfit for the planned use. If this is the case, the data profiling analyst needs to specifically document the issues and try to quantify the magnitude of offending data. Some projects will not get past the extraction process, and rightly so. When this happens, they need to seek alternate data sources or launch a reengineering of the data source and its applications to improve the accuracy of the data before it is subsequently extracted.

Extracting data for data profiling can also be viewed as developing a common extraction process for all subsequent batch extractions of the data. Data profiling will be performed on the output of extraction, not the original data source. Therefore, any conclusions about quality or identified needs for transformations and cleansing will be based on the extracted representation of the data.

If data profiling does its job well, the extraction routines should then be used for all subsequent draws of the data so that all logic built from the information developed through the data profiling process will be accurate. If you extract the data differently in subsequent project execution, the remedies developed through data profiling may be wrong or insufficient.

Data Profiling Outputs

The primary output of the data profiling process is best described as accurate, enriched metadata and facts surrounding discrepancies between the data and the accurate metadata. These facts are the evidence of inaccurate data and become the basis for issues formation and investigation.

Metadata

Metadata is both an input and an output. This is true because the data profiling process will both correct and expand the metadata gathered to start the process.

The data profiling process probably generates the most complete set of information ever assembled for this data. It is not only complete, it is accurate. It should be the basis for all subsequent development activities on projects that will use the data source that is profiled.

If purchased data profiling software is used, it will normally come with a data profiling repository to store this in. If you also maintain an enterprise metadata repository, you may want to copy the relevant information to that repository after profiling. This is also a good opportunity to determine how good the information is in your enterprise repository.

Data Quality Facts

The database of information gathered for data quality facts includes indentification of rows and values that failed to pass the many rules used in the data profiling process. This information should include

table and column name
property rule
invalid value list with frequency of each
table and rowID containing the violation
total number of rows tested

The rowID may be a list of rowIDs. There may be a number of rows that match the violation. If one rule violation encompasses multiple data rows, this should also be captured. Not all rules will identify a row set that can be recorded. As you will see later, some aggregation tests cover such a large amount of rows that detailing all rows accessed to generate the aggregate values would be impractical.

The collection of rule violations should be kept separate from the metadata. This set constitutes an instance of violations (those violations found in the specific extraction used to generate it). Over time, you may accumulate multiple result sets against the same metadata rule set. It may be helpful to put summary information about the result set in the data profiling repository.

Latency

Most people would claim that maintaining this information over time is impractical. If past history is any indication, they are correct. As much as we would all like to think we could create a true data profiling repository once and keep it current over time, the practical reality is that it never happens. It is therefore more practical to have a strategy of creating it the first time a data store becomes the subject of a quality assessment project or is used as a data source in any other type of project. It should then be upgraded through data profiling for any subsequent projects that require use of the information. Updating it should take considerably less work than starting over for each project that needs to use the date.