7.4 Analytical Methods Used in Data Profiling

Within each data profiling step there can be processes for discovery, assertion testing, or value inspection. The analyst uses outputs of these processes to make decisions. Each type of activity is explored in the sections that follow.

Discovery

Discovery includes processing that reveals facts about the data the analyst may not know or suspect. In this regard it is similar to data mining techniques.

An example of discovery is to execute a program that analyzes the patterns in text columns to determine if they may be date values, coded values following a specific character encoding pattern, single text strings, multiple text strings, or multiple text strings connected with special characters such as = or :. If a dominant characterization is found over a high percentage of the data, the analyst is a long way toward establishing the true properties of the column. The discovered property may be different from the one implied in the documented data type (usually just CHARnn) or in the name or description of the column.

Discovery can be applied to a number of variables that are common across columns. For example, it is easier to execute a range computation for all numeric columns and then to compare against the expected range.

Using this same example, discovery can be more powerful than assertion testing through exposing the true range rather than only exceptions to the expected range. If the discovery returned a true range of 100 to 200, all well within the expected range 0 to 10,000, the huge discrepancy would indicate something wrong with the expected range or quite possibly that the column is used for a different business purpose than documented.

This same effect can be true of discrete value lists, text string lengths, and other factors where the actual list of values, although all valid, may indicate a problem in the definition of the data. A data quality problem may be hiding within the valid set that is exposed through discovery but not through verification. Simple examples are columns with all values being the same (essentially a NOT USED column) or columns with a very high incidence of one value that cannot be true in the real world, which indicates a problem in data capture. For these reasons, it is always better to discover facts about the data rather than to merely construct assertion tests.

Discovery is also possible for complex tests, such as finding functional dependencies. This is very important in order to determine the true structure of data if you intend to transform the structure and match it to another structure. If you only test the dependencies you think are true, you will miss others.

Assertion Testing

Assertion testing refers to processing that takes a known or suspected fact about the data and checks to see if the data conforms and, if not, reveals the instances where it does not. A good data profiler performs both types of investigation. Generally, discovery precedes assertion testing because discovery can uncover new rules.

An example of assertion testing is where you believe a column has a range of 0 to 10,000. You construct a query to test this assertion. The query returns all values that are not in this range.

Only rules involving single-column content or structural components that hold true over all of the data are reasonable candidates for discovery. Other data rules that are simple or complex have to be formulated and tested by crafting a query that exposes the violations. There is endless potential for such rules.

Data rules can come from existing documentation, application program logic, or from work sessions with business analysts. Some of these data rules will be conjecture until they are verified by data and business analysts.

All data rules need to be tested against the data to determine conformity. A lack of conformity needs to be investigated to determine if the rule should be discarded, if the rule should be modified, or whether the data is inaccurate. This decision is purely and simply a semantic decision that should be made through collaboration between the data profiling analyst and the business analyst.

Again, rules can be written too broadly and hide problems that could have been exposed if the rule were more restrictive (data hiding inside the rule). For example, if a series of dates need to be in order (BACKORDERED_DATE < = SHIPPED_DATE < = RECEIVED_DATE), it may be that for a significant number of orders all three dates are the same. All orders would pass the rule. However, it may be that it is impossible or unreasonable for this condition to be true all of the time or even a high percentage of the time. It may be necessary to create a second rule as NOT(BACKORDERED_DATE = SHIPPED_DATE = RECEIVED_DATE). This says that the dates cannot be all the same and isolates the ones that are.

This second data rule may yield instances of accurate data (false positives) because sometimes having two of the dates the same is acceptable. However, the output can be reviewed to see if real inaccuracies are also reported from the data rule.

Visual Inspection

An additional form of analysis is through visual inspection. Many of the clues of data inaccuracies are not easily formulated as boundaries, limits, or rules. Visual inspection of the data can lead an analyst to the conclusion that something is wrong with the data (possibly very wrong). Typical value computations that may be used are

frequency distribution of values over a column
sums, totals of values over a column for a relation, or groups within a relation
comparing counts or totals from one data source to another
values with counts less than some interesting threshold
text strings

Text strings present an interesting example. Consider a free-form text column used to capture the name of a company. The properties description of the column says that it is a column containing characters and can be from 1 to 30 characters in length. The description may also disallow leading blanks. However, embedded blanks would have to be allowed. It may also be refined to disallow other special characters, such as the following: #$%^&*(). This provides a reasonable filter against invalid entries.

However, this filter will allow many obviously invalid names that pass the previously described test. You cannot put in a test that requires more than two characters because it would disallow the company 12. Note that this name also includes a numeric character, so that cannot be disallowed either.

By visual inspection, you would be able to spot invalid names such as Don't know, Not provided, A company, 1235 main street, or Chicago, IL. These would all pass the properties verification test but not a visual inspection.

You cannot visually inspect all names in a five-million record customer database. However, you can programmatically look for names that suggest errors, thus significantly lowering the scope of names to inspect. For example, your screen could look for names beginning with lowercase characters, containing numbers, or containing other special characters that might, but rarely do, appear in names. You can also look for names that appear more than once in the database, in that entry operator conventions such as Don't know tend to appear multiple times.

Metadata Verification

The data profiling analyst needs to constantly review the metadata against testing results. Much of the metadata confirmation and development is a reflection of the business meaning of the data. Much of the interpretation of what constitutes inaccurate data is also semantic. The data profiling process will uncover a number of surprises from the data. Each surprise needs to be investigated to determine if the interpretation of the data given in the metadata is correct and the data wrong, or whether the data is right and the metadata is wrong. Input from others is generally needed to make these determinations.

Work groups consisting of the data profiling analyst, business analysts, and other subject matter experts should meet periodically to review the data as it evolves. These reviews can be done in a meeting or over the Internet. The important point is to use the computed data and compare it to metadata rules that are being proposed. Because the data is being used for the computations, there should be no disagreement about what is present, only discussion of what it means.

There is a specific value in ensuring completeness of each step before continuing to the next step. This will become more evident as the process steps are reviewed. It is therefore not a good idea to wait until the data profiling analyst has completed all of the steps before involving the other team members.

Iterations and Backtracking

Although the methodology calls for logically progressing through the steps and trying to complete and verify the results in each step before going to the next step, it is sometimes useful to backtrack. For example, performing structural analysis may reveal that the understanding of the content of a specific column is not what you think it is. This may require that you go back to value analysis and revisit that column before retrying the structural analysis part.

Software Support

Data profiling analytical techniques require special software to be effective. As you can see from this section, it depends on the ability to execute a wide variety of functions against a lot of data. The functions are too pervasive to support tuning through typical database administration methods. You would have an excessive number of indexes on the data if you tuned each process and rule.

Although you can perform some of the functions through standard query and reporting tools, you cannot perform all of them. Some of the more productive functions cannot be done through standard tools. You need to look for or develop specialized data profiling software to support strong data profiling activities.