9.3 The Process for Structure Analysis

9.3 The Process for Structure Analysis

The process for analyzing structure is the same as that for examining column properties. This is shown in Figure 9.6. The source of information is different and the analysis tools used are very different, but the overall process of collecting available metadata, looking at the data to fix inaccurate or missing metadata, and using the accurate definition of the metadata to expose inaccuracies is the same.

Click To expand Figure 9.6: Structure analysis process.

One of the key differences is that when inaccuracies are flushed out, the specific values that are inaccurate are not determined. This is true because all structure rules relate either multiple values in a column or values in multiple columns. This means that any rule violation only shows you the narrowed set of data within which the inaccuracy exists, not which of the values in the set are inaccurate.

Gathering Information

Structure information can be acquired in a variety of ways: from existing documentation, from metadata repositories, from data models such as entity-relation diagrams, and from database definitions, or by common sense examination of data columns. The information from these sources can and often is inaccurate and incomplete. Some of this information can be believed, and others can only be considered proposed structure.

Metadata Repositories

Metadata repositories will tell you some things about structure. They usually identify primary keys and foreign keys. The basic affinities of columns are defined to the table level. A good source of information will not only itemize tables, segments, or records (and the columns that fall within them) but will include information on the business meaning of the business objects themselves. This information is generally lacking in detail.

Data Models

Formal data models may exist that describe the data structures in great detail. Modeling tools became popular in the mid- to late 1980s, and many organizations started using them to develop data structure designs. If those models are still around, they can describe the entities that should end up in profiling tables and the relationships that connect entities. These will identify the primary key of each table, foreign keys, and rules surrounding the connections.

Many of these models will be out of data. A common practice was to use them to get started. However, as the data models were tuned for implementation and performance, the new information was never put back into the models. This is particularly true where denormalization for performance reasons was a common practice. It is also likely that these models do not address natural keys, derived columns, or synonyms.

Information should be taken from them to use as a starting point for the data profiling process. The process will validate the information or drive modifications to show how the data really behaves.

Database Definitions

Most applications have information available in the form of program source files or database definitions that describe structure. For simple file systems such as VSAM data, the pervasive COBOL copybook provides insights into record layouts, key definitions, REDEFINE, and OCCURS structure information. For IMS, the PSB control block provides valuable information. For relational databases, the SYSTEM catalog or directory gives much information. You should mine these data sources and construct as much information as you can from them to use as the starting point for structure analysis.

Many of these can be believed because the database system enforces them for you. For example, a primary key defined to DB2 will certainly be a key, and the key values will be UNIQUE. This is also true of primary key/foreign key relationships. However, this is not to say that other functional dependencies or foreign keys may exist in the data that are not defined to DB2.

It may be helpful at this point to compare the information gathered from external sources (such as data repositories and data models) to the database implementation. This is where some of the differences show up. If a data model shows a primary key/foreign key relationship that is not defined to the database system, it may be because the implementors did not want the processing overhead of enforcement. If that is true, there are probably several violations of the rule due to inaccurate data.

Another example is discovering that a table was split in implementation although defined as a single table in the metadata. When this occurs and the database system does not relate the primary keys of the two tables as a primary key/foreign key pair, there will probably be instances in which there is a record in one table with no corresponding record in the other table. This again points to inaccuracies in the database.

Commonsense Speculation

Many of the structure elements are intuitively obvious to a student of database architecture. If there is a serious lack of information on structure, the analyst can speculate about structure components by merely looking at column definitions.

For example, a column called PRODUCT_IDENTIFIER in an INVENTORY table is probably a primary key. The same column in an ORDER table is almost certainly a foreign key.

Most of the documentation you have usually calls out only a single primary key of a table and only foreign keys that are enforced. Documentation rarely covers such things as derived columns, embedded denormalization, or unenforced foreign keys.

It is helpful to speculate on all structural possibilities so that the data profiling process can confirm them or demonstrate their nonconformance in the data. The analyst should develop a data model that describes how they think the data is structured. This will be very helpful in the data profiling process.

Discovery from Data

The process shown in Figure 9.6 shows a step for discovery of structural information from the data. This sounds good but is particularly difficult without special software. You are not likely going to write a program to dig out all functional dependencies in the data programmatically.

There are two approaches to accomplishing this step. The first is to use special software that will discover all functional dependencies in the data and find all column pairs that appear to be synonyms. The second method is to test the structural rules you believe to be true against the data and find the degree of conformance between them.

Discovery of functional dependencies and common column pairs is clearly preferable because you can discover things you would not think to test. It is more efficient to say "These are the dependencies that are true in the data and these are those the available metadata seems to indicate are true" and then analyze the difference.

However, without special software, the best you can do is to test against the data. This requires that you do a thorough job of collecting and speculating on what may be true. The problem with this approach is that you leave open the possibility of missing a structure relationship that exists in the data but that you are not aware of. The missing relationships are often those that cause problems when mapping data to other structures, loading data to target systems, and reporting on data.

It may be necessary to employ the testing method against the data even when structure discovery software is used. This is because a structural component may not show up in the results of discovery if a sufficient amount of inaccurate data exists that masks the condition. This is not uncommon when examining data that does not have the benefit of database management system enforcement. This again supports the need to develop a complete list of dependencies you think may be true before you begin analysis.

Verification of Results

Mapping the results of discovery to the gathered metadata can be started by the data profiling analyst. A straightforward physical analysis that shows which keys are supported in the data and which column pairs have strong data overlap are easy to do.

However, verification needs to be done with business analysts and subject matter experts. Structural issues always boil down to the basic business definition of data structures. Tables need to correlate to real-world objects, and relationships between tables need to make clear business sense. Most business analysts are not educated on data architecture issues and terminology and therefore will have trouble following the discussions unless the data profiling analyst translates structure into terms that make sense to the audience. This may require building diagrams that show the functional structure groupings and relationships between them.

Too often the data profiling analyst believes that structure issues do not require semantic interpretation and leaves the business experts out of the process. This is a big mistake and can lead to serious problems later on.