Chapter 8: Column Property Analysis

Analysis of column properties is the process of looking at individual, atomic values and determining whether they are valid or invalid. To do this, you need a definition of what is valid. This is in the metadata. It consists of a set of definitional rules to which the values need to conform.

In the industry, these types of rules are generally not referred to as rules. They are commonly referred to as domain definitions, column specifications, column descriptions, or column properties. However, they can all be considered rules because they provide a definition of what can be put into a column. For example, a column specification that the length of a column is 10 characters is basically a rule that says the maximum length is 10. All of the specifications for the content of an individual column are rules that restrict the values that can be considered valid. They are generally not expressed in a rule language or rule syntax.

8.1 Definitions

To understand this chapter you need to know how certain terms are defined and subsequently used. These terms are column, domain, properties, null, and field overloading. These terms are notorious for having different meanings, depending on the author or product using them. Defining how they are used in this text will reduce the risk of confusion. Figure 8.1 shows how these terms relate to one another.

Figure 8.1: Definitional elements.

Columns

The most basic element of a database is an individual column. In some systems this is called a field or an attribute. It refers to a place to hold a single fact about a single business object that is stored in the database.

Columns are the elements of a database that generally have the most definition. This is because the database or file systems require a layout of the records, rows, or segments. This layout describes each data column as it appears in the database. This is essential for the database system to work properly. This definition usually consists of a NAME entry followed by a physical description of the requirements for storing the data. The following is an example.

EMPLOYEE_NAME	CHARACTER 35
BIRTH_DATE	DATE
UNIT_PRICE	DECIMAL (5,2)

This basic definition is required for the database or file system to find the data when processing a query statement. It also provides the minimal definition so that the database system can properly store the data. The database system will generally allow any value in a column that fits the shape and size provided for it.

These descriptions are also needed for application programmers to know how to treat each column. This is often the only definition that exists for a column. The COBOL copybook file, the PL/1 INCLUDE file, and the relational database CREATE TABLE specifications are very often all you will find for columns (or for anything else, for that matter).

There may exist other documentation for columns. This would typically be in a data dictionary or metadata repository. These descriptions would provide for a more restricted view of what can be stored in a column plus text descriptions of the business meaning and intended use of the data. It may provide a different, more expressive NAME entry for the column. This would be done in cases in which the database systems or the programming language used to access data has restrictions on the length of a column name, thus making it impossible to make names expressive at the physical level.

Relational database systems generally provide for a short text and/or a long text to allow documentation of the meaning of a column. The short text is frequently used to provide a longer descriptive name for a column. These can be extracted from the DBMS catalog.

Looking at how a column is described in leading text on data entry forms can also be helpful in establishing the meaning beyond what is implied by the column name. To fully understand a column you generally need more information and rules than what is available. The available metadata is generally incomplete. The metadata you have can be inaccurate, even if it comes from a database or programming source code file. In fact, it is often inaccurate. You need to be skeptical about its accuracy.

Domains

A domain is a named object that defines the acceptable values for encoding one specific business fact. Domains are not columns. They are standards. They are standards that should be applied to the definition of business data objects. A column is a specific instance of a business fact being defined within a business object. The domain is a generic definition not attached to a physical instance of its use. A column is the physical instance of its use.

The concept of domains is to define the basic rules for encoding common business elements independently of the applications that will use them. If application developers are responsible enough to use them, they will significantly enhance the usability, understanding, and maintainability of the applications across the corporation. They will eliminate the enormously wasteful practice of each individual programmer or data designer inventing multiple ways of expressing the same business fact across the corporation.

A well-disciplined corporation will have a metadata repository that has domains defined and easily available to all developers. It will also have an application development quality assurance function to ensure that they are used and used properly. The metadata repository will link the domains to the columns that use them to enhance the ability to interpret application data and to easily map the impacts of proposed changes.

This sounds so good. And yet, few corporations actually do this. Examples are explored in the sections that follow.

Date Domain

A date domain may be defined as a date between the day 01/01/1900 and the day 12/31/2099 inclusive, of the format mm/dd/yyyy, where mm is the month, dd is the day of the month, and yyyy is the year. This is a typical date domain definition for the United States.

A column that contains dates such as ORDER_DATE may use the date domain with additional restrictions on the range. An example is 01/01/1996 through CURRENT_DATE.

An enterprise may define several date domains: USA format dates, European format dates, year of the emperor dates, and so on. If the enterprise deals with very old dates, they may need a date domain that goes back thousands of years and has the BC/AD designation as well. This may be required for dating art, for example, or for encoding historical or geological events.

The corporation ought to have the discipline to define date domains as needed and then construct column definitions containing them from the standard domains, and to impose additional restrictions at the column level. RDBMS systems all internalize dates for columns specifying the DATE data type, making the external format immaterial. Using this capability greatly eliminates many potential errors in handling dates.

In 1987, when IBM added date support to DB2, I decided to provide my customers with conversion routines to automatically convert dates contained in CHARACTER or INTEGER columns to the new DATE data type. To do this, I surveyed a limited customer base of about 10 DB2 users, asking them to provide me with the formats they were using in DB2 columns for expressing dates.

They returned 94 different formats. One customer alone contributed over 40 formats. Some were CHARACTER formats, and some were INTEGER formats. I could not have imagined that people could think up that many different ways of expressing the same fact. I would have expected information system organizations to be more disciplined about data representation. I was wrong.

In addition, we all know about the debacle of Y2K, in which just finding all the columns that had dates in them and then building conversion routines for the dozens of formats used cost the IT departments around the world hundreds of billions of dollars. If database systems had better date support earlier and/or if customers had better discipline earlier, the costs would have been billions of dollars less. A little discipline, please!

Zip Code Domain

This one is also instructive. The USA postal code calls for a Zip code on addresses in the format of nnnnn-nnnn, where n is a numeric character between 0 and 9, inclusive. However, the enterprise may decide that the shorter version is also acceptable, in that most people do not remember or provide the four digit extension. Thus, the standard for the enterprise may be that Zip codes are in one of two formats: nnnnn or nnnnn-nnnn. They may choose to represent the short version as nnnnn-0000 instead of nnnnn. They should not allow both short versions to be used.

However, if you want to include Canadian addresses, the previously described standard does not work. They use six characters, with the first three separated from the last three by a blank in the format cnc ncn, where n is a number between 0 and 9 and c is an alphabetic character from A to Z. They do not have an extension.

You can go beyond the United States and Canada because there are postal codes in most countries. However, they all have their own unique definitions.

You have two choices when defining domains for international addresses. You can define a domain strictly for each country, or you can generate a least common denominator domain definition that satisfies all of them. This might be that the Zip code is ALPHA/NUMERIC of variable length to a maximum of 10 characters. However, the least common denominator approach is not very useful in checking data for accuracy; just about anything will pass the test.

Unit Of Measure Domain

In this example, the unit of measure of various products produced or sold by the corporation is defined. In this case, the domain is an itemized list of values with their definitions. For example, you might use LB for pound, MT for metric ton, CT for carton, and EA for each.

The purpose of this domain is to create the proper definition for each of these so that individual data entry people do not make up their own versions and create a representation inconsistency problem. In fact, it is easy to build a pull-down list with acceptable values and their meanings.

The domain should encompass all units of measure that are needed for the corporation. This could also include units of measure that are not currently being used in anticipation of product line expansion in the future.

A few notes about domains. Domains should follow external standards wherever possible. There are countless standards groups in the world, many of which are restricted to one subject or one industry. These are very useful in finding standard domains that apply to your corporation.

An entire study of the topic of domain gets into micro-issues and macro-issues. A micro-issue is defining just what is meant by ALPHA, ALPHA-NUMERIC, and so on. It takes into consideration different alphabets and country code set issues. For example, the term alphacharacter means something different to someone in Russia than it does to someone in the United States.

Macro-issues include building hierarchies of named domains. In the Zip code example, the corporation may define a Zip code domain for each country. They could then combine country domains into superdomains for use by columns. For example, they might define a domain as USA-CANADA ZIP CODE, which inherits the domains of USA and CANADA. Another superdomain might be INTERNATIONAL ZIP CODE, which inherits the definitions from all Zip code domains. In this hierarchy, each subdomain may inherit the domains of individual character sets.

In this last example, the Zip codes are truly defined at a low enough level to identify invalid Zip codes from any country. However, it may require an excessive amount of computer processing cycles to do the validation.

Domains make all the sense in the world. It is so logical to take each business fact that you want to record information on and define precisely what constitutes correctness. But do not expect many to have done this. In practice, you will find many columns built from ad hoc domain definitions that have no similarity to other columns containing the same fact.

Property Lists

The term properties refers to rules for values that are permitted within a single column. Each column should have a base definition that includes one or more domain definitions.

The column property list is more than just the domain's restrictions. It also includes other factors. Figure 8.2 shows the basic list of some properties that are possible in a column. Some of these are domain properties; others are not. For example, the data type, length restrictions, range of acceptable values, and list of acceptable values would normally be found in the domain definition. Rules such as unique rule, consecutive rule, and null rule are not domain rules but additional rules imposed on a specific use of the domain.

Figure 8.2: Typical properties.

A unique rule says that each value in the column must be different from all other values in the column. A consecutive rule says that the column values must be unique but also that no values between the highest and lowest can be missing (for example, check numbers in a check-issued table). The consecutive rule generally applies to an integer column but may also apply to a single-byte alphabetic column.

A column can be defined in terms of a domain plus additional factors that apply only as the domain is used in that column, or it can fully define itself without a domain definition. For example, an ORDER_NUMBER domain might say that it is a five-digit number between 10001 and 99999. In the ORDER_HEADER table, there is a column named ORDER_NUMBER. Its properties are that it contains the ORDER_NUMBER domain plus a requirement for UNIQUE, NOT NULL, and CONSECUTIVE. However, the ORDER_DETAIL table also has a column called ORDER_NUMBER. In this table it again uses the ORDER_NUMBER domain, but the unique rule is NOT UNIQUE because you can have many details per order, the null rule is again NOT NULL, and the consecutive rule is NOT CONSECUTIVE because it permits multiple occurrences of the same number. There is another column called LAST_ORDER_NUMBER in the inventory table, which shows the last order received for that item of inventory. In this case, the business meaning is that it is only the most recent order received for this item. It again uses the ORDER_NUMBER domain. However, the properties are NOT UNIQUE, NULL permitted by using a blank because some items may have had no orders yet, and NOT CONSECUTIVE. This is shown in Figure 8.3.

Figure 8.3: Example of domain versus property definitions.

Column Names

One of the most important properties of a column is the column name. Having a name that indicates the nature of the content is extremely valuable in performing data profiling as well as in constructing other processes for using the data. Most analysts and application developers tend to overdepend on the name to indicate content. A name such as QUANTITY_AVAILABLE is more descriptive than a name such as AQ003BZ.

In some cases, column names are prefixed or postfixed with codes to indicate application or department. This reduces the number of characters available for meaningful semantics and makes the result less useful.

Columns sometimes have descriptive names and sometimes do not. As systems get larger and larger in the number of columns, the use of descriptive names becomes less likely. Many older application development tools, DBMSs, and file systems restrict the length of a name, making it very difficult to use descriptive names.

The analyst should always have a descriptive name associated with every column. This name may not be possible to use for application development but is very useful for metadata purposes. The analyst should generate a name if one is not available and should not worry about the length of the name.

The descriptive name should be as precise as possible. For example, a column named EMPLOYEE_NAME is much more descriptive than one named merely NAME. When you are working with hundreds or thousands of columns, having a more complete descriptive name is very helpful.

Null

The term null is one of the most controversial terms in database circles. There are many definitions and theories about how to handle the null condition, most of which the application development world completely ignores. When they ignore it in application design, the data entry people often invent their own conventions for dealing with the missing information, creating problems for data accuracy.

The null condition is the case where a value is not available for a column. A value may not be available for a number of reasons, such as

The originator of the data refuses to provide the value.
The entry person does not know the correct value.
The column is not applicable for that particular instance.
The value for the column is to be provided in a later step of the business process.

It is possible for applications to be designed such that each column could be entered as a value or a checklist for each of the reasons cited previously. The reason for no value could be recorded, and everyone would know how to interpret the value.

However, this never happens. What happens in practice is that the application stores a blank for character columns, stores zero for numeric columns, or uses the relational definition of NULL, which only allows one indicator for all conditions.

Other practices are found in legacy systems for which application designers decide to build a convention for NULL into a database or file system that does not support NULL indicators. In this case you see things such as?, Not provided, or 9999.99 put into the columns as values. The application programs "understand" that these mean no value provided.

Sometimes data entry people make up their own conventions. This occurs when the entry person does not know the value but the data entry procedure requires that something other than blank be provided.

Data profiling property lists need to record the conventions used for the NULL condition. If the conventions are bad design or can lead to quality problems, this should become a data quality issue. In the meantime, the data profiling process needs to be able to sort out nulls from real values.

Field Overloading

Field overloading is a term for a number of practices that are today considered bad data design but were commonly used in the past. It refers to the practice of using a single column for recording more than one business fact.

Field overloading is common in legacy systems. They were almost never a part of the original application design. However, as changes occurred in an application and the need for new columns arose, developers found it more convenient to piggyback columns than to create new columns. Creating new columns meant that record or segment lengths needed to be longer, which meant that databases needed to be unloaded and reloaded and that application programs not related to the change needed to be modified or at least recompiled. To avoid these expensive disruptions, they just overloaded fields that had unused bit capacity to handle the new information.

In addition to creating a difficult condition to detect and handle, the overloading was usually not documented. Because COBOL and PL/1 have no convenient way of specifying overloading, the programmers just did not document the fact at all. The splitting of the bits to separate columns was always done in the logic sections of the application programs (under the covers, so to speak). Following are some of the ways field overloading is done:

Let text columns contain multiple facts as words or keywords.
Steal unused bits of packed decimal column to use for binary information.
Use the sign bit of a numeric column that cannot be negative for binary information.
Use a column conditionally, meaning, for example, that if one column has a value of X, this column means this and otherwise that.

Column property analysis cannot be performed on overloaded columns without first splitting them into discrete, single-fact columns.