It is not the intention of this book to provide a thorough education on database structure rules. The definitions provided here are brief and intended only to clarify how terms are being used in this chapter. Definitions of the terms used in this section are needed for the same reason as the previous chapter: these terms tend to be used in different ways by different authors and database systems. Defining how they are used here may head off some confusion.
The rule discussions in this book assume that data is being represented in a table, row/column format. Many source systems do not represent data this way. However, to profile data it is helpful to extract the data into table format. This can almost always be done, although the effort may be considerable. The topic of extracting data is not covered in this book.
It is assumed that an extraction has been done and that all data is represented in table format. This is not to say it is properly normalized. It need only be in first-normal form. Normal forms are defined later in this chapter. The data profiling process should determine the degree to which it is normalized and demonstrate the structure needed for other normal forms.
A major structural element discussed in this chapter is the functional dependency. Functional dependencies are used to unlock the natural structure of business objects as defined by the data. A column in a table is considered a functional dependency of one or more other columns in the same table if it has only one value for any specific value set of the other columns.
Said another way, a functional dependency has a LEFT-HAND-SIDE (LHS) consisting of a set of columns that determines the value of a single column on the RIGHT-HAND-SIDE (RHS). A functional dependency is said to hold over a table if, for all rows where the values of the LHS columns match, the values in the RHS column are the same. This is shown in Figure 9.1.
The LHS of a functional dependency can consist of more than one column. In this case it takes the concatenated values of all columns to determine the LHS. When this is the case, it does not matter what the order is of the columns on the LHS.
Another property is that the columns on the LHS should be the minimum columns necessary to uniquely determine the value of the column on the RHS. Clearly, if A and B determine C, then A and B and D also determine C. However, the presence of D adds no additional meaning to the dependency.
Clearly, if EMPLOYEE_ID determined BIRTHDATE, then EMPLOYEE_ID and any other column would also determine BIRTHDATE. For example, EMPLOYEE_ID and HIREDATE also determine BIRTHDATE. HIREDATE should not be included in the dependency LHS definition because it is not necessary to uniquely define BIRTHDATE.
A number of functional dependencies can exist that share the same LHS but determine the value of different columns on the RHS. For ease of handling, these can be combined into a single functional dependency representation wherein the RHS includes all columns that have a functional dependency on the same LHS.
Functional dependencies are a metadata definition of structure in a table. Discovered functional dependencies are those that are true for the data in the table at the time discovery is performed. The metadata and discovered functional dependency sets may be different.
If the table has inaccurate data in it, a functional dependency set that is supposed to be true may be untrue for that collection of data. It is also possible for a functional dependency to exist in the data because it is a coincident. Semantically, it is allowed to be violated. However, none of the data populating the table at that time violates it.
Keys are a significant part of the structure definition of data. They identify rows within tables and they connect tables to other tables. Figure 9.2 is an example of keys that are described in this section.
Functional dependencies are the key to understanding the primary keys in a table, denormalized column sets embedded in a table, and derived columns within a table. If you knew all the functional dependencies, you would be able to determine all of these. In addition to keys defined through functional dependencies, there exists another key type, the foreign key, which is also important in structure analysis.
A primary key is a set of columns that uniquely defines each row of a table. A primary key has a functional dependency in which the LHS columns and the RHS columns include all of the columns in the table. The LHS is the primary key.
A table can have more than one consolidated functional dependency that satisfies the definition of a primary key. A fact often overlooked in data analysis is that many tables have two primary keys: a token key and a natural key.
The token key is usually columns that have been assigned system identifiers to uniquely define each row. The key may consist of multiple token columns. A natural key is a combination of values from the real world that uniquely identifies a row.
For example, in a personnel database the natural key would be some combination of the employee's name, home address, and birth data. You would allow two or more employees with the same name. They may even live at the same address (as, for example, in a father and son who share the same name but do not use Jr. and Sr. to distinguish themselves). However, once you put birth date into the natural key, you get uniqueness guaranteed. However, you would not want to use this primary key for processing: it is too long and contains personal information.
The SOCIAL_SECURITY_NUMBER is also a primary key candidate and is a single column. It is much better for system use. However, this is also personal information employees would not want to have used for identifying them on reports going around the corporation. Therefore, you generate a third primary key, the PERSON_ID, which is a unique number assigned to each employee when hired. This is a token key because it comes from the system and not from the real world.
This example shows three separate functional dependencies that satisfy the definition of a primary key for the same table. Only one would be specified as the primary key for data management purposes. However, as we will see later, the others have an analytical value in finding inaccurate data values.
A primary key may require all columns in the table to define each unique entry. For example, a table that describes an employee's college degrees may consist of columns for PERSON_ID, UNIVERSITY_NAME, DEGREE, and MAJOR. Because an employee may have multiple degrees from the same university, and may have multiple B.A. degrees from the same university, it takes all four to define each row uniquely. There is one functional dependency in the table that has no RHS. The term primary key is often referred to as the key to the table. That convention is used in this chapter.
When a consolidated functional dependency exists that does not include all columns in the table on both sides, it represents a denormalized key. It determines the values of some of the other columns but not all of them. All of the columns, including the LHS columns, may or may not also be part of another dependency that describes a true primary key candidate for the table.
In relational theory these should not exist. However, in practice they exist all the time.
An example is shown in Figure 9.2. The functional dependency where PRODUCT_ID determines the DESCRIPTION is a denormalized case. The DESCRIPTION could just as easily have been obtained by looking it up in the PRODUCT table. However, the database designer copies it to the ORDER_DETAIL table when orders are created in order to improve performance when processing queries against the order database.
Derived columns are those that exist as the RHS of a functional dependency and whose value can be determined from a rule on the values of the LHS. For example, if COLUMN_C = COLUMN_A + COLUMN_B, then COLUMN_C is a derived column and the functional dependency is COLUMN_A, COLUMN_B → COLUMN_C.
The rule may be a mathematical formula (such as UNIT_PRICE * QUANTITY = PRICE) or a business rule (such as that all customers in Chicago get a 10% discount and all others get only a 5% discount). The functional dependency that defines a derived column may include all attributes in the table (making it a key to the table) or subset (making it a denormalized key). What makes it a derived column is that you can compute the RHS by knowing the rule or formula and the values on the LHS. Derived columns are a special case of the other types of functional dependencies.
A foreign key is one or more columns in one table (the dependent table) that identifies a row in another table, the parent table. It is usually only one column but is sometimes more than one column. It is almost never the primary key of the dependent table. It is just another column in the row. For example, a DEPARTMENT table may have a column that identifies the department manager, such as DEPT_MANAGER_ID. Instead of having the manager's name and other information about the manager, it just includes a foreign key reference to the personnel database where this information is held. The foreign key DEPT_MANAGER_ID references the PERSON_ID column of the PERSONNEL table.
In this example, it may be that the PERSON_ID is also a primary key of the DEPARTMENT table. This would be true if every department had a manager and no two departments had the same manager. This would technically be a second key candidate for the table (in addition to DEPARTMENT_ID). However, if a person could be manager of two departments at the same time, it would not be a primary key.
The foreign key in a dependent table is a primary key in the parent table. Both tables may be the same table. For a single row, the foreign key will generally identify a different row of the same table. This must be true because it must identify a single row of the other table.
Foreign keys are used for multiple purposes. One is to connect the subtables of a single business object in the database. A functional dependency within the object may have multiple occurrences for each key to the primary object. The columns that do not have multiple occurrences are placed in a parent table with the key to the object. The columns within the functional dependency that can repeat are placed in a dependent table, where each row is one occurrence. The rows of the dependent table are connected to rows of the parent table through the foreign key column. For example, the ORDER_NUMBER is used to connect ORDER_HEADER information to ORDER_LINE_ITEM information.
The second use of foreign keys is to connect different business objects. An example is an ORDER_NUMBER used to connect to an INVOICE table. Orders and invoices are different business objects but have a relationship reflected in the ORDER_NUMBER foreign key.
Data has a natural structure that shows how columns relate to each other. Database theory defines several forms of data organization to guide a data designer to get to a proper structure. These are called normal forms. Theorists have defined six levels of normal forms. Each form is a more restrictive version of the previous form.
In practice, most data exists in first-, second-, or third-normal form. Although there is some theoretical interest in higher levels of normal forms, they add little value to operational handling of data. The goal of structure analysis is to determine the current form of data sources at the level of no normal-form first, second, or third level. Figure 9.3 shows an example of data in first-normal form and in third-normal form.
To be in first-normal form, a table must have the same columns for every row. Each column contains only a single value. The meaning of the values in the column are the same for all rows. A fact is represented in only one column. This means that you cannot have multiple columns for multiple occurrences of the same fact. Repeating groups are not allowed.
If the data in a table has REDEFINE definitions over one or more columns, if some columns are missing from some of the rows, or if columns are overloaded (as defined in Chapter 8), they are not in any normal form.
A table in second-normal form must satisfy the requirements of first-normal form. In addition, it must have a key. Each column in the table must depend only on all of the columns in the key, not a subset of the columns.
For example, in Figure 9.3, the initial parent table has a key of PRODUCT and PART. Each row identifies a single part used in the manufacturing of the product. However, the column DESCRIPTION is dependent on only the PART column and not the entire key. To get to second-normal form, you remove this situation by creating a second table of PART and DESCRIPTION. At this point the parent table is better off, because everything depends on the key. It still has a problem that keeps it from being in third-normal form.
Data is in third-normal form if it satisfies first- and second-normal form rules and no column is dependent on any nonkey column. In the example shown in Figure 9.3, the column DRAWING_CODE references where a part is used in a blueprint drawing of the product. The CODE_TYPE column shows what type of reference it is. Each drawing code only has one code type. This means that the CODE_TYPE needs to be broken out to get the parent table to third-normal form. This is shown in the figure as the three-table set. All three tables are in third-normal form.
A table is denormalized if it is not in third-normal form. This means that it contains one or more denormalized functional dependencies.
Denormalization is very common in database systems. There are a number of distinct cases of denormalization that can be used to describe the concept, as well as the motivations for having each of them.
The first case is the situation in which all data is included in a table. For example, a single table contains a subset of columns that includes one or more columns that are dependent on a denormalized key. The example shown in Figure 9.4 shows the situation in which ZIP_CODE determines CITY and STATE and is therefore denormalized in the table. It could be broken out into a separate table with ZIP_CODE as the key and CITY and STATE as the dependent columns. However, because these fields are always used together, they are merely embedded in the primary table and no such normalized table is created. This type of denormalization is common.
The second case shown in Figure 9.4 is the situation in which data is repeated in a table from another table. In this example, PRODUCT_DESCRIPTION is carried in the ORDER_LINE_ITEM table, along with the PRODUCT_ID. The description column is clearly redundant to the same information in the INVENTORY table. It is copied into the ORDER_ITEM table to improve performance on queries by not having to perform a JOIN.
The third case shows the situation in which a single table contains all fields of two tables. This is shown in Figure 9.3 for the first-normal form example. In this example, there are two denormalized functional dependency sets within the same table, causing a great deal of redundancy of data. The reason for doing this is again performance on query.
Denormalized tables always cause data to be repeated within the set of tables. This leaves open the possibility of inconsistencies arising between the various duplicates, due to inaccurate data entry or updating some but not all of the occurrences of the data. The possibility of inaccurate data rises, and the detection of it becomes more difficult.
A split table occurs when two tables have the same primary key but all other columns are different. In this case, the two tables could be combined into one. The database designer has divided the columns between the two tables for some purpose. It may be to improve performance by putting the highly referenced columns in one table and the less frequently referenced columns in the other. Another reason for splitting a table is to aid in managing the security of information by putting more sensitive information into one table and less sensitive information in another table. Some data architects consider this a violation of third-normal form because all columns that are dependent on the key are not in the same table.
Tables may be split into more than two tables. This practice can easily lead to additional denormalization by adding redundancy to avoid multiway JOINs in order to access all of the data needed by each client.
Two columns are synonyms of each other if they contain the same business fact. They may be using it for different reasons, but the content in each column of the synonym has the same business meaning. For example, a column called DEPARTMENT_NUMBER in the department's master database, a column called DEP_NO in the personnel database, and a column called DEPT_NUM in the project's database would all be synonyms of each other. They would constitute a synonym column set.
Having the same values does not qualify two columns as being synonyms for that reason alone. For example, a column of EMPLOYEE_HIRE_DATE would not be a synonym of the column ORDER_CREATED_DATE, even though they are both dates and both identify the date a business object was created. They are not synonyms because they represent the date of creation of different types of business objects. The values may be drawn from a common date domain. However, the fact that they come from a common physical domain does not make them synonyms. Both columns of a synonym pair can exist in the same table.
Synonym analysis is an important part of structure analysis because it identifies and validates foreign keys, identifies redundant data, and identifies representation inconsistencies between columns that record the same facts.
There are four distinct types of synonyms. This is shown in Figure 9.5. Each type has unique characteristics and, as will be shown later, has different analytical techniques for discovery and verification testing.
Primary key/foreign key pairs are synonyms. They are used to relate data between tables. This is a very common relationship in which one of the columns in the pair is a primary key of a table. It identifies the specific business object. The table containing the other column references the business object. In Figure 9.5, this is shown for the DEPARTMENT_ID column.
A single column that identifies an important business object, such as a department, can be used in numerous tables to provide a reference to department information. For example, the department identifier may appear in a PROJECT table, a PERSONNEL table, in financial tables for expenses and expense accounts, in BENEFIT tables, and many more. It is not uncommon for a single column in a master table to be the primary key component of hundreds of synonym pairs. In many corporations, the commonly used columns of DEPARTMENT_ID, PERSON_ID, and PRODUCT_ID typically connect hundreds of tables through individual synonym pairs.
Redundant data exists when a column is a synonym of another column in a different table. In one table, the column is dependent on the key. In the other table, the column is dependent on the corresponding foreign key. This is shown in Figure 9.5 for the PRODUCT_DESCRIPTION column. The two PRODUCT_DESCRIPTION columns are synonyms of each other. Neither is a key in its respective table, but both are dependent on the PRODUCT_ID columns in each table. The connection between individual values is indirectly made through the surrogate primary key/foreign key synonym. In this case, one of them could be eliminated without any loss of information. However, it would require adding a JOIN to query statements against the dependent table.
Redundant column synonyms always have a functional dependency for each column where the RHS is the synonym column and the LHS is part of a primary/foreign key synonym.
A third category of synonyms is where the pair has no structural relationship, unlike the previous two types. They are just the same business fact. Figure 9.5 shows this for the CITY columns. Each of these contains the same business fact: the name of a CITY. The occurrences have no relationship to each other.
It is important to distinguish domain commonality at the business level, not at the physical data type level. The previous example of date fields demonstrates a physical domain equivalence. Similarly, two columns containing integers such as QUANTITY_ORDERED and QUANTITY_ON_HAND share only a physical data type domain. These are of no interest whatever because they represent different business facts. They should not be considered synonyms. However, domains that identify specific business entities are valid domain synonyms. Examples of this are CITIES, STATES, UNIT_OF_MEASURE, and COLOR.
These synonyms have no structural value. They are worth identifying in order to study the representation of values in each column to determine if they are consistent. Identifying them is also useful in helping users of data connect data in queries. For example, if the CITY column is used in both the ORDER table and the SUPPLIER table, a user may want to see if there are any suppliers in the same city that have a large number of orders. To make that query work, the representation of city values must be consistent across the CITY column pairs.
Domain synonyms exist across the foreign key columns of multiple primary key/foreign key synonyms that share the same primary key column. For example, as shown in the first part of Figure 9.5, the two columns for DEPT_MGR and PROJECT_MEMBER are domain synonyms. They represent the same business fact (an employee) but have no structural meaning. These types of synonyms have no value because validation and representation consistency is accomplished through the associated primary key/foreign key synonyms.
Synonyms are also used to match columns across different data sources when trying to consolidate data. This is very common in projects designed to consolidate databases of corporations that have merged or to eliminate multiple departmental applications in favor of a single central application. In this case, every column in each source needs to be synonymed with columns in the other sources.