This section describes a variety of data rules as examples that can be formulated and tested. It classifies rules in order to demonstrate the areas where data rules are normally used. Once you gain a data rule orientation, you will acquire a knack for spotting potential data rules.
Data rule execution can be laborious and expensive. Rule engines may be used to do some of the work. The alternative is to formulate queries over the data for each data rule. This is greatly facilitated if you have put the data into a table form prior to data profiling. Data rule testing against primary data sources that are not in table form and that have not had messy problems resolved, such as column overloading, can be an enormously complex task.
The more data you use, the more machine resources will be burned. However, if you profile over too small a sample, you may not get the results you seek. It is best to use as much of the data as possible for data rule testing.
Data rules do not need to be executed singly. It is possible to combine several rules into a single executable by either using a rule-based engine that can do that for you or formulating complex query statements yourself. If you do this, you can reduce the cost and time of executing data rules, but you have to deal with the problem of separating output by rule (which rule caused which output). Sometimes this is obvious from looking at the data, and other times it is not so obvious.
Searching for data rules begins with a methodical examination of the business meaning of each column in a table. Some very obvious relations between columns surface when you do this. Some of them can be constructed solely on the basis of commonsense logic. Digging out deeper rules requires more extensive knowledge of the business practices of the corporation.
Columns with a DATE or TIME data type are a good place to start. Some business objects have many dates in them. These dates generally reflect a logical progression of activities regarding the object. For example, an employee has a BIRTH_DATE, HIRE_DATE, LAST_PROMOTION_DATE, RETIREMENT_DATE, and possibly others. A purchase order has an ORDER_DATE, BACK_ORDER_DATE, SHIPPED_DATE, RECEIVED_DATE, INVOICED_DATE, and potential others. These all have an ordering implied in their definitions. Date values that are out of order generally, if not always, indicate that at least one value is inaccurate.
Some objects also include TIME_OF_DAY columns that also can have an implied ordering. For example, START_TIME should always be earlier than STOP_TIME for data-recording repair activities in an auto repair garage, provided the repair activities do not cross over the end of a day.
These fields can be mapped out to determine the expected ordering and then one or more data rules formulated to test for conformance. If all of the dates are in the same row of the table, one query can find all violations. It is generally very easy to spot the violation by examining the output. Examples of validation rules are
SHIPPING_DATE IS NOT EARLIER THAN ORDER_DATE RECEIVED_DATE IS NOT EARLIER THAN ORDER_DATE RENEWAL_DATE IS 1 YEAR AFTER CONTRACT_DATE
Other dates that are important in rules are the CURRENT_DATE and a milestone date. Many date fields are not allowed to be later than the CURRENT_DATE. For example, an ORDER_DATE cannot be forward dated. A milestone date is one that sets a boundary for objects. For example, the date a company was founded would be a milestone date for ensuring that no employee had a start date that preceded it. These types of rules are generally not included in this section because they can usually be part of the column value definition, as described in Chapter 8. They generally involve only a single column.
A time duration is the computation of time between a pair of date and/or time columns. For example, the function YEARS(HIRE_DATE—BIRTHDATE) will determine how old you were when you were hired.
Some durations have boundaries set by the company. For example, a rule may exist that all orders sent to suppliers must be filled in 60 days or be automatically cancelled. A duration of more than 60 days would mean that either the rule was violated on purpose or one of the dates was wrong. Violations of duration rules generally indicate inaccurate data. One of the dates is wrong, causing the duration to be unreasonable. Examples of duration rules are
DAYS(SHIPPING_DATE_ORDER − DATE) LESS THAN 60 YEARS(HIRE_DATE − BIRTHDATE) GREATER THAN OR EQUAL TO 18
Duration rules can also check for order of values as discussed earlier. Negative duration values would indicate that values are not in the proper order.
Another place to look for simple data rules is to seek columns that divide the object type into subtypes. For example, GENDER divides employees into MALE and FEMALE, MARITAL_STATUS divides employees into MARRIED and NOT_MARRIED, ACCOUNT_TYPE may divide customers into RETAIL and COMMERCIAL, and INVENTORY_TYPE may divide inventory into NORMAL, TIME_SENSITIVE, RADIOACTIVE, and TEMPERATURE CONTROLLED.
This division of the objects into subgroups probably has an impact on the acceptable values in other columns. Some of the columns may be used only for one subtype. In this case they must be NOT_NULL or not blank for that subgroup, and NULL or blank for others. They may also restrict the permitted values in some columns based on the subgroup. Examples of object subgrouping columns rules are
SPOUSE_NAME IS 'BLANK' IF MARRIED_FLAG IS NOT YES DATE_OF_LAST_BIRTH IS NULL IF GENDER IS 'MALE' IF EMPLOYEE_TYPE IS 'PART_TIME' THEN PAY_TYPE IS 'NONEXEMPT'
Violations of rules such as these are almost always the result of inaccurate data entry. These types of errors can have large impacts on decision support routines that tend to use the subgrouping columns to define comparison groups.
A similar type of rule involves data objects that record multiple steps in a process. As each step is completed, more data is added to the object. Each of the points in the process defines a "state" the data can be in. Each state generally involves some columns that are applicable only to that state. This means that if the step has not completed yet for a state, the columns with affinity to that state should be NULL. If the step has completed, they should be NOT NULL.
Work flow rules are a subset of process rules. Data rules can be constructed from them to determine the correctness of values within each valid state.
As in the previous section, the database systems are not functionally sufficient to enforce this conditional NULL rule. A data rule is needed to ensure that all columns conform to the requirements of the current state.
The state may not affect only whether other columns are empty or not. It may also restrict the content of another column to a subset of the values permitted in value analysis.
Examples of work flow data objects are ORDERS (placed, shipped, received, returned, cancelled), LOAN_APPLICATION (requested, reviewed, accepted or rejected, made), and EMPLOYMENT_APPLICATION (received, reviewed, interviewed, background check, offered/rejected, employed). Examples of data rules for work flow process rules are
IF APPLICATION_STATUS IS "REVIEWED" THEN INTEREST_RATE IS NULL IF APPLICATION_STATUS IS "ACCEPTED" THEN INTEREST_RATE IS NOT NULL
Sometimes a column is determined by consideration of values from other columns through a business procedure or policy. Setting a customer rating or a risk factor on a security investment are examples. If the rules for setting these values are deterministic enough, they can be converted into data rules to test against the data. An example is
IF INVENTORY_USAGE_LAST6MONTHS IS 0 AND ONHAND IS GREATER THAN 0 THEN SALVAGE_FLAG=YES
Sometimes the business rule is vague, allowing discretion on a businessperson, and thus cannot be conveniently encoded as a data rule. However, a subset of the business rule may be absolute and can be encoded as a valid test. For example, a rule that defines a BAD customer as opposed to a GOOD customer may be very complex and allow some leeway. However, within the rule it may be an absolute that a customer with an OVERDUE amount in excess of $100,000 that is overdue by more than 90 days is a BAD customer. Even though some of the criteria for a BAD customer is open to discretion, this part of the rating process is not. The part that is an absolute can be converted into a data rule and tested.
The rules shown so far involve values contained in a single row. Sometimes an object has multiple rows for parts of the data object. For example, an ORDER has line item detail rows; a project has multiple people working on it. Rules can apply to these that involve looking at more than one row at a time but still within the boundaries of the single data object.
An example is testing for uniqueness across the PRODUCT_ID column of ORDER_LINE_ITEMS. This is a column that is not unique across the entire table but is expected to be unique across the rows of each order. In this example, it appears that only one column was being used. However, there were actually three columns used: PRODUCT_ID, ORDER_NUMBER, and LINE_ITEM_NUMBER. Another example of this is a data rule that says that multiple entries for the same employee for emergency contact cannot have the same telephone number.
Of course, additional data rules can be created that are not included in the samples shown. Many are more complex than those shown here. These are generally data rules dealing with business policies. Many times these rules are soft rules that are allowed to be violated at times. Examples are
EXPENSE LINE ITEMS FOR A SINGLE EXPENSE REPORT CAN- NOT INCLUDE DATES FOR MORE THAN ONE WEEK WHERE A WEEK IS DEFINED AS SUNDAY THROUGH SATURDAY AN ORDER CANNOT INCLUDE ITEMS THAT REQUIRE DIFFERENT SHIPPING MODES