Remedies to quality problems can range anywhere from simply holding a training class for data entry personnel to replacing an entire application. Without remedies, the problems are likely to persist, if not get worse. Without remedies, the potential problems that have not yet occurred increase in likelihood of occurring.
Often the problems that exist in a database cannot be repaired. This is true when the number of errors make it impractical to seek out and repair the wrong ones. This is also true when it is no longer possible to obtain the correct information. The remedies are mostly designed to improve the quality of new data being entered into the databases as opposed to fixing the data that is already there.
There are a number of classical problems associated with this phase. The first is the trade-off of making quick improvements through patching an existing system versus taking a longer-term view of reengineering the data processes and application programs. The second is the trade-off between making changes to primary systems versus performing data cleansing to fix problems when moving data. Figure 5.3 lists some of the types of remedies that can be used for resolving issues.
Remedies are changes to systems that are designed to prevent data inaccuracies from occurring in the future, as well as to detect as many of them as possible when they do occur. The scope of changes includes data capture processes, primary applications that create and update data, processes that move data between databases, and applications that generate information products. In short, everything is fair game to designing remedies.
Improving data capture processes can include actions such as redesigning data entry windows and associated logic, training data entry people, and instituting feedback reporting of quality problems to data entry people. Many small items like these can make large improvements in the accuracy of data.
At the other extreme is altering the business processes that include data capture and update. Changes in who enters data and when they do it can improve the efficiency of the processes and the likelihood that the data will be accurate. Getting the entry of data closer to the real-world event, having fewer people involved in the process, and having the entry people trained on the intent of the application can all contribute to better data.
Business processes can be altered to add data verification through additional means in cases where it is warranted. Business processes can be altered to eliminate incentives to provide inaccurate data.
More automation can be brought to the entry process wherever appropriate. Use of bar coding, lookup of previously entered information, voice capture of verbal information exchange between the person creating the data and the person entering the data for later replay, and verification are examples where automation can improve accuracy.
Defensive data checkers are software that assists in enforcing rules at the point of data entry to prevent invalid values, invalid combinations of valid values, and structural problems from getting into the database in the first place.
Rule checking can be performed in multiple places and through multiple means. Data entry screens can be designed to check for valid values for encoded fields and to enforce values for required fields. Application server code can take the data for a transaction and perform further rule testing for more stringent value testing and multivalued correlation testing. The database implementation can employ the support of the DBMS software to enforce many structural rules, such as primary key uniqueness, primary/foreign key constraints, and null rule enforcement. The use of a separate rule-checking component can be added to the transaction flow to perform additional data rule checking.
A solution that might be chosen is to leave the application alone but change the database management system used in order to take advantage of a different DBMS's superior data-checking functions.
Data checkers can be moved more into the mainstream of the application. For example, several new Internet applications are checking the correlation of address information at the point of data capture and alerting the entry person when the various components are incompatible.
Defensive checkers cannot prevent all inaccuracies from getting into the database. Inaccuracies still flow through in cases for which values are valid individually and in combination but are just plain wrong. It is also generally impractical to test rules that involve large sets of data to determine correlation correctness.
Data monitoring is the addition of programs that run periodically over the databases to check for the conformance to rules that are not practical to execute at the transaction level. They can be used to off-load work from transaction checks when the performance of transactions is adversely affected by too much checking. Because you can check for more rules, they can be helpful in spotting new problems in the data that did not occur before.
The use of data cleansing programs to identify and clean up data after it has been captured can also be a remedy. Cleansing data is often used between primary databases and derivative databases that have less tolerance for inaccuracies. They can also be used for cleaning up data in original source systems.
Data cleansing has been specifically useful for cleaning up name and address information. These types of fields tend to have the highest error rate at capture and the highest decay rates, but also are the easiest to detect inaccuracies within and the easiest to correct programmatically.
In extreme cases, the application that generates data can be overhauled or replaced. This is becoming more common as the solution to cases in which many data issues pile on the same data source.
Reengineering can apply to the primary databases where data is initially captured, as well as to the applications that extract, transform, and move the data to derivative data stores or to the derivative stores themselves.
This remedy rarely stands alone. All other remedies are specifically directed at solving a data quality problem. Reengineering generally will not be selected as a solution solely for data quality reasons. Data quality concerns become additional justification for making a change that has been justified by other drivers.
Remedies need to be devised with consideration for the cost and time to implement. Time to implement must include the time lag before it is likely any project would start. Many of these remedies require negotiation with development teams and scheduling against many other competing tasks.
This often leads to a staged approach to implementation involving data cleansing and monitoring early and reengineering of applications later. It may also lead to implementation of throwaway efforts in order to effect some short-term improvements while waiting for long-term projects to complete.
Too often projects initiated from these remedies end up on a to-do list and then get dropped or continue to get prioritized behind other projects. A reason for this is that they tend to be too granular and are not competitive against bigger projects that promise greater returns.
Issues management should strive for as many easy or short-term remedies as possible to obtain quick improvements. For example, training data entry people, changing screen designs, adding checker logic, or setting expectations are easy to do.
Data cleansing can also be introduced as a short-term remedy to fill the void while more substantive changes are made. Data cleansing should always be considered a temporary fix.
These are tricky matters to manage. One of the dangers is that the temporary improvements become permanent. Managers think that because some improvements have been made that the problem is solved. They may think that data cleansing is a solution instead of a short-term coping mechanism.
This underlines the need to keep issues open as long as they are not fully addressed. If necessary, long-term remedies can be split off into separate issues for tracking.
This is also a reason to monitor the results of remedies implemented. After the short-term remedies are implemented, the profiling process should be repeated and the impacts reexamined. This allows quality problems and their associated impacts that remain after short-term remedies are implemented to be documented, sized, and used to justify the longer-term efforts.
There is a real danger in this phase of overengineering remedies. A zealous data quality team can outline a number of measures that will have no chance of being implemented. It is important that the team performing the remedy recommendations include representatives from the IT and user organizations in order to avoid recommending something that will be rejected.
An example of overengineering is to require that all data rules discovered during the data profiling process be implemented as transaction checks or as periodic monitoring functions. Although this would catch many errors, in practice it has the potential of overloading the transaction path and causing performance problems. The rule set needs to be prioritized based on the probability of errors occurring and the importance of an inaccurate value. The high-risk rules should be added to the transaction path, moderate-risk rules should be added to periodic monitoring sweeps over the data, and low-risk rules should not be implemented. Periodic reprofiling of data may check the rules not implemented to make sure they are not becoming more of a problem; possibly once a year.
Note that a rule can be classified as high risk even though profiling indicates few if any violations have occurred. If the potential cost to the corporation of a violation is very high, it needs to be included in checkers even though there is no evidence it has already produced inaccurate data.
Another example is to call for a major company reorganization to obtain more reliable data capture processes. This should not be considered a remedy unless an awful lot of evidence exists to justify it.
Organizations resist change, and change does not always produce the expected results. If there is a perception that little is to be gained, this type of recommendation will never be approved.
Similarly, recommendations that require major changes to high-availability applications are less likely to get approved. The disruption factor on a major application can cost a company tons of money if it is not managed properly. These types of changes are not accepted easily.
As more and more issues pass through the process, the team will learn more about what types of remedies are most effective and what types of remedies can more easily be adopted. What you learn can be converted into best practices that can be employed in all new system developments. This is a good way to improve the quality of data coming from new systems before a data quality problem even exists.
This is a part of the role of the data quality assurance department. It feeds into their role of preventing problems.