Developing a Backup and Recovery Strategy

A database can become unusable because of hardware or software failure, or both. You may, at one time or another, encounter storage problems, power interruptions, application failures, and different failure scenarios that require different recovery actions. It is important to protect your data against the possibility of loss by having a well-rehearsed recovery strategy in place. Some of the questions that you should answer when developing your recovery strategy are:

Will the database be recoverable?
Is version recovery sufficient, or must the database allow rollforward recovery?
How much time can be spent recovering the database?
How much time will pass between backup operations?
How much storage space can be allocated for backup copies and archived logs?
Will table space level backups be sufficient, or will full database backups be necessary?

Now the question is, Why back up? What are you trying to protect? What failures do you expect?

You should:

Always have a recovery plan in place (make sure the plan is well documented and practiced).
Identify failures that can occur and how to prevent/react to the failures.
- Disk failure: Use disk mirroring or Redundant Array of Independent Disks (RAID)
- CPU failure: Use a standby machine or failover machine
- Application failure: Use a backup image
Use DB2's backup and recovery utilities.

A database recovery strategy should ensure that all information is available when it is required for database recovery. It should include a regular schedule for taking backups and, in the case of partitioned database systems, should include backups when the system is scaled (when database partitions are added or dropped).

The overall strategy should also include procedures for recovering command scripts, applications, user-defined functions (UDFs), as well as stored procedure code in the operating system libraries, and load copies.

Different recovery methods are discussed in the sections that follow, and you will discover which recovery method is best suited to your business environment.

The concept of a database backup is the same as any other data backup: taking a copy of the data, then storing it on a different medium in case of failure or damage to the original. The simplest case of a backup involves shutting down the database to ensure that no further transactions occur, then simply backing it up. You can then rebuild the database if it becomes damaged or corrupted in some way.

The rebuilding of the database is called recovery.

Crash recovery is the automatic recovery of the database if a failure occurs before all of the changes that are part of one or more units of work are completed and committed or rolled back. This is done by rolling back incomplete transactions and ensuring completion of committed transactions that were still in memory when the crash occurred.
Version recovery is the restoration of a previous version of the database, using an image that was created during a backup operation.
Disaster recovery is the restoration of the entire database on another machine, using the full database backup and all of the archived logs for the database.
Rollforward recovery is the reapplication of transactions recorded in the database log files after a database or a table space backup image has been restored.

Each database includes logs, which are used to recover from application or system errors. In combination with the database backups, they are used to recover the consistency of the database right up to the point in time when the error occurred.

Data that is easily recreated can be stored in a non-recoverable database. This includes data from an outside source that is used for read-only applications and tables that are not often updated, for which the small amount of logging does not justify the added complexity of managing log files and rolling forward after a restore operation. Non-recoverable databases have both the LOGRETAIN and the USEREXIT database configuration parameters turned off. This means that the only logs that are kept are those required for crash recovery.

These logs are known as active logs, and they contain current transaction data. Version recovery using offline backups is the primary means of recovery for a non-recoverable database. An offline backup means that no other application can use the database when the backup operation is in progress. Such a database can be restored only offline. It is restored to the state it was in when the backup image was taken, and rollforward recovery is not supported.

Data that cannot be easily recreated should be stored in a recoverable database. This includes data whose source is destroyed after the data is loaded, data that is manually entered into tables, and data that is modified by application programs or users after it is loaded into the database. Recoverable databases have either the LOGRETAIN database configuration parameter set to RECOVERY, the USEREXIT database configuration parameter set to YES, or both.

Active logs are still available for crash recovery, but when the database is configured as a recoverable database, you also have the archived logs, which contain committed transaction data. Such a database can be restored only offline. It is restored to the state it was in when the backup image was taken. However, with rollforward recovery, you can roll the database forward (that is, past the time when the backup image was taken) by using the active and archived logs either to a specific point in time or to the end of the active logs.

Recoverable database backup operations can be performed either offline or online. An online backup means that other applications can connect to the database when the backup operation is in progress. Database restore and rollforward operations must always be performed offline. During an online backup operation, rollforward recovery ensures that all changes are captured and reapplied if that backup is restored.

If you have a recoverable database, you can back up, restore, and roll individual table spaces forward, rather than the entire database. When you back up a table space online, it is still available for use, and simultaneous updates are recorded in the logs. When you perform an online restore or rollforward operation on a table space, the table space itself is not available for use until the operation completes, but users can be allowed to access tables in other table spaces.

The Recovery History File

The recovery history file contains certain historical information about major actions that have been performed against a database. The information recorded in the recovery history file is used to assist with the recovery of the database in the event of a failure. The following is a list of the actions that will generate an entry in the history file:

Backing up the database or a table space
Restoring the database or a table space from a backup image
Performing a rollforward recovery operation on the database or a table space
Loading a table
Altering a table space's definition
Quiescing a table space
Reorganizing a table (REORG)
Updating statistics for a table (RUNSTATS)
Dropping a table

A recovery history file is also created automatically when a database is created. This file is located in the database directory. You cannot directly modify a recovery history file; however, you can delete entries from the recovery history file using the PRUNE HISTORY command. You can also use the REC_HIS_RETENTN database configuration parameter to specify the number of days that the recovery history file will be retained.

Recovery history file related to backup information for the database SAMPLE:

db2 list history backup all for sample

                         List History File for sample
Number of matching file entries = 1
Op   Obj Timestamp+Sequence   Type  Dev Earliest Log  Current Log   Backup ID
---  --- ----------------     ----  --- ------------  ------------  ---------
B    D   20020829090733001    F     D   S0000000.LOG  S0000000.LOG
--------------------------------------------------------------------------
Contains 2 tablespace(s):
00001 SYSCATSPACE
00002 USERSPACE1
--------------------------------------------------------------------------
00002 USERSPACE1
     Comment: DB2 BACKUP SAMPLE OFFLINE
Start Time: 20020829090733
  End Time: 20020829090756
------------------------------------------------------------------------
00001 Location: /data/dbbackup

Deciding How Often to Back Up

Your recovery plan should allow for regularly scheduled backup operations, because backing up a database requires time and system resources. Your plan may include a combination of full database backups, table space backups, and incremental backup operations.

You should take full database backups regularly, even if you archive the logs (which allows for rollforward recovery). It is more time consuming to rebuild a database from a collection of table space backup images than it is to recover the database from a full database backup image. Table space backup images are useful for recovering from an isolated disk failure or an application error.

You should also consider not overwriting backup images and logs, saving at least two full database backup images and their associated logs as an extra precaution.

If the amount of time needed to apply archived logs when recovering and rolling a very active database forward is a major concern, consider the cost of backing up the database more frequently. This reduces the number of archived logs you need to apply when rolling forward. You can initiate a backup operation while the database is either online or offline. If it is online, other applications or processes can connect to the database, as well as read and modify data while the backup operation is running. If the backup operation is running offline, other applications cannot connect to the database.

To reduce the amount of time that the database is not available, consider using online backup operations. Online backup operations are supported only if rollforward recovery is enabled. If rollforward recovery is enabled and you have a complete set of recovery logs, you can rebuild the database, should the need arise. You can use an online backup image for recovery only if you have the logs that span the time during which the backup operation was running.

Offline backup operations can be faster than online backu operations, because there is no contention for the data files.

The backup utility lets you back up selected table spaces. If you use DMS table spaces, you can store different types of data in their own table spaces to reduce the time required for backup operations. You can keep table data in one table space, long field and large object (LOB) data in another table space, and indexes in yet another table space. If you do this and a disk failure occurs, it is likely to affect only one of the table spaces. Restoring or rolling forward one of these table spaces will take less time than it would have taken to restore a single table space containing all of the data.

NOTE

If a table is split among table spaces, you must restore all of the table spaces for the table to roll forward to a point in time prior to the end of the logs.

You can also save time by taking backups of different table spaces at different times, as long as the frequency of changes to them are not the same. If long field or LOB data is not changed as frequently as the other data, you can back up these table spaces less frequently. If long field and LOB data are not required for recovery, you can also consider not backing up the table space that contains that data. If the LOB data can be reproduced from a separate source, choose the NOT LOGGED column option when creating or altering a table to include LOB columns.

When formulating a recovery plan, you should take these recovery costs and their impact on your business operations into account. Testing your overall recovery plan will assist you in determining whether the time required to recover the database is reasonable, given your business requirements. Following each test, you should determine whether the recovery time required is acceptable; if it is too long, you may want to increase the frequency with which you take a backup. If rollforward recovery is part of your strategy, this will reduce the number of logs that are archived between backups and, as a result, reduce the time required to roll the database forward after a restore operation.

Storage Considerations

When deciding which recovery method to use, consider the storage space required.

The version recovery method requires space to hold the backup copy of the database and the restored database. The rollforward recovery method requires space to hold the backup copy of the database or table spaces, the restored database, and the archived database logs.

If a table contains long field or LOB columns, you should consider placing this data into a separate table space. This will affect your storage space considerations, as well as your plan for recovery, as mentioned above. With a separate table space for long field and LOB data, and knowing the time required to back up long field and LOB data, you may decide to use a recovery plan that only occasionally saves a backup of this table space. You may also choose, when creating or altering a table, to include LOB columns, not to log changes to those columns. This will reduce the size of the required log space and the corresponding log archive space.

To prevent media failure from destroying a database and your ability to rebuild it, keep the database backup, the database logs, and the database itself on different devices. For this reason, it is highly recommended that you use the NEWLOGPATH configuration parameter to put database logs on a separate device, once the database is created.

The database logs can use up a large amount of storage. If you plan to use the rollforward recovery method, you must decide how to manage the archived logs. Your choices are the following:

Use a user exit program to copy these logs to another storage device in your environment.
Manually copy the logs to a storage device or directory other than the database log path directory after they are no longer in the active set of logs.

Keeping Related Data Together

As part of your database design, you will know the relationships that exist between tables. These relationships can be expressed at the application level, when transactions update more than one table; at the database level, where referential integrity exists between tables; or where triggers on one table affect another table. You should consider these relationships when developing a recovery plan. You will want to back up related sets of data together. Such sets can be established at either the table space or the database level. By keeping related sets of data together, you can recover to a point where all of the data is consistent. This is especially important if you want to be able to perform point-in-time rollforward recovery on table spaces.

Using Different Operating Systems

When working in an environment that has more than one operating system, you must consider that you cannot back up a database on one operating system, then restore that database on another operating system (such as UNIX to/from Linux or Windows).

In such cases, you should keep the recovery plans for each operating system separate and independent. There is, however, support for cross-platform backup and restore operations between operating systems with similar architectures, such as AIX, HP/UX, and Sun Solaris, and between 32-bit and 64-bit operating systems.

When you transfer the backup image between systems, you must transfer it in binary mode. The target system must have the same (or later) version of DB2 as the source system. Restore operations to a down-level system are not supported.

If you must move tables from one operating system to another and cross-platform backup and restore support is not available in your environment, you can use the db2move command, or the export utility followed by the import or the load utility.

Crash Recovery

Transactions or units of work against a database can be interrupted unexpectedly. If a failure occurs before all of the changes that are part of the unit of work are completed and committed, the database is left in an inconsistent and unusable state. Crash recovery is the process by which the database is moved back to a consistent and usable state. This is done by rolling back incomplete transactions and completing committed transactions that were still in memory when the crash occurred. When a database is in a consistent and usable state, it has attained what is known as a point of consistency.

If you want the rollback of incomplete units of work to be done automatically by the database manager, enable the automatic restart (AUTORESTART) database configuration parameter by setting it to ON; this is the default value. If you do not want automatic restart behavior, set the AUTORESTART database configuration parameter to OFF. As a result, you will need to issue the RESTART DATABASE command when a database failure occurs. If the database I/O was suspended before the crash occurred, you must specify the WRITE RESUME option of the RESTART DATABASE command in order for the crash recovery to continue. The administration notification log records when the database restart operation begins.

If crash recovery is applied to a database that is enabled for forward recovery (that is, the LOGRETAIN configuration parameter is set to RECOVERY, or the USEREXIT configuration parameter is set to ON) and an error occurs during crash recovery that is attributable to an individual table space, that table space will be taken offline and cannot be accessed until it is repaired. Crash recovery continues. At the completion of crash recovery, the other table spaces in the database will be accessible, and connections to the database can be established. However, if the table space that is taken offline is the table space that contains the system catalogs, it must be repaired before any connections will be permitted.

Recovering Damaged Table Spaces

A damaged table space has one or more containers that cannot be accessed. This is often caused by media problems that are either permanent (a bad disk) or temporary (an offline disk or an unmounted file system).

If the damaged table space is the system catalog table space, the database cannot be restarted. If the container problems cannot be fixed, leaving the original data intact, the only available options are:

To restore the database.
To restore the catalog table space. Table space restore is valid only for recoverable databases, because the database must be rolled forward.

If the damaged table space is not the system catalog table space, DB2 attempts to make as much of the database available as possible.

If the damaged table space is the only temporary table space, you should create a new temporary table space as soon as a connection to the database can be made. Once created, the new temporary table space can be used, and normal database operations requiring a temporary table space can resume.

You can, if you wish, drop the offline temporary table space. There are special considerations for table reorganization using a system temporary table space:

If the database or the database manager configuration parameter INDEXREC is set to RESTART, all invalid indexes must be rebuilt during database activation; this includes indexes from a reorganization that crashed during the build phase.
If there are incomplete reorganization requests in a damaged temporary table space, you may have to set the INDEXREC configuration parameter to ACCESS to avoid restart failures.

Recovering Table Spaces in Recoverable Databases

When crash recovery is necessary, a damaged table space will be taken offline and will not be accessible. It will be placed in rollforward pending state. A restart operation will succeed if there are no additional problems, and the damaged table space can be used again once you:

Fix the damaged containers without losing the original data (i.e., mount the file systems), then perform a table space rollforward operation to the end of the logs. The rollforward operation will first attempt to bring it from offline to normal state.
Perform a table space restore operation after fixing the damaged containers (with or without losing the original data), then a rollforward operation to the end of the logs or to a point in time.

Reducing the Impact of Media Failure

To reduce the probability of media failure causing a database problem and to simplify recovery from this type of failure if it does occur:

Mirror or duplicate the disks that hold the data and logs for important databases.
Use a RAID configuration, such as RAID level 5.
In a partitioned database environment, set up a procedure for handling the data and the logs on the catalog partition. Because this catalog partition is critical for maintaining the database:
- Ensure that it resides on redundant disks, i.e., mirrored or RAID-protected.
- Mirror or duplicate the disk as needed.
- Make frequent backups.
- Consider not placing user data on the catalog partition.

Protecting Against Disk Failure

If you are concerned about the possibility of damaged data or logs due to a disk crash, consider the use of some form of disk fault tolerance. Generally, this is accomplished through the use of a disk array or disk/file system mirroring.

A disk array is sometimes referred to simply as a RAID. Disk arrays can also be provided through hardware or software at the operating system or application level. The point of distinction between hardware and software disk arrays is how processing of I/O requests is handled.

For hardware disk arrays, I/O activity is managed by the disk controllers; for software disk arrays, this is done by the operating system/file system.

Hardware Disk Arrays

In a hardware disk array, multiple disks are used and managed by a disk controller (or controllers), complete with its own memory/storage. All of the logic required to manage the disks forming the array is contained on the disk controller; therefore, this implementation is independent of the operating system.

There are several types of RAID architecture, differing in function and performance, but RAID levels 1 and 5 are the most commonly used today.

RAID level 1 is also known as disk mirroring or duplexing.

Disk mirroring copies data from one disk to a second disk, using a single disk controller.
Disk duplexing is similar to disk mirroring, except that disks are attached to a second disk controller (such as two adapters).
With either of these technologies, data protection is very good: Either disk can fail, and data is still accessible from the other disk. With disk duplexing, a disk controller can also fail without compromising data protection.
Performance is also good, but this implementation requires twice the usual number of disks.

RAID level 5 involves data and parity striping by sectors, across a set of disks. Parity is interleaved with the data, rather than being stored on a dedicated drive. With this technology, data protection is also good: If any disk fails, the data can still be accessed by using the parity information from the other disks. Read performance is good, but write performance can be adversely affected. A RAID level 5 configuration requires a minimum of three identical disks. The amount of disk space required for overhead varies with the number of disks in the array. In the case of a RAID level 5 configuration with 5 disks, the space overhead is 20% (i.e., five disks plus a parity disk).

When using a RAID (but not a RAID level 0) disk array, a disk failure will not prevent you from accessing data on the array. When hot-pluggable or hot-swappable disks are used in the array, a replacement disk can be swapped with the failed disk while the array is in use. With RAID level 5, if two disks fail at the same time, all data is lost (but the probability of simultaneous disk failures is very small).

You might consider using a RAID level 1 hardware disk array or a software disk array for your logs, because this provides recoverability to the point of failure and offers good write performance, which is important for logs. In cases where reliability is critical (because time cannot be lost recovering data following a disk failure) and write performance is not so critical, consider using a RAID level 5 hardware disk array. Alternatively, if write performance is critical and the cost of additional disk space is not significant, consider a RAID level 1 hardware disk array for your data, as well as for your logs.

NOTE

You should consider using MIRRORLOGPATH to specify a secondary log path for the database to manage copies of the active log. It is recommended that you place the secondary log path on a separate physical disk (preferably one that is also on a different disk controller). That way, the disk controller cannot be a single point of failure.

Stripe Set

A software disk array accomplishes much the same as does a hardware disk array, but disk traffic is managed either by the operating system or by an application program running on the server. Like other programs, the software array must compete for CPU and system resources. This is not a good option for a CPU-constrained system, and it should be remembered that overall disk array performance is dependent on the server's CPU load and capacity.

A typical software disk array provides disk mirroring. Although redundant disks are required, a software disk array is comparatively inexpensive to implement, because costly disk controllers are not required.

CAUTION

Having the operating system boot drive in the disk array prevents your system from starting if that drive fails. If the drive fails before the disk array is running, the disk array cannot allow access to the drive. A boot drive should be separate from the disk array.

Reducing the Potential of Transaction Failure

To reduce the potential of a transaction failure, try to ensure:

Adequate disk space for the database logs.
Reliable communication links among the database partitions in a partitioned database environment.
Synchronization of the system clocks in a partitioned database environment.

Recovering from Transaction Failures in a Partitioned Database Environment

If a transaction failure occurs in a partitioned database environment, database recovery is usually necessary on both the failed database partition and any other database partitions that were participating in the transaction:

Crash recovery occurs on the failed database partition after the failure condition is corrected.
Database partition failure recovery on the other (still active) database partitions occurs immediately after the failure has been detected.

In a partitioned database environment, the database partition on which an application is submitted is the coordinator partition, and the first agent that works for the application is the coordinator agent. The coordinator agent is responsible for distributing work to other database partitions, and it keeps track of which ones are involved in the transaction. When the application issues a COMMIT statement for a transaction, the coordinator agent commits the transaction by using a two-phase commit protocol.

During the first phase, the coordinator partition distributes a PREPARE request to all the other database partitions that are participating in the transaction. These database partitions then respond with one of the following:

READ-ONLY indicates that no data change occurred at this database partition.
YES indicates that data change occurred at this database partition.
NO indicates that there is an error and the database partition is not prepared to commit.

If one of the database partitions responds with a NO, the transaction is rolled back. Otherwise, the coordinator partition begins the second phase.

During the second phase, the coordinator partition writes a COMMIT log record, then distributes a COMMIT request to all the database partitions that responded with a YES. After all the other database partitions have committed, they send an acknowledgement of the COMMIT to the coordinator partition. The transaction is complete when the coordinator agent has received all COMMIT acknowledgements from all the participating partitions. At this point, the coordinator agent writes a FORGET log record.

Transaction Failure Recovery on an Active Database Partition

If any database partition detects that another database partition is down, all work that is associated with the failed database partition is stopped:

If the still active database partition is the coordinator partition for an application and the application was running on the failed database partition (and not ready to COMMIT), the coordinator agent is interrupted to do failure recovery. If the coordinator agent is in the second phase of COMMIT processing, SQL0279N is returned to the application, which in turn loses its database connection. Otherwise, the coordinator agent distributes a ROLLBACK request to all other partitions participating in the transaction, and SQL1229N is returned to the application.
If the failed database partition was the coordinator partition for the application, agents that are still working for the application on the active partitions are interrupted to do failure recovery. The current transaction is rolled back locally on each database partition, unless it has been prepared and is waiting for the transaction outcome. In this situation, the transaction is left in doubt on the active database partitions, and the coordinator partition is not aware of this (because it is not available).
If the application was connected to the failed database partition (before it failed) but neither the local database partition nor the failed database partition is the coordinator partition, agents working for this application are interrupted. The coordinator partition will send either a ROLLBACK or a disconnect message to the other database partitions.

The transaction will be indoubt only on database partitions that are still active if the coordinator partition returns an SQL0279N.

Any process (such as an agent or deadlock detector) that attempts to send a request to the failed partition is informed that it cannot send the request.

Transaction Failure Recovery on the Failed Database Partition

If the transaction failure causes the database manager to end abnormally, you can issue the db2start command with the RESTART option to restart the database manager, once the database partition has been restarted. If you cannot restart the database partition, you can issue db2start to restart the database manager on a different partition.

If the database manager ends abnormally, database partitions may be left in an inconsistent state. To make them usable, crash recovery can be triggered on a database partition:

Explicitly, through the RESTART DATABASE command
Implicitly, through a CONNECT request when the AUTORESTART database configuration parameter has been set to ON

Crash recovery reapplies the log records in the active log files to ensure that the effects of all complete transactions are in the database. After the changes have been reapplied, all uncommitted transactions are rolled back locally, except for indoubt transactions. There are two types of indoubt transaction in a partitioned database environment:

On a database partition that is not the coordinator partition, a transaction is indoubt if it is prepared but not yet committed.
On the coordinator partition, a transaction is indoubt if it is committed but not yet logged as complete (that is, the FORGET record is not yet written). This situation occurs when the coordinator agent has not received all the COMMIT acknowledgements from all the database partitions that worked for the application.

Crash recovery attempts to resolve all the indoubt transactions by doing one of the following. The action that is taken depends on whether the database partition was the coordinator partition for an application:

If the database partition that restarted is not the coordinator partition for the application, it sends a query message to the coordinator agent to discover the outcome of the transaction.
If the database partition that restarted is the coordinator partition for the application, it sends a message to all the other agents (subordinate agents) that the coordinator agent is still waiting for COMMIT acknowledgements.

It is possible that crash recovery may not be able to resolve all the indoubt transactions (for example, some of the database partitions may not be available). In this situation, the SQL warning message SQL1061W is returned. Because indoubt transactions hold resources, such as locks and active log space, it is possible to get to a point where no changes can be made to the database because the active log space is being held up by indoubt transactions. For this reason, you should determine whether indoubt transactions remain after crash recovery and recover all database partitions that are required to resolve the indoubt transactions as quickly as possible.

If one or more partitions that are required to resolve an indoubt transaction cannot be recovered in time, and access is required to database partitions, you can manually resolve the indoubt transaction by making a heuristic decision. You can use the LIST INDOUBT TRANSACTIONS command to query, commit, and roll back the indoubt transaction on the database partition.

Identifying the Failed Database Partition

When a database partition fails, the application will typically receive one of the following SQLCODEs. The method for detecting which database manager failed depends on the SQLCODE received:

SQL0279N? This SQLCODE is received when a database partition involved in a transaction is terminated during COMMIT processing.
SQL1224N? This SQLCODE is received when the database partition that failed is the coordinator partition for the transaction.
SQL1229N? This SQLCODE is received when the database partition that failed is not the coordinator partition for the transaction.

Determining which database partition failed is a two-step process. The SQLCA associated with SQLCODE SQL1229N contains the database partition number of the server that detected the error in the sixth array position of the sqlerrd field.

NOTE

If multiple logical database partitions are being used on a single server, the failure of one logical database partition may cause other logical database partitions on the same single server to fail.

Recovering from the Failure of a Database Partition

To recover from the failure of a database partition:

Correct the problem that caused the failure.
Restart the database manager by issuing the db2start command from any database partition.
Restart the database by issuing the RESTART DATABASE command on the failed database partition.

Disaster Recovery

The term disaster recovery is used to describe the activities that need to be done to restore the database (in a remote location) in the event of a fire, earthquake, vandalism, or other catastrophic events. A plan for disaster recovery can include one or more of the following:

A remote site to be used in the event of an emergency
A different server on which to recover the database
Off-site storage of database backups and archived logs

If your plan for disaster recovery is to recover the entire database on another machine, you require at least one full database backup and all the archived logs for the database. You may choose to keep a standby database up to date by applying the logs to it as they are archived. Or you may choose to keep the database backup and log archives in the standby site and to perform the restore and rollforward operations only after a disaster has occurred. With a disaster, however, it is generally not possible to recover all of the transactions up to the time of the disaster, i.e., some of the active log files may be lost.

The usefulness of a table space backup for disaster recovery depends on the scope of the failure. Typically, disaster recovery requires that you restore the entire database; therefore, a full database backup should be kept at a standby site. Even if you have a separate backup image of every table space, you cannot use them to recover the database.

Both table space backups and full database backups can have a role to play in any disaster recovery plan. The DB2 facilities available for backing up, restoring, and rolling data forward provide a foundation for a disaster recovery plan. You should ensure that you have tested recovery procedures in place to protect your business.

Version Recovery

Version recovery is the restoration of a previous version of the database, using an image that was created during an offline database backup operation. You use this recovery method with non-recoverable databases (that is, databases for which you do not have archived logs). You can also use this method with recoverable databases by using the WITHOUT ROLLING FORWARD option on the RESTORE DATABASE command.

A database restore operation will rebuild the entire database, using a backup image created earlier. A database backup allows you to restore a database to a state identical to the state at the time that the backup was made. However, every unit of work from the time of the backup to the time of the failure is lost.

Using the version recovery method, you must schedule and perform full offline backups of the database on a regular basis.

In a partitioned database environment, the database is located across many database partitions. You must restore all partitions, and the backup images that you use for the restore database operation must all have been taken at the same time. Each database partition is backed up and restored separately. A backup of each database partition taken at the same time is known as a version backup.

NOTE

You need to restore the catalog partition first, then the remaining database partitions.

Rollforward Recovery

To use the rollforward recovery method, you must have taken a backup of the database and archived the logs (by setting to YES either the LOGRETAIN or the USEREXIT database configuration parameters, or both). Restoring the database and specifying the WITHOUT ROLLING FORWARD option is equivalent to using the version recovery method. The database is restored to a state identical to the one at the time that the offline backup image was made. If you restore the database and do not specify the WITHOUT ROLLING FORWARD option for the restore database operation, the database will be in rollforward pending state at the end of the restore operation. This allows rollforward recovery to take place.

NOTE

The WITHOUT ROLLING FORWARD option cannot be used if the database backup was taken online.

The two types of rollforward recovery to consider are:

Database rollforward recovery. In this type of rollforward recovery, transactions recorded in database logs are applied following the database restore operation. The database logs record all changes made to the database. This method completes the recovery of the database to its state at a particular point in time or to its state immediately before the failure (i.e., to the end of the active logs).
In a partitioned database environment, the database is located across many database partitions. If you are performing point-in-time rollforward recovery, all database partitions must be rolled forward to ensure that all partitions are at the same level. If you need to restore a single database partition, you can perform rollforward recovery to the end of the logs to bring it up to the same level as the other partitions in the database. Only recovery to the end of the logs can be used if one database partition is being rolled forward. Point-in-time recovery applies to all database partitions.
Table space rollforward recovery. If the database is enabled for forward recovery, it is also possible to back up, restore, and roll table spaces forward. To perform a table space restore and rollforward operation, you need a backup image of either the entire database (i.e., all of the table spaces) or one or more individual table spaces. You also need the log records that affect the table spaces that are to be recovered.

You can roll forward through the logs to one of two points:

The end of the logs
A particular point in time (called point-in-time recovery)

Table space rollforward recovery can be used in the following two situations:

After a table space restore operation, the table space is always in rollforward pending state, and it must be rolled forward. Invoke the ROLLFORWARD DATABASE command to apply the logs against the table spaces either to a point in time or to the end of the logs.
If one or more table spaces are in rollforward pending state after crash recovery, first correct the table space problem. In some cases, correcting the table space problem does not involve a restore database operation. For example, a power loss could leave the table space in rollforward pending state. A restore database operation is not required in this case. Once the problem with the table space is corrected, you can use the ROLLFORWARD DATABASE command to apply the logs against the table spaces to the end of the logs. If the problem is corrected before crash recovery, crash recovery may be sufficient to take the database to a consistent, usable state.

NOTE

If the table space in error contains the system catalog tables, you will not be able to start the database. You must restore the SYSCATSPACE table space, then perform rollforward recovery to the end of the logs.

In a partitioned database environment:

If you are rolling a table space forward to a point in time, you do not have to supply the list of database partitions on which the table space resides. DB2 submits the rollforward request to all partitions. This means the table space must be restored on all database partitions on which the table space resides.
If you are rolling a table space forward to the end of the logs, you must supply the list of database partitions if you do not want to roll the table space forward on all partitions. If you want to roll all table spaces (on all partitions) that are in rollforward pending state forward to the end of the logs, you do not have to supply the list of database partitions. By default, the database rollforward request is sent to all partitions.
You must restore catalog partition first.

Incremental Backup and Recovery

As the size of databases, and particularly data warehouses, continues to expand into the tens and hundreds of terabytes time and hardware resources required to back up and recover these databases are also growing substantially.

Full database and table space backups are not always the best approach when dealing with large databases, because the storage requirements for multiple copies of such databases are enormous. When only a small percentage of the data in a warehouse changes, it should not be necessary to back up the entire database or table space.

An incremental backup is a backup image that contains only pages that have been updated since the previous backup was taken. In addition to updated data and index pages, each incremental backup image also contains all of the initial database metadata (such as database configuration, table space definitions, database history, and so on) that is normally stored in full backup images.

Two types of incremental backup are supported:

Incremental? An incremental backup image is a copy of all database data that has changed since the most recent, successful, full backup operation. This is also known as a cumulative backup image, because a series of incremental backups taken over time will each have the contents of the previous incremental backup image. The predecessor of an incremental backup image is always the most recent successful full backup of the same object.
Delta? A delta, or incremental delta, backup image is a copy of all database data that has changed since the last successful backup (full, incremental, or delta) of the database or table space in question.

The key difference between incremental and delta backup images is their behavior when successive backups are taken of an object that is continually changing over time. Each successive incremental image contains the entire contents of the previous incremental image, plus any data that has changed or is new since the previous full backup was produced. Delta backup images contain only the pages that have changed since the previous image of any type was produced.

Combinations of database and table space incremental backups are permitted in both online and offline modes of operation. Be careful when planning your backup strategy, because combining database and table space incremental backups implies that the predecessor of a database backup (or a table space backup of multiple table spaces) is not necessarily a single image but could be a unique set of previous database and table space backups taken at different times.

To rebuild the database or the table space to a consistent state, the recovery process must begin with a consistent image of the entire object (database or table space) to be restored and must then apply each of the appropriate incremental backup images in the order described below.

To enable the tracking of database updates, DB2 supports a database configuration parameter, TRACKMOD, which can have one of two accepted values:

Set TRACKMOD to NO? incremental backup is not permitted with this configuration. Database page updates are not tracked or recorded in any way. This is the default value.
Set TRACKMOD to YES? incremental backup is permitted with this configuration. When update tracking is enabled, the change becomes effective at the first successful connection to the database. Before an incremental backup can be taken on a particular table space, a full backup of that table space is necessary.

NOTE

The tracking of updates to the database can have an impact on the runtime performance of transactions that update or insert data.

For SMS and DMS table spaces, the granularity of this tracking is at the table space level. In table space level tracking, a flag for each table space indicates whether there are pages in that table space that need to be backed up.

If no pages in a table space need to be backed up, the backup operation can skip that table space altogether.

Restoring from Incremental Backup Images

A restore operation from incremental backup images always consists of the following steps:

Identifying the incremental target image or the last image to be restored and request an incremental restore operation from the DB2 restore utility. The incremental target image is specified using the TAKEN AT parameter in the RESTORE DATABASE command.
Restoring the most recent full database or table space image to establish a baseline against which each of the subsequent incremental backup images can be applied.
Restoring each of the required full or table space incremental backup images, in the order in which they were produced, on top of the baseline image restored in Step 2.
Repeating Step 3 until the target image from Step 1 is read a second time. The target image is accessed twice during a complete incremental restore operation to ensure that the database is initially configured with the correct history, database configuration, and table space definitions for the database that will be created during the restore operation.

There are two ways to restore incremental backup images.

For a manual incremental restore, the RESTORE command must be issued once for each backup image that needs to be restored (as outlined in the steps above).
For an automatic incremental restore, the RESTORE command is issued only once, specifying the target image to be used. DB2 then uses the database history to determine the remaining required backup images and restores them.

Manual Incremental Restore Example

To restore a set of incremental backup images, using manual incremental restore, specify the target image using the TAKEN AT timestamp option of the RESTORE DATABASE command and follow the steps outlined above.

For example:

db2 restore db sample incremental taken at 20020829102021

db2 restore database sample incremental taken at <timestamp_1>
where <timestamp_1> points to the last incremental backup image (the target image) to be restored
db2 restore database sample incremental taken at <timestamp_2>
where <timestamp_2> points to the initial full database (or table space) image
db2 restore database sample incremental taken at <timestamp_X>
where <timestamp_X> points to each incremental backup image in creation sequence
Repeat Step 3, restoring each incremental backup image up to and including image <timestamp_1>

If you are using manual incremental restore for a database restore operation, and table space backup images have been produced, the table space images must be restored in the chronological order of their backup timestamps.

If you want to use manual incremental restore, the db2ckrst utility can be used to query the database history and generate a list of backup image timestamps needed for an incremental restore. A simplified restore syntax for a manual incremental restore is also generated. It is recommended that you keep a complete record of backups and use this utility only as a guide.

Automatic Incremental Restore Example

To restore a set of incremental backup images using automatic incremental restore, specify the TAKEN AT timestamp option on the RESTORE DATABASE command. Use the timestamp for the last image that you want to restore.

For example:

db2 restore db sample incremental automatic taken at 20020829102037

This will result in the restore utility performing each of the steps described at the beginning of this section, automatically and in sequence. During the initial phase of processing, the backup image with timestamp 20020829102037 is read, and the restore utility verifies that the database, its history, and the table space definitions exist and are valid.
During the second phase of processing, the database history is queried to build a chain of backup images required to perform the requested restore operation. If, for some reason, this is not possible, and DB2 is unable to build a complete chain of required images, the restore operation terminates, and an error message is returned. In this case, an automatic incremental restore will not be possible, and you will have to issue the RESTORE DATABASE command with the INCREMENTAL ABORT option. This will clean up any remaining resources so that you can proceed with a manual incremental restore.
During the third phase of processing, DB2 will restore each of the remaining backup images in the generated chain. If an error occurs during this phase, you will have to issue the RESTORE DATABASE command with the INCREMENTAL ABORT option to clean up any remaining resources. You will then have to determine whether the error can be resolved before you reissue the RESTORE command or attempt the manual incremental restore again.

NOTE

It is highly recommended that you not use the FORCE option of the PRUNE HISTORY command. The default operation of the PRUNE HISTORY command prevents you from deleting history entries that may be required for recovery from the most recent, full database backup image; but with the FORCE option, it is possible to delete entries that are required for an automatic restore operation.

Limitations to Automatic Incremental Restore

Scenario #1: Automatic incremental restore is unable to proceed

When a table space name has been changed since the backup operation you want to restore from the backup images, and you use the new name when you issue a table space level restore operation, the required chain of backup images from the database history will not be generated correctly with automatic incremental restore, and an error will occur (SQL2571N).

Example restore procedure:

Step #1: db2 backup db sample
Step #2: db2 backup db sample incremental
Step #3: db2 rename tablespace userspace1 to ts1
Step #4: db2 restore db sample tablespace (ts1) incremental automatic taken at <step #2 timestamp>

Limitation:

SQL2571N Automatic incremental restore is unable to proceed. Reason code: "3".

Solution:

Check restore order sequence.
Use manual incremental restore.

The following is sample output from automatic incremental restore:

[View full width]
db2 update db cfg for sample using trackmod

DB20000I  The UPDATE DATABASE CONFIGURATION command completed successfully.

db2 backup db sample to /data/dbbackup

Backup successful. The timestamp for this backup image is : 20020829090733
 This is the first timestamp

db2 list history backup all for sample

                         List History File for sample
Number of matching file entries = 1
Op   Obj  Timestamp+Sequence Type  Dev    Earliest Log   Current Log  Backup ID
---  ---  ------------------ ----  ---    ------------   -----------  ---------
          --
B    D    20020829090733001  F     D      S0000000.LOG   S0000000.LOG
------------------------------------------------------------------------
Contains 2 tablespace(s):
00001 SYSCATSPACE
00002 USERSPACE1
------------------------------------------------------------------------
    Comment: DB2 BACKUP SAMPLE OFFLINE
Start Time: 20020829090733
   End Time: 20020829090756
------------------------------------------------------------------------
00001 Location: /data/dbbackup