Database Configuration Parameter Tuning and Monitoring

The database configuration parameters have an effect on the database, as well as on all applications that access the database. The parameters listed in Table 8.3 have the most significant impact on the performance of the database and its applications.

Table 8.3. Database Configuration Parameter Descriptions
Parameter	Description
BUFFPAGE	Default Buffer Pool Size
LOGBUFSZ	Log Buffer Size
APPLHEAPSZ	Application Heap Size
SORTHEAP	Sorting
SHEAPTHRES_SHR	Sorting
LOCKLIST	Locking
MAXLOCKS	Locking
NUM_IOCLEANERS	Number of Asynchronous Page Cleaners
NUM_IOSERVERSW	Number of I/O Servers
MINCOMMIT	Number of Commits to Group
CATALOGCACHE_SZ	Catalog Cache Size
CHNGPGS_THRESH	Changed Pages Threshold
AVG_APPLS	Average Number of Concurrent Applications

Default Buffer Pool Size (BUFFPAGE)

Background

As discussed previously, a DB2 buffer pool is an area in memory into which database pages containing table rows or index keys are read and manipulated. The purpose of the buffer pool is to improve database system performance by caching frequently accessed pages in memory to eliminate the need for I/O to retrieve the pages because data can be accessed much faster from memory than from a disk. Therefore, the fewer times the database manager needs to read from or write to a disk, the better will be the performance. The configuration of the buffer pool (or buffer pools) is one of the most important tuning areas because it is here that most of the data manipulation takes place for applications connected to the database.

When a buffer pool is created or altered, its size can be explicitly specified or it can be set to use the default buffer pool size. To use the default buffer pool size, as specified by the BUFFPAGE database configuration parameter, the size of the buffer pool must be set to -1 when the buffer pool is created or altered.

If the database has a single large buffer pool, the default buffer pool size can be used. For a database with multiple buffer pools, the buffer pools will need to be sized independently, and attempting to use one size for all of the buffer pools will normally lead to suboptimal performance.

Configuring

To change the default buffer pool size and to increase the default buffer pool size, use the following commands:

update db cfg for <dbname> using BUFFPAGE bigger_value
alter bufferpool IBMDEFAULTBP size -1

Monitoring

Monitoring of the effectiveness of the database buffer pool or buffer pools is covered in detail in the Monitoring Buffer Pool Activity section of Chapter 2, Data Manipulation, so it will not be repeated here.

Automatic

Online

Yes

Log Buffer Size (LOGBUFSZ)

Background

Just as buffer pools help to improve database performance, the log buffer is used to improve logging performance by providing an area of memory for the DB2 engine to write log records to instead of writing directly to disk. DB2 uses a dual buffering technique so that, as it is asynchronously writing a log buffer to disk, DB2 can continue logging to the other log buffer. The log buffers are written to disk when any of the following conditions occur:

A transaction commits (or MINCOMMIT transactions commit).
The log buffer is full.
One second has elapsed since the last log buffer flush.

Configuring

Buffering of the log records results in more efficient log file I/O because the log records are written to disk less frequently, and multiple log records can be written with each write request. The default size of the log buffer is normally too small for most update workloads. When increasing the value of the log buffer, it is also important to consider the size of the database heap (DBHEAP) because the log buffer area is allocated from within the database heap.

Typically, the log buffer should be increased to a minimum value of 256 pages, as follows

update database cfg for <dbname> using LOGBUFSZ 256

to help improve overall performance, especially for online workloads.

Monitoring

The database logging activity and I/O effectiveness can be examined using a database snapshot. To capture a database snapshot and extract the information related to the logging activity, use the following:

  get snapshot for database on <database_name> | grep ?i "Log space"
or
  SELECT sec_log_used_top, tot_log_used_top, total_log_used, total_log_available
  FROM TABLE(SNAPSHOT_DATABASE('SAMPLE',-1 )) as SNAPSHOT_DATABASE

The output of the get snapshot command would look like the following:

Log space available to the database (Bytes)       = 4549916
Log space used by the database (Bytes)            = 550084
Maximum secondary log space used (Bytes)          = 0
Maximum total log space used (Bytes)              = 550084

The SQL statement above would produce the following output:

SEC_LOG_USED_TOP  TOT_LOG_USED_TOP  TOTAL_LOG_USED  TOTAL_LOG_AVAILABLE
----------------  ----------------  --------------  -------------------
                0            550084          550084              4549916

However, when examining the log space, it is important to ensure that there are enough primary log files configured to handle normal processing. Secondary logs should be used for exceptions only in cases where the amount of logging is expected to be heavier than normal, i.e., nightly batch processing. It is also important that there be sufficient log space available at all times to ensure that there is no danger of encountering a log full condition.

To determine the current amount of log space available (CLSA), use the following formula:

CLSA = Log space available to the database - Log space used by the database

or the following SQL statement:

SELECT total_log_available - total_log_used as "CurrentLogSpaceAvail"
  FROM TABLE(SNAPSHOT_DATABASE('SAMPLE',-1 ))
  as SNAPSHOT_DATABASE

To determine the minimum amount of log space available, use the following formula:

CLSA = Log space available to the database - Maximum total log space used

or the following SQL statement:

SELECT total_log_available - tot_log_used_top as "CurrentLogSpaceAvail"
  FROM TABLE(SNAPSHOT_DATABASE('SAMPLE',-1 ))
  as SNAPSHOT_DATABASE

To capture a database snapshot and extract the information related to the I/O effectiveness of the database logging, use the following:

[View full width]
  get snapshot for database on <database_name> | grep ?i "Log pages"
or
  SELECT log_reads, log_writes FROM TABLE(SNAPSHOT_DATABASE('SAMPLE',-1 )) as
 SNAPSHOT_DATABASE

We can use the database snapshot to determine whether the LOGBUFSZ parameter is optimal by looking at the lines shown in the following example:

Log pages read         = 0
Log pages written      = 12644

The ratio between the number of log pages read to the number of log pages written should be as small as possible. Ideally, there should not be any log pages read, and the number of log pages written will depend on the insert/update/delete workload on the database server. When there are a significant number of log pages read, it is an indication that the size of the log buffer should be increased.

Automatic

Online

Application Heap Size (APPLHEAPSZ)

Background

The application heap size (APPLHEAPSZ) defines the number of private memory pages available to be used by a DB2 UDB instance on behalf of each DB2 agent and/or subagent. This heap is used to store a copy of the currently executing sections of the access plan for the application associated with the DB2 agent or subagent.

NOTE

If the database is partitioned, the executing sections of the SQL statements for the agents and subagents will be stored in the application control heap (APP_CTL_HEAP_SZ), not in the application heap.

The application heap is allocated when an agent or subagent is initialized for an application. The amount of memory allocated will be only what is needed to process the request that has been given to the DB2 agent or subagent. When a DB2 agent or subagent requires additional application heap to be able to process larger SQL statements, DB2 will allocate additional memory, up to the maximum specified by the application heap size.

Configuring

For a database with small, relatively simple SQL, the default application heap size is normally adequate. However, for large, complex SQL, the default may not be large enough, and applications may run out of application heap and encounter errors. In DB2 UDB Version 7, the default value of the application heap was normally too low. However, in DB2 UDB Version 8, the default value has been increased to 256 pages.

If an application does encounter errors running out of application heap, the size of the application heap can be increased as follows:

update database cfg for <database_name> using applheapsz 1024

Monitoring

It is not possible to monitor the size of the application heap allocated within each DB2 agent or subagent directly. Therefore, in most cases, it is best to increase the application heap size, as above, and when an application receives an error indicating that there is not enough storage in the application heap to increase the value of this parameter. Because DB2 allocates only what is required, do not increase the size of the database heap by small amounts. If an error is encountered, double the size of the database heap and test to see whether the error has been eliminated.

Automatic

Online

Sorting (SORTHEAP, SHEAPTHRES_SHR)

Background

The sort heap (SORTHEAP) size specifies the maximum number of private memory pages to be used for private sorts or the maximum number of shared memory pages to be used for shared sorts. If the DB2 optimizer chooses to perform a private sort, the sort heap size affects agent private memory. If the DB2 optimizer chooses to perform a shared sort, the sort heap size affects the database-level shared memory. Each sort operation will have a separate sort heap that will be allocated as needed by DB2 where the underlying data will be sorted. Normally, DB2 will allocate a full sort heap. However, if directed by the optimizer, a smaller amount of memory than specified by the sort heap size may be allocated, using the information provided by the optimizer and the database statistics.

The sort heap threshold is an instance-level configuration parameter, as described previously. Private and shared sorts use memory from two different memory areas at the operating system level. The total size of the shared sort memory area is allocated and statically predetermined at the time of the first connection to (or activation of) a database based on the value of the shared sort heap threshold (SHEAPTHRES_SHR) parameter. The total size of the private sort memory area is allocated as needed and is not restricted in size. Because of the fundamental differences between shared and private memory, the sort heap threshold (SHEAPTHRES and SHEAPTHRES_SHR) parameters are applied differently for private and shared sorts:

For private sorts, the sort heap threshold parameter is an instance-wide soft limit on the total amount of memory that can be used by private sorts at any given time. When the total usage of private sort memory for a DB2 instance reaches this limit, the memory allocated for new private sort requests will be reduced by a factor of one half.
For shared sorts, the sort heap threshold parameter is a database-wide hard limit on the total amount of memory that can be used by shared sorts at any given time. When the total usage of shared sort memory for a DB2 instance reaches this limit, no further shared sort memory requests will be allowed until one of the currently executing shared sorts completes.

Configuring

The default value for SHEAPTHRES is quite low, especially for decision support databases. It is good practice to increase the value for SHEAPTHRES significantly because it can have a very dramatic effect on performance. However, increasing the SHEAPTHRES blindly can mask a real problem in the system or applications.

NOTE

By default, SHEAPTHRES_SHR is equal to SHEAPTHRES.

When determining an appropriate value for the SHEAPTHRES and SHEAPTHRES_SHR parameters, consider the following:

Hash joins and dynamic bitmaps used for index ANDing and star joins use sort heap memory. Increase the size of the SORTHEAP, SHEAPTHRES, and SHEAPTHRES_SHR when these techniques are used.
Increase the SHEAPTHRES when large private sorts are frequently required.
Increase the SHEAPTHRES_SHR when large shared sorts are frequently required.
The values of SHEAPTHRES and SHEAPTHRES_SHR need to be based on the value of the SORTHEAP, as well as on the average number of applications executing in the database.
- For example, if database monitoring shows that, on average, there are 12 applications concurrently executing in DB2, setting the SHEAPTHRES to 12?15 times the SORTHEAP would be a good starting point.
- Because shared sorts are less common than private sorts, setting the SHEAPTHRES_SHR to 5 times the SORTHEAP would be a good starting point.

To set the SHEAPTHRES configuration parameter, use the following command:

update dbm cfg using sheapthres 80000

To set the SHEAPTHRES_SHR configuration parameter, use the following command:

update db cfg for <database_name> using sheapthres_shr 30000

Once the database is operational, monitor the system to determine whether the sort heap threshold is being reached. However, sorts may also be overflowed without hitting the sort heap threshold if they require more memory than the configured sort heap size. These sort overflows may be eliminated by increasing the size of the sort heap, as follows:

update database cfg for <database_name> using sortheap 6000

Monitoring

Due to the impact of sorting on database performance, DB2 monitors a number of things in relation to sort activity.

Sorts that are started after the sort heap threshold has been reached may not get an entire SORTHEAP allocated and, therefore, have a much higher chance of overflowing. The number of sorts stated after the SHEAPTHRES has been reached is reported in the post threshold sorts database monitor element.

In addition to the number of piped sorts requested and accepted, the number of overflowed sorts is also available in the snapshot information. These are related to the sort heap, as well as the sort heap threshold, and they will be discussed in more detail when the sort heap is discussed later in this chapter.

To determine whether the sort heap threshold is being reached, take a database manager snapshot, using the following command:

get snapshot for database manager | grep ?i "Post threshold sorts"

This will capture a snapshot for the specified database and extract the monitor element concerned with the number of database files closed. This would produce output like the following:

Post threshold sorts                 =  16

This information can also be captured using an SQL statement, as follows:

SELECT post_threshold_sorts FROM TABLE(SNAPSHOT_DBM(-1 )) as SNAPSHOT_DBM

This would produce output like the following:

POST_THRESHOLD_SORTS
--------------------
                   16

If this value is excessive, the size of the sort heaps and/or sort heap threshold should be examined to determine whether they are sufficient. In addition, the applications should be examined to ensure that they are using appropriate indexes.

If the allocation of one more sort heap equals or exceeds the sort heap threshold, a piped sort cannot be performed, and any request for a piped sort will get rejected. A sort heap will be allocated to handle the sorting of the data, but it will be a reduced size.

The percentage of piped sorts requests that have been serviced by the database manager can be calculated by using the formula:

Percent Piped Sorts = (piped_sorts_accepted / piped_sorts_requested) * 100%

If this percentage of piped sorts is low, the sort performance could be improved by increasing the sort heap threshold. The number of piped sort requests that have been rejected by DB2 can be calculated using the following formula:

Piped Sorts Rejected = piped_sorts_requested - piped_sorts_accepted

A high number of rejected pipe sort requests may indicate that either the value of sort heap or sort heap threshold is too small to support the current workload.

Another indicator of sort performance is the percentage of post threshold sorts. DB2 allocates sort heaps at the beginning of sorts and at the beginning of any sort merge phase. If at any time during a sort a request to allocate a sort heap would exceed the SHEAPTHRES, the sort would be classified as a post threshold sort. The percentage of post threshold sorts is calculated using the following formula:

Percent Post Threshold Sorts = (post_threshold_sorts / sum of total_sorts ) * 100%

The total amount of private sort heap that is currently allocated can be monitored using the following:

  get snapshot for database manager | grep ?i "Private Sort heap allocated"
or
  SELECT sort_heap_allocated FROM TABLE(SNAPSHOT_DBM(-1 )) as SNAPSHOT_DBM

For an online database, it is also important to examine the number of sort operations per transaction (SPT), as well as the percentage of overflowed sorts (POS), as follows:

SPT = (Total Sorts) / (Commit statements attempted + Rollback statements attempted)

POS = (Sort overflows * 100 ) / (Total sorts)

Because online databases need to provide instantaneous results, the number of sort operations per transaction should be as low as possible. If the average number of sorts per transaction is three or more, this indicates that there are far too many sorts per transaction. If the percentage of overflowed sorts is greater than 3%, there may be serious and unexpected large sorts occurring. When this happens, increasing the SORTHEAP and/or SHEAPTHRES will likely only mask the underlying performance problem, not fix it. The best action to help correct these sorting issues is to capture dynamic SQL snapshots, look for poorly performing SQL statements, and add proper indexes as required.

See Appendix B for a sample script to parse the output of a dynamic SQL snapshot and output the information in ASCII-delimited format so that it can be read into a spreadsheet to be examined.

Automatic

Online

Yes

Locking (LOCKLIST, MAXLOCKS, LOCKTIMEOUT, DLCHKTIME)

Background

The lock list is an area of memory used to store all locks that are currently active within a database. The maximum storage for lock list database configuration parameter (LOCKLIST) indicates the amount of storage that is allocated to the lock list for each database. Each database has its own lock list that contains information about the locks held by all applications concurrently connected to the database. Locking is the mechanism that DB2 UDB uses to control concurrent access to data in the database by multiple applications. Within a DB2 database, locks can be obtained on both rows and tables but not on pages.

The amount of space required for a lock depends on whether DB2 is installed in 32-bit mode or 64-bit mode and on whether it is the first lock on a database object or a subsequent lock on a database object.

When DB2 is installed in 32-bit mode, each lock will require either 36 or 72 bytes of space in the database lock list, depending on whether other locks are held on the object:

72 bytes are required to hold a lock on an object that has no other locks held on it.
36 bytes are required to record a lock on an object that has an existing lock held on it.

When DB2 is installed in 64-bit mode, each lock will require either 56 or 112 bytes of space in the database lock list, depending on whether other locks are held on the object:

112 bytes are required to hold a lock on an object that has no other locks held on it.
56 bytes are required to record a lock on an object that has an existing lock held on it.

The maximum percentage of the lock list before escalation database configuration parameter (MAXLOCKS) parameter defines the percentage of the lock list that can be held by an application before DB2 UDB will perform lock escalation for the application. Lock escalation is the process of replacing a number of row locks with a single table lock, therefore reducing the number of locks in the database's lock list and making space available for new locks to be obtained. When the number of locks held by any one application reaches this percentage of the total lock list size, DB2 will perform lock escalation as follows:

Examine the lock list for the identified application to determine which object has the most row-level locks.
Request an equivalent table-level lock.
Release the row-level locks once the table lock is granted.

If, after replacing the row locks with a single table lock, the MAXLOCKS value is no longer exceeded for the application, lock escalation will stop. If not, lock escalation will continue on other database tables until the percentage of the lock list held by the application is below the MAXLOCKS configuration parameter.

NOTE

The MAXLOCKS configuration parameter multiplied by the MAXAPPLS configuration parameter cannot be less than 100.

Lock escalation will also occur if the lock list becomes full for any reason. When the lock list is full, DB2 will perform lock escalation as follows:

Examine the lock list to determine which application has the most locks.
Examine the lock list for the identified applications to determine which object has the most row-level locks.
Request an equivalent table-level lock.
Release the row-level locks once the table lock is granted.

If, after replacing the row locks with a single table lock, the lock list is still full (because other applications can still obtain locks while the escalation is occurring), lock escalation will continue following the same procedure as above. If the lock escalation causes space to become available in the lock list, lock escalation will stop.

Although the escalation process itself does not take much time, locking entire tables (versus locking individual rows) can cause a decrease in concurrency, and overall database performance may be impacted for subsequent accesses against the affected tables. Lock escalation can also cause deadlocks as the row locks are being converted to the table lock.

Deadlocks

When multiple applications are working with data in the same database, there are opportunities for a deadlock to occur between two or more applications, especially if lock escalations are occurring. A deadlock is created when one application is waiting for a lock that is held by another application, and that application is waiting for a lock held by the first application. Each of the waiting applications is locking data needed by another application, while also holding a lock held by the waiting application. Mutual waiting for other applications to release locks on the data leads to a deadlock condition. In this case, the applications could potentially wait forever until one of the other applications releases a lock on the held data.

Because applications do not voluntarily release locks on data that they need, DB2 UDB uses a deadlock detector process to detect and break deadlocks to allow application processing to continue. As its name suggests, the deadlock detector is started every period of time defined by the deadlock check time (DLCHKTIME), when it reads the database lock list and examines the information about the DB2 agents that are waiting on locks. If a deadlock condition is detected, the deadlock detector arbitrarily selects one of the applications involved in the deadlock as the victim. The victim application will be rolled back by DB2, and its locks will be released so that the other applications can continue. The victim application will receive an SQL 911 error code with a reason code of 2.

In Version 8, DB2 has enhanced the diagnostics available when a deadlock occurs, and the deadlock can now be debugged without the need to reproduce the condition with event monitors or snapshot monitoring turned on. When the diagnostic level is set to 4, a record will be written to the diagnostic log, indicating which application caused the deadlock condition, as well as the SQL statement it was executing. An example of the diagnostic record is below:

Request for lock "REC: (2, 13) RID 0000000B" in mode "..U" failed due to deadlock
Application caused the lock wait is "*LOCAL.DB2.00F888145716"
Statement: 7570 6461 7465 2074 3120 7365 7420 6331     update t1 set c1
3d32 3120 7768 6572 6520 6332 3d39                     =21 where c2=9

Based on the above diagnostic entry, we can determine that:

DB2 was attempting to acquire a record lock (REC) to execute this statement.
The lock mode requested was Update (U).
The application that caused the error had an application ID of "LOCAL.DB2.00F888145716".
The statement that caused the error was "update t1 set c1=21 where c2=9".

The LOCKTIMEOUT configuration parameter can be used to prevent applications from waiting indefinitely for locks. By default, the lock timeout is set to -1, which tells DB2 to wait indefinitely for locks. To time out the locks after a set period of time, the lock timeout parameter can be set to a specific value.

Configuring

The size of the lock list can be estimated as follows:

The minimum lock list size would occur if all (or almost all) of the locks exist on objects that already have locks. The maximum lock list size would occur if all locks are unique and exist on objects without any other locks. Therefore, the lower bound for the lock list would be calculated as follows:

[View full width]
MinLockList = (ALA * LS1 * MAXAPPLS) / 4096

Where:               ALA = Average # of locks per application
                     LS1 = Lock Size (either 36 or 56 bytes, depending on whether DB2 is
 running in 32-bit or 64-bit mode)

The upper bound for the lock list would be calculated as follows:

[View full width]
MaxLockList = (ALA * LS2 * MAXAPPLS) / 4096

Where:               ALA = Average # of locks per application
                     LS2 = Lock Size (either 72 or 112 bytes, depending on whether DB2 is
 running in 32-bit or 64-bit mode)

Then attempt to estimate the percentage of unique locks versus subsequent locks on the same object and use this percentage to choose a value between these extremes as the initial size of the lock list.

The default value for MAXLOCKS is 10%. This is normally too small and can cause unnecessary lock escalation. A value of 25?30% is normally a better balance between overall concurrency and minimizing lock escalations.

To configure the size of the lock list to be 500 pages and increase the MAXLOCKS to 30%, use the following commands:

update db cfg for <database_name> using LOCKLIST 500
update db cfg for <database_name> using MAXLOCKS 30

The default deadlock check time is 10 seconds (10,000 milliseconds), and the default lock timeout is indefinite (-1). To increase the deadlock check time to 30 seconds and enable locks to time out after waiting for 5 seconds, use the following commands:

update db cfg for <database_name> using DLCHKTIME 30000
update db cfg for <database_name> using LOCKTIMEOUT 5

Monitoring

Lock escalation will cause more table locks and fewer row-level locks, thus reducing concurrency within the database. In addition to reduced concurrency, lock escalation can also cause deadlocks to occur, which will result in transactions being rolled back. Due to the importance of locking on overall database performance, there is a database monitor specifically for locking. It is important to note that, because the lock snapshots capture timestamp information within the monitor elements, they are the most expensive (i.e., they have the largest overhead) of the database monitors.

As with other monitor information, the data can be captured using a database monitor snapshot or an SQL statement. However, it is important to note that the SQL snapshot acquires a lock while capturing the lock snapshot information, so in reality, the number of "real" locks on the system is one less than the number reported by the SQL statement. The following command will capture a database snapshot and return only the locking-related information for the database:

get snapshot for database on <database_name> | grep -i 'Lock'

The following SQL statement will provide the same information as the above command:

SELECT LOCKS_HELD, LOCK_WAITS, LOCK_WAIT_TIME,
LOCK_LIST_IN_USE, DEADLOCKS, LOCK_ESCALS, X_LOCK_ESCALS,
LOCKS_WAITING,  LOCK_TIMEOUTS, INT_DEADLOCK_ROLLBACKS
FROM TABLE(SNAPSHOT_DATABASE('SAMPLE',-1 )) as SNAPSHOT_DATABASE;

If the above command and SQL statement were executed at the same time, the output of the command would look like the following:

Locks held currently                      = 12
Lock waits                                = 0
Time database waited on locks (ms)        = 0
Lock list memory in use (Bytes)           = 2080
Deadlocks detected                        = 1
Lock escalations                          = 0
Exclusive lock escalations                = 0
Agents currently waiting on locks         = 0
Lock Timeouts                             = 0
Internal rollbacks due to deadlock        = 1

The output of the SQL statement would look like the following:

LOCKS_HELD          LOCK_WAITS       LOCK_WAIT_TIME
LOCK_LIST_IN_USE    DEADLOCKS        LOCK_ESCALS
LOCKS_WAITING       LOCK_TIMEOUTS    INT_DEADLOCK_ROLLBACKS    X_LOCK_ESCALS
----------------    -------------    ----------------------    -------------
13                  0                0                         1
2116                1                0
0                   0                0

1 record(s) selected.

Notice that there is an extra lock reported in the output of the SQL statement, as well as 36 extra bytes of lock list usage. This information was captured on a 32-bit DB2 instance where a lock will use 36 bytes of the lock list for a lock if there are already other locks on the same object.

Locks Held

In general, the number of locks held should be as low as possible because applications holding locks for long periods of time will hurt application concurrency. For a database with a large number of applications, there will normally be some locks held at any give time, especially for an online system with inserts, updates, or deletions.

When examining the number of locks held, it is important also to examine the number of currently connected applications. For a system with 100 connected users, 100 locks is not excessive. However, if there are only five applications connected to the database, each application is holding 20 locks on average, which can be excessive.

The average number of locks per applications (ALA) is calculated using:

ALA = Locks held currently / Applications connected currently

SELECT  (real (real(APPLS_CUR_CONS)) / (real(LOCK_WAIT_TIME)))
  as AverageLocksPerApp
  FROM TABLE(SNAPSHOT_DATABASE('SAMPLE',-1 )) as SNAPSHOT_DATABASE;

Lock wait time

The time that the database has waited on locks should be as low as possible because when DB2 is waiting for the lock, the applications are also waiting. A single lock wait may not seem too bad if we look only at the number of lock waits. However, if the lock waited for 5 minutes, it is very important to examine the reason for this and attempt to eliminate the source of the lock wait.

Lock list memory in use

If the amount of lock list in use is approaching the size of the lock list, this is an indication that lock escalation may be imminent. For normal operations, the lock list should not be more than 60% used. This leaves the remaining 40% of the lock list for exception processing. The percentage of the lock list used (PLU) is calculated using the following formula:

PLU = (Lock list memory in use / LOCKLIST) * 100%

Deadlocks detected

As discussed previously, when a deadlock is encountered, one of the applications involved in the deadlock will be rolled back and will need to redo the entire transaction. All deadlocks should be investigated and eliminated, if possible.

Lock escalations/exclusive lock escalations

Lock escalations cause decreased application concurrency and in many cases, the application may encounter lock waits and/or deadlocks while performing the lock escalation. These locking issues are even more likely with exclusive lock escalations because a number of exclusive row locks are converted to an exclusive table lock, and no other applications can access any row in the entire table.

Lock timeouts

The number of lock timeouts needs to be analyzed in conjunction with the setting for the LOCKTIMEOUT database configuration parameter. If the LOCKTIMEOUT parameter were configured to a low value to cause locks to time out quickly, the number of lock timeouts would normally be higher. If the LOCKTIMEOUT parameter is configured to a relatively high value, lock timeouts should not occur frequently, and when a lock timeout is encountered, it should be examined.

In Version 8, DB2 has enhanced the diagnostics available when a lock timeout occurs, and the lock timeout can now be debugged without the need to reproduce the condition with event monitors or snapshot monitoring turned on. When the diagnostic level is set to 4, a record will be written to the diagnostic log, indicating which application caused the lock timeout condition, as well as the SQL statement it was executing. An example of the diagnostic record is below:

Request for lock "TAB: (2, 13)" in mode ".IX" timed out
Application caused the lock wait is "*LOCAL.DB2.007340152709"
Statement: 7570 6461 7465 2074 3120 7365 7420 6331   update t1 set c1
3d63 312b 3531 3231 30                              =c1+51210

Based on the above diagnostic entry, we can determine that:

DB2 was attempting to acquire a table lock (TAB) to execute this statement.
The lock mode requested was Intent-eXclusive (IX).
The application that caused the error had an application ID of "LOCAL.DB2.007340152709".
The statement that caused the error was "update t1 set c1=c1+51210".

Automatic

Online

Yes

Number of Asynchronous Page Cleaners (NUM_IOCLEANERS)

Background

When a DB2 agent, acting on behalf of an application, needs to access table or index pages it will first look for the pages in the database buffer pool area. If the page cannot be found in the buffer pool area, it will be read from disk into the buffer pool. If the buffer pool is full, DB2 must select a "victim" page to be overwritten in the buffer pool. If the victim page is dirty (i.e., it was changed since it was read into the buffer pool and has not been written to disk), it must first be written to disk before it can be over written. During the write operation, the application must wait. To reduce the likelihood that the victim page will be dirty, DB2 uses page cleaners to asynchronously write dirty pages to disk before they are chosen as victims.

NOTE

An additional benefit of the page cleaners writing the dirty pages to disk is the reduction in the amount of work required in the event that there is a problem with DB2 and a database restart/recovery is required.

The number of asynchronous page cleaners database configuration parameter (NUM_IOCLEANERS) specifies the number of asynchronous page cleaner processes that can be run for the database. If this parameter is set to zero (0), there will be no page cleaners available for the database and as a result, the database agents will perform all of the page writes from the buffer pool to disk synchronously.

If the applications for a database consist primarily of transactions that update data, an increase in the number of cleaners will help improve the performance and will also reduce the recovery time from soft failures, such as power outages, because the contents of the database on disk will be more up to date at any given time.

How the Page Cleaners Are Triggered

The page cleaners can be triggered in three different ways.

Dirty page threshold

When a page in the buffer pool is changed, it is added to the buffer pool's dirty list. At this time, DB2 checks to see whether this addition to the dirty list exceeds the changed page threshold (aka. dirty page threshold) for the buffer pool. If the changed page threshold is exceeded, the page cleaners will be triggered.

The changed page threshold database configuration parameter (CHNGPGS_THRESH) represents the percentage of the buffer pool that can be dirty before the page cleaners are triggered.

LSN gap

When transactions are occurring against the database, they will be logged. To reduce the amount of work required in the event of a problem, DB2 will trigger the page cleaners as it writes to the log file(s).

The percentage of the log file reclaimed before the soft checkpoint database configuration parameter SOFTMAX represents the percentage of a log file that is written before the page cleaners are triggered.

Dirty page steals

When an agent requests a page that must be read from disk and DB2 chooses the victim page, if the page is dirty, the page must first be written to disk before it can be used to read the new page that the agent has requested. After a number of dirty victim pages have been selected, DB2 will automatically trigger the page cleaners to write the dirty pages to disk.

How the Page Cleaners Work

When the page cleaners are triggered, all of the page cleaners are triggered at the same time. They will each gather up to 400 pages from the dirty lists for the database buffer pools. The pages from the dirty list will then be written to disk, one page at a time, until the page cleaner has processed its assigned dirty pages. Once it has written all of the pages, it will check to see whether there are more pages to be written or whether there have been any new triggers. If so, it will gather a new list of pages to process; if not, it will wait for the next page cleaner trigger.

Configuring

Because all page cleaners are started whenever a page cleaner trigger is hit, having too many page cleaners can overwhelm the run queue on the server and cause a significant performance impact on the system. Therefore, as a rule of thumb, set the number of page cleaners equal to the number of CPUs in the database server.

For example, for a server with 16 CPUs, the following command will set the number of asynchronous page cleaners to 16:

update db cfg for <database_name> using NUM_IOCLEANERS 16

Monitoring

When examining the effectiveness of the asynchronous page cleaners, it is important to examine the ratio of asynchronous data and index page writes. The percentage of asynchronous data (PADW) and index page writes (PAIX) can be calculated using the following formulas:

PADW =  (Asynchronous pool data page writes / Buffer pool data writes) * 100%

PAIX =  (Asynchronous pool index page writes / Buffer pool index writes) * 100%

The number of page cleaners could potentially be reduced if:

PADW is close to 100%
PAIX is close to 100%

It is also important to understand which of the three I/O cleaner triggers is causing the page cleaners to be activated and to write the dirty pages from the buffer pools to disk. This information is available in the database snapshot information or through an SQL table function. To take a database snapshot and extract the entries that describe the page cleaner triggers in the database snapshot, use the following command:

get snapshot for database on <database_name> | grep -i 'cleaner triggers'

The SQL table function that will return the page cleaner triggers would look like the following:

SELECT DB_NAME,
  POOL_LSN_GAP_CLNS,
  POOL_DRTY_PG_STEAL_CLNS,
  POOL_DRTY_PG_THRSH_CLNS
  FROM TABLE(SNAPSHOT_DATABASE('SAMPLE',-1 ))
  as SNAPSHOT_DATABASE

The output of the get snapshot command would look like the following:

LSN Gap cleaner triggers                        = 142
Dirty page steal cleaner triggers               = 2
Dirty page threshold cleaner triggers           = 396

The output of the SQL function would look like the following:

DB_NAME  POOL_LSN_GAP_CLNS   POOL_DRTY_PG_STEAL_CLNS   POOL_DRTY_PG_THRSH_CLNS
-------  -----------------   -----------------------   -----------------------
SAMPLE                  142                         2                       396

1 record(s) selected.

In this case, the page cleaners were triggered by the "good" triggers (changed page threshold and/or LSN gap) well over 99% of the time. As was explained earlier, a dirty page steal trigger is done only after a number of pages have been synchronously written to disk and their associated clients forced to wait. If the number of "bad" page cleaner triggers (i.e., dirty page steal triggers) is more than a couple of percentage points of the total number of triggers, the values set for changed page threshold and soft max, as well as the number of page cleaners, should be examined.

The percentage of bad page cleaner triggers (PBPCT) is calculated as follows:

[View full width]
PBPCT = ((LSN gap cleaner triggers) / (Dirty page steal cleaner triggers + Dirty page
 threshold cleaner triggers + LSN gap cleaner triggers)) * 100%

Based on the snapshot information above, the PBPCT equals:

PBPCT = ((2) / (142 + 396 + 2)) * 100%

PBPCT = 0.37%

This ratio is very good and indicates that the system is primarily writing dirty pages to disk using the asynchronous page cleaners, and applications are not waiting for synchronous page writes. However, based on the following snapshot information for the page cleaner triggers, the PBPCT is much higher.

DB_NAME  POOL_LSN_GAP_CLNS   POOL_DRTY_PG_STEAL_CLNS   POOL_DRTY_PG_THRSH_CLNS
-------  -----------------   -----------------------   -----------------------
SAMPLE                   17                      2034                      1192

1 record(s) selected.

The PBPCT equals:

PBPCT = ((2034) / (17 + 1192 + 2034)) * 100%

PBPCT = 62.7%

In this case, the asynchronous page cleaners are rarely being triggered by the pool LSN gap trigger. This indicates that the database configuration parameter SOFTMAX may be set too high. To determine the value of the SOFTMAX configuration variable, use the command:

get db cfg for sample | grep ?i softmax

This returns the following:

Percent log file reclaimed before soft chckpt (SOFTMAX) = 100

In this case, the page cleaners are being triggered each time a log file is filled. Because this value is not abnormally high, next examine the log file size by using the command:

get db cfg for sample | grep ?i logfilsiz

This returns the following:

Log file size (4KB)                   (LOGFILSIZ) = 250000

The log file size for this database is 250,000 4-KB pages, or 1 GB. Therefore, the page cleaners are being triggered only after 1 GB of log information has been written. If the log file size cannot be reduced, the SOFTMAX configuration parameter can be reduced to cause the page cleaners to be triggered more frequently. To update the SOFTMAX configuration parameter to cause the page cleaners to trigger after 10% of a log has been written, use the following command:

update db cfg for sample using softmax 10

If the log files do not need to be this large and can be reduced, the log file size can be changed to 250 4K pages or 1 MB using the following command:

update db cfg for sample using logfilsiz 250

Asynchronous Pages per Write

When the page cleaners are triggered, it is important that they be writing to disk as efficiently as possible. Having the page cleaners triggered too infrequently and writing a large number of pages to disk will cause the system to slow down. Likewise, having the page cleaners triggered frequently but writing a small number of pages to disk is also inefficient.

The number of pages written per page cleaner trigger is not captured in any of the DB2 snapshots. However, the average number of pages written per asynchronous write request can be calculated, using the database base and buffer pool snapshot information. The average pages per asynchronous write (APPAW) can be calculated using the formula:

APPAW = ((Asynchronous pool data page writes + Asynchronous pool index
  page writes) / (Dirty page steal cleaner triggers +  Dirty page
  threshold cleaner triggers + LSN gap cleaner triggers))

Based on the following information from the database and buffer pool snapshots:

LSN Gap cleaner triggers                            = 142
Dirty page steal cleaner triggers                   = 2
Dirty page threshold cleaner triggers               = 396

Asynchronous pool data page writes                  = 167660
Asynchronous pool index page writes                 = 178944

the APPAW would be:

APPAW = (167660 + 178944) / (142 + 2 + 396)

APPAW = 641.9

In this case, the page cleaners wrote an average of 641.9 pages, or 2.5 MB, each time they were triggered. This value needs to be examined in the context of the size of the buffer pool that is being examined. For a 1-GB buffer pool, this is a small value, and perhaps the page cleaners are being triggered too aggressively. For a 100-MB buffer pool, this value is much more reasonable.

Automatic

Online

Number of I/O Servers (NUM_IOSERVERS)

Background

When a DB2 agent acting on behalf of an application needs to access table or index pages, it will first look for the pages in the database buffer pool area. If the page cannot be found in the buffer pool area, it will be read from disk into the buffer pool. I/O is very expensive and in this case, the agent cannot do anything but wait for the read request to finish before it can access the page. These page reads are typically done one page at a time and if the application also needs to access subsequent pages within the table or index, this is not an efficient method for reading the pages into the buffer pool.

In many situations, DB2 UDB can anticipate the pages that will be requested by an application and read them into the buffer pool before the agent actually attempts to access them. This is referred to as prefetching. Prefetching can improve the database performance because the pages will be found in the buffer pool when the agent accesses them, reducing or eliminating the time the application must wait for the page to be read from disk into the buffer pool. This is more relevant to DSS-type workloads that scan large indexes and tables than it is for OLTP-type workloads that involve less scanning and more random insert/update/delete activity.

Prefetching can be enabled by the DB2 optimizer when it is building the access plan for a statement and determines that it will be scanning a large portion of a table or index. It can also be enabled or triggered when DB2 is executing an access plan and detects that it has read a number of pages in sequence and will likely continue to do so. This is known as sequential detection and can be enabled or disabled using the database configuration parameter SEQDETECT.

Configuring

If the prefetch size is set as a multiple of the table space extent size (i.e., prefetch size = extent size x number of containers) for all of the table spaces in the database and all of the table spaces are being scanned at the same time, the number of prefetchers should be equal to the number of disks belonging to the database. However, if one or more of the table spaces has been set up using more aggressive prefetching (i.e., the prefetch size is a multiple of this value) and/or some of the table spaces are not being scanned at the same time as the others, the calculation becomes more complicated. To determine the number of prefetchers required in this case:

Determine the table spaces that will potentially be scanned at the same time.
For each of these table spaces, determine the number of prefetchers required to service a scan of it (based on the formulas above).
Sum these values to determine the total number of prefetchers required.

The number of prefetchers for the database can be set using the following command:

update db cfg for <database_name> using NUM_IOSERVERS 64

Monitoring

An important aspect of the prefetching performance that can be analyzed using the snapshot information is the amount of synchronous versus asynchronous I/O. The percentage of asynchronous read requests (or asynchronous read ratio, ARR) is calculated using the following formula:

ARR = ((Asynchronous data reads + Asynchronous index reads) / (Data
  logical reads + Index logical reads)) * 100%

The ARR can also be calculated by using the SQL table function as follows:

[View full width]
select BP_NAME,
  (INT(((FLOAT(pool_Async_data_Reads + pool_async_index_Reads)) / (FLOAT
(Pool_Index_L_Reads + Pool_data_L_Reads))) * 100))
  AS Asynch_Read_Ratio
  FROM TABLE(SNAPSHOT_BP('SAMPLE',-1 ))
  as SNAPSHOT_BP;

For the following buffer pool snapshot:

                            Bufferpool Snapshot

Bufferpool name                           = IBMDEFAULTBP
Database name                             = SAMPLE
Database path                             = /v1/db2/NODE0000/SQL00001/
Input database alias                      = SAMPLE
Buffer pool data logical reads            = 523956
Buffer pool data physical reads           = 33542
Buffer pool data writes                   = 288
Buffer pool index logical reads           = 257949
Buffer pool index physical reads          = 11323
Total buffer pool read time (ms)          = 12012
Total buffer pool write time (ms)         = 720
Asynchron