Example 5: Periodic Noticeable Slowdown of Database Response Time

Problem Description

CustomerC is running an eCommerce site with the backed data store on DB2. At various times throughout the day, the database response time for application requests becomes very long.

Problem Analysis and Resolution

Step 1: Examine the Database and Database Manager Configuration Parameters
Database manager configuration

The three parameters in the database manager configuration that stand out are highlighted in bold below. In this case, the application is mainly an online application; therefore, intra-partition parallelism should be disabled. Because there will be a large number of concurrent applications, the sort heap threshold (SHEAPTHRES) will need to be high; however, the existing value is sufficient.

Node type = Database Server with local clients

Database manager configuration release level            = 0x0a00
CPU speed (millisec/instruction)             (CPUSPEED) = 4.000000e-05

Max number of concurrently active databases     (NUMDB) = 8
Data Links support                          (DATALINKS) = NO
Federated Database System Support           (FEDERATED) = NO
Transaction processor monitor name        (TP_MON_NAME) =

Default charge-back account           (DFT_ACCOUNT_STR) =

Java Development Kit installation path       (JDK_PATH) = /wsdb/v81/bldsupp/AIX/jdk1.3.0

Diagnostic error capture level              (DIAGLEVEL) = 3
Notify Level                              (NOTIFYLEVEL) = 3
Diagnostic data directory path               (DIAGPATH) =

Default database monitor switches
Buffer pool                           (DFT_MON_BUFPOOL) = OFF
Lock                                     (DFT_MON_LOCK) = OFF
Sort                                     (DFT_MON_SORT) = OFF
Statement                                (DFT_MON_STMT) = OFF
Table                                   (DFT_MON_TABLE) = OFF
Timestamp                           (DFT_MON_TIMESTAMP) = ON
Unit of work                              (DFT_MON_UOW) = OFF
Monitor health of instance and databases   (HEALTH_MON) = OFF

SYSADM group name                        (SYSADM_GROUP) = BUILD
SYSCTRL group name                      (SYSCTRL_GROUP) =
SYSMAINT group name                    (SYSMAINT_GROUP) =

Database manager authentication        (AUTHENTICATION) = SERVER
Cataloging allowed without authority   (CATALOG_NOAUTH) = YES
Trust all clients                      (TRUST_ALLCLNTS) = YES
Trusted client authentication          (TRUST_CLNTAUTH) = CLIENT
Use SNA authentication                   (USE_SNA_AUTH) = NO
Bypass federated authentication            (FED_NOAUTH) = NO

Default database path                       (DFTDBPATH) = /home/dsnow

Database monitor heap size (4KB)          (MON_HEAP_SZ) = 90
Java Virtual Machine heap size (4KB)     (JAVA_HEAP_SZ) = 1024
Audit buffer size (4KB)                  (AUDIT_BUF_SZ) = 0
Size of instance shared memory (4KB)  (INSTANCE_MEMORY) = AUTOMATIC
Backup buffer default size (4KB)            (BACKBUFSZ) = 1024
Restore buffer default size (4KB)           (RESTBUFSZ) = 1024

Sort heap threshold (4KB)                  (SHEAPTHRES) = 80000

Directory cache support                     (DIR_CACHE) = YES

Application support layer heap size (4KB)   (ASLHEAPSZ) = 15
Max requester I/O block size (bytes)         (RQRIOBLK) = 32767
Query heap size (4KB)                   (QUERY_HEAP_SZ) = 1000
DRDA services heap size (4KB)            (DRDA_HEAP_SZ) = 128

Priority of agents                           (AGENTPRI) = SYSTEM
Max number of existing agents               (MAXAGENTS) = 200
Agent pool size                        (NUM_POOLAGENTS) = 100(calculated)
Initial number of agents in pool       (NUM_INITAGENTS) = 0
Max number of coordinating agents     (MAX_COORDAGENTS) = MAXAGENTS
Max no. of concurrent coordinating agents  (MAXCAGENTS) = MAX_COORDAGENTS
Max number of client connections      (MAX_CONNECTIONS) = MAX_COORDAGENTS

Keep fenced process                        (KEEPFENCED) = YES
Number of pooled fenced processes         (FENCED_POOL) = MAX_COORDAGENTS
Initialize fenced process with JVM     (INITFENCED_JVM) = NO
Initial number of fenced processes     (NUM_INITFENCED) = 0

 Index re-creation time                       (INDEXREC) = RESTART

Transaction manager database name         (TM_DATABASE) = 1ST_CONN
Transaction resync interval (sec)     (RESYNC_INTERVAL) = 180
SPM name                                     (SPM_NAME) =
SPM log size                          (SPM_LOG_FILE_SZ) = 256
SPM resync agent limit                 (SPM_MAX_RESYNC) = 20
SPM log path                             (SPM_LOG_PATH) =

TCP/IP Service name                          (SVCENAME) =
Discovery mode                               (DISCOVER) = SEARCH
Discovery communication protocols       (DISCOVER_COMM) =
Discover server instance                (DISCOVER_INST) = ENABLE

Maximum query degree of parallelism   (MAX_QUERYDEGREE) = ANY
Enable intra-partition parallelism     (INTRA_PARALLEL) = NO

No. of int. communication buffers(4KB)(FCM_NUM_BUFFERS) = 512
Database configuration for database sample

The three parameters in the database manager configuration that stand out are highlighted in bold below.

  • Since the applications will be performing a lot of inserts/updates/deletes, the log buffer size (LOGBUFSZ) will be important. However, the log buffer is sized adequately for this workload.

  • Because this is an online application, the sort list heap (SORTHEAP) should not be too large. When it is too large, the DB2 optimizer will tend to favor sorts over index scans, and with a large number of concurrent applications, sorting will be detrimental to performance.

  • In this case, the default buffer pool size (BUFFPAGE) is 1,000 4-KB pages. To determine the real size of the buffer pool, use the following statement:

    select * from syscat.bufferpools
    
    • In this case, there is only one buffer pool (IBMDEFAULTBP), and it has a size of 675,000 4-KB pages.

    • This is adequately sized for this workload.

Database configuration release level                    = 0x0a00
Database release level                                  = 0x0a00

Database territory                                      = US
Database code page                                      = 819
Database code set                                       = ISO8859-1
Database country/region code                            = 1

Dynamic SQL Query management           (DYN_QUERY_MGMT) = DISABLE

Discovery support for this database       (DISCOVER_DB) = ENABLE
Default query optimization class         (DFT_QUERYOPT) = 5
Degree of parallelism                      (DFT_DEGREE) = 1
Continue upon arithmetic exceptions   (DFT_SQLMATHWARN) = NO
Default refresh age                   (DFT_REFRESH_AGE) = 0
Number of frequent values retained     (NUM_FREQVALUES) = 10
Number of quantiles retained            (NUM_QUANTILES) = 20

Backup pending                                          = NO

Database is consistent                                  = NO
Rollforward pending                                     = NO
Restore pending                                         = NO

Multi-page file allocation enabled                      = NO

Log retain for recovery status                          = NO
User exit for logging status                            = NO

Data Links Token Expiry Interval (sec)      (DL_EXPINT) = 60
Data Links Write Token Init Expiry Intvl(DL_WT_IEXPINT) = 60
Data Links Number of Copies             (DL_NUM_COPIES) = 1
Data Links Time after Drop (days)        (DL_TIME_DROP) = 1
Data Links Token in Uppercase                (DL_UPPER) = NO
Data Links Token Algorithm                   (DL_TOKEN) = MAC0

Database heap (4KB)                            (DBHEAP) = 1200
Size of database shared memory (4KB)  (DATABASE_MEMORY) = AUTOMATIC
Catalog cache size (4KB)              (CATALOGCACHE_SZ) = (MAXAPPLS*4)
Log buffer size (4KB)                        (LOGBUFSZ) = 256
Utilities heap size (4KB)                (UTIL_HEAP_SZ) = 5000
Buffer pool size (pages)                     (BUFFPAGE) = 1000
Extended storage segments size (4KB)    (ESTORE_SEG_SZ) = 16000
Number of extended storage segments   (NUM_ESTORE_SEGS) = 0
Max storage for lock list (4KB)              (LOCKLIST) = 100

Max size of appl. group mem set (4KB) (APPGROUP_MEM_SZ) = 20000
Percent of mem for appl. group heap   (GROUPHEAP_RATIO) = 70
Max appl. control heap size (4KB)     (APP_CTL_HEAP_SZ) = 128

Sort heap thres for shared sorts (4KB) (SHEAPTHRES_SHR) = (SHEAPTHRES)
Sort list heap (4KB)                         (SORTHEAP) = 4096
SQL statement heap (4KB)                     (STMTHEAP) = 4096
Default application heap (4KB)             (APPLHEAPSZ) = 256
Package cache size (4KB)                   (PCKCACHESZ) = (MAXAPPLS*8)
Statistics heap size (4KB)               (STAT_HEAP_SZ) = 4384
Interval for checking deadlock (ms)         (DLCHKTIME) = 10000
Percent. of lock lists per application       (MAXLOCKS) = 10
Lock timeout (sec)                        (LOCKTIMEOUT) = -1

Changed pages threshold                (CHNGPGS_THRESH) = 60
Number of asynchronous page cleaners   (NUM_IOCLEANERS) = 10
Number of I/O servers                   (NUM_IOSERVERS) = 40
Index sort flag                             (INDEXSORT) = YES
Sequential detect flag                      (SEQDETECT) = YES
Default prefetch size (pages)         (DFT_PREFETCH_SZ) = 32

Track modified pages                         (TRACKMOD) = OFF

Default number of containers                            = 1
Default tablespace extentsize (pages)   (DFT_EXTENT_SZ) = 32

Max number of active applications            (MAXAPPLS) = AUTOMATIC
Average number of active applications       (AVG_APPLS) = 1
Max DB files open per application            (MAXFILOP) = 64

Log file size (4KB)                         (LOGFILSIZ) = 100000
Number of primary log files                (LOGPRIMARY) = 32
Number of secondary log files               (LOGSECOND) = 8
Changed path to log files                  (NEWLOGPATH) =
Path to log files                                       = /databases/sample/logs
Overflow log path                     (OVERFLOWLOGPATH) =
Mirror log path                         (MIRRORLOGPATH) =
First active log file                                   =
Block log on disk full                (BLK_LOG_DSK_FUL) = NO
Percent of max active log space by transaction(MAX_LOG) = 0
Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0

Group commit count                          (MINCOMMIT) = 1
Percent log file reclaimed before soft chckpt (SOFTMAX) = 100
Log retain for recovery enabled             (LOGRETAIN) = OFF
User exit for logging enabled                (USEREXIT) = OFF

Auto restart enabled                      (AUTORESTART) = ON
Index re-creation time                       (INDEXREC) = SYSTEM (RESTART)
Default number of loadrec sessions    (DFT_LOADREC_SES) = 1
Number of database backups to retain   (NUM_DB_BACKUPS) = 12
Recovery history retention (days)     (REC_HIS_RETENTN) = 366
TSM management class                    (TSM_MGMTCLASS) =
TSM node name                            (TSM_NODENAME) =
TSM owner                                   (TSM_OWNER) =
TSM password                             (TSM_PASSWORD) =
Step 2: Check the System for I/O Bottlenecks, Excess Paging, or Other Processes/Applications Using Excess System Resources

This requires operating system tools such as vmstat, iostat, ps, and/or top to capture the memory, I/O, and process level information. It is important to capture the snapshots from these tools over a period of time, not taking just one snapshot, especially because the first line of the output of the vmstat and iostat tools contains average information since the server was started and is not really useful in analyzing a problem.

Step 3: Make Changes Based on Examination of Configuration Information

While the system is experiencing the slowdown, capture iostat output as follows:

iostat 5 > iostat.out

Next, examine the iostat.out file. Below is an entry from the iostat.out file.

Disks:        % tm_act     Kbps      tps    Kb_read   Kb_wrtn
hdisk0           2.6      13.2       7.3         19        21
hdisk1           8.3      99.2      37.3          3       208
hdisk2           8.3      99.2      37.3          3       208
hdisk3           8.3      99.2      37.3          3       208
hdisk4           8.3      99.2      37.3          3       208
hdisk5           8.3      99.2      37.3          3       208
hdisk6           8.3      99.2      37.3          3       208
hdisk7           8.3      99.2      37.3          3       208
cd0              0.0       0.0       0.0          0         0

The other entries were very similar to the above, so they are not shown here. In this case, the I/O activity is all writes and is spread across all of the disks on which the database resides, i.e., hdisk1 through hdisk7.

Next, run the top or ps aux command to determine which processes are currently executing that could be causing the I/O. The output of the ps aux command is below:

USER         PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
inst1      40970 10.0  2.0   40 23888      - A      Oct 03 572:36 db2pclnr
inst1      32776 10.0  2.0   40 23888      - A      Oct 03 570:33 db2pclnr
inst1      24582 10.0  2.0   40 23888      - A      Oct 03 576:07 db2pclnr
inst1      16388 10.0  2.0   40 23888      - A      Oct 03 538:13 db2pclnr
inst1     122974 10.0  0.0  320   324      - A      Oct 03 548:32 db2pclnr
inst1     294988 10.0  0.0  700   720      - A      Oct 03 513:02 db2pclnr
inst1      23109 10.0  2.0   52 23900      - A      Oct 03 522:04 db2pclnr
inst1     401514 10.0  2.0  184 24032      - A      Oct 03 509:06 db2pclnr
inst1      81940 10.0  2.0  116 23964      - A      Oct 03 511:07 db2pclnr
inst1     262242 10.0  0.0  936   940      - A      Oct 03 511:01 db2pclnr

The ps aux command shows that there are 10 db2pclnr processes running, using all of the available CPU cycles. The db2pclnr process is the I/O cleaner or page cleaner process that asynchronously flushes dirty pages from the buffer pools to disk.

NOTE

The ps aux command gives more information than just db2pclnr processes; however, only the relevant information is shown above.


Step 4: Determine Which I/O Cleaner Triggers Caused the Buffer Pool Flush

To determine which of the triggers for the page cleaners actually caused the I/O cleaners to start, a database snapshot is required. To capture the database snapshot, use the following command:

get snapshot for database on sample > sample.db.snap

To take a snapshot and extract just the lines with the page cleaner trigger information, use the following command:

get snapshot for database on sample | grep ?i 'triggers'

The output from the above command looks like the following:

LSN Gap cleaner triggers                 = 0
Dirty page steal cleaner triggers        = 0
Dirty page threshold cleaner triggers    = 1

Therefore, the I/O cleaners were triggered by the changed pages threshold (CHNGPGS_THRESH) database configuration parameter. Looking back at the database configuration information above, the CHNGPGS_THRESH configuration parameter is set to 60, as shown below:

Changed pages threshold                (CHNGPGS_THRESH) = 60

As discussed previously, there is only one buffer pool (IBMDEFAULTBP), and it has a size of 675,000 4-KB pages. Therefore, when the CHNGPGS_THRESH triggers the I/O cleaners, there are 675,000 x .6, or 405,000 dirty pages that need to be flushed to disk. When the I/O cleaners are triggered, all of the configured page cleaners (10, in this case) are started at the same time. Therefore, there will be 10 db2pclnr processes working to write 1.6 GB of data to disk.

Step 5: Reduce the Changed Pages Threshold Parameter Setting

To reduce the impact of the page cleaners when they are triggered, they need to be triggered more often but with fewer dirty pages to write to disk. This can be done by:

  • Reducing the changed pages threshold (CHNGPGS_THRESH)

  • Reducing the percent log file reclaimed before soft checkpoint (SOFTMAX)

Update the database configuration to reduce the CHNGPGS_THRESH and SOFTMAX, as follows:

update db cfg for sample using chngpgs_thresh 20
update db cfg for sample using softmax 50

This will cause the I/O cleaners to be triggered when half a log file is filled or when 20% of the buffer pool is occupied by dirty pages. This will trigger the I/O cleaners more frequently but with less work to perform when they are triggered.

Step 6: Retest

After reducing the CHNGPGS_THRESH and SOFTMAX, the Web servers were reconnected to the database, and the performance slowdown did not reoccur.

The page cleaner trigger information from the snapshots was recaptured and looked like the following:

LSN Gap cleaner triggers                    = 0
Dirty page steal cleaner triggers           = 0
Dirty page threshold cleaner triggers       = 4