CustomerC is running an eCommerce site with the backed data store on DB2. At various times throughout the day, the database response time for application requests becomes very long.
The three parameters in the database manager configuration that stand out are highlighted in bold below. In this case, the application is mainly an online application; therefore, intra-partition parallelism should be disabled. Because there will be a large number of concurrent applications, the sort heap threshold (SHEAPTHRES) will need to be high; however, the existing value is sufficient.
Node type = Database Server with local clients Database manager configuration release level = 0x0a00 CPU speed (millisec/instruction) (CPUSPEED) = 4.000000e-05 Max number of concurrently active databases (NUMDB) = 8 Data Links support (DATALINKS) = NO Federated Database System Support (FEDERATED) = NO Transaction processor monitor name (TP_MON_NAME) = Default charge-back account (DFT_ACCOUNT_STR) = Java Development Kit installation path (JDK_PATH) = /wsdb/v81/bldsupp/AIX/jdk1.3.0 Diagnostic error capture level (DIAGLEVEL) = 3 Notify Level (NOTIFYLEVEL) = 3 Diagnostic data directory path (DIAGPATH) = Default database monitor switches Buffer pool (DFT_MON_BUFPOOL) = OFF Lock (DFT_MON_LOCK) = OFF Sort (DFT_MON_SORT) = OFF Statement (DFT_MON_STMT) = OFF Table (DFT_MON_TABLE) = OFF Timestamp (DFT_MON_TIMESTAMP) = ON Unit of work (DFT_MON_UOW) = OFF Monitor health of instance and databases (HEALTH_MON) = OFF SYSADM group name (SYSADM_GROUP) = BUILD SYSCTRL group name (SYSCTRL_GROUP) = SYSMAINT group name (SYSMAINT_GROUP) = Database manager authentication (AUTHENTICATION) = SERVER Cataloging allowed without authority (CATALOG_NOAUTH) = YES Trust all clients (TRUST_ALLCLNTS) = YES Trusted client authentication (TRUST_CLNTAUTH) = CLIENT Use SNA authentication (USE_SNA_AUTH) = NO Bypass federated authentication (FED_NOAUTH) = NO Default database path (DFTDBPATH) = /home/dsnow Database monitor heap size (4KB) (MON_HEAP_SZ) = 90 Java Virtual Machine heap size (4KB) (JAVA_HEAP_SZ) = 1024 Audit buffer size (4KB) (AUDIT_BUF_SZ) = 0 Size of instance shared memory (4KB) (INSTANCE_MEMORY) = AUTOMATIC Backup buffer default size (4KB) (BACKBUFSZ) = 1024 Restore buffer default size (4KB) (RESTBUFSZ) = 1024 Sort heap threshold (4KB) (SHEAPTHRES) = 80000 Directory cache support (DIR_CACHE) = YES Application support layer heap size (4KB) (ASLHEAPSZ) = 15 Max requester I/O block size (bytes) (RQRIOBLK) = 32767 Query heap size (4KB) (QUERY_HEAP_SZ) = 1000 DRDA services heap size (4KB) (DRDA_HEAP_SZ) = 128 Priority of agents (AGENTPRI) = SYSTEM Max number of existing agents (MAXAGENTS) = 200 Agent pool size (NUM_POOLAGENTS) = 100(calculated) Initial number of agents in pool (NUM_INITAGENTS) = 0 Max number of coordinating agents (MAX_COORDAGENTS) = MAXAGENTS Max no. of concurrent coordinating agents (MAXCAGENTS) = MAX_COORDAGENTS Max number of client connections (MAX_CONNECTIONS) = MAX_COORDAGENTS Keep fenced process (KEEPFENCED) = YES Number of pooled fenced processes (FENCED_POOL) = MAX_COORDAGENTS Initialize fenced process with JVM (INITFENCED_JVM) = NO Initial number of fenced processes (NUM_INITFENCED) = 0 Index re-creation time (INDEXREC) = RESTART Transaction manager database name (TM_DATABASE) = 1ST_CONN Transaction resync interval (sec) (RESYNC_INTERVAL) = 180 SPM name (SPM_NAME) = SPM log size (SPM_LOG_FILE_SZ) = 256 SPM resync agent limit (SPM_MAX_RESYNC) = 20 SPM log path (SPM_LOG_PATH) = TCP/IP Service name (SVCENAME) = Discovery mode (DISCOVER) = SEARCH Discovery communication protocols (DISCOVER_COMM) = Discover server instance (DISCOVER_INST) = ENABLE Maximum query degree of parallelism (MAX_QUERYDEGREE) = ANY Enable intra-partition parallelism (INTRA_PARALLEL) = NO No. of int. communication buffers(4KB)(FCM_NUM_BUFFERS) = 512
The three parameters in the database manager configuration that stand out are highlighted in bold below.
Since the applications will be performing a lot of inserts/updates/deletes, the log buffer size (LOGBUFSZ) will be important. However, the log buffer is sized adequately for this workload.
Because this is an online application, the sort list heap (SORTHEAP) should not be too large. When it is too large, the DB2 optimizer will tend to favor sorts over index scans, and with a large number of concurrent applications, sorting will be detrimental to performance.
In this case, the default buffer pool size (BUFFPAGE) is 1,000 4-KB pages. To determine the real size of the buffer pool, use the following statement:
select * from syscat.bufferpools
In this case, there is only one buffer pool (IBMDEFAULTBP), and it has a size of 675,000 4-KB pages.
This is adequately sized for this workload.
Database configuration release level = 0x0a00 Database release level = 0x0a00 Database territory = US Database code page = 819 Database code set = ISO8859-1 Database country/region code = 1 Dynamic SQL Query management (DYN_QUERY_MGMT) = DISABLE Discovery support for this database (DISCOVER_DB) = ENABLE Default query optimization class (DFT_QUERYOPT) = 5 Degree of parallelism (DFT_DEGREE) = 1 Continue upon arithmetic exceptions (DFT_SQLMATHWARN) = NO Default refresh age (DFT_REFRESH_AGE) = 0 Number of frequent values retained (NUM_FREQVALUES) = 10 Number of quantiles retained (NUM_QUANTILES) = 20 Backup pending = NO Database is consistent = NO Rollforward pending = NO Restore pending = NO Multi-page file allocation enabled = NO Log retain for recovery status = NO User exit for logging status = NO Data Links Token Expiry Interval (sec) (DL_EXPINT) = 60 Data Links Write Token Init Expiry Intvl(DL_WT_IEXPINT) = 60 Data Links Number of Copies (DL_NUM_COPIES) = 1 Data Links Time after Drop (days) (DL_TIME_DROP) = 1 Data Links Token in Uppercase (DL_UPPER) = NO Data Links Token Algorithm (DL_TOKEN) = MAC0 Database heap (4KB) (DBHEAP) = 1200 Size of database shared memory (4KB) (DATABASE_MEMORY) = AUTOMATIC Catalog cache size (4KB) (CATALOGCACHE_SZ) = (MAXAPPLS*4) Log buffer size (4KB) (LOGBUFSZ) = 256 Utilities heap size (4KB) (UTIL_HEAP_SZ) = 5000 Buffer pool size (pages) (BUFFPAGE) = 1000 Extended storage segments size (4KB) (ESTORE_SEG_SZ) = 16000 Number of extended storage segments (NUM_ESTORE_SEGS) = 0 Max storage for lock list (4KB) (LOCKLIST) = 100 Max size of appl. group mem set (4KB) (APPGROUP_MEM_SZ) = 20000 Percent of mem for appl. group heap (GROUPHEAP_RATIO) = 70 Max appl. control heap size (4KB) (APP_CTL_HEAP_SZ) = 128 Sort heap thres for shared sorts (4KB) (SHEAPTHRES_SHR) = (SHEAPTHRES) Sort list heap (4KB) (SORTHEAP) = 4096 SQL statement heap (4KB) (STMTHEAP) = 4096 Default application heap (4KB) (APPLHEAPSZ) = 256 Package cache size (4KB) (PCKCACHESZ) = (MAXAPPLS*8) Statistics heap size (4KB) (STAT_HEAP_SZ) = 4384 Interval for checking deadlock (ms) (DLCHKTIME) = 10000 Percent. of lock lists per application (MAXLOCKS) = 10 Lock timeout (sec) (LOCKTIMEOUT) = -1 Changed pages threshold (CHNGPGS_THRESH) = 60 Number of asynchronous page cleaners (NUM_IOCLEANERS) = 10 Number of I/O servers (NUM_IOSERVERS) = 40 Index sort flag (INDEXSORT) = YES Sequential detect flag (SEQDETECT) = YES Default prefetch size (pages) (DFT_PREFETCH_SZ) = 32 Track modified pages (TRACKMOD) = OFF Default number of containers = 1 Default tablespace extentsize (pages) (DFT_EXTENT_SZ) = 32 Max number of active applications (MAXAPPLS) = AUTOMATIC Average number of active applications (AVG_APPLS) = 1 Max DB files open per application (MAXFILOP) = 64 Log file size (4KB) (LOGFILSIZ) = 100000 Number of primary log files (LOGPRIMARY) = 32 Number of secondary log files (LOGSECOND) = 8 Changed path to log files (NEWLOGPATH) = Path to log files = /databases/sample/logs Overflow log path (OVERFLOWLOGPATH) = Mirror log path (MIRRORLOGPATH) = First active log file = Block log on disk full (BLK_LOG_DSK_FUL) = NO Percent of max active log space by transaction(MAX_LOG) = 0 Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0 Group commit count (MINCOMMIT) = 1 Percent log file reclaimed before soft chckpt (SOFTMAX) = 100 Log retain for recovery enabled (LOGRETAIN) = OFF User exit for logging enabled (USEREXIT) = OFF Auto restart enabled (AUTORESTART) = ON Index re-creation time (INDEXREC) = SYSTEM (RESTART) Default number of loadrec sessions (DFT_LOADREC_SES) = 1 Number of database backups to retain (NUM_DB_BACKUPS) = 12 Recovery history retention (days) (REC_HIS_RETENTN) = 366 TSM management class (TSM_MGMTCLASS) = TSM node name (TSM_NODENAME) = TSM owner (TSM_OWNER) = TSM password (TSM_PASSWORD) =
This requires operating system tools such as vmstat, iostat, ps, and/or top to capture the memory, I/O, and process level information. It is important to capture the snapshots from these tools over a period of time, not taking just one snapshot, especially because the first line of the output of the vmstat and iostat tools contains average information since the server was started and is not really useful in analyzing a problem.
While the system is experiencing the slowdown, capture iostat output as follows:
iostat 5 > iostat.out
Next, examine the iostat.out file. Below is an entry from the iostat.out file.
Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 2.6 13.2 7.3 19 21 hdisk1 8.3 99.2 37.3 3 208 hdisk2 8.3 99.2 37.3 3 208 hdisk3 8.3 99.2 37.3 3 208 hdisk4 8.3 99.2 37.3 3 208 hdisk5 8.3 99.2 37.3 3 208 hdisk6 8.3 99.2 37.3 3 208 hdisk7 8.3 99.2 37.3 3 208 cd0 0.0 0.0 0.0 0 0
The other entries were very similar to the above, so they are not shown here. In this case, the I/O activity is all writes and is spread across all of the disks on which the database resides, i.e., hdisk1 through hdisk7.
Next, run the top or ps aux command to determine which processes are currently executing that could be causing the I/O. The output of the ps aux command is below:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND inst1 40970 10.0 2.0 40 23888 - A Oct 03 572:36 db2pclnr inst1 32776 10.0 2.0 40 23888 - A Oct 03 570:33 db2pclnr inst1 24582 10.0 2.0 40 23888 - A Oct 03 576:07 db2pclnr inst1 16388 10.0 2.0 40 23888 - A Oct 03 538:13 db2pclnr inst1 122974 10.0 0.0 320 324 - A Oct 03 548:32 db2pclnr inst1 294988 10.0 0.0 700 720 - A Oct 03 513:02 db2pclnr inst1 23109 10.0 2.0 52 23900 - A Oct 03 522:04 db2pclnr inst1 401514 10.0 2.0 184 24032 - A Oct 03 509:06 db2pclnr inst1 81940 10.0 2.0 116 23964 - A Oct 03 511:07 db2pclnr inst1 262242 10.0 0.0 936 940 - A Oct 03 511:01 db2pclnr
The ps aux command shows that there are 10 db2pclnr processes running, using all of the available CPU cycles. The db2pclnr process is the I/O cleaner or page cleaner process that asynchronously flushes dirty pages from the buffer pools to disk.
The ps aux command gives more information than just db2pclnr processes; however, only the relevant information is shown above.
To determine which of the triggers for the page cleaners actually caused the I/O cleaners to start, a database snapshot is required. To capture the database snapshot, use the following command:
get snapshot for database on sample > sample.db.snap
To take a snapshot and extract just the lines with the page cleaner trigger information, use the following command:
get snapshot for database on sample | grep ?i 'triggers'
The output from the above command looks like the following:
LSN Gap cleaner triggers = 0 Dirty page steal cleaner triggers = 0 Dirty page threshold cleaner triggers = 1
Therefore, the I/O cleaners were triggered by the changed pages threshold (CHNGPGS_THRESH) database configuration parameter. Looking back at the database configuration information above, the CHNGPGS_THRESH configuration parameter is set to 60, as shown below:
Changed pages threshold (CHNGPGS_THRESH) = 60
As discussed previously, there is only one buffer pool (IBMDEFAULTBP), and it has a size of 675,000 4-KB pages. Therefore, when the CHNGPGS_THRESH triggers the I/O cleaners, there are 675,000 x .6, or 405,000 dirty pages that need to be flushed to disk. When the I/O cleaners are triggered, all of the configured page cleaners (10, in this case) are started at the same time. Therefore, there will be 10 db2pclnr processes working to write 1.6 GB of data to disk.
To reduce the impact of the page cleaners when they are triggered, they need to be triggered more often but with fewer dirty pages to write to disk. This can be done by:
Reducing the changed pages threshold (CHNGPGS_THRESH)
Reducing the percent log file reclaimed before soft checkpoint (SOFTMAX)
Update the database configuration to reduce the CHNGPGS_THRESH and SOFTMAX, as follows:
update db cfg for sample using chngpgs_thresh 20 update db cfg for sample using softmax 50
This will cause the I/O cleaners to be triggered when half a log file is filled or when 20% of the buffer pool is occupied by dirty pages. This will trigger the I/O cleaners more frequently but with less work to perform when they are triggered.
After reducing the CHNGPGS_THRESH and SOFTMAX, the Web servers were reconnected to the database, and the performance slowdown did not reoccur.
The page cleaner trigger information from the snapshots was recaptured and looked like the following:
LSN Gap cleaner triggers = 0 Dirty page steal cleaner triggers = 0 Dirty page threshold cleaner triggers = 4