Monday, 13 March 2017

Troubleshooting of DB2 HADR-TSA hanging issues.

Is your HADR-TSA configuration corrupted ? hung ? None of the TSA commands executing. Take a deep breath, and start thinking from TSA perspective rather than DB2 perspective.

You need to make sure TSA resource are alive and running well... hold on.... Has that also been holding up your DB2 instances to be started ?

The ONLY and ONLY reason db2 instance is stuck is because the cluster manager is still set to TSA. Please verify it.

db2 get dbm cfg | grep -i clus
 Cluster manager                                         = TSA

How to get rid of this ? You need to run "db2haicu -disable" or "db2haicu -delete". But, these commands are hanging too for you. So, lets do the cleanup here.

Step1 : Find out the TSA domain

Step2 : Remove the domain

        rmrpdomain -f <domain-name>

Now, check the dbm config for cluster manager, if its set to TSA. You can run "db2haicu -delete" and it should work fine. Now, you can start the db2 instance safely, start the HADR processes and deliver the environment to business if they are waiting for it. Because, you are going to build TSA anyways online while DB2 and HADR is up and running.

Now, let's look at the background processes / resources which are used by TSA.

Step1 : Check the RSCT daemons if they are active

lssrc -g rsct
Subsystem         Group            PID          Status
 ctcas            rsct             11403338     active
 ctrmc            rsct             20512964     active

You may like to recycle them using "stopsrc -g rsct" and "startsrc -g rsct".

Step2 : Check the resource manager's status

lssrc -g rsct_rm
Subsystem         Group            PID          Status
 IBM.HostRM       rsct_rm          11665650     active
 IBM.ServiceRM    rsct_rm          9437272      active
 IBM.DRM          rsct_rm          15990970     active
 IBM.ConfigRM     rsct_rm          20250724     active
 IBM.MgmtDomainRM rsct_rm          19792028     active
 IBM.StorageRM    rsct_rm          16646386     active
 IBM.TestRM       rsct_rm          19923128     active
 IBM.RecoveryRM   rsct_rm          16384102     active
 IBM.GblResRM     rsct_rm          11272240     active
 IBM.ERRM         rsct_rm                       inoperative
 IBM.LPRM         rsct_rm                       inoperative
 IBM.AuditRM      rsct_rm                       inoperative

When you have problem most of them will be inactive, you can recycle these processes too, "stopsrc -g rsct_rm" and "startsrc -g rsct_rm". Even then, all the processes will not be active unless you have created the Domain. So, my suggestion would be to start creating domain using "db2haicu" command line utility and keep checking resource manager status using "lssrc -g rsct_rm".

Exclusively, in my scenario, IBM.StorageRM resource was always inactive, but my TSA was failing at resource "IBM.RecoveryRM" however it was active. But, I learned another information, even if IBM.RecoveryRM is active that does not mean its really working. You can verify the status from here :

lssrc -ls IBM.RecoveryRM | grep -i "In Config State"

    => True means complete
    => False means still initializing, so not ready to service commands like lssam

In my case, it was always False, but it was just victimized because of IBM.StorageRM issue. We had opened PMR with rsct team in IBM lab and they pointed , storage component were deinstalled due to OS patching work. Again, we did not follow the best practices for OS patching in TSA environment. Once, we installed, our resources were back to work.

I think I touched the way to navigate TSA hanging troubleshooting. But, its not always the case that you could resolve by yourself. But, it will definitely save tons of time.

References :

Note:- This information is shared based on my knowledge and the experience.

Wednesday, 16 July 2014

IBM BLU : Know its limitations and restrictions

IBM BLU is really cool but let's explore its limitations and restrictions.

You can not implement BLU if you are running your DB2 server with following :
1. PureScale environment
2. Database with DPF feature
3. Database with HADR enabled
4. Database with NO automatic storage
5. Database without UNICODE codeset
6. Tablespace without reclaimable storage
7. Windows operating system

Following features are not supported with IBM BLU :
1. Table replication ,i.e. Q-replication, SQL replication
2. Tables Federation
5. CREATE Index
6. CREATE Trigger
7. CREATE / DECALRE GTT (Global Temporary Tables)
8. Isolation levels for queries : RR and RS
9. XA Transactions

Note:- This information is shared based on my knowledge and the experience.

Monday, 24 February 2014

DB2 - Bufferpool Memory Protection

Buffer Pool memory protection feature is not much discussed among DB2 enthusiasts, yet its a very useful feature which enhances the DB2 resilience capability in terms of memory corruption. So, I thought of bringing this to everyone's attention.

Description : DB2 took advantage of a specific hardware feature available to Power6 servers (starting with AIX 5.3 TL6) called "Protection of storage keys". Its formally known as "Storage Keys" in DB2 world which provides protection to memory using hardware keys at the karnel thread level. Storage key protection reduces buffer pool memory corruption problems and limits errors that might crash the database. Buffer pool memory protection works on a per-agent level; any particular agent has access to buffer pool pages only when that agent needs access.

How does it work : When an agent requires access to the buffer pools to perform its work, it is temporarily granted access to the buffer pool memory. When the agent no longer requires access to the buffer pools, access is revoked. This behavior ensures that agents are only allowed to modify buffer pool contents when needed, reducing the likelihood of buffer pool corruptions. Any illegal access to buffer pool memory results in a segmentation error.

You need to set the registry variable "DB2_MEMORY_PROTECT" to enable this feature in DB2 instance.

Why does it needed : There is a memory overlay issue where data is accidentally over-written to a memory page which is already occupied by different set of data. In this scenario it results in memory corruption and may cause the instance crash.

Recommendation : In my opinion we should implement this feature in DB2 by setting up the registry variable "DB2_MEMORY_PROTECT" as its anyway enabled at the hardware level so why not to extend it to DB2 as well ? This feature is supported since version DB2 9.5 on Power 6 servers with AIX 5.3 TL 6.

Note:- This information is shared based on my knowledge and the experience.

Sunday, 24 March 2013

Performance Tuning of classic reorg

Before we talk about tuning the reorg process, lets first understand different phases of REORG. As tuning is done at the individual phase of reorg.

Reorg on table takes place in following phases :


You can use the below query to monitor the progress of reorg :


SORT is a first phase of reorg where data is sorted according to index if you have specified in the reorg statement, or clustering index is defined on the table. Reorg will take place as per that sorted order. During this phase, compression dictionary is also recreated if you provide RESETDICTIONARY option in
the reorg statement.

Tune it, first of all we could completey avoid this phase if we do not specify the index to reorg table as per specific index order. Even if you did, specify INDEXSCAN option, this way SORT phase could complete in no minute. If you do not require to build dictionary, do not specify RESETDICTIONARY option and
completely get rid of SORT/RECREATEDICTIONARY phase from your reorg.

BUILD is the second phase where your entire data is rebuilt and any free space is claimed. This is the most time consuming phase as its operating on the entire data of table and re-organizing it. As of best practice, we use the temporary tablespace where the data is built to reorganize as we do not want to blow up the table's own tablespace.

Tune it, if you have enough space in your table's tablespace, DO NOT specify the option of temporary tablespace. This improves the performance of next phase which is REPLACE.

REPLACE is the phase where reorganized table object is copied over to the tablespace where the table is created when temporary tablespace is specified during reorg. This takes a while to complete.

Tune it, if you do not specify the temporary tablespace, the data reorganization takes place within the same tablespace of the table. In this case, new set of organized table updates the pointer and old table object is dropped. This saves the complete time for copying over the data from temporary tablespace to the table's tablespace. Just ensure that you are using DMS tablespace for the table reorganization to take this benefit, SMS tablespace does not have any effect of it and it always uses a process of copying the object from one location to another location.

INDEX RECREATE is the last phase where all the indexes are recreated after the reorg. Nothing is there to tune it as such. Just make sure after the table reorg, you do not require to reorg the indexes explicitly again.

Note:- This information is shared based on my knowledge and the experience.

Saturday, 16 February 2013

Few Tips on Q-rep Performance Tuning

We spend most of the time in tuning the business application, but often ignore the system application which is integrated with DB2 such as MQ-Replication. Poor performance of Q-replication does NOT only affect its own performance but also affect badly on the whole physical database server. I highlight about the basic tuning parameters operated at capture/apply program and some maintenance on Q-rep control tables.

Tune the parameters for Capture program

1. High latency costs on CPU of the database server, if high latency is not your concern adjust some of the configurations parameters :

=> sleep_interval : This sets the frequency for reading a DB2 log changes by Q-capture program. Its default value is 0.5 second (500 milliseconds) which is usually very high and it can be lowered down a little bit and this impacts on CPU very much. I would say reasonable value could be 1 or 2 seconds (1000 or 2000 milliseconds) depending on your environment.

=> commit_interval : Adjust it to change the frequency of committing the captured transaction to Q-Manager. Its default value is also a 0.5 second. Good starting point is to match the value with sleep_interval unless you have optimum memory_limit configured.

=> memory_limit : This memory is used for Q-capture program to build transactions in memory. If the transactions can't fit into this memory, it will be spilled over to disk and overall slows down the replication. Usually it happens if there is a big transaction needs to be captured , it could also happen due to either delayed value configured for sleep_interval or commit_interval. The default value is 500 MB on LUW. Before adjusting this value you can always check whether you spilled the transaction over to disk. Use the below query to find out :
select max(max_trans_size) from qasn.ibmqrep_capmon  where trans_spilled > 0

2. Other parameters can also be considered for tuning but they are not as important as above :

=> max_message_size : This parameter determines the size of each message which is sent through SendQ. Having its size small, the messages are sent in chunks which impacts on capture performance, and also in apply as the changes are applied in chunks too. Following guidelines can be followed for setting this parameter :

  •  One typical large transaction should fit into one message, so setting the parameter max_message_size to be slightly higher than the maximum size of a typical transaction should be sufficient
  •  For very large transactions that exceed the value of max_message_size that you derived above, ensure that you set max_message_size such that at least one row of the transaction fits into one message.
  •  The value of max_message_size must be less than or equal to the WebSphere® MQ parameter MAXMSGL, which sets the maximum message size for a queue.

=> monitor_interval : This parameter sets a frequency of a Q-capture inserting a row into ibmqrep_capmon & ibmqrep_capqmon tables which is used for monitoring the success & failure of Q-capture program. Its default value is 30 seconds, which is usually too high and you can have some delayed frequency,  2 to 5 minutes depending on your environment, of inserting monitoring data into these tables.

=> prune_interval : It sets a frequency for a Q-capture program could prunes the data from tables IBMQREP_CAPMON, IBMQREP_CAPQMON, IBMQREP_SIGNAL and IBMQREP_CAPTRACE. The default value is 5 minutes, if its reduced further that means pruning queries will be executed more frequently and will utilize the system resources more frequently.

=> trans_batch_sz : It specify that Q-capture program groups transactions into batches before they are replicated. Grouping small transactions into single MQ message can reduce CPU consumption on the server. The default value is 1 which means no grouping , setting some higher value which will yield in some grouping and it could reduce the CPU overhead that is incurred each time Q-Capture puts a message on the sendQ.

Tune the parameters for Apply program

=> num_apply_agents : It determines as how many agent threads will be used by Q-Apply program to apply the transactions into target tables. This parameter can be tweaked around to increase the throughput and also to keep CPU usage low. The default value is 16 on LUW and usually good for medium volume replicated

=> memory_limit : This parameter determines the amount of memory that a Q-Apply program can use a buffer to process transactions from a receive queue. When this limit is reached, the messages are not retrieved from recieveQ until apply agents applies the transactions to replicated tables. If just a single
transaction is large enough to fill up the memory_limit, rows are processed in multiple sets of the same transaction. This parameter must be at least three times larger than max_message_size of sendQ.

=> monitor_interval : This parameter sets a frequency of a Q-apply inserting a row into ibmqrep_applymon table which is used for monitoring the success & failure of Q-apply program. Its default value is 30 seconds, which is usually too high and you can have some delayed frequency ,2 to 5 minutes depending on your environment, of inserting monitoring data into monitoring table.

=> prune_interval : It sets the frequency of a Q-apply program could prones the data from tables IBMQREP_APPLYMON,IBMQREP_APPLYTRACE. The default value is 5 minutes, if its reduced further that means pruning queries will be executed more frequently and will utilize the system resources more frequently.

Note:- Since it does not cover up pruning the table IBMQREP_DONEMSG (an internal table) which is used by Apply program to record all transaction and administrative messages that have received. Frequency of pruning this table is controlled by parameters "PRUNE_METHOD" & "PRUNE_BATCH_SIZE"

=> prune_batch_size : It signifies the number of rows that are to be deleted from ibmqrep_donemsg table in one commit scope when prune_method is 2. It does not hurt setting as 1000.

=> prune_method : This signifies the prune method for deleting the old data from recieve queue, querying the ibmqrep_donemsg table to verify if all the messages are applied, and then deletes the corresponding data from ibmqrep_donemsg table. The default value is "1" which means the pruning activity is fired after processing every single row and the other possible value is "2" which means the data would be deleted after processing "n" number of rows defined in prune_batch_size.

Maintenance on Q-rep database objects 

Q-rep has own control tables which are actively used specially on target database where apply program runs. Their maintenance is important otherwise it impacts direly on performance.

=> Assign a dedicated bufferpool and monitor its hit ratio reguarly

=> Generally these control tables are defined as APPEND ON. So, any deletion on these tables leave the empty pages which are only claimed after the REORG is done. So, regular REORG is needed.

=> Apply the runstats on periodic basis as it impacts the internal queries which are executed by Q-rep for processing the rows or other administrative purposes.

Note:- This information is shared based on my knowledge and the experience.

Saturday, 9 February 2013

DB2 Automatic Storage

DB2 Automatic storage was initially introduced from version 8 Fixpack 9, but it was enabled by default only from Version 9. It was intended to make the storage management easier from the storage perspective as well as administration for backup,restore and creating additional tablespaces. It only applies to DMS tablespaces and can't be reverted back once its enabled.

If we go by naming convention "automatic storage", it enables the storage autogrow and we need not worry for defining the container paths while adding the tablespaces anymore.

How do we enable automatic storage in database :
It can be enabled when you create the database initially. By default it is AUTOMATICS STORAGE YES since the db2 version 9.x

CREATE DATABASE TEST ( It creates the storage path on home directory)
CREATE DATABASE TEST on '/fs1' (It creates the storage path on /fs1)
CREATE DATABASE TEST on '/fs1','fs2' (It creates the storage path on two filesystems /fs1 & /fs2)

If you are creating the database on version under 9.x, always specify the AUTOMATIC STORAGE YES like below :

Note:- There is a restriction on windows operating system while defining the storage path. The automatic storage path must be a drive letter only unless DB2_CREATE_DB_ON_PATH registry variable is set to YES

For example :

There is other way where you can create the windows database with the storage path but you also need to specify the database path like below :


In case you are not sure for creating the automatic storage on DB2 9.x & above, you need to explicitly specify ATOMATIC STORAGE NO


Would you need to add the storage path later in future : You can simply add a storage to the database by using the below command.


Would you need to drop the storage path : You can drop the storage path and data can be moved off to the other storage paths after the tablespace rebalance.

For example :
Drop the storage :

Rebalance the tablespaces :

Database restore of automatic storage enabled database : Restore is way simper now than earlier non-automatic storage specially the redirected restore. Some standard examples are given below :


Important information about database path & storage path :
The database path (which is where the database manager stores various control files for the database)
If TO or DBPATH ON is specified, this indicates the database path.
If ON is used but DBPATH ON is not specified with it, the first path listed with ON is used as the database path (in addition to it being a storage path).
If none of TO, ON, or DBPATH ON is specified, the dftdbpath database manager configuration parameter determines the database path.

The storage paths (where the database manager creates automatic storage table space containers)
If ON is specified, all of the paths listed are considered storage paths, and these paths are used instead of the ones stored within the backup image.
If ON is not specified, no change is made to the storage paths (the storage paths stored within the backup image are maintained).

Note:- This information is shared based on my knowledge and the experience.

Saturday, 15 December 2012

DB2 Memory Architecture

DB2 is very interesting database product which can grow in size as much you want, literally from couple MBs to Multi-Terabytes. Capacity totally depends on a) Disk where the real data will be stored, b) Memory allocation which is utilized to process the multi-terabytes of data to end users.

Lets talk how DB2 manages memory : DB2 memory is allocated in terms of shared memory and private memory. Shared memory is the one which is being shared by all of the db2 agents and gets allocated as soon as Instance/database is started. Whereas private memory is the one which is used by db agents individually.

Let's understand first about the shared memory : This is allocated at 3 different levels :
1. Instance
2. Database
3. Application

Instace level shared memory is allocated when its started, and its de-allocated as soon as you stop it. Its been controlled through following Instance configuration parameters :

Database level shared memory is allocated when its either activated explicitly or with the first db connection, and de-allocated when its either deactivated explicitly or with the reset of last db connection. Following configuration parameters control the shared memory of database :

Application level shared memory is not allocated in every situation. It gets allocated when the database agents (db connections) need to coordinate with each other. This happens in a specific scenario when you have either multi-partitioned database or have enabled the intra_parallel configuration in a single partitioned database. In these specific scenario, a single db agent forks the multiple subagents to serve the request in parallel but still share the same application memory which was allocated to their parent agent. Following configuration parameters control this shared memory :

Private memory is often referred as agent private memory. Its allocated through a database connection when its assigned to a database agent. Its usage is very dynamic and totally depends on how long that specific agent is busy in processing the request. This memory is utilized to perform sort,to acquire lock,
to build the access plan of SQL etc. Following configuration parameters are responsible for utilizing private memory :

You can monitor the usage of these memory allocation using db2 memory tracker ( db2mtrk) utility and you can keep re-adjusting based on the database size growth and its utilization.

Note:- This information is shared based on my knowledge and the experience.