Monday 13 March 2017

Troubleshooting of DB2 HADR-TSA hanging issues.

Is your HADR-TSA configuration corrupted ? hung ? None of the TSA commands executing. Take a deep breath, and start thinking from TSA perspective rather than DB2 perspective.

You need to make sure TSA resource are alive and running well... hold on.... Has that also been holding up your DB2 instances to be started ?

The ONLY and ONLY reason db2 instance is stuck is because the cluster manager is still set to TSA. Please verify it.

db2 get dbm cfg | grep -i clus
 Cluster manager                                         = TSA


How to get rid of this ? You need to run "db2haicu -disable" or "db2haicu -delete". But, these commands are hanging too for you. So, lets do the cleanup here.

Step1 : Find out the TSA domain
 
         lsrpdomain

Step2 : Remove the domain

        rmrpdomain -f <domain-name>

Now, check the dbm config for cluster manager, if its set to TSA. You can run "db2haicu -delete" and it should work fine. Now, you can start the db2 instance safely, start the HADR processes and deliver the environment to business if they are waiting for it. Because, you are going to build TSA anyways online while DB2 and HADR is up and running.

Now, let's look at the background processes / resources which are used by TSA.

Step1 : Check the RSCT daemons if they are active

lssrc -g rsct
Subsystem         Group            PID          Status
 ctcas            rsct             11403338     active
 ctrmc            rsct             20512964     active


You may like to recycle them using "stopsrc -g rsct" and "startsrc -g rsct".


Step2 : Check the resource manager's status

lssrc -g rsct_rm
Subsystem         Group            PID          Status
 IBM.HostRM       rsct_rm          11665650     active
 IBM.ServiceRM    rsct_rm          9437272      active
 IBM.DRM          rsct_rm          15990970     active
 IBM.ConfigRM     rsct_rm          20250724     active
 IBM.MgmtDomainRM rsct_rm          19792028     active
 IBM.StorageRM    rsct_rm          16646386     active
 IBM.TestRM       rsct_rm          19923128     active
 IBM.RecoveryRM   rsct_rm          16384102     active
 IBM.GblResRM     rsct_rm          11272240     active
 IBM.ERRM         rsct_rm                       inoperative
 IBM.LPRM         rsct_rm                       inoperative
 IBM.AuditRM      rsct_rm                       inoperative


When you have problem most of them will be inactive, you can recycle these processes too, "stopsrc -g rsct_rm" and "startsrc -g rsct_rm". Even then, all the processes will not be active unless you have created the Domain. So, my suggestion would be to start creating domain using "db2haicu" command line utility and keep checking resource manager status using "lssrc -g rsct_rm".

Exclusively, in my scenario, IBM.StorageRM resource was always inactive, but my TSA was failing at resource "IBM.RecoveryRM" however it was active. But, I learned another information, even if IBM.RecoveryRM is active that does not mean its really working. You can verify the status from here :

lssrc -ls IBM.RecoveryRM | grep -i "In Config State"

    => True means complete
    => False means still initializing, so not ready to service commands like lssam

In my case, it was always False, but it was just victimized because of IBM.StorageRM issue. We had opened PMR with rsct team in IBM lab and they pointed , storage component were deinstalled due to OS patching work. Again, we did not follow the best practices for OS patching in TSA environment. Once, we installed, our resources were back to work.

I think I touched the way to navigate TSA hanging troubleshooting. But, its not always the case that you could resolve by yourself. But, it will definitely save tons of time.

References :
http://www-01.ibm.com/support/docview.wss?uid=swg21385581
http://www-01.ibm.com/support/docview.wss?uid=swg21236233
http://www-01.ibm.com/support/docview.wss?uid=swg21293701

Note:- This information is shared based on my knowledge and the experience.