11g R2 RAC: NODE EVICTION DUE TO CSSDAGENT STOPPING

In addition to the ocssd.bin process which is responsible, among other things, for the network
and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring processes
cssdagent and cssdmonitor , which run with the highest real-time scheduler priority and are also
able to fence a server.
– Find out PID for cssdagent
[root@host02 lastgasp]# ps -ef |grep cssd |grep -v grep
root      5085     1  0 09:45 ?        00:00:00 /u01/app/11.2.0/grid/bin/cssdmonitor
root      5106     1  0 09:45 ?        00:00:00 /u01/app/11.2.0/grid/bin/cssdagent
grid      5136     1  0 09:45 ?        00:00:02 /u01/app/11.2.0/grid/bin/ocssd.bin
– Find out the scheduling priority of cssdagent
[root@host02 lastgasp]# chrt -p 5106
pid 5106’s current scheduling policy: SCHED_RR
pid 5106’s current scheduling priority: 99
Since cssdagent and cssdmonitor have schedulilng priority of 99 stopping them can reset a server in case :
• there is some problem with the ocssd.bin process
• there is some problem with OS scheduler
. CPU starvation
• OS is locked up in a driver or hardware (e.g. I/O call)
Both of them are also associated with an undocumented timeout. In case the execution of the
processes stops for more than 28 sec., the node will be evicted.
– Let us stop the execution of  cssdagent for 40 secs
root@rac1 ~]# kill -STOP  5106; sleep 40; kill -CONT 5106
– check the alert log of host01 –
– Node2 is rebooted
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: ag100946, with time stamp: L-2012-11-09-
10:21:28.040
[ohasd(12412)]CRS-8013:reboot advisory message text: Rebooting after limit 28100 exceeded; disk timeout 28100, network
timeout 27880, last heartbeat from CSSD at epoch seconds 352436647.013, 34280 milliseconds ago based on invariant clock
– Node 2 is rebooted and network connection with it breaks
value of 294678040
2012-11-09 10:21:45.671
[cssd(14493)]CRS-1612:Network communication with node host02 (2) missing for 50% of timeout interval.  Removal of this node
from cluster in 14.330 seconds
2012-11-09 10:21:53.923
[cssd(14493)]CRS-1611:Network communication with node host02 (2) missing for 75% of timeout interval.  Removal of this node
from cluster in 7.310 seconds
2012-11-09 10:21:59.845
[cssd(14493)]CRS-1610:Network communication with node host02 (2) missing for 90% of timeout interval.  Removal of this node
from cluster in 2.300 seconds
2012-11-09 10:22:02.587
[cssd(14493)]CRS-1632:Node host02 is being removed from the cluster in cluster incarnation 247848834
2012-11-09 10:22:02.717
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 10:22:02.748
[crsd(14820)]CRS-5504:Node down event reported for node ‘host02′.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server ‘host02′ has been removed from pool ‘Generic’.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server ‘host02′ has been removed from pool ‘ora.orcl’.
References:
————————————————————————————————————————————–

Related links:

Home

11G R2 RAC Index

Node Eviction Due To Missing Network Heartbeat
Node Eviction Due T0 Missing Disk Heartbeat
Node Eviction Due To Member Kill Escalatio

11g R2 RAC: Reboot-less Fencing With Missing Network Heartbeat

 

                                                          ——————- 

 

Your comments and suggestions are welcome!