In addition to the ocssd.bin process which is responsible, among other things, for the network
and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring processes
cssdagent and cssdmonitor , which run with the highest real-time scheduler priority and are also
able to fence a server.
– Find out PID for cssdagent
[root@host02 lastgasp]# ps -ef |grep cssd |grep -v grep
root 5085 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdmonitor
root 5106 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdagent
grid 5136 1 0 09:45 ? 00:00:02 /u01/app/11.2.0/grid/bin/ocssd.bin
– Find out the scheduling priority of cssdagent
[root@host02 lastgasp]# chrt -p 5106
pid 5106′s current scheduling policy: SCHED_RR
pid 5106′s current scheduling priority: 99
Since cssdagent and cssdmonitor have schedulilng priority of 99 stopping them can reset a server in case :
• there is some problem with the ocssd.bin process
• there is some problem with OS scheduler
. CPU starvation
• OS is locked up in a driver or hardware (e.g. I/O call)
Both of them are also associated with an undocumented timeout. In case the execution of the
processes stops for more than 28 sec., the node will be evicted.
– Let us stop the execution of cssdagent for 40 secs
root@rac1 ~]# kill -STOP 5106; sleep 40; kill -CONT 5106
– check the alert log of host01 –
– Node2 is rebooted
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: ag100946, with time stamp: L-2012-11-09-
[ohasd(12412)]CRS-8013:reboot advisory message text: Rebooting after limit 28100 exceeded; disk timeout 28100, network
timeout 27880, last heartbeat from CSSD at epoch seconds 352436647.013, 34280 milliseconds ago based on invariant clock
– Node 2 is rebooted and network connection with it breaks
value of 294678040
[cssd(14493)]CRS-1612:Network communication with node host02 (2) missing for 50% of timeout interval. Removal of this node
from cluster in 14.330 seconds
[cssd(14493)]CRS-1611:Network communication with node host02 (2) missing for 75% of timeout interval. Removal of this node
from cluster in 7.310 seconds
[cssd(14493)]CRS-1610:Network communication with node host02 (2) missing for 90% of timeout interval. Removal of this node
from cluster in 2.300 seconds
[cssd(14493)]CRS-1632:Node host02 is being removed from the cluster in cluster incarnation 247848834
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
[crsd(14820)]CRS-5504:Node down event reported for node ‘host02′.
[crsd(14820)]CRS-2773:Server ‘host02′ has been removed from pool ‘Generic’.
[crsd(14820)]CRS-2773:Server ‘host02′ has been removed from pool ‘ora.orcl’.