11g R2 RAC: NODE EVICTION DUE TO MISSING NETWORK HEARTBEAT

In this post, I will demonstrate node eviction due to missing netsork heartbeat i.e. a node will be evicted from the cluster, if it can’t communicate wioth other nodes in the cluster. To simulate it, I will stop private network on one of the nodes and then scan alert logs  of the surviving nodes.
Current scenario:
No. of nodes in the cluster  : 3
Names of the nodes      : host01, host02, host03
Name of the cluster database : orcl
I will stop PVT. network  service on host03 so that it is evicted.
– Find out the pvt network name
[root@host03 ~]# oifcfg getif
eth0  192.9.201.0  global  public
eth1  10.0.0.0  global  cluster_interconnect
– Stop pvt. network service on host03 so that it can’t communicate with host01 and host02 and will be evicted.
[root@host03 ~]# ifdown eth1
——————-
OCSSD log of host03
——————–
It can be seen that CSSD process of host03 can’t communicate with host01 and host02
at 09:43:52
Hence votedisk timeouot is set to Short Disk Time OUT (SDTO) = 27000 ms (27 secs)
2012-11-19 09:43:52.714: [    CSSD][843736976]clssnmPollingThread: node host01 (1) at 50% heartbeat fatal, removal in 14.880 seconds
2012-11-19 09:43:52.714: [    CSSD][843736976]clssnmPollingThread: node host01 (1) is impending reconfig, flag 132108, misstime 15120
2012-11-19 09:43:52.714: [    CSSD][843736976]clssnmPollingThread: node host02 (2) at 50% heartbeat fatal, removal in 14.640 seconds
2012-11-19 09:43:52.714: [    CSSD][843736976]clssnmPollingThread: node host02 (2) is impending reconfig, flag 132108, misstime 15360
2012-11-19 09:43:52.714: [    CSSD][843736976]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-19 09:43:52.927: [    CSSD][2833247120]clssnmSendingThread: sending status msg to all nodes
———————–
– Alert log of host03
———————–
– At 09:43:52, CSSD process host03 identifies that it can’t communicate with CSSD on host02 and host03
[cssd(5124)]CRS-1612:Network communication with node host01 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.880 seconds
2012-11-19 09:43:52.714
[cssd(5124)]CRS-1612:Network communication with node host02 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.640 seconds
2012-11-19 09:44:01.880
[cssd(5124)]CRS-1611:Network communication with node host01 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.790 seconds
2012-11-19 09:44:01.880
[cssd(5124)]CRS-1611:Network communication with node host02 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.550 seconds
2012-11-19 09:44:06.536
[cssd(5124)]CRS-1610:Network communication with node host01 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.780 seconds
2012-11-19 09:44:06.536
[cssd(5124)]CRS-1610:Network communication with node host02 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.540 seconds
2012-11-19 09:44:09.599
– At 09:44:16, CSSD process of host03 reboots the node to preserve cluster integrity
[cssd(5124)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-19 09:44:16.697
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5713)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) in /u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-11-19 09:44:16.193
[ctssd(5285)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03. Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
2012-11-19 09:44:21.177
——————–
Ocssd log of host01
——————–
– At 09:43:53, CSSD process of host01 identifies that it can’tommunicate with CSSD on host03
2012-11-19 09:43:53.340: [    CSSD][841635728]clssnmPollingThread: node host03 (3) at 50% heartbeat fatal, removal in 14.500 seconds
2012-11-19 09:43:53.340: [    CSSD][841635728]clssnmPollingThread: node host03 (3) is impending reconfig, flag 132110, misstime 15500
2012-11-19 09:43:53.340: [    CSSD][841635728]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
——————-
Alert log of host01
——————-
– At 09:44:01, alert log of host01 is updated regarding communication failure with host03
[cssd(5308)]CRS-1612:Network communication with node host03 (3) missing for 50% of timeout interval.  Removal of this node from cluster in 14.500 seconds
2012-11-19 09:44:01.695
[cssd(5308)]CRS-1611:Network communication with node host03 (3) missing for 75% of timeout interval.  Removal of this node from cluster in 7.450 seconds
2012-11-19 09:44:07.666
[cssd(5308)]CRS-1610:Network communication with node host03 (3) missing for 90% of timeout interval.  Removal of this node from cluster in 2.440 seconds
2012-11-19 09:44:10.606
[cssd(5308)]CRS-1607:Node host03 is being evicted in cluster incarnation 32819913; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-19 09:44:24.705
– At 09:44:24, OHASD process on host01 receives reboot message from host03
[ohasd(4941)]CRS-8011:reboot advisory message from host: host03, component: ag050107, with time stamp: L-2012-11-19-09:44:24.373
[ohasd(4941)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-11-19 09:44:24.705
[ohasd(4941)]CRS-8011:reboot advisory message from host: host03, component: mo050107, with time stamp: L-2012-11-19-09:44:24.376
[ohasd(4941)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-11-19 09:44:46.379
[cssd(5308)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
——————-
OCSSD log of host02
——————–
– At 09:43:52, CSSD process of host02 identifies communication failure with host03
2012-11-19 09:43:52.385: [    CSSD][841635728]clssnmPollingThread: node host03 (3) at 50% heartbeat fatal, removal in 14.950 seconds
2012-11-19 09:43:52.386: [    CSSD][841635728]clssnmPollingThread: node host03 (3) is impending reconfig, flag 394254, misstime 15050
2012-11-19 09:43:52.386: [    CSSD][841635728]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-19 09:43:52.733: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK01 sched delay 970 > margin 750 cur_ms 18331974 lastalive 18331004
20
——————–
Alert log of host02
———————
– At 09:44:01 (same as host01), alert log of host02 is updated regarding communication failure with host03
[cssd(5284)]CRS-1612:Network communication with node host03 (3) missing for 50% of timeout interval.  Removal of this node from cluster in 14.950 seconds
2012-11-19 09:44:01.971
[cssd(5284)]CRS-1611:Network communication with node host03 (3) missing for 75% of timeout interval.  Removal of this node from cluster in 6.930 seconds
2012-11-19 09:44:06.750
[cssd(5284)]CRS-1610:Network communication with node host03 (3) missing for 90% of timeout interval.  Removal of this node from cluster in 2.920 seconds
2012-11-19 09:44:24.520
– At 09:44:24 (same as host01), OHASD process on host01 receives reboot message from host03
[ohasd(4929)]CRS-8011:reboot advisory message from host: host03, component: ag050107, with time stamp: L-2012-11-19-09:44:24.373
[ohasd(4929)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-11-19 09:44:24.520
[ohasd(4929)]CRS-8011:reboot advisory message from host: host03, component: mo050107, with time stamp: L-2012-11-19-09:44:24.376
[ohasd(4929)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-11-19 09:44:46.073
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
20
————————————————————————————–
Related links:

Your comments and suggestions are welcome!