In this post, I will demonstrate node eviction due to missing disk heartbeat i.e. a node will be evicted from the cluster, if it can’t access the voting disk. To simulate it, I will stop iscsi service on one of the nodes and then scan alert logs and ocssd logs of various nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop ISCSI service on host03 so that it is evicted.
— Stop ISCSI service on host03 so that it can’t access shared storage and hence voting disk
[root@host03 ~]# service iscsi stop
– scan alert log of host03– Note that I/O error occurs at 03:32:11
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/alerthost03.log
— Note that ocssd process of host03 is not able to access voting disks
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at (:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.310
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details at (:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.311
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details at (:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.311
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at (:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.312
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02; details at (:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.310
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02; details at (:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
– ACFS can’t be accessed
[client(8048)]CRS-10001:ACFS-9112: The following process IDs have open references on /u01/app/oracle/acfsmount/11.2.0/sharedhome:
[client(8050)]CRS-10001:6323 6363 6391 6375 6385 6383 6402 6319 6503 6361 6377 6505 6389 6369 6335 6367 6333 6387 6871 6325 6381 6327 6496 6498 6552 6373 7278 6339 6400 6357 6500 6329 6365
[client(8052)]CRS-10001:ACFS-9113: These processes will now be terminated.
[client(8127)]CRS-10001:ACFS-9114: done.
[client(8136)]CRS-10001:ACFS-9115: Stale mount point /u01/app/oracle/acfsmount/11.2.0/sharedhome was recovered.
[client(8178)]CRS-10001:ACFS-9114: done.
[client(8183)]CRS-10001:ACFS-9116: Stale mount point ‘/u01/app/oracle/acfsmount/11.2.0/sharedhome’ was not recovered.
[client(8185)]CRS-10001:ACFS-9117: Manual intervention is required.
2012-11-17 03:33:34.050
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5682)]CRS-5016:Process “/u01/app/11.2.0/grid/bin/acfssinglefsmount” spawned by agent
“/u01/app/11.2.0/grid/bin/orarootagent.bin” for action “start” failed: details at “(:CLSN00010:)” in “/u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log”
– At 03:34, voting disk can’t be accessed even after waiting for timeout
2012-11-17 03:34:10.718
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:ASMDISK01 will be considered not functional in 99190 milliseconds
2012-11-17 03:34:10.724
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:ASMDISK02 will be considered not functional in 99180 milliseconds
2012-11-17 03:34:10.724
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:ASMDISK03 will be considered not functional in 99180 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:ASMDISK01 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:ASMDISK02 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:10.666
[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:ASMDISK03 will be considered not functional in 49110 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:ASMDISK01 will be considered not functional in 19060 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:ASMDISK02 will be considered not functional in 19060 milliseconds
2012-11-17 03:35:46.654
[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:ASMDISK03 will be considered not functional in 19060 milliseconds
– Voting files are offlined as they can’t be accessed
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK01; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK02; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK03; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596
– CSSD of host03 reboots the node as no. of voting disks available(0) is less than minimum required (2)
[cssd(5149)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
2012-11-17 03:36:15.645
[ctssd(5236)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03. Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
– scan ocssd log of host03
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
– I/O fencing for ORCL database is carried out by CSSD at 03:32 ( same time as
when host02 got the msg that orcl has failed on host03)
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa14990), member 2 in group DBORCL, no share, death fence 1, SAGE fence 0
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmUnreferenceMember: global grock DBORCL member 2 refcount is 7
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceProcessDeath: client (0xaa14990) pid 6337 undead
…..
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa24250), member 4 in group DAALL_DB, no share, death fence 1, SAGE fence 0
…
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa6db08), member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0
…
2012-11-17 03:32:10.357: [ CSSD][864708496]clssgmTermMember: Terminating member 2 (0xaa15920) in grock DBORCL
…
2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa46760) process death fence completed for process 6337, object type 3
..
2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa05758) process death fence completed for process 6337, object type 2
…
2012-11-17 03:32:10.359: [ CSSD][852125584]clssgmRemoveMember: grock DAALL_DB, member number 4 (0xaa05aa8) node number 3 state 0x0 grock type 2
…
2012-11-17 03:32:11.310: [ SKGFD[942172048]ERROR: -15(asmlib ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)
2012-11-17 03:32:11.310: [ CSSD][942172048](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 19 of ORCL:ASMDISK02
2012-11-17 03:32:11.310: [ SKGFD][973764496]ERROR: -15(asmlib ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)
2012-11-17 03:32:11.310: [ CSSD][973764496](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 19 of ORCL:ASMDISK03
…
2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa38ae0), member 2 in group DBORCL, same group share, death fence 1, SAGE fence 0
…
2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client (0xaa5e128), member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0
…
2012-11-17 03:32:11.354: [ CSSD][908748688]clssnmvDiskAvailabilityChange: voting file ORCL:ASMDISK01 now offline
2012-11-17 03:32:11.354: [ CSSD][973764496]clssnmvDiskAvailabilityChange: voting file ORCL:ASMDISK03 now offline
2012-11-17 03:32:11.354: [ CSSD][931682192]clssnmvDiskAvailabilityChange: voting file ORCL:ASMDISK02 now offline
…
2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK02 sched delay 1610 > margin 1500 cur_ms 232074 lastalive 230464
2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK01 sched delay 1640 > margin 1500 cur_ms 232074 lastalive 230434
….
2012-11-17 03:32:12.223: [ CLSF][887768976]Closing handle:0xa746bc0
2012-11-17 03:32:12.223: [ SKGFD][887768976]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa746df8 for disk :ORCL:ASMDISK01:
2012-11-17 03:32:12.236: [ CLSF][921192336]Closing handle:0xa5cbbb0
2012-11-17 03:32:12.236: [ SKGFD][921192336]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa644fb8 for disk :ORCL:ASMDISK02:
…
2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK03 sched delay 3110 > margin 1500 cur_ms 233574 lastalive 230464
2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK02 sched delay 3110 > margin 1500 cur_ms 233574 lastalive 230464
2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK01 sched delay 3140 > margin 1500 cur_ms 233574
…
2012-11-17 03:36:10.638: [ CSSD][877279120]CALL TYPE: call ERROR SIGNALED: no CALLER: clssscExit
– scan alert log of host01
– Note that reboot message from host03 is received at 03:36:16
[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(4942)]CRS-8011:reboot advisory message from host: host03, component: mo031159, with time stamp: L-2012-11-17-03:36:16.705
[ohasd(4942)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-11-17 03:36:29.610
– After host03 reboots itself, network communication with host03 is lost
[cssd(5177)]CRS-1612:Network communication with node host03 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.060 seconds
2012-11-17 03:36:37.988
[cssd(5177)]CRS-1611:Network communication with node host03 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.050 seconds
2012-11-17 03:36:43.992
[cssd(5177)]CRS-1610:Network communication with node host03 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.040 seconds
2012-11-17 03:36:46.441
– After network communication can’t be established for timeout interval, the node is removed form cluster
[cssd(5177)]CRS-1632:Node host03 is being removed from the cluster in cluster incarnation 232819906
2012-11-17 03:36:46.572
[cssd(5177)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
– scan ocssd log of host01
– Note that ocssd process of host01 discovers missing disk heartbeat from host03 at 03:32:16
[root@host01 cssd]# tailf /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess: clssgmCommonAddMember failed, member(-1/CLSN.ONS.ONSNETPROC[3]) on node(3)
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess: Operation(3) unsuccessful grock(CLSN.ONS.ONSNETPROC[3])
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmHandleMasterJoin: clssgmProcessJoinUpdate failed with status(-10)
….
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: host03, node(3) connection failed, endp (0x319), probe((nil)), ninf->endp 0x319
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp (0x319), init state 3, cur state 3
…
2012-11-17 03:36:15.329: [GIPCXCPT][852125584]gipcInternalDissociate: obj 0x96c7eb8 [0000000000001310] { gipcEndpoint : localAddr ‘gipc://host01:f278-d1bd-1509-2f25#10.0.0.1#20071′, remoteAddr ‘gipc://host03:gm_cluster01#10.0.0.3#58536′, numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x141d, pidPeer 0, flags 0x261e, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
@
– scan alert log of host02–
Note that reboot message is received ar 03:36:16
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/alerthost02.log
. At 03:32, CRSD process of host02 receives message that orcl database has failed on host03 as
datafiles for orcl are on shared storage
[crsd(5576)]CRS-2765:Resource ‘ora.orcl.db’ has failed on server ‘host03′.
2012-11-17 03:32:44.303
. CRSD process of host02 receives message that acfs has failed on host03 as
shared storage can’t be accessed
[crsd(5576)]CRS-2765:Resource ‘ora.acfs.dbhome_1.acfs’ has failed on server ‘host03‘.
2012-11-17 03:36:16.981
. ohasd process receives reboot advisory message from host03
[ohasd(4916)]CRS-8011:reboot advisory message from host: host03, component: ag031159, with time stamp: L-2012-11-17-03:36:16.705
[ohasd(4916)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-11-17 03:36:16.981
[ohasd(4916)]CRS-8011:reboot advisory message from host: host03, component: mo031159, with time stamp: L-2012-11-17-03:36:16.705
[ohasd(4916)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-11-17 03:36:28.920
. CSSD process of host02 identifies missing network communication from host03 as host03 has rebooted itself
[cssd(5284)]CRS-1612:Network communication with node host03 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.420 seconds
2012-11-17 03:36:37.307
[cssd(5284)]CRS-1611:Network communication with node host03 (3) missing for 75% of timeout interval. Removal of this node from cluster in 7.410 seconds
2012-11-17 03:36:43.328
[cssd(5284)]CRS-1610:Network communication with node host03 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.400 seconds
– After network communication can’t be established for timeout interval, the node is removed form cluster
2012-11-17 03:36:46.297
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
2012-11-17 03:36:46.470
[crsd(5576)]CRS-5504:Node down event reported for node ‘host03′.
2012-11-17 03:36:51.890
[crsd(5576)]CRS-2773:Server ‘host03′ has been removed from pool ‘Generic’.
2012-11-17 03:36:51.909
[crsd(5576)]CRS-2773:Server ‘host03′ has been removed from pool ‘ora.orcl’.
[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 host03 .
– scan ocssd log of host02
– note that ocssd of host02 discovers missing host03 only after it has been rebooted at 03:36
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/cssd/ocssd.log
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: host03, node(3) connection failed, endp (0x22e), probe((nil)), ninf->endp 0x22e
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp (0x22e), init state 3, cur state 3
…..
2012-11-17 03:36:15.052: [ CSSD][852125584]clssgmPeerDeactivate: node 3 (host03), death 0, state 0x1 connstate 0x1e
….
2012-11-17 03:36:28.920: [ CSSD][841635728]clssnmPollingThread: node host03 (3) at 50% heartbeat fatal, removal in 14.420 seconds
2012-11-17 03:36:28.920: [ CSSD][841635728]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-11-17 03:36:29.017: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074
2012-11-17 03:36:29.017: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074
2012-11-17 03:36:29.017: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074
2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmTagize: version(1), type(13), tagizer(0x80cf3ac)
2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmHandleDataInvalid: grock HB+ASM, member 1 node 1, birth 1
2012-11-17 03:36:30.218: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074
2012-11-17 03:36:30.218: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074
2012-11-17 03:36:30.218: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074
2012-11-17 03:36:31.408: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074
2012-11-17 03:36:31.408: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074
2012-11-17 03:36:31.408: [ CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074
2012-11-17 03:36:32.204: [ CSSD][831145872]clssnmSendingThread: sending status msg to all nodes
….
2012-11-17 03:36:46.161: [ CSSD][810166160]clssnmHandleSync: local disk timeout set to 27000 ms, remote disk timeout set to 27000
————————–
Related links:
11G R2 RAC Index
Node Eviction Due To Missing Network Heartbeat
Node Eviction Due To Member Kill Escalation
Node Eviction Due To CSSD Agent Stopping
This comment has been removed by the author.
can you tell me how do u troubleshoot this type of issue?
You need to find out the reason why a node is not able to access the voting disk and resolve it accordingly.
Regards
Anju Garg