11g R2 RAC: NODE EVICTION DUE TO MISSING DISK HEARTBEAT

In this post, I will demonstrate node eviction due to missing disk heartbeat i.e. a node will be evicted from the cluster, if it can’t access the voting disk. To simulate it, I will stop iscsi service on one of the nodes and then scan alert logs and ocssd logs of various nodes.

Current scenario:
No. of nodes in the cluster  : 3
Names of the nodes      : host01, host02, host03
Name of the cluster database : orcl

I will stop ISCSI service on host03 so that it is evicted.

— Stop ISCSI service on host03 so that it can’t access shared storage and hence voting disk

[root@host03 ~]# service iscsi stop
 scan alert log of host03 Note that I/O error occurs at 03:32:11

[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/alerthost03.log

— Note that ocssd process of host03 is not able to access voting disks

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at (:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:32:11.310

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details at (:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:32:11.311

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details at (:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:32:11.311

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details at (:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:32:11.312

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02; details at (:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:32:11.310

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02; details at (:CSSNM00059:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

– ACFS can’t be accessed

[client(8048)]CRS-10001:ACFS-9112: The following process IDs have open references on /u01/app/oracle/acfsmount/11.2.0/sharedhome:

[client(8050)]CRS-10001:6323 6363 6391 6375 6385 6383 6402 6319 6503 6361 6377 6505 6389 6369 6335 6367 6333 6387 6871 6325 6381 6327 6496 6498 6552 6373 7278 6339 6400 6357 6500 6329 6365

[client(8052)]CRS-10001:ACFS-9113: These processes will now be terminated.

[client(8127)]CRS-10001:ACFS-9114: done.

[client(8136)]CRS-10001:ACFS-9115: Stale mount point /u01/app/oracle/acfsmount/11.2.0/sharedhome was recovered.

[client(8178)]CRS-10001:ACFS-9114: done.

[client(8183)]CRS-10001:ACFS-9116: Stale mount point ‘/u01/app/oracle/acfsmount/11.2.0/sharedhome’ was not recovered.

[client(8185)]CRS-10001:ACFS-9117: Manual intervention is required.

2012-11-17 03:33:34.050

[/u01/app/11.2.0/grid/bin/orarootagent.bin(5682)]CRS-5016:Process “/u01/app/11.2.0/grid/bin/acfssinglefsmount” spawned by agent

“/u01/app/11.2.0/grid/bin/orarootagent.bin” for action “start” failed: details at “(:CLSN00010:)” in “/u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log”

– At 03:34, voting disk can’t be accessed even after waiting for timeout

2012-11-17 03:34:10.718

[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:ASMDISK01 will be considered not functional in 99190 milliseconds

2012-11-17 03:34:10.724

[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:ASMDISK02 will be considered not functional in 99180 milliseconds

2012-11-17 03:34:10.724

[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:ASMDISK03 will be considered not functional in 99180 milliseconds

2012-11-17 03:35:10.666

[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:ASMDISK01 will be considered not functional in 49110 milliseconds

2012-11-17 03:35:10.666

[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:ASMDISK02 will be considered not functional in 49110 milliseconds

2012-11-17 03:35:10.666

[cssd(5149)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:ASMDISK03 will be considered not functional in 49110 milliseconds

2012-11-17 03:35:46.654

[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:ASMDISK01 will be considered not functional in 19060 milliseconds

2012-11-17 03:35:46.654

[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:ASMDISK02 will be considered not functional in 19060 milliseconds

2012-11-17 03:35:46.654

[cssd(5149)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:ASMDISK03 will be considered not functional in 19060 milliseconds

– Voting files are offlined as they can’t be accessed

[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK01; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:36:10.596

[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK02; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:36:10.596

[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK03; details at (:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.

2012-11-17 03:36:10.596

– CSSD of host03 reboots the node as no. of voting disks available(0) is less than minimum required (2)

[cssd(5149)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log

2012-11-17 03:36:15.645

[ctssd(5236)]CRS-2402:The Cluster Time Synchronization Service aborted on host host03. Details at (:ctsselect_mmg5_1: in /u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.

 scan ocssd log of host03

[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log

– I/O fencing for ORCL database is carried out by CSSD at 03:32 ( same time as

   when host02 got the msg that orcl has failed on host03)

2012-11-17 03:32:10.356: [    CSSD][997865360]clssgmFenceClient: fencing client (0xaa14990), member 2 in group DBORCL, no share, death fence 1, SAGE fence 0

2012-11-17 03:32:10.356: [    CSSD][997865360]clssgmUnreferenceMember: global grock DBORCL member 2 refcount is 7

2012-11-17 03:32:10.356: [    CSSD][997865360]clssgmFenceProcessDeath: client (0xaa14990) pid 6337 undead

…..

2012-11-17 03:32:10.356: [    CSSD][997865360]clssgmFenceClient: fencing client (0xaa24250), member 4 in group DAALL_DB, no share, death fence 1, SAGE fence 0

2012-11-17 03:32:10.356: [    CSSD][997865360]clssgmFenceClient: fencing client (0xaa6db08), member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0

2012-11-17 03:32:10.357: [    CSSD][864708496]clssgmTermMember: Terminating member 2 (0xaa15920) in grock DBORCL

2012-11-17 03:32:10.358: [    CSSD][864708496]clssgmFenceCompletion: (0xaa46760) process death fence completed for process 6337, object type 3

..

2012-11-17 03:32:10.358: [    CSSD][864708496]clssgmFenceCompletion: (0xaa05758) process death fence completed for process 6337, object type 2

2012-11-17 03:32:10.359: [    CSSD][852125584]clssgmRemoveMember: grock DAALL_DB, member number 4 (0xaa05aa8) node number 3 state 0x0 grock type 2

2012-11-17 03:32:11.310: [   SKGFD[942172048]ERROR: -15(asmlib ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)

2012-11-17 03:32:11.310: [    CSSD][942172048](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 19 of ORCL:ASMDISK02

2012-11-17 03:32:11.310: [   SKGFD][973764496]ERROR: -15(asmlib ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)

2012-11-17 03:32:11.310: [    CSSD][973764496](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 19 of ORCL:ASMDISK03

2012-11-17 03:32:11.349: [    CSSD][997865360]clssgmFenceClient: fencing client (0xaa38ae0), member 2 in group DBORCL, same group share, death fence 1, SAGE fence 0

2012-11-17 03:32:11.349: [    CSSD][997865360]clssgmFenceClient: fencing client (0xaa5e128), member 0 in group DG_LOCAL_DATA, same group share, death fence 1, SAGE fence 0

2012-11-17 03:32:11.354: [    CSSD][908748688]clssnmvDiskAvailabilityChange: voting file ORCL:ASMDISK01 now offline

2012-11-17 03:32:11.354: [    CSSD][973764496]clssnmvDiskAvailabilityChange: voting file ORCL:ASMDISK03 now offline

2012-11-17 03:32:11.354: [    CSSD][931682192]clssnmvDiskAvailabilityChange: voting file ORCL:ASMDISK02 now offline

2012-11-17 03:32:12.038: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK02 sched delay 1610 > margin 1500 cur_ms 232074 lastalive 230464

2012-11-17 03:32:12.038: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK01 sched delay 1640 > margin 1500 cur_ms 232074 lastalive 230434

….

2012-11-17 03:32:12.223: [    CLSF][887768976]Closing handle:0xa746bc0

2012-11-17 03:32:12.223: [   SKGFD][887768976]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa746df8 for disk :ORCL:ASMDISK01:

2012-11-17 03:32:12.236: [    CLSF][921192336]Closing handle:0xa5cbbb0

2012-11-17 03:32:12.236: [   SKGFD][921192336]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa644fb8 for disk :ORCL:ASMDISK02:

2012-11-17 03:32:13.825: [    CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK03 sched delay 3110 > margin 1500 cur_ms 233574 lastalive 230464

2012-11-17 03:32:13.825: [    CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK02 sched delay 3110 > margin 1500 cur_ms 233574 lastalive 230464

2012-11-17 03:32:13.825: [    CSSD][997865360]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:ASMDISK01 sched delay 3140 > margin 1500 cur_ms 233574

2012-11-17 03:36:10.638: [    CSSD][877279120]CALL TYPE: call   ERROR SIGNALED: no   CALLER: clssscExit

scan alert log of host01

– Note that reboot message from host03  is received at 03:36:16 

[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log

[ohasd(4942)]CRS-8011:reboot advisory message from host: host03, component: mo031159, with time stamp: L-2012-11-17-03:36:16.705

[ohasd(4942)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

2012-11-17 03:36:29.610

– After host03 reboots itself, network communication with host03 is lost

[cssd(5177)]CRS-1612:Network communication with node host03 (3) missing for 50% of timeout interval.  Removal of this node from cluster in 14.060 seconds

2012-11-17 03:36:37.988

[cssd(5177)]CRS-1611:Network communication with node host03 (3) missing for 75% of timeout interval.  Removal of this node from cluster in 7.050 seconds

2012-11-17 03:36:43.992

[cssd(5177)]CRS-1610:Network communication with node host03 (3) missing for 90% of timeout interval.  Removal of this node from cluster in 2.040 seconds

2012-11-17 03:36:46.441

– After network communication can’t be established for timeout interval, the node is removed form cluster

[cssd(5177)]CRS-1632:Node host03 is being removed from the cluster in cluster incarnation 232819906

2012-11-17 03:36:46.572

[cssd(5177)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .

scan ocssd log of host01

– Note that ocssd process of host01 discovers missing disk heartbeat from host03 at 03:32:16

[root@host01 cssd]# tailf /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log



2012-11-17 03:32:16.352: [    CSSD][852125584]clssgmGrockOpTagProcess: clssgmCommonAddMember failed, member(-1/CLSN.ONS.ONSNETPROC[3]) on node(3)

2012-11-17 03:32:16.352: [    CSSD][852125584]clssgmGrockOpTagProcess: Operation(3) unsuccessful grock(CLSN.ONS.ONSNETPROC[3])

2012-11-17 03:32:16.352: [    CSSD][852125584]clssgmHandleMasterJoin: clssgmProcessJoinUpdate failed with status(-10)

….

2012-11-17 03:36:15.328: [    CSSD][810166160]clssnmDiscHelper: host03, node(3) connection failed, endp (0x319), probe((nil)), ninf->endp 0x319

2012-11-17 03:36:15.328: [    CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp (0x319), init state 3, cur state 3

2012-11-17 03:36:15.329: [GIPCXCPT][852125584]gipcInternalDissociate: obj 0x96c7eb8 [0000000000001310] { gipcEndpoint : localAddr ‘gipc://host01:f278-d1bd-1509-2f25#10.0.0.1#20071′, remoteAddr ‘gipc://host03:gm_cluster01#10.0.0.3#58536′, numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x141d, pidPeer 0, flags 0x261e, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)

@

scan alert log of host02

 Note that reboot message is received ar 03:36:16

[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/alerthost02.log

. At 03:32, CRSD process of host02 receives message that orcl database has failed on host03 as

     datafiles for orcl are on shared storage

[crsd(5576)]CRS-2765:Resource ‘ora.orcl.db’ has failed on server ‘host03′.

2012-11-17 03:32:44.303

. CRSD process of host02 receives message that acfs has failed on host03 as

      shared storage can’t be accessed

[crsd(5576)]CRS-2765:Resource ‘ora.acfs.dbhome_1.acfs’ has failed on server ‘host03‘.

2012-11-17 03:36:16.981

    . ohasd process receives reboot advisory message from host03

[ohasd(4916)]CRS-8011:reboot advisory message from host: host03, component: ag031159, with time stamp: L-2012-11-17-03:36:16.705

[ohasd(4916)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

2012-11-17 03:36:16.981

[ohasd(4916)]CRS-8011:reboot advisory message from host: host03, component: mo031159, with time stamp: L-2012-11-17-03:36:16.705

[ohasd(4916)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

2012-11-17 03:36:28.920

    . CSSD process of host02 identifies missing network communication from host03 as host03 has rebooted itself

[cssd(5284)]CRS-1612:Network communication with node host03 (3) missing for 50% of timeout interval.  Removal of this node from cluster in 14.420 seconds

2012-11-17 03:36:37.307

[cssd(5284)]CRS-1611:Network communication with node host03 (3) missing for 75% of timeout interval.  Removal of this node from cluster in 7.410 seconds

2012-11-17 03:36:43.328

[cssd(5284)]CRS-1610:Network communication with node host03 (3) missing for 90% of timeout interval.  Removal of this node from cluster in 2.400 seconds
– After network communication can’t be established for timeout interval, the node is removed form cluster

2012-11-17 03:36:46.297

[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .

2012-11-17 03:36:46.470

[crsd(5576)]CRS-5504:Node down event reported for node ‘host03′.

2012-11-17 03:36:51.890

[crsd(5576)]CRS-2773:Server ‘host03′ has been removed from pool ‘Generic’.

2012-11-17 03:36:51.909

[crsd(5576)]CRS-2773:Server ‘host03′ has been removed from pool ‘ora.orcl’.

[cssd(5284)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 host03 .

scan ocssd log of host02

– note that ocssd of host02 discovers missing host03 only after it has been rebooted at 03:36

[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/cssd/ocssd.log

2012-11-17 03:36:15.052: [    CSSD][810166160]clssnmDiscHelper: host03, node(3) connection failed, endp (0x22e), probe((nil)), ninf->endp 0x22e

2012-11-17 03:36:15.052: [    CSSD][810166160]clssnmDiscHelper: node 3 clean up, endp (0x22e), init state 3, cur state 3

…..

2012-11-17 03:36:15.052: [    CSSD][852125584]clssgmPeerDeactivate: node 3 (host03), death 0, state 0x1 connstate 0x1e

….

2012-11-17 03:36:28.920: [    CSSD][841635728]clssnmPollingThread: node host03 (3) at 50% heartbeat fatal, removal in 14.420 seconds

2012-11-17 03:36:28.920: [    CSSD][841635728]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)

2012-11-17 03:36:29.017: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074

2012-11-17 03:36:29.017: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074

2012-11-17 03:36:29.017: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074

2012-11-17 03:36:29.908: [    CSSD][852125584]clssgmTagize: version(1), type(13), tagizer(0x80cf3ac)

2012-11-17 03:36:29.908: [    CSSD][852125584]clssgmHandleDataInvalid: grock HB+ASM, member 1 node 1, birth 1

2012-11-17 03:36:30.218: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074

2012-11-17 03:36:30.218: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074

2012-11-17 03:36:30.218: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 475884 lastalive 475074

2012-11-17 03:36:31.408: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074

2012-11-17 03:36:31.408: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074

2012-11-17 03:36:31.408: [    CSSD][810166160]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 790 > margin 750 cur_ms 476864 lastalive 476074

2012-11-17 03:36:32.204: [    CSSD][831145872]clssnmSendingThread: sending status msg to all nodes

….

2012-11-17 03:36:46.161: [    CSSD][810166160]clssnmHandleSync: local disk timeout set to 27000 ms, remote disk timeout set to 27000

References:

————————–

Related links:

Home

11G R2 RAC Index
Node Eviction Due To Missing Network Heartbeat 
Node Eviction Due To Member Kill Escalation 
Node Eviction Due To CSSD Agent Stopping

11g R2 RAC: Reboot-less Node Fencing 
11g R2 RAC :Reboot-less  Fencing With Missing Disk Heartbeat
11g R2 RAC: Reboot-less Fencing With Missing Network Heartbeat                                                             —————— 
 

3 thoughts on “11g R2 RAC: NODE EVICTION DUE TO MISSING DISK HEARTBEAT

Your comments and suggestions are welcome!