11g R2 RAC: REBOOT-LESS FENCING WITH MISSING DISK HEARTBEAT

In my earlier post, I had discussed about reboot-less node fencing , a new feature introduced since 11.2.0.2. In this post, I will demonstrate reboot-less node fencing when disk heartbeat is lost.

– Check that clusterware version is 11.2.0.3

[root@host02 ~]# crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.2.0.3.0]

– check that both the nodes in the cluster are active

[root@host02 ~]# olsnodes -s
host01  Active
host02  Active

– Stop ISCSI service on node2

[root@host02 ~]# service iscsi stop
Logging out of session [sid: 1, target: iqn.2006-01.com.openfiler:tsn.e55ea88d0212, portal: 

192.9.201.182,3260]
Logout of [sid: 1, target: iqn.2006-01.com.openfiler:tsn.e55ea88d0212, portal: 192.9.201.182,3260]: 

successful
Stopping iSCSI daemon:

– Alert log of node2 –

– Note that instead of rebooting the node, CRSD resources are cleaned up

[cssd(2876)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK013; details at (:CSSNM00059:) 

...

[cssd(2876)]CRS-1606:The number of voting files available, 0, is less than the minimum number of 

voting files required, 1, resulting in CSSD termination to ensure data integrity; 

[cssd(2876)]CRS-1656:The CSS daemon is terminating due to a fatal error; 

[cssd(2876)]CRS-1652:Starting clean up of CRSD resources.
2013-10-09 11:04:30.795

...

[cssd(2876)]CRS-1654:Clean up of CRSD resources finished successfully.
2013-10-09 11:04:31.914

— Check that OHAS service is still up on host02

[root@host02 ~]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

– Check that resources cssd , crsd and HAIP are down on host02[

[root@host02 ~]# crsctl stat res -t -init
——————————————————————————–
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
——————————————————————————–
Cluster Resources
——————————————————————————–
ora.asm
      1        ONLINE  OFFLINE                                                   
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE                                                   
ora.crf
      1        ONLINE  ONLINE       host02                                       
ora.crsd
      1        ONLINE  OFFLINE                                                   
ora.cssd
      1        ONLINE  OFFLINE                               STARTING            
ora.cssdmonitor
      1        ONLINE  ONLINE       host02                                       
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
      1        ONLINE  ONLINE       host02                                       
ora.evmd
      1        ONLINE  OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       host02                                       
ora.gpnpd
      1        ONLINE  ONLINE       host02                                       
ora.mdnsd
      1        ONLINE  ONLINE       host02

--Check that host02 is no longer a part of the cluster

[root@host01 cluster01]# olsnodes -s
host01  Active
host02  Inactive

-- Restart ISCSI service on host02

[root@host02 ~]# service iscsi start
iscsid dead but pid file exists
Turning off network shutdown. 

Starting iSCSI daemon:                                     [  OK  ]
                                                           [  OK  ]
Setting up iSCSI targets: Logging in to [iface: default, target: iqn.2006-

01.com.openfiler:tsn.e55ea88d0212, portal: 192.9.201.182,3260]
Login to [iface: default, target: iqn.2006-01.com.openfiler:tsn.e55ea88d0212, portal: 

192.9.201.182,3260]: successful
                                                           [  OK  ]

- Alert log of host02

-- Note that as soon as ISCSI service is started, CSSD service starts immediately and host02 joins the cluster

[cssd(5481)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details 

...

2013-10-09 11:10:43.897
[cssd(5481)]CRS-1707:Lease acquisition for node host02 number 2 completed

2013-10-09 11:10:47.629
[cssd(5481)]CRS-1605:CSSD voting file is online: ORCL:ASMDISK013; details in 

/u01/app/11.2.0/grid/log/host02/cssd/ocssd.log.

2013-10-09 11:10:54.652
[cssd(5481)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .

-- check that resources haip, cssd and crsd have started on host02

[root@host02 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       host02                   Started             
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       host02                                       
ora.crf
      1        ONLINE  ONLINE       host02                                       
ora.crsd
      1        ONLINE  ONLINE       host02                                       
ora.cssd
      1        ONLINE  ONLINE       host02                                       
ora.cssdmonitor
      1        ONLINE  ONLINE       host02                                       
ora.ctssd
      1        ONLINE  ONLINE       host02                   OBSERVER            
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
      1        ONLINE  ONLINE       host02                                       
ora.evmd
      1        ONLINE  ONLINE       host02                                       
ora.gipcd
      1        ONLINE  ONLINE       host02                                       
ora.gpnpd
      1        ONLINE  ONLINE       host02                                       
ora.mdnsd
      1        ONLINE  ONLINE       host02

-- Check that host02 has joined the cluster

[root@host02 ~]# olsnodes -s
host01  Active
host02  Active

References:

http://ora-ssn.blogspot.in/2011/09/reboot-less-node-fencing-in-oracle.html
http://www.trivadis.com/uploads/tx_cabagdownloadarea/Trivadis_oracle_clusterware_node_fencing_v.pdf
http://www.vmcd.org/2012/03/11gr2-rac-rebootless-node-fencing/

---------------------------------------------------------------------------------------------

Related Links:

Home

11g R2 RAC Index

11g R2 RAC: Node Eviction Due To Missing Network Heartbeat
11g R2 RAC :Reboot-less Node Fencing
 11g R2 RAC: Reboot-less Fencing With Missing Network Heartbeat

 

--------------

Your comments and suggestions are welcome!