11g R2 RAC : REBOOT-LESS NODE FENCING

Prior to 11.2.0.2, during failures of certain Oracle RAC-required subcomponents (e.g. private interconnect, voting disk etc.) , Oracle Clusterware tried to prevent a split-brain with a fast reboot of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems. As a result, non-cluster-aware applications would be forcibly shut down. Moreover, during reboots, resources need to re-mastered across the surviving nodes . In a big cluster with many numbers of nodes, this can be potentially a very expensive operation.

This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set).

After deciding which node to evict,

– the clusterware will attempt to clean up the failure within the cluster by killing only the offending process(es) on that node . Especially I/O generating processes are killed .

– If all oracle resources/processes can be stopped and all IO generating processes can be killed,

  • clusterware resources will stop on the node
  • Oracle High Availability Services Daemon will keep on trying to restart the  Cluster Ready Services (CRS) stack again.
  • Once the conditions to start  CRS stack are re-established, all relevant cluster resources on that node will automatically start.

– If, for some reason, not all resources can be stopped or IO generating processes cannot be stopped completely (hanging in kernel mode, I/O path, etc.) ,

  • Oracle Clusterware will still perform a reboot or use IPMI to forcibly evict the node from the cluster as earlier.

This behavior change is particularly useful for non-cluster aware applications as the data will be protected by shutting down the cluster only on the node without rebooting the node itself.

I will demonstrate this functionality in two scenarios :

Failure of network heartbeat
Failure of DISK heartbeat

References:

http://ora-ssn.blogspot.in/2011/09/reboot-less-node-fencing-in-oracle.html
http://www.trivadis.com/uploads/tx_cabagdownloadarea/Trivadis_oracle_clusterware_node_fencing_v.pdf
http://www.vmcd.org/2012/03/11gr2-rac-rebootless-node-fencing/
http://www.vitalsofttech.com/grid-11gr2-ipmi-based-failure-isolation/

———————————————————————————————

Related Links:

Home

11g R2 RAC Index

11g R2 RAC: Node Eviction Due To Missing Network Heartbeat 
 11g R2 RAC: Reboot-less Fencing With Missing Network Heartbeat
11g R2 RAC :Reboot-less  Fencing With Missing Disk Heartbeat

————–

 

 

Your comments and suggestions are welcome!