11g R2 RAC : VOTING DISK DEMYSTIFIED

                                      Voting disk in 11g

In this post, I will write about voting disk – what does it contain, who updates it, how is it used, where is it stored and so on…
Voting disk a key component of clusterware and its failure can lead to inoperability of the cluster.
In RAC at any point in time the clusterware must know which nodes are member of the cluster so that
- it can perform load balancing
- In case a node fails, it can perform failover of resources as defined in the resource profiles
- If a node joins, it can start resources on it as defined in OCR/OLR
- If a node joins, it can assign VIP to it in case GNS is in use
- If a node fails, it can execute callouts if defined
and so on
Hence, there must be a way by which clusterware can find out about the node membership.
  That is where voting disk comes into picture. It is the place where nodes mark their attendance. Consider an analogy where a manager wants to find out which of his subordinates are present. He can just check the attendance register and assign them their tasks accordingly. Similarly, CSSD process on every node makes entries in the voting disk to ascertain the membership of that node. The voting disk  records node membership information. If it ever fails, the entire clustered environment for Oracle 11g RAC will be adversely affected and a possible outage may result if the vote disks is/are lost.
Also, in a cluster communication between various nodes is of paramount importance.  Nodes which can’t communicate with other nodes  should be evicted from the cluster. While marking their own presence, all the nodes also register the information about their communicability with other nodes in voting disk . This is called network heartbeat. CSSD process in each RAC node maintains its heart beat in a block of size 1 OS block, in the hot  block of voting disk  at a specific offset.  The written block has a header area with the node name.  The heartbeat counter increments every second on every write call. Thus heartbeat of various nodes is recorded at different offsets in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario.If the disk block is not updated in a short timeout period, that node is considered unhealthy and  may be rebooted to protect the database information. In this case , a message to this effect is written in the kill block of the node. Each node  reads its kill block once per second, if the kill block is overwritten node commits  suicide.
During reconfig (join or leave) CSSD monitors all nodes and determines whether  a node has a disk heartbeat, including those with no network heartbeat. If no disk  heartbeat is detected  then node is declared as dead.
What is stored in voting disk?
——————————
Voting disks contain static and dynamic data.
Static data : Info about nodes in the cluster
Dynamic data : Disk heartbeat logging
It maintains and consists of important details about the cluster nodes membership, such as
- which node is part of the cluster,
- who (node) is joining the cluster, and
- who (node) is leaving the cluster.
Why is voting disk needed ?
—————————
The Voting Disk Files are used by Oracle Clusterware  by way of a health check .
- by CSS to determine which nodes are currently members of the cluster
- in concert with other Cluster components such as CRS to shut down, fence, or reboot either single or multiple nodes whenever network communication is lost between any node within the cluster, in order to prevent the dreaded split-brain condition in which two or more instances attempt to control the RAC database. It  thus protects the database information.
- It will be used by the CSS daemon to arbitrate with peers that it cannot see over the private interconnect in the event of an outage, allowing it to salvage the largest fully connected subcluster for further operation.  It checks the voting disk to determine if there is a failure on any other nodes in the cluster. During this operation, NM will make an entry in the voting disk to inform its vote on availability. Similar operations are performed by other instances in the cluster. The three voting disks configured also provide a method to determine who in the cluster should survive. For example, if eviction of one of the nodes is necessitated by an unresponsive action, then the node that has two voting disks will start evicting the other node. NM alternates its action between the heartbeat and the voting disk to determine the availability of other nodes in the cluster.
The Voting disk is the key communication mechanism within the Oracle Clusterware where all nodes in the cluster read and write heartbeat information. CSSD processes (Cluster Services Synchronization Daemon) monitor the health of  RAC nodes employing two distinct heart beats: Network heart beat and Disk heart beat. Healthy nodes will have continuous network and disk heartbeats exchanged between the  nodes. Break in heart beat indicates a possible error scenario. There are few different scenarios possible with missing heart beats:
1. Network heart beat is successful, but disk heart beat is missed.
2. Disk heart beat is successful, but network heart beat is missed.
3. Both heart beats failed.
In addition, with numerous nodes, there are other possible scenarios too. Few possible scenarios:
1. Nodes have split in to N sets of nodes, communicating within the set, but not with members in other set.
2. Just one node is unhealthy.
Nodes with quorum will maintain active membership of the cluster and other node(s) will be fenced/rebooted.
Why should we have an odd number of voting disks?
————————————————-
The odd number of voting disks configured provide a method to determine who in the cluster should survive.
A node must be able to access more than half of the voting disks at any time. For example, let’s have a two node cluster with an even number of let’s say 2 voting disks. Let Node1 is able to access voting disk1 and Node2 is able to access voting disk2 . This means that there is no common file where clusterware can check the heartbeat of both the nodes.  If we have 3 voting disks and both the nodes are able to access more than half i.e. 2 voting disks, there will be at least on disk which will be accessible by both the nodes. The clusterware can use that disk to check the heartbeat of both the nodes. Hence, each  node should be  able to access more than half the number of voting disks. A node not able  to do so will have to be evicted from the cluster by another node that has more than half the voting disks, to maintain the integrity of the cluster  . After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle Clusterware to recover the failed node and restore it to the cluster.
   Loss of more than half your voting disks will cause the entire cluster to fail !!
Where is voting disk stored?
—————————-
 The Voting Disk is a shared disk that will be accessed by all member nodes in the cluster during an operation. Hence, the voting disks must be on shared accessible storage .
- You should plan on allocating 280MB for each voting disk file.
- Prior to 11g R2 RAC, it could be placed on
     . a raw device
   . a clustered filesystem supported by Oracle RAC such as OCFS, Sun Cluster, or Veritas Cluster filesystem
- As of  11g R2 RAC, it can be placed on  ASM disks . This simplifies management and improves performance.  But this brought up a puzzle too. For a node to join the cluster, it must be able to access voting disk but voting disk is on ASM and ASM can’t be up until node is up. To resolve this issue, Oracle ASM reserves several blocks at a fixed location for every Oracle ASM disk used for storing the voting disk.As a result , Oracle Clusterware can access the voting disks present in ASM even if the ASM instance is down and CSS can continue to maintain the Oracle cluster even if the ASM instance has failed.The physical location of the voting files in used ASM disks is fixed, i.e. the cluster stack does not rely on a running ASM instance to access the files. The location of the file is visible in the ASM disk header (dumping the file out of ASM with dd is quite easy):
oracle@rac1:~/ [+ASM1] kfed read /dev/sdf | grep -E ‘vfstart|vfend’

kfdhdb.vfstart:                   96 ; 0x0ec: 0×00000060                          <

kfdhdb.vfend:                    128 ; 0x0f0: 0×00000080                         <

 - The voting disk is not striped but put as a whole on ASM Disks
 - In the event that the disk containing the voting disk fails, Oracle ASM will choose another disk on which to store this data.
 - It eliminates the need for using a third-party cluster volume manager.
 - you can reduce the complexity of managing disk partitions for voting disks during Oracle Clusterware installations.
 -  Voting disk needs to be mirrored, should it become unavailable, cluster will come down. Hence, you should maintain multiple copies of the voting disks on separate disk LUNs so that you eliminate a Single Point of Failure (SPOF) in your Oracle 11g RAC configuration.
- If voting disk is stored on ASM, multiplexing level of voting disk is decided by the redundancy of the diskgroup.
Redundancy of the diskgroup       #of copies of voting disk        ( Minimum # of disks in the diskgroup)
External                                               1                                                  1
Normal                                                3                                                  3
High                                                 5                                                  5- If voting disk is on a diskgroup with external redundancy, one copy of voting file will be stored on one disk in the diskgroup.-  If we store voting disk on a diskgroup with normal redundancy, we should be able to tolerate the loss of one disk i.e. even if we lose one disk, we should have sufficient number of voting disks so that clusterware can continue.  If the diskgroup has 2 disks (minimum required for normal redundancy), we can store 2 copies of voting disk on it. If we lose one disk, only one copy of voting disk will be left  and clusterware won’t be able to continue,  because to continue, clusterware should be able to access more than  half the no. of voting disks i.e.> (2*1/2)
i.e. > 1
i.e.=  2
Hence, to be able to tolerate the loss of one disk, we should have 3 copies of the voting disk on a diskgroup with normal redundancy . So, a normal redundancy diskgroup having voting disk should have minimum 3 disks in it.
- Similarly, if we store voting disk on diskgroup with high redundancy, 5 Voting Files are placed, each on one ASM Disk i.e a high redundancy diskgroup should have at least 5 disks so that even of we lose 2 disks, clusterware can continue .
 - Ensure that all the nodes participating in the cluster have read/write permissions on disks.
 - You can have up to a maximum of 15 voting disks. However, Oracle recommends that you do not go beyond five voting disks.
Backing up voting disk
———————–
In previous versions of Oracle Clusterware you needed to backup the voting disks with the dd command. Starting with Oracle Clusterware 11g Release 2 you no longer need to backup the voting disks. The voting disks are automatically backed up as a part of the OCR. In fact, Oracle explicitly indicates that you should not use a backup tool like dd to backup or restore voting disks. Doing so can lead to the loss of the voting disk.
Although the Voting disk contents are not changed frequently, you will need to back up the Voting disk file every time
- you add or remove a node from the cluster or
- immediately after you configure or upgrade a cluster.
  A node in the cluster must be able to access more than half of the voting disks at any time in order to be able to tolerate a failure of n voting disks. Therefore, it is strongly recommended that you configure an odd number of voting disks such as 3, 5, and so on.
Check the location of voting disk
grid@host01$crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
–  —–    —————–                ——— ———
 1. ONLINE   243ec3b2a3cf4fbbbfed6f20a1ef4319 (ORCL:ASMDISK01) [DATA]
Located 1 voting disk(s).
– we can see that only one copy of the voting disk is there on data diskgroup which has external redundancy.
As I mentioned earlier, Oracle writes the voting devices to the underlying disks at pre-designated locations so that it can get the contents of these files when the cluster starts up.
Let’s see that with an actual example. Let’s see the logs from CSS . They are located at $ORACLE_HOME/log//cssd  Here is an excerpt from one of the logs. The line says that it found a “potential” voting file on one of the disks – 243ec3b2-a3cf4fbb-bfed6f20-a1ef4319
.
grid@host01$ vi /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log
search for string potential or File Universal ID – 243ec3……
2012-10-09 03:54:28.423: [    CSSD][986175376]clssnmvDiskVerify: Successful discovery for disk ORCL:ASMDISK01, UID 243ec3b2-a3cf4fbb-bfed6f20-a1ef4319,
Create another diskgroup test with normal redundancy and 2 disks.
Try to move voting disk from diskgroup data to test diskgroup
– Fails as we should have at least 3 disks in the test diskgropup
[grid@host01 cssd]$ crsctl replace votedisk +test
Failed to create voting files on disk group test.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.
Add another disk to test diskgroup and mark it as quorum disk. The quorum disk is one small Disk (300 MB should be on the safe side here, since the Voting File is only about 280 MB in size) to keep one Mirror of the Voting File. Other two disks will contain each one Voting File and all the other stripes of the Database Area as well, but quorum  will only get that one Voting File.
Now try to move the voting disk from data diskgroup tp test diskgroup
– Now the operation is successful
[grid@host01 cssd]$ crsctl replace votedisk +test
Successful addition of voting disk 00ce3c95c6534f44bfffa645a3430bc3.
Successful addition of voting disk a3751063aec14f8ebfe8fb89fccf45ff.
Successful addition of voting disk 0fce89ac35834f99bff7b04ccaaa8006.
Successful deletion of voting disk 243ec3b2a3cf4fbbbfed6f20a1ef4319.
Successfully replaced voting disk group with +test.
CRS-4266: Voting file(s) successfully replaced
– Check the ocssd.log – search for 00ce3c9……
grid@host01$vi $ORACLE_HOME/log/host01/cssd/ocssd.log
2012-10-09 05:08:19.484: [    CSSD][997631888]  Listing unique IDs for 3 voting files:
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 1: 00ce3c95-c6534f44-bfffa645-a3430bc3
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 2: a3751063-aec14f8e-bfe8fb89-fccf45ff
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 3: 0fce89ac35834f99bff7b04ccaaa8006
I hope this information was useful.
Keep visiting the blog. Thanks for your time!

References:

Oracle 10g RAC Grid, Services & ClusteringBy Murali Vallath

http://orainternals.wordpress.com/2010/10/29/whats-in-a-voting-disk/

—————————————————————————————————–

Related links:

Home

11G R2 RAC Index
11g R2 RAC: GPNP Profile Demystified
11g R2 RAC: How To Identify The Master Node In RAC
11g R2 RAC:Node Eviction Due To CSSDagent Stopping
11g R2 RAC : Node Eviction Due To Member Kill Escalation
11g R2 RAC: Node Eviction Due To Missing Disk Heartbeat
11g R2 RAC: Node Eviction Due To Missing Network Heartbeat 
11g R2 RAC : OCR Demystified
11g R2 RAC : OLR  Demystified
How Does  11G R2 Clusterware Start ASM When ASM SPfile Is On ASM Itself?
Cache Fusion Demonstrated
Instance Recovery In RAC
Need For VIP In RAC
Recover Voting Disk – Scenario-I
Recover Voting Disk – Scenario-II

17 thoughts on “11g R2 RAC : VOTING DISK DEMYSTIFIED

  1. Hello Anju,
    we have 3 node RAC in RAID 10 SAN,but unfortunately configuration done in ASM external redundancy,what you think about this configuration?what is the possibility of recovery if the file is corrupt?as per i know, as we have R10 file can be recover if the disk is fail.But my concern is what happen if the particular file is corrupt?
    Your suggestions/comment regarding my concerns will be highly appreciated.

    Thanks,
    Pankaj

      1. Yes Anju , i am aware of that,but just need to know that, if i let as it is.will it be a disaster?is this a worst configuration of production RAC.?

        Thanks,
        Pankaj

        1. Hi Pankaj,

          If you have oracle backup (RMAN, ocrconfig etc) of the corrupt file, you can recover corrupted file from its backup. In case you do not have oracle backup, then it can be recovered at RAID level .

          Regards
          Anju

  2. Hi Anju,

    I have a doubt arising into my mind from quite a long time regarding maximum number of voting disks can be 32 in oracle .

    If that is the case then since there should be odd number of voting disks , why do we have 32 maximum voting disks as 32 is an even number.

    Regards,
    Rupesh Choudhary

    1. Hi Rupesh,

      Maximum no. of voting disks supported are 15.
      If you are storing voting disk on ASM, no. of voting disks is decided by redundancy of diskgroup.
      Hence maximum no. of voting disks supported = 5 on high redundancy disk group

      If voting disk is on raw device then you can have 15 voting disks on 15 disks.

      Regards
      Anju

      1. HI Anju,
        Thanks for your reply . As I have gone through so many docs I have read oracle can support upto 32 voting disk.
        I am totally agree with you below suggestion:
        If you are storing voting disk on ASM, no. of voting disks is decided by redundancy of diskgroup..

        I will be thankful to you if you clear why 32 maximum.
        Regards,
        Rupesh

  3. Hi,

    What would happen in the scenario of a 3 node RAC with 3 voting disks and the interconnect for each node goes down?
    If each node can still see the voting disks but not each other how is it decided which node(s) to evict?
    What are the rules for the number of voting disks for RAC’s with more than 2 nodes?

    Many thanks,
    John

    1. The node who takes control of the DB control file first remains in the cluster and the other are evicted and reboot takes place.
      No. of VD should be odd and depends on the level of redundancy chosen.

  4. one of the best article on VD .. clearing many basic doubts.. your articles are worth reading everytime.

  5. Very Good article..I have a doubt..recently due to SAN failover issue all 3 votig disks were inaccessible and the nodes rebooted..we want to avoid this situation in future. If you can suggest us on how to manage our voting disks , we thought some options like
    1) Keep voting disks on different controllers(1 on NFS 2 on controller 1 and 2 on other controller 2)
    2) Currently voting disks are on NFS so if we move it to ASM, will that going to help
    3) Keep voting disks on seperate NFS mount points
    Please advice
    Thanks,
    Kanu

Your comments and suggestions are welcome!