Category Archives: 11g R2 RAC

11g R2 RAC: SERVER POOLS

                                        

In this post, I will explain about Server Pools. 11G R2 RAC introduced serverpools which will revolutionalize cluster management and ease the life of the DBA’s. To understand srever pools, let us see how the things were earlier :

 

ADMIN MANAGED DATABASES
 In prior releases of Oracle RAC, the DBA would explicitly manage which nodes of a cluster would be used to run the various instances associated with a RAC database. Whenever a node was added/removed from the cluster, DBAs were responsible for adding/removing instances from a RAC database, including the creation of redo log threads and undo tablespaces. Additionally, database services would be manually configured with preferred and available nodes, which facilitated the balancing of connections and failover of the services in a RAC environment. This scheme worked well for a cluster with a small number of nodes.

 

NEED FOR POLICY MANAGED DATABASES
 - As the number of nodes become large, DBA’s tasks became very complicated .
- It was not possible to add more instances dynamically to newly added servers .
- It was not possible for additional instances of database to start automatically as some servers in the cluster became idle.
 –  It was not possible for a service to be available on newly added nodes dynamically.

 

 Now, from Oracle 11g R.2 RAC all these tasks are automated using server pool , a new feature of Oracle 11g RAC. Server Pools logically divide the cluster into small segments that can be used for varying workloads.  Now databases and services can be defined to run in server pools. and  servers in the server pools can change dynamically.

 

ENTER SERVER POOLS
Oracle 11g R2 introduced server pools in which we could divide the cluster into smaller server pools. Each pool could be assigned some attributes:

 

IMPORTANCE : 0(default) to 1000 – It is used to determine how to reconfigure the servers in various pools when a node joins/leaves the cluster. Using importance parameter you can ensure that low precedence workload doesn’t eat up more resources.

 

MAX_SIZE: The maximum number of servers a server pool can contain. Default value is -1 which indicates  means that all the servers in the cluster can be a part of this pool.

 

MIN_SIZE: The minimum size of a server pool. If the number of servers contained in a server pool is below the number you specify in this attribute, then Oracle Clusterware automatically moves servers from other pools into this one until that number is met. The default value is 0.

 

ACL:  String in the following format:owner:user:rwx,pgrp:group:rwx,other::r— Defines the owner of the server pool and which privileges are granted to various operating system users and groups.
  • r: Read only
  • w: Modify attributes of the pool or delete it
  • x: Assign resources to this pool
By default, the identity of the client that creates the server pool is the owner.
Clusterware can move servers from server pool robotically to maintain important server pool configuration. By default two pools get created after installation of the cluster : Generic Pool and Free Pool

 

Generic Pool: The servers in the pool are used to host Admin Managed databases.
  When you upgrade your existing version of the clusterware to 11.2, all the nodes get mapped to the in-built Generic pool.This is an internally managed server pool and the modification of the attributes of this pool is not allowed. When you create a database which is Admin-managed, that also becomes a part of the Generic Pool as a child pool of it.

 

 Free Pool: All the servers which have not been assigned to any pool are part of this pool. When the clusterware starts , initially all the servers are assigned to this pool  only. As applications start, servers move from this pool to Generic or user defined pool dynamically. As like the Generic pool, this is also an internally managed pool but still some attributes are available to be modified by the dba like IMPORTANCE.
In addition to above pools, users can also define server pools.

 

The allocation of the hosts is going to be happening in the following order:
     1. Generic Pool
     2. Explicitly created pool
     3. Free Pool

 

Servers will be assigned to explicitly created pools  as per the following rules:
1.    Fill all server pools in order of importance until they meet  their minimum.
2.    Fill all server pools in order of importance until they meet  their maximum.
3.    By default any left over go to FREE.

 

If servers leave the cluster for any reason, a server reallocation may take place. If there are servers in the Free Pool and another server pool falls below its maximum value, a free server is allocated to the affected server pool. If there are no free servers, then server reallocation takes place only if a server pool falls below its minimum level. If that occurs, a server will be sourced from one of the following locations in the following order:

1. The server pool with the lowest importance that has more than its minimum number of servers
2. The server pool with the lowest importance that has at least one server and has lower importance than the affected server pool

Using these mechanisms, server pools can maintain an optimal level of resources based on current no. of servers that are available.

DEMONSTRATION:
 
Current scenario:
Nodes in the cluster : host01, host02, host03

 

- Let’s see the current status of server pools
#crsctl status serverpool
NAME=Free
ACTIVE_SERVERS=host01 host02 host03
NAME=Generic
ACTIVE_SERVERS=

 

- create a serverpool pool mypool with min = 2, max = 2, imp = 1
[root@host02 host02]# srvctl add srvpool -g mypool -l 2 -u 2 -i 1

 

- check that two servers have been allocated to mypool from free pool
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=host01
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host02 host03

 

- Add another server pool mypool1 with min = 2, max = 2, imp = 1
[root@host02 host02]# srvctl add srvpool -g mypool1 -l 2 -u 2 -i 1

 

- check that one  server has been allocated to mypool1 from free pool
 [root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host02 host03
NAME=ora.mypool1
ACTIVE_SERVERS=host01

 

- Increase the importance of mypool1 to 2
[root@host02 host02]# srvctl modify srvpool -g mypool1 -i 2

 

- Check that one server has been moved from mypool of lower importance(1) to mypool of higher importance (2)
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host03
NAME=ora.mypool1
ACTIVE_SERVERS=host01 host02

 

- Increase the importance of mypool to 3
[root@host02 host02]# srvctl modify srvpool -g mypool -i 3

 

- Check that one server has been moved from mypool1 to mypool
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host02 host03
NAME=ora.mypool1
ACTIVE_SERVERS=host01

 

- Cleanup –
[root@host02 host02]# srvctl remove srvpool -g mypool1
[root@host02 host02]# srvctl remove srvpool -g mypool

 

CHILD POOLS
 We can have various child pools  for the same parent pool.  The child pools will  have their own properties.  Servers will be allocated to the child pools based on the IMPORTANCE parameter assigned to them. For example we can have a database which runs in a parent pool. We can assign various services of the database assigned to different pools. We can have different attributes set up for the child pools depending on the importance that we want to give to each service of the database. The hosts assigned  to various parent pools are mutually exclusive i.e. same server can’t be assigned to different parent pools.  For the child pools, whether the same server can be assigned to different child pools or not, would be based on the  value of the attribute EXCLUSIVE_POOL for them. If there is a string value assigned to this parameter and this is shared by two child pools, host assigned to one child pool can’t be shared by its other sibling.

 

– Create a parent pool pmpool3 with min = 3, max=3 and importance = 1
[root@host02 host02]# srvctl add srvpool -g pmpool3 -l 3 -u 3 -i 1

 

– Check that all the three servers have been assigned to pmpool3 –
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03

 

- Add a child pool ch1 with parent pool = pmpool3, min = 2, max =2 and importance = 1
[grid@host02 ~]$ crsctl add serverpool ch1 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=1"

 

- Check that two servers have been allocated to child pool ch1
[root@host02 ~]# crsctl status serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01 host02
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03

 

- Add another child pool ch2 with min = 2, max = 2, importance = 2(higher than ch1(1))
 [grid@host02 ~]$ crsctl add serverpool ch2 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=2"

 

- Check that two servers have been allocated to child pool ch2. Since we haven’t set the parameter EXCLUSIVE_POOLS, the same hosts host01 and  host02  are  shared by both the children.
[root@host02 ~]# crsctl status serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01 host02
NAME=ch2
ACTIVE_SERVERS=host01 host02
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03
————————————————————————————

 

EXCLUSIVE_POOLS

 

– Delete both child pools ch1 and ch2
[grid@host02 ~]$ crsctl delete serverpool ch1 
[grid@host02 ~]$ crsctl delete serverpool ch2

 

– Add child pool ch1 with min = 2,max = 2, importance = 1, exclusive_pools=’x’
[grid@host02 ~]$ crsctl add serverpool ch1 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=1,EXCLUSIVE_POOLS='x'"

 

– check that 2 servers have been assigned to child pool ch1
[grid@host02 ~]$ crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01 host02
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03

 

- Add another child pool ch2 with min = 2, max = 2, importance = 2(higher than ch1(1)), exclusive_pools = x so that it won’t share any servers with ch1
[grid@host02 ~]$ crsctl add serverpool ch2 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=2,EXCLUSIVE_POOLS='x'"

 

- CHeck that one server has been moved from ch1 to ch2 as importance of ch2 is higher and servers are not shared between ch1 and ch2.
[grid@host02 ~]$ crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01
NAME=ch2
ACTIVE_SERVERS=host02 host03
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03
=========

 

WIth the introduction of server pools, we can have two management styles for RAC database.

 

Administrator managed databases (similar to what we had prior to 11g R2)
  . DBA defines which servers should the database run on
  . DBA decides the servers (preferred/available nodes) on which services of the database should run.
  . DBA decides which instance of the database will run on which server.
  . This database type would assign the instances to a child pool of the default pool available Generic pool.

 

Although this type seems to be much simpler, this can’t be used when the number of nodes pass a number which is difficult to manage by a DBA, for example 100. In that case, the next type of the 11.2 database would be helpful-Policy Managed database.

 

Policy Managed Databases
- The database can be associated with a server pool rather than a specific set of nodes. It will decide   the minimum and maximum no. of servers needed by that resource (database , service, third party application).
- The  database will run on the servers which have been allocated to the serverpool it has been assigned to.
- Since servers allocated to the server pool can change dynamically, this would allow Oracle to dynamically deliver services based on the total no. of servers available to the cluster.
  . The database will started on enough servers subject to the availability of the servers.
  . We need not hardcode the servers a database should run on.
  . Any instance of the database can run on any node. There is no fixed mapping between instance number and node.
  . As servers are freed/added/deleted, they will be allocated to existing pools as per the rules mentioned earlier.

 

For example, if a cluster consisted of eight nodes in total and supported three RAC databases. Each database would be defined with a minimum and maximum number of servers. Let’s assume that

 

DB1 is defined with a minimum of 4 and a maximum of 6 with an importance of 10,
DB2 is defined with a minimum of 2 and maximum of 3 and an importance of 7, and
DB3 is set to a minimum of 2 and maximum of 3 with an importance of 5.

 

Initially the 8 nodes could be configured as nodes 1-4 allocated to DB1, nodes 5-6 allocated to DB2 and nodes 7-8 allocated to DB3. If node 3 failed for some reason, the system would allocate node 7 or 8 to DB1 because it has a higher importance than DB3 and a minimum requirement of 4 servers, even though it would cause DB3 to fall below the minimum number of servers. If node 3 is re-enabled it would be allocated immediately to DB3 to bring that database back to its minimum required servers.

 

If a 9th node were added to the cluster, it would be assigned to DB1 because of the importance factor and the fact that DB1 has not yet hit its maximum number of servers.

Conclusion
As the number of cluster increase, DBA’s will get a huge benefit from server pools as the management of the clusterware resources will become quite easy and dynamic.
In my next post, I will write about instance failover in server pools.
References:
http://docs.oracle.com/cd/E11882_01/rac.112/e16794/admin.htm
 http://allthingsoracle.com/an-introduction-to-11-2-rac-server-pools/
http://www.mahir-quluzade.com/2013/05/converting-administrator-managed.html
Policy based cluster management in Oracle 12c

———————————————————————————
Related links:

HOME   

11gR2 RAC INDEX

11g R2 RAC :SCAN DEMYSTIFIED

Single Client Access Name or SCAN  is a new feature that has been introduced in 11g R2 RAC. It enables the clients to  use a single hostname to access oracle databases running in the cluster. Clients using SCAN do not need to change their TNS configuration if you add or remove nodes in the cluster.
 To understand it better, let us go back a little to 10g/11gR1 RAC.
Let us say we have a database orcl which is running on 3 nodes in a cluster;
Name of the cluster: cluster01.example.com
Instance       Host name                       IP address             VIP       Pvt network
orcl1       host01.example.com       192.9.201.181   192.9.201.182    10.0.0.1
orcl2       host02.example.com       192.9.201.183   192.9.201.184    10.0.0.2
orcl3       host03.example.com       192.9.201.185   192.9.201.186    10.0.0.3
DNS server
               server1.example.com       192.9.201.59
Storage
               openfiler1.example.com    192.9.201.9
Prior to 11g R2, when the user requested the connection to the database by issuing
$sqlplus hr/hr@orcl
Entry in tnsnames.ora on client side was read which looked like:
ORCL =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = host01-vip.example.com)(PORT = 1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = host02-vip.example.com)(PORT = 1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = host03-vip.example.com)(PORT = 1521))
      (LOAD_BALANCE = yes)
    )
    (CONNECT_DATA =
      (SID = orcl)
    )
  )
TNS entry contains reference(address) to all the nodes in the cluster. The request was to be sent to any one of the host names listed in ADDRESS_LIST.
To resolve the host name two methods were available :
- hosts file
- DNS Server
Let us discuss hosts file first. The /etc/hosts file contained entries using which hostnames could be resolved to corresponding VIPs.
#Public Network
192.9.201.181    host01.example.com host01
192.9.201.183    host02.example.com host02
192.9.201.185    host03.example.com host03
#Private Network
10.0.0.1        host01-priv.example.com host01-priv
10.0.0.2        host02-priv.example.com host02-priv
10.0.0.3        host03-priv.example.com host03-priv
#Virtual Network
192.9.201.182   host01-vip.example.com host01-vip
192.9.201.184  host02-vip.example.com host02-vip
192.9.201.186   host03-vip.example.com host03-vip
#Storage
192.9.201.9     openfiler1.example.com openfiler
Once VIP of the host was known, the request was sent to that IP address where it would be received by the local listener running there. The listener’s information was stored on the listener.ora file on the server side.
In this method of hostname resolution, whenever we added/deleted a node , we needed to modify both – tnsnames.ora and hosts file on each client machine which was a cumbersome task. This problem of hosts file was resolved by  storing hostname resolution entries in a centralised DNS server. Now , on adding/deleting a node, hostname resolution entries needed to be modified only in one place i.e. DNS server but tnsnames.ora still needed to be modified on each client machine.
To avoid modifying tnsnames.ora on each client machine, SCAN was introduced in 11gR2.
ENTER SCAN in 11g R2 RAC ….
Now we can specify only the cluster name in the tnsnames.ora  and  client would be connected to any of the nodes in the cluster. Let us see how it can be implemented:
 Now tnsnames.ora entry looks like:
ORCL =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = cluster01-scan.cluster01.example.com)(PORT = 1521))
      (LOAD_BALANCE = yes)
    )
    (CONNECT_DATA =
      (SID = orcl)
    )
  )
When a user tried to connect to orcl database, he needs to resolve cluster name i.e. cluster01.example.com. To resolve the cluster name again we can use the hosts file or DNS server.
SCAN USING HOSTS FILE
If we use /etc/hosts file, it would look like this :
cluster01-scan 192.9.201.181
Here, we have the limitation that cluster name can resolve only to one SCAN-VIP i.e. all the client connection requests will be directed to same SCAN listener i.e. client side load balancing won’t take place. In case we want client side load balancing,  cluster name should  resolve to multiple SCAN-VIP’s  and  we must use DNS .
SCAN USING DNS
Hence we include the following in /etc/resolve.conf to forward any connection request for example.com to DNS server running at IP address 192.9.201.59
/etc/resolv.conf
Search nameserver
example.com  192.9.201.59
Various files associated with DNS on DNS server are :
- /var/named/chroot/etc/named.conf
- /var/named/chroot/var/named/forward.zone
- /var/named/chroot/var/named/reverse.zone
- /var/named/chroot/var/named/reverse1.zone
——————————————
- /var/named/chroot/etc/named.conf
——————————————
When a request comes to DNS, this file /var/named/chroot/etc/named.conf
is looked into first
Here it informs that domain example.com’s info is in forward.zone
zone “example.com” IN {
        type master;
        file “forward.zone”;
        };
In forward.zone , following are specified :
– IP addresses of all the nodes in the cluster
– VIP addresses of all the nodes in the cluster
– Private network  IP addresses of all the nodes in the cluster
– IP address of DNS Server
– IP address of Storage
- SCAN-VIP’s
************
——————————————
- /var/named/chroot/var/named/forward.zone
—————————————–
This file is used to resolve host names to ip addresses.
                     IN NS           server1
                     IN A            192.9.201.59
server1         IN A            192.9.201.59
host01          IN A            192.9.201.183
host02          IN A            192.9.201.185
host03          IN A            192.9.201.187
openfiler1      IN A            192.9.201.181
host01-priv     IN A            10.0.0.11
host02-priv     IN A            10.0.0.22
host03-priv     IN A            10.0.0.33
host01-vip      IN A            192.9.201.184
host02-vip      IN A             192.9.201.186
host03-vip      IN A             192.9.201.188
cluster01-scan   IN A             192.9.201.190
                 IN A             192.9.201.191
                 IN A             192.9.201.192
After getting  address, the  request is sent in round-robin fashion  to the SCAN listeners listening at SCAN-VIP’s.
——————————————–
- /var/named/chroot/var/named/reverse.zone
——————————————–
This file is used to resolve ip addresses to host names.
********
$TTL    86400
@       IN      SOA     server1.example.com. root.server1.example.com.  (
                                       997022700 ; Serial
                                      28800      ; Refresh
                                      14400      ; Retry
                                      3600000    ; Expire
                                      86400 )    ; Minimum
        IN      NS      server1.example.com.
59      IN      PTR     server1.example.com.
182     IN      PTR     openfiler.example.com.
183     IN      PTR     host01.example.com.
185     IN      PTR     host02.example.com.
187     IN      PTR     host03.example.com.
184     IN      PTR     host01-vip.example.com.
186     IN      PTR     host02-vip.example.com.
188     IN      PTR     host03-vip.example.com.
190     IN      PTR     cluster01-scan.example.com
191     IN      PTR     cluster01-scan.example.com
192     IN      PTR     cluster01-scan.example.com
——————————————–
- /var/named/chroot/var/named/reverse1.zone
——————————————–
 This file is used to resolve private interconnect ip addresses
IN      NS      server1.example.com.
11      IN      PTR     host11-priv.example.com.
22      IN      PTR     host22-priv.example.com.
33      IN      PTR     host33-priv.example.com.
Thus, now we do not need to modify tnsnames.ora file on client machine whenever we add/delete node(s)as the file  contains just a reference to the cluster. Also, we need not modify the entries in DNS on adding / deleting as node as SCAN name in DNS resolves to SCAN-VIP’s and not node-VIP’s. But we still needed to assign VIPs manually to various nodes. If we let DHCP assign VIP’s to various nodes automatically and let these VIP’s be known to DNS somehow, we would just need to run a few commands to let clusterware know about the node addition/deletion and we are done!!! No need to assign VIP’s manually.
This feature of 11gR2 RAC is called have GPNP – Grid Plug n Play . This is implemented by GNS (Grid Naming Service) which is similar to DNS but is used to resolve the cluster name only within a corporate domain (i.e. example.com).
SCAN WITH GNS
Let us see how DNS entries are modified now :
Various files associated with DNS are :
- /var/named/chroot/etc/named.conf
- /var/named/chroot/var/named/forward.zone
- /var/named/chroot/var/named/reverse.zone
- /var/named/chroot/var/named/reverse1.zone
——————————————
- /var/named/chroot/etc/named.conf
——————————————
When a request comes to DNS, this file is looked into first
Here domain example.com’s info is in forward.zone
zone “example.com” IN {
        type master;
        file “forward.zone”;
        allow-transfer { 192.9.201.180; };
};
As per forward.zone  request for cluster01.example.com is to be forwarded to the address for cluster01-gns i.e. 192.9.201.180
——————————————
- /var/named/chroot/var/named/forward.zone
—————————————–
This file is used to resolve host names to ip addresses.
                   IN NS           server1
                   IN A            192.9.201.59
server1         IN A            192.9.201.59
host01          IN A            192.9.201.183
host02          IN A            192.9.201.185
host03          IN A            192.9.201.187
openfiler1      IN A            192.9.201.181
host01-priv     IN A            10.0.0.11
host02-priv     IN A            10.0.0.22
host03-priv     IN A            10.0.0.33
$ORIGIN cluster01.example.com.
@       IN      NS      cluster01-gns.cluster01.example.com.
cluster01-gns   IN      A       192.9.201.180
After getting gns address, it comes back to named.conf and checks if transfer is allowed to that address. if it is allowed, request is placed on that address to gns.
zone “example.com” IN {
        type master;
        file “forward.zone”;
        allow-transfer { 192.9.201.180; };
};
——————————————–
- /var/named/chroot/var/named/reverse.zone
——————————————–
This file is used to resolve ip addresses to host names.
$TTL    86400
@       IN      SOA     server1.example.com. root.server1.example.com.  (
                                      997022700 ; Serial
                                      28800      ; Refresh
                                      14400      ; Retry
                                      3600000    ; Expire
                                      86400 )    ; Minimum
           IN      NS      server1.example.com.
 59      IN      PTR     server1.example.com.
182     IN      PTR     openfiler.example.com.
183     IN      PTR     host01.example.com.
185     IN      PTR     host02.example.com.
187     IN      PTR     host03.example.com.
——————————————–
- /var/named/chroot/var/named/reverse1.zone
——————————————–
 This file is used to resolve private interconnect
ip addresses
IN      NS      server1.example.com.
11      IN      PTR     host11-priv.example.com.
22      IN      PTR     host22-priv.example.com.
33      IN      PTR     host33-priv.example.com.
——————————————————————-
Once the request goes to GNS, it is multicast to all the nodes in the cluster. In all, three SCAN Listeners run on any of the nodes in the cluster. mDNS process running on the nodes hosting the SCAN listeners return the VIP’s of SCAN listeners to GNS. GNS in turn returns the addresses of the three SCAN listeners to DNS. The client request is forwarded to any one of the three SCAN listeners (say SCAN-Listener1). All the SCAN listeners have the load information of all the nodes. The SCAN listener which receives the client request (SCAN-Listener1) forwards the request to the local listener running on the least loaded node.
     Thus, with SCAN all the VIP’s (Host VIP’s and SCAN-VIPs) are assigned dynamically by DHCP. The only VIP which is static is the VIP for GNS. Note that we have to specify the range of IP’s to be assigned by DHCP in the file /etc/dhcpd.conf.. For example the following entry specifies the IP range from 192.9.201.190 to 192.9.201.254.
range dynamic-bootp 192.9.201.190 192.9.201.254;
Now, to add a node, We just need to add an entry giving the physical IP of the node in files of DNS server. The VIP’s will be automatically assigned by DHCP.
I hope you found this post useful.
Your comments and suggestions are always welcome.
References:
——————————————————————————————————-

Related links:

Home

                                                              ——————

11g R2 RAC: CONVERT NON RAC DATABASE TO RAC DATABASE USING RCONFIG

Oracle supports the following methods to convert a single-instance database to an RAC database as long as the RAC and the standalone environments are running on the same operating system and using the same oracle release:
• DBCA
• Oracle Enterprise Manager (grid control)
• RCONFIG
• Manual method
In this post, I will focus on and demonstrate this using the rconfig command-line tool.
During the conversion, rconfig performs the following steps automatically:
• Migrating the database to ASM, if specified
• Creating RAC database instances on all specified nodes in the cluster
• Configuring the Listener and NetService entries
• Registering services with CRS
• Starting up the instances and listener on all nodes
 In Oracle 11g R2., a single-instance database can either be converted to an administrator-managed cluster database or a policy-managed cluster database.
When you navigate through the $ORACLE_HOME/assistants/rconfig/sampleXMLS, you will find two sample XML input files.
- ConvertToRAC_AdminManaged.xml
- ConvertToRAC_PolicyManaged.xml
While converting a single-instance database, with filesystem storage, to an RAC database with Automatic Storage Management (ASM), rconfig invokes RMAN internally to back up the database to proceed with converting non-ASM to ASM. Therefore, configuring parallel options to use multiple RMAN channels in the
RMAN on the local node may make backup run faster, which eventually reduces the conversion duration. For example, you may configure the following in the RMAN  settings of orcl database on the local node.
RMAN> CONFIGURE DEVICE TYPE DISK PARALLELISM 6;
CURRENT SCENARIO:
- 3 node RAC setup
- Names of nodes : Host01, Host02, Host03
- Name of single instance database with filesystem storage : orcl
- Source Oracle home : /u01/app/oracle/product/11.2.0/dbhome_1
- Target Oracle home : /u01/app/oracle/product/11.2.0/dbhome_1
OBJECTIVE
- convert orcl to a Admin managed RAC database running on two nodes host01 and host02.
- change storage to ASM with
   . Datafiles on +DATA diskgroup
   . Flash recovery area on +FRA diskgroup
IMPLEMENTATION:
– copy ConvertToRAC_AdminManaged.xml to another file my.xml
host01$cd $ORACLE_HOME/assistants/rconfig/sampleXMLs
host01$cp ConvertToRAC_AdminManaged.xml my.xml
– Edit my.xml and make following changes :
   . Specify Convert verify as “YES”
   . Specify current OracleHome of non-rac database for SourceDBHome
   . Specify OracleHome where the rac database should be configured. It can be same as SourceDBHome
   . Specify SID of non-rac database and credential. User with sysdba role is required to perform conversion
   . Specify the list of nodes that should have rac instances running for the Admin Managed Cluster Database. LocalNode should be the first node in this nodelist.
   . Instance Prefix tag is optional starting with 11.2. If left empty, it is derived from db_unique_name
   . Specify the type of storage to be used by rac database. Allowable values are CFS|ASM
   . Specify Database Area Location to be configured for rac database.
   . Specify Flash Recovery Area to be configured for rac database.
Note: The Convert verify option in xml file has three options:
  • YES : rconfig performs check that prerequisites for single-instance to RAC conversion have been met before it starts conversion
  • NO : rconfig does not perform checks and starts conversion
  • ONLY: rconfig performs only prerequisite checks; it does not perform conversion after completing checks
– Run rconfig to convert orcl from single instance database to 2 instance RAC database
host01$rconfig my.xml
– Check the log file for rconfig while conversion is going on
oracle@host01$ls -lrt  $ORACLE_BASE/cfgtoollogs/rconfig/*.log
                         tailf
– check that the database has been converted successfully
host01$srvctl status database -d orcl
Instance orcl1 is running on node host01
Instance orcl2 is running on node host02
– Note that rconfig adds password file to all the nodes but  entry to   tnsnames.ora needs to be modified (to reflect scan name instead of host-ip) on the local node and added to rest of the nodes.
– For all other nodes, copy the entry for the database (orcl) from    tnsnames.ora on local node to tnsnames.ora on remote node(s).
– Following is the entry I modified on the local node and copied to rest of the nodes :
ORCL =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = cluster01-scan.cluster01.example.com)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = orcl)
    )
  )
– check that database can be connected remotely from remote node.
host02$sqlplus system/oracle@orcl
– check that datafiles have converted to ASM
SQL>select name from v$datafile;
NAME
——————————————————————————–
+DATA/orcl/datafile/system.326.794838279
+DATA/orcl/datafile/sysaux.325.794838349
+DATA/orcl/datafile/undotbs1.305.794838405
+DATA/orcl/datafile/users.342.794838413
+DATA/orcl/datafile/undotbs2.348.794838
———————————————————————————————————–
Related links:

11g R2 RAC: Clone Database Home

 

                                                              ———————

 

WORKLOAD MANAGEMENT IN 11g R2 RAC : LOAD BALANCING

In this post, I will discuss about various types of load balancing techniques that can be configured in 11g R2 RAC for workload management.
In RAC we have multiple instances of a database running on multiple servers.
Workload management involves :
   – Failover : If connection to an instance fails, client should automatically connect to another instance
   – Load balancing : Workload should spread across various instances to obtain maximum performance/throughput
Failover can be
  – Connect Time Connection Failover (CTCF) while making initial connection ()
  - Run time connection failover (RTCF) after connection has been established or (transparent application failover (TAF)
     FAILOVER_METHOD = NONE, BASIC, PRECONNECT
     FAILOVER TYPE   = SESSION, SELECT
Load balancing can be :
  – Connect time load balancing (CTLB)
    . On client side
    . On server side
  –  Run time load balancing (RTLB)
TO configure failover/load balancing, tnsnames.ora should contain multiple listener addresses to connect to multiple instances e.g.
Multiple Listener addresses within a description: i.e. User is trying to connect to a service which is supported by multiple instances.
RAC=
(DESCRIPTION=
   (ADDRESS_LIST=
     (ADDRESS= (PROTOCOL=TCP) (HOST=node1-vip) (PORT=1521))
     (ADDRESS= (PROTOCOL=TCP) (HOST=node2-vip) (PORT=1521))
     (ADDRESS= (PROTOCOL=TCP) (HOST=node3-vip) (PORT=1521))
     (ADDRESS= (PROTOCOL=TCP) (HOST=node4-vip) (PORT=1521))
                      )
    (CONNECT_DATA= (SERVICE_NAME= RAC))
       )
In case SCAN is used, SCAN name is used in the address which resolves to 3 SCAN listeners.
e.g.
RAC=
(DESCRIPTION=
   (ADDRESS_LIST=
     (ADDRESS= (PROTOCOL=TCP) (HOST=cluster01-scan) (PORT=1521))
                      )
    (CONNECT_DATA= (SERVICE_NAME= RAC))
       )
In this post, I will discuss in detail about Load Balancing. To know more about failover , please click here.
LOAD BALANCING 
————–
Load balancing in RAC implies distributing the workload over multiple instances accessing the same physical database. Two kinds load balancing can be configured
  – Connect time load balancing (CTLB)
    . On client side
    . On server side
  – On server side – Run time load balancing (RTLB)
Connect time load balancing (CTLB) : This enables user to connect to one of the instances supporting the service. The connection stays with the same instance until the user disconnects orthe session is killed. It can be configured on the client side and/or server side.
- Connect time load balancing on client side: When a user session attempts to connect to the database, Oracle Net chooses an address specified in tnsnames.ora to connect to in a random order rather than sequential order. This has the effect of clients connecting through addresses which are picked up at random and no one address is overloaded. Its configuration is quite simple. You just need to set the parameter LOAD_BALANCE=ON in the client connection definition in tnsnames.ora. For example :
RAC=
(DESCRIPTION=
  (ADDRESS_LIST=
                       (LOAD_BALANCE=ON)
    (ADDRESS= (PROTOCOL=TCP) (HOST=node1-vip) (PORT=1521))
                       (ADDRESS= (PROTOCOL=TCP) (HOST=node2-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node3-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node4-vip) (PORT=1521))
                      )
   (CONNECT_DATA= (SERVICE_NAME= RAC)
       )
LOAD_BALANCE parameter is set to ON by default. When this parameter is set to ON, Oracle Net Services progresses through the list of listener addresses in a random sequence, balancing the load on several listeners. When it is set to OFF, the addresses are tried out sequentially until one succeeds.
When using SCAN in the connection definition, Oracle database randomly connects to one of the available SCAN listeners in a round robin fashion and balances the connections on the three scan listeners. Here is a sample tnsnames.ora for a RAC database using SCAN.
RAC=
(DESCRIPTION=
    (LOAD_BALANCE=ON)
    (ADDRESS= (PROTOCOL=TCP) (HOST=SCAN-HOSTNAME) (PORT=1521))
                  )
   (CONNECT_DATA= (SERVICE_NAME= RAC)
       )
Limitation of connect time load balancing : The connection stays with the same instance for the life of a session. If connection lasts a long time, it might be possible that load of current instance increases and some other less loaded instance might be preferable. In that case we would like the connection to switch the other more appropriate instance. This can be achieved by using Run time Load Balancing (RTLB).
- Connect time load balancing on server side: After a listener receives the connection request, it can forward the request to another instance based on the connect time load balancing goal (CLB_GOAL)specified for the service. CLB_GOAL can be :
LONG(Default) - used for application connections that are connected for a long period such as third party connection pools and SQL*Forms applications. In this case,
   . the listener will load balance on number of sessions
   . Run time load balancing goal will not be used in this case
SHORT - used for application connections that are short in duration. This should be used with connection pools integrated with the load balancing advisory. In this case, listener uses Load Balancing Advisory (LBA) to make the connection based on CPU utilization on the node.
Limitation of Connect time load balancing on client side : The listener has no idea if the session has been assigned to an endpoint whose corresponding database server is already overloaded. Hence, timeouts can occur if the node is heavily loaded and unable to respond quickly.Hence, to overcome this problem, server side connect time load balancing needs to be configured. It is useful to spread initial connection load among all listeners inthe cluster. Client may then be redirected based on server side load balancing.
Run Time (Server side) load Balancing (RTLB): In this case, the listener routes incoming client connections according to policies and based on the current service level provided by the database instances. The listener determines the connection distribution depending upon profile statistics that  are dynamically updated by PMON. The higher the load on the node, the more frequently PMON updates the load profile.Thus connections may be switched depending upon changes in cluster configuration, application wrokload  overworked nodes or hangs.
   The core of server side laod balancing id Dynamic service registration so that a services are registered with all the listeners. Since PMON on each node sends load profile to all the listeners with which the service is registered,  all the  listeners come to know about load profile of all the instances and hence the connection is forwarded to the most appropriate listener depending upon the goal of the run time load balancing.
  Run time load balancing is achieved using connection pools. Work requests are automatically balanced across the pool of connections.The connection allocation is based on the current performance level provided by the database instances as indicated by the LBA FAN events. This provides load balancing at the transaction level instead of load balancing at the time of initial connection. 
With server-side load balancing, the listener directs a connection request to the best instance currently providing the service by using the load balancing advisory.
Load Balancing Advisory
- is an advisory for balancing work across RAC instances
- Monitors workload activity for a service across all instances in the cluster
- Analyzes the service level for each instance based on defined metric goal
    Metric: service time (GOAL_SERVICE_TIME)
    Metric: throughput (GOAL_THROUGHPUT)
- Publishes FAN events recommending amount of work to be sent to each instance and data quality  flag
- Default is Off.
-  Directs work to where services are executing well and resources are available
- Adjusts distribution for different power nodes, different priority and shape workloads, changing demand
- Stops sending work to slow, hung, failed nodes early
How to configure server side load balancing:
1. set parameters LOCAL_LISTENER and REMOTE_LISTENER
2. set CLB_GOAL = SHORT for the service
3. set RTLB_GOAL for the service
1. set parameters LOCAL_LISTENER and REMOTE_LISTENER
LOCAL_LISTENER parameter should be set to name of the listener defined in the same node
REMOTE_LISTENER should be set to names of the listeners running on other nodes
For example in a 3 node setup
Host01 running instance orcl1
Host02 running instance orcl2
Host03 running instance orcl3
For host01,
local_listener  – (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP) HOST=host01-vip)(PORT=1521))))
remote_listener  – (DESCRIPTION=(ADDRESS_LIST=
                                 (ADDRESS=(PROTOCOL=TCP) HOST=host02-vip)(PORT=1521)
                                 (ADDRESS=(PROTOCOL=TCP) HOST=host03-vip)(PORT=1521))))
For host02,
local_listener  – (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP) HOST=host02-vip)(PORT=1521))))
remote_listener  – (DESCRIPTION=(ADDRESS_LIST=
                                 (ADDRESS=(PROTOCOL=TCP) HOST=host01-vip)(PORT=1521)
                                 (ADDRESS=(PROTOCOL=TCP) HOST=host03-vip)(PORT=1521))))
For host03,
local_listener  – (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP) HOST=host03-vip)(PORT=1521))))
remote_listener  – (DESCRIPTION=(ADDRESS_LIST=
                                 (ADDRESS=(PROTOCOL=TCP) HOST=host02-vip)(PORT=1521)
                                 (ADDRESS=(PROTOCOL=TCP) HOST=host01-vip)(PORT=1521))))
When we start the three instances the corresponding PMON processes get dynamically registered with all the listeners and start feeding listeners with load profile information. Now all the listeners come to know about load profile of all the instances and hence the connection is forwarded to the listener of the least loaded node.
When SCAN is used, remote_listener parameter should be set to SCAN name on all the nodes i.e.
remote_listener – myrac-cluster-scan:1521
So that PMON process of each instance registers the database services with the default listener on the local node and with each SCAN listener, which is specified by the REMOTE_LISTENER database parameter.
2. set CLB_GOAL = SHORT for the service
   EXECUTE DBMS_SERVICE.MODIFY_SERVICE
   (service_name => ‘sjob’ -
   , clb_goal => DBMS_SERVICE.CLB_GOAL_SHORT);
OR
   srvctl modify service
      -s orcl_serv database -d orcl
      -j SHORT   // connection load balancing goal {long|short}
3. set RTLB_GOAL for the service
A request for a connection is serviced by selecting a connection based on the service goal as determined by the Load Balancing Advisory. The service goal determines whether the connection provides best service quality, that is, how efficiently a single transaction completes, or best throughput, that is, how efficiently an entire job or long-running query completes.
For RTLB, we can define service level goal which will be used only if CLB_GOAL= SHORT
There are 3 options available
NONE – Default setting, you are not taking advantage of this feature
THROUGHPUT – Work requests are directed based on throughput.  THROUGHPUT should be used when the work in a service completes at homogenous rates.  An example is a trading system where work requests are similar lengths. Attempts to direct work requests according to throughput. The load balancing advisory analyzes the service level for each instance based on the service time and is based on the rate that work is completed in the service plus available bandwidth to the service. For example for the use of THROUGHPUT is for workloads such as batch processes,trading system work requests have similar lengths and  next job starts when the last job completes:
EXECUTE DBMS_SERVICE.MODIFY_SERVICE
 (service_name => ‘sjob’ -
  , goal => DBMS_SERVICE.GOAL_THROUGHPUT );
OR
srvctl modify  service
      -s orcl_serv database -d orcl
      -B throughput      // runtime connection load balancing goal { service_time|throughput | none}
SERVICE_TIME – Work requests are directed based on response time. SERVICE_TIME should be used when the work in a service completes at various rates.  In this case, Load balancing advisory data is based on elapsed time for work done in the service plus available bandwidth to the service. An example for the use of SERVICE_TIME is for workloads such as internet shopping where the rate of demand changes and work requests are of differing various lengths.:
EXECUTE DBMS_SERVICE.MODIFY_SERVICE
(service_name => ‘OE’ -
, goal => DBMS_SERVICE.GOAL_SERVICE_TIME -
);
OR
srvctl modify  service
      -s orcl_serv database -d orcl
      -B service_time      // runtime connection load balancing goal { service_time|throughput | none}
 You can see the goal settings for a service in the data dictionary and in the DBA_SERVICES, V$SERVICES, and V$ACTIVE_SERVICES views.
SUMMARY:
Workload management involves :
   – Failover : If connection to an instance fails, client should automatically connect to another instance
   – Load balancing : Workload should spread across various instances to obtain maximum performance/throughput
Load balancing can be :
  – Connect time load balancing (CTLB)
    . On client side
    . On server side
  –  Run time load balancing (RTLB)
TO configure failover/load balancing, tnsnames.ora should contain multiple listener addresses to connect to multiple instances
Connect time load balancing (CTLB) : This enables user to connect to one of the instances supporting the service. The connection stays with the same instance until the user disconnects orthe session is killed. It can be configured on the client side and/or server side.
- Connect time load balancing on client side: When a user session attempts to connect to the database, Oracle Net chooses an address specified in tnsnames.ora to connect to in a random order rather than sequential order. This has the effect of clients connecting through addresses which are picked up at random and no one address is overloaded. Its configuration is quite simple. You just need to set the parameter LOAD_BALANCE=ON in the client connection definition in tnsnames.ora.
Limitation of Connect time load balancing (CTLB) on client side : The listener has no idea if the session has been assigned to an endpoint whose corresponding database server is already overloaded.
- Connect time load balancing (CTLB) on server side: After a listener receives the connection request, it can forward the request to another instance based on the connect time load balancing goal (CLB_GOAL)specified for the service. CLB_GOAL can be : LONG or SHORT
LONG(Default) – In this case,
   . the listener will load balance on number of sessions
   . Run time load balancing goal will not be used in this case
SHORT - used for application connections that are short in duration. This should be used with connection pools integrated with the load balancing advisory. In this case, listener uses Load Balancing Advisory (LBA) to make the connection based on CPU utilization on the node.
Limitation of connect time load balancing : The connection stays with the same instance for the life of a session. If connection lasts a long time, it might be possible that load of current instance increases and some other less loaded instance might be preferable.
Run Time (Server side) load Balancing (RTLB): In this case, the listener routes incoming client connections according to policies and based on the current service level provided by the database instances. With server-side load balancing, the listener directs a connection request to the best instance currently providing the service by using the load balancing advisory.
To configure server side load balancing:
1. set parameters LOCAL_LISTENER and REMOTE_LISTENER
2. set CLB_GOAL = SHORT for the service
3. set RTLB_GOAL for the service
———————————————————————————

WORKLOAD MANAGEMENT IN 11g R2 RAC : FAILOVER

In this post, I will discuss about various types of failovers that can be configured in 11g R2 RAC for workload management.
In RAC we have multiple instances of a database running on multiple servers.
Workload management involves :
   – Failover : If connection to an instance fails, client should automatically connect to another instance
   – Load balancing : Workload should spread across various instances to obtain maximum performance/throughput
Failover can be
   – Connect Time Connection Failover (CTCF) while making initial connection ()
   - Run time connection failover (RTCF) after connection has been established or (transparent application failover (TAF)
     FAILOVER_METHOD = NONE, BASIC, PRECONNECT
     FAILOVER TYPE   = SESSION, SELECT
Load balancing can be :
  - Connect time load balancing (CTLB)
    . On client side
    . On server side
  –  Run time load balancing (RTLB)
TO configure failover/load balancing, tnsnames.ora should contain multiple listener addresses to connect to multiple instances e.g.
Multiple Listener addresses within a description: i.e. User is trying to connect to a service which is supported by multiple instances.
RAC=
(DESCRIPTION=
  (ADDRESS_LIST=
    (ADDRESS= (PROTOCOL=TCP) (HOST=node1-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node2-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node3-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node4-vip) (PORT=1521))
                      )
   (CONNECT_DATA= (SERVICE_NAME= RAC))
       )
In case SCAN is used, SCAN name is used in the address which resolves to 3 SCAN listeners.
e.g.
RAC=
(DESCRIPTION=
  (ADDRESS_LIST=
    (ADDRESS= (PROTOCOL=TCP) (HOST=cluster01-scan) (PORT=1521))
                      )
   (CONNECT_DATA= (SERVICE_NAME= RAC))
       )
In this post, I will discuss in detail about Failover. To know more about load balancing, please click the following link:
FAILOVER
——–
 Failover refers to the mechanism of switching over to an alternate resource when connection to the primary resource gets terminated due to any reason. In context of RAC it implies that when a user tries to connect to one instance of the database and connection fails due to some reason , he should be able to switch to another instance. Connection failure may be broadly categorized as:
- Failure while making the initial connection to an instance (Connect Time Connection Failover (CTCF)
- Failure after a connection has been successfully established (transparent application failover (TAF) or  Run time connection failover (RTCF))
Failure while making the initial connection to an instance (Connect Time Connection Failover (CTCF) : It could occur for example when listener on a node is not working or a node is not up. If your attempt to connect to an instance fails, you can make the connection attempt again, but this time to a backup instance. As long as you have backup instances configured, you can continue trying to connect, until a successful connection is established. This process is technically termed as Connect Time connection Failover (CTCF).
Failure after a connection has been successfully established (transparent application failover (TAF) or  Run time connection failover (RTCF) :It happens as a result of a node crash or an instance crash. In such cases the application  has to reconnectto a backup instance, reestablish the session environment, and resubmit any work lost, because of the break in the connection. The technical name for this type of failover mechanism is server side transparent application failover (TAF) or  Run time connection failover (RTCF).
How to configure Connect Time Connection Failover (CTCF)
Connect Time Failover can be achieved by the setting FAILOVER=ON/TRUE/YES (default) in DESCRIPTION_LIST, DESCRIPTION or ADDRESS_LIST in tnsnames.ora.
RAC=
(DESCRIPTION=
  (ADDRESS_LIST=
                       (FAILOVER=ON)
    (ADDRESS= (PROTOCOL=TCP) (HOST=node1-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node2-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node3-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node4-vip) (PORT=1521))
                      )
   (CONNECT_DATA= (SERVICE_NAME= RAC))
       )
 The default value of parameter FAILOVER is set to ON so that Oracle Net Services tries the first address in the list and if that fails, it fails over to the next one in the address list. Without this, Oracle would try only one address from the list and report an error on connection failure.
Note: Connect Time Failover only works if you are using dynamic registration. This means that this feature won’t work if you have something like this configured in your listener.ora:
SID_LIST_MYLISTENER=
(SID_LIST=
(SID_DESC=
(GLOBAL_DBNAME=sales.us.acme.com)
                                                (ORACLE_HOME=/s01/app/oracle/ora/prod)
(SID_NAME=sghosh)
)
    )
Any reference to Global_dbname and you can forget about Connect Time Failover.
How to configure Run Time Connect Failover (RTCF) or Server side TAF: 
It can be achieved by creating a service for the database and define TAF Policy for the service. (METHOD = BASIC/PRECONNECT) and TYPE (SELECT/SESSION). In this case, connect-descriptor in tnsnames.ora contains  a FAILOVER_MODE portion e.g.
RAC=
(DESCRIPTION=
  (ADDRESS_LIST=
                       (FAILOVER=ON)
    (ADDRESS= (PROTOCOL=TCP) (HOST=node1-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node2-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node3-vip) (PORT=1521))
    (ADDRESS= (PROTOCOL=TCP) (HOST=node4-vip) (PORT=1521))
                      )
   (CONNECT_DATA= (SERVICE_NAME= RAC)
                      (FAILOVER_MODE=(TYPE=select)(METHOD=basic))
       )
Unlike connect time failover, which is invoked before the connection is made, TAF comes into play after the connection is made (and then, broken). If the connection is lost while the application is running, Oracle Net will transparently reconnect the application to another instance accessing the same database.
TAF supports two types of failover (TYPE)
- SESSION:
   . Any uncommitted transactions will be rolled back.
   . The session will be connected to another instance
- SELECT:
  . The session will be connected to another instance
  . A SELECT statement that was executing at the time of failure will be re-executed by the new session using the same SCN and fetched rows will be discarded up to the point that the original query failed.
  . Any uncommitted transactions will be rolled back
TAF supports two methods of failover (METHOD): Method determines which method of failovertakes place on instance failure from the primary node to secondary.
- BASIC: You connect to the backup instance when the primary connection fails. This option requires almost no work on the backup server until failover time.
- PRECONNECT: you connect to the backup instance at the same time you connect to the primary instance. This has the obvious benefit of having a backup connection available all of the time, thus reducing the time of ‘failover’. But the downside is that you have to pay the extra ‘cost’ in terms of resources spent, of having a backup connection open all the while.
SUMMARY:
Workload management involves :
   – Failover : If connection to an instance fails, client should automatically connect to another instance
   – Load balancing : Workload should spread across various instances to obtain maximum performance/throughput
Failover can be 
   – while making initial connection (Connect Time Connection Failover (CTCF))
   – after connection has been established (transparent application failover (TAF) or  Run time connection failover (RTCF))
     FAILOVER_METHOD = NONE, BASIC, PRECONNECT
     FAILOVER TYPE   = SESSION, SELECT
TO configure failover/load balancing, tnsnames.ora should contain multiple listener addresses to connect to multiple instances
Connect Time Connection Failover (CTCF) : If your attempt to connect to an instance fails, you can make the connection attempt again, but this time to a backup instance. Connect Time Failover can be achieved by the setting FAILOVER=ON/TRUE/YES (default) in DESCRIPTION_LIST, DESCRIPTION or ADDRESS_LIST in tnsnames.ora.
Run Time Connect Failover (RTCF) or Server side TAF: If the connection is lost while the application is running, Oracle Net will transparently reconnect the application to another instance accessing the same database. It can be achieved by creating a service for the database and define TAF Policy for the service. (METHOD = BASIC/PRECONNECT) and TYPE (SELECT/SESSION). In this case, connect-descriptor in tnsnames.ora contains  a FAILOVER_MODE portion .
References:

 

—————————————————————————————————-

Related links:

Home

11G R2 RAC Index
11g R2 RAC : Services
Service Created Using EM Does Not Update DD or OCR

                                 ———-

11g R2 RAC : VOTING DISK DEMYSTIFIED

                                      Voting disk in 11g

In this post, I will write about voting disk – what does it contain, who updates it, how is it used, where is it stored and so on…
Voting disk a key component of clusterware and its failure can lead to inoperability of the cluster.
In RAC at any point in time the clusterware must know which nodes are member of the cluster so that
- it can perform load balancing
- In case a node fails, it can perform failover of resources as defined in the resource profiles
- If a node joins, it can start resources on it as defined in OCR/OLR
- If a node joins, it can assign VIP to it in case GNS is in use
- If a node fails, it can execute callouts if defined
and so on
Hence, there must be a way by which clusterware can find out about the node membership.
  That is where voting disk comes into picture. It is the place where nodes mark their attendance. Consider an analogy where a manager wants to find out which of his subordinates are present. He can just check the attendance register and assign them their tasks accordingly. Similarly, CSSD process on every node makes entries in the voting disk to ascertain the membership of that node. The voting disk  records node membership information. If it ever fails, the entire clustered environment for Oracle 11g RAC will be adversely affected and a possible outage may result if the vote disks is/are lost.
Also, in a cluster communication between various nodes is of paramount importance.  Nodes which can’t communicate with other nodes  should be evicted from the cluster. While marking their own presence, all the nodes also register the information about their communicability with other nodes in voting disk . This is called network heartbeat. CSSD process in each RAC node maintains its heart beat in a block of size 1 OS block, in the hot  block of voting disk  at a specific offset.  The written block has a header area with the node name.  The heartbeat counter increments every second on every write call. Thus heartbeat of various nodes is recorded at different offsets in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario.If the disk block is not updated in a short timeout period, that node is considered unhealthy and  may be rebooted to protect the database information. In this case , a message to this effect is written in the kill block of the node. Each node  reads its kill block once per second, if the kill block is overwritten node commits  suicide.
During reconfig (join or leave) CSSD monitors all nodes and determines whether  a node has a disk heartbeat, including those with no network heartbeat. If no disk  heartbeat is detected  then node is declared as dead.
 What is stored in voting disk?
——————————
Voting disks contain static and dynamic data.
Static data : Info about nodes in the cluster
Dynamic data : Disk heartbeat logging
It maintains and consists of important details about the cluster nodes membership, such as
- which node is part of the cluster,
- who (node) is joining the cluster, and
- who (node) is leaving the cluster.
Why is voting disk needed ?
—————————
The Voting Disk Files are used by Oracle Clusterware  by way of a health check .
- by CSS to determine which nodes are currently members of the cluster
- in concert with other Cluster components such as CRS to shut down, fence, or reboot either single or multiple nodes whenever network communication is lost between any node within the cluster, in order to prevent the dreaded split-brain condition in which two or more instances attempt to control the RAC database. It  thus protects the database information.
- It will be used by the CSS daemon to arbitrate with peers that it cannot see over the private interconnect in the event of an outage, allowing it to salvage the largest fully connected subcluster for further operation.  It checks the voting disk to determine if there is a failure on any other nodes in the cluster. During this operation, NM will make an entry in the voting disk to inform its vote on availability. Similar operations are performed by other instances in the cluster. The three voting disks configured also provide a method to determine who in the cluster should survive. For example, if eviction of one of the nodes is necessitated by an unresponsive action, then the node that has two voting disks will start evicting the other node. NM alternates its action between the heartbeat and the voting disk to determine the availability of other nodes in the cluster.
The Voting disk is the key communication mechanism within the Oracle Clusterware where all nodes in the cluster read and write heartbeat information. CSSD processes (Cluster Services Synchronization Daemon) monitor the health of  RAC nodes employing two distinct heart beats: Network heart beat and Disk heart beat. Healthy nodes will have continuous network and disk heartbeats exchanged between the  nodes. Break in heart beat indicates a possible error scenario. There are few different scenarios possible with missing heart beats:
1. Network heart beat is successful, but disk heart beat is missed.
2. Disk heart beat is successful, but network heart beat is missed.
3. Both heart beats failed.
In addition, with numerous nodes, there are other possible scenarios too. Few possible scenarios:
1. Nodes have split in to N sets of nodes, communicating within the set, but not with members in other set.
2. Just one node is unhealthy.
Nodes with quorum will maintain active membership of the cluster and other node(s) will be fenced/rebooted.
A node must be able to access more than half of the voting disks
————————————————————————————–
A node must be able to access more than half of the voting disks at any time. For example, let’s have a two node cluster with an even number of let’s say 2 voting disks. Let Node1 is able to access voting disk1 and Node2 is able to access voting disk2 . This means that there is no common file where clusterware can check the heartbeat of both the nodes.  Hence, if we have 2 voting disks, all the nodes in the cluster should be able to access both the voting disks. If we have 3 voting disks and both the nodes are able to access more than half i.e. 2 voting disks, there will be at least on disk which will be accessible by both the nodes. The clusterware can use that disk to check the heartbeat of both the nodes. Hence, each  node should be  able to access more than half the number of voting disks. A node not able  to do so will have to be evicted from the cluster to maintain the integrity of the cluster  . After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle Clusterware to recover the failed node and restore it to the cluster.
 Loss of more than half your voting disks will cause the entire cluster to fail !!
Why should we have an odd number of voting disks?
————————————————————————–
Here is a table which represents the number of voting disks whose failure can be tolerated for different numbers of voting disks:
                                                    No. of voting disks
 Total                             Which should be accessible           Whose failure can be tolerated
      1                                                         1                                                                     0
      2                                                         2                                                                     0
      3                                                         2                                                                     1
      4                                                         3                                                                     1
      5                                                         3                                                                     2
      6                                                         4                                                                     2
It can be seen that number of voting disks whose failure can be tolerated is same for (2n-1) as well as 2n voting disks where n can be 1, 2 or 3. Hence to save a redundant voting disk, (2n-1) i.e. an odd number of voting disks are desirable.
Where is voting disk stored?
—————————-
 The Voting Disk is a shared disk that will be accessed by all member nodes in the cluster during an operation. Hence, the voting disks must be on shared accessible storage .
- You should plan on allocating 280MB for each voting disk file.
- Prior to 11g R2 RAC, it could be placed on
     . a raw device
   . a clustered filesystem supported by Oracle RAC such as OCFS, Sun Cluster, or Veritas Cluster filesystem
- As of  11g R2 RAC, it can be placed on  ASM disks . This simplifies management and improves performance.  But this brought up a puzzle too. For a node to join the cluster, it must be able to access voting disk but voting disk is on ASM and ASM can’t be up until node is up. To resolve this issue, Oracle ASM reserves several blocks at a fixed location for every Oracle ASM disk used for storing the voting disk.As a result , Oracle Clusterware can access the voting disks present in ASM even if the ASM instance is down and CSS can continue to maintain the Oracle cluster even if the ASM instance has failed.The physical location of the voting files in used ASM disks is fixed, i.e. the cluster stack does not rely on a running ASM instance to access the files. The location of the file is visible in the ASM disk header (dumping the file out of ASM with dd is quite easy):
oracle@rac1:~/ [+ASM1] kfed read /dev/sdf | grep -E ‘vfstart|vfend’

kfdhdb.vfstart:                   96 ; 0x0ec: 0x00000060                          <

kfdhdb.vfend:                    128 ; 0x0f0: 0x00000080                         <

 – The voting disk is not striped but put as a whole on ASM Disks
 – In the event that the disk containing the voting disk fails, Oracle ASM will choose another disk on which to store this data.
 – It eliminates the need for using a third-party cluster volume manager.
 – you can reduce the complexity of managing disk partitions for voting disks during Oracle Clusterware installations.
 –  Voting disk needs to be mirrored, should it become unavailable, cluster will come down. Hence, you should maintain multiple copies of the voting disks on separate disk LUNs so that you eliminate a Single Point of Failure (SPOF) in your Oracle 11g RAC configuration.
– If voting disk is stored on ASM, multiplexing level of voting disk is decided by the redundancy of the diskgroup.
Redundancy of the diskgroup       #of copies of voting disk        ( Minimum # of disks in the diskgroup)
External                                               1                                                  1
Normal                                                3                                                  3
High                                                      5                                                  5
- If voting disk is on a diskgroup with external redundancy, one copy of voting file will be stored on one disk in the diskgroup
.-  If we store voting disk on a diskgroup with normal redundancy, we should be able to tolerate the loss of one disk i.e. even if we lose one disk, we should have sufficient number of voting disks so that clusterware can continue.  If the diskgroup has 2 disks (minimum required for normal redundancy), we can store 2 copies of voting disk on it. If we lose one disk, only one copy of voting disk will be left  and clusterware won’t be able to continue,  because to continue, clusterware should be able to access more than  half the no. of voting disks i.e.> (2*1/2)
i.e. > 1
i.e.=  2
Hence, to be able to tolerate the loss of one disk, we should have 3 copies of the voting disk on a diskgroup with normal redundancy . So, a normal redundancy diskgroup having voting disk should have minimum 3 disks in it.
– Similarly, if we store voting disk on diskgroup with high redundancy, 5 Voting Files are placed, each on one ASM Disk i.e a high redundancy diskgroup should have at least 5 disks so that even of we lose 2 disks, clusterware can continue .
 – Ensure that all the nodes participating in the cluster have read/write permissions on disks.
 – You can have up to a maximum of 15 voting disks. However, Oracle recommends that you do not go beyond five voting disks.
Backing up voting disk
———————–
In previous versions of Oracle Clusterware you needed to backup the voting disks with the dd command. Starting with Oracle Clusterware 11g Release 2 you no longer need to backup the voting disks. The voting disks are automatically backed up as a part of the OCR. In fact, Oracle explicitly indicates that you should not use a backup tool like dd to backup or restore voting disks. Doing so can lead to the loss of the voting disk.
Although the Voting disk contents are not changed frequently, you will need to back up the Voting disk file every time
- you add or remove a node from the cluster or
- immediately after you configure or upgrade a cluster.
  A node in the cluster must be able to access more than half of the voting disks at any time in order to be able to tolerate a failure of n voting disks. Therefore, it is strongly recommended that you configure an odd number of voting disks such as 3, 5, and so on.
Check the location of voting disk
grid@host01$crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
–  —–    —————–                ——— ———
 1. ONLINE   243ec3b2a3cf4fbbbfed6f20a1ef4319 (ORCL:ASMDISK01) [DATA]
Located 1 voting disk(s).
– we can see that only one copy of the voting disk is there on data diskgroup which has external redundancy.
As I mentioned earlier, Oracle writes the voting devices to the underlying disks at pre-designated locations so that it can get the contents of these files when the cluster starts up.
Let’s see that with an actual example. Let’s see the logs from CSS . They are located at $ORACLE_HOME/log//cssd  Here is an excerpt from one of the logs. The line says that it found a “potential” voting file on one of the disks – 243ec3b2-a3cf4fbb-bfed6f20-a1ef4319
.
grid@host01$ vi /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log
search for string potential or File Universal ID – 243ec3……
2012-10-09 03:54:28.423: [    CSSD][986175376]clssnmvDiskVerify: Successful discovery for disk ORCL:ASMDISK01, UID 243ec3b2-a3cf4fbb-bfed6f20-a1ef4319,
Create another diskgroup test with normal redundancy and 2 disks.
Try to move voting disk from diskgroup data to test diskgroup
– Fails as we should have at least 3 disks in the test diskgropup
[grid@host01 cssd]$ crsctl replace votedisk +test
Failed to create voting files on disk group test.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.
Add another disk to test diskgroup and mark it as quorum disk. The quorum disk is one small Disk (300 MB should be on the safe side here, since the Voting File is only about 280 MB in size) to keep one Mirror of the Voting File. Other two disks will contain each one Voting File and all the other stripes of the Database Area as well, but quorum  will only get that one Voting File.
Now try to move the voting disk from data diskgroup tp test diskgroup
– Now the operation is successful
[grid@host01 cssd]$ crsctl replace votedisk +test
Successful addition of voting disk 00ce3c95c6534f44bfffa645a3430bc3.
Successful addition of voting disk a3751063aec14f8ebfe8fb89fccf45ff.
Successful addition of voting disk 0fce89ac35834f99bff7b04ccaaa8006.
Successful deletion of voting disk 243ec3b2a3cf4fbbbfed6f20a1ef4319.
Successfully replaced voting disk group with +test.
CRS-4266: Voting file(s) successfully replaced
– Check the ocssd.log – search for 00ce3c9……
grid@host01$vi $ORACLE_HOME/log/host01/cssd/ocssd.log
2012-10-09 05:08:19.484: [    CSSD][997631888]  Listing unique IDs for 3 voting files:
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 1: 00ce3c95-c6534f44-bfffa645-a3430bc3
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 2: a3751063-aec14f8e-bfe8fb89-fccf45ff
2012-10-09 05:08:19.484: [    CSSD][997631888]    voting file 3: 0fce89ac35834f99bff7b04ccaaa8006
I hope this information was useful.
Keep visiting the blog. Thanks for your time!

References:

Oracle 10g RAC Grid, Services & ClusteringBy Murali Vallath

http://orainternals.wordpress.com/2010/10/29/whats-in-a-voting-disk/

—————————————————————————————————–

Related links:

Home

11G R2 RAC Index
11g R2 RAC: GPNP Profile Demystified
11g R2 RAC: How To Identify The Master Node In RAC
11g R2 RAC:Node Eviction Due To CSSDagent Stopping
11g R2 RAC : Node Eviction Due To Member Kill Escalation
11g R2 RAC: Node Eviction Due To Missing Disk Heartbeat
11g R2 RAC: Node Eviction Due To Missing Network Heartbeat 
11g R2 RAC : OCR Demystified
11g R2 RAC : OLR  Demystified
How Does  11G R2 Clusterware Start ASM When ASM SPfile Is On ASM Itself?
Cache Fusion Demonstrated
Instance Recovery In RAC
Need For VIP In RAC
Recover Voting Disk – Scenario-I
Recover Voting Disk – Scenario-II

11g R2 RAC : ORACLE LOCAL REGISTRY (OLR) DEMYSTIFIED

In Oracle Clusterware 11g Release 2 an additional component related to the OCR called the Oracle Local Registry (OLR) is installed on each node in the cluster. The OLR is a local registry for node specific resources. THe OLR is not shared by other nodes in the cluster. It is installed and configured when Oracle clusterware is installed.
 
Purpose of OLR
————–
It is the very first file that is accessed to startup  clusterware when OCR is stored on ASM. OCR should be accessible to find out the resources which need to be started on a node. If OCR is on ASM, it can’t be read until ASM (which itself is a resource for the node and this information is stored in OCR) is up. To resolve this problem, information about the resources which need to be started on a node is stored in an operating system  file which is called Oracle Local Registry or OLR. Since OLR is a file an operating system file, it can be accessed by various processes on the node for read/write irrespective of the status of the clusterware (up/down). Hence, when  a node joins the cluster,  OLR on that node is read, various resources ,including ASM  are started on the node  . Once ASM is up , OCR is accessible and is used henceforth to manage all the clusterware resources. If OLR is missing or corrupted, clusterware can’t be started on that node!
 
Where is OLR located?
———————
The OLR file is located in the grid_home/cdata/<hostname>.olr . The location of OLR is stored in /etc/oracle/olr.loc.and used by OHASD .
 
What does OLR contain?
———————-
The OLR stores data about
  • ORA_CRS_HOME
  • localhost version
  • active version
  • GPnP details
  • OCR latest backup time and location
  • information about OCR daily, weekly backup location
  •  node name etc.
This information stored in the OLR is needed by OHASD to start or join a cluster.
A quick  peek at the backup of the olr shows the resources that are being maintained.
[root@host01 ~]# ocrconfig -local -manualbackup

host01     2013/01/18 01:20:27     /u01/app/11.2.0/grid/cdata/host01/backup_20130118_012027.olr

[root@host01 ~]# strings /u01/app/11.2.0/grid/cdata/host01/backup_20130118_012027.olr |grep -v type |grep ora!

ora!drivers!acfs
ora!crsd
ora!asm
ora!evmd
ora!ctssd
ora!cssd
ora!cssdmonitor
ora!diskmon
ora!gpnpd
ora!gipcd
ora!mdnsd
 
 OLR administration
——————-
You can view the status of the OLR file on each node by using the ocrcheck command with the –local parameter as seen here:
#ocrcheck -local
 
ocrdump can be used to  dump the contents of the OLR to tthe text terminal:
#ocrdump -local -stdout
 
You can use the ocrconfig command to export and import the OLR as seen in these examples:
#ocrconfig -local -export <export file name >
 
#ocrconfig -local -import <file name>
 
And you can repair the OLR file should it become corrupted with the ocrconfig command as seen in this example:
#ocrconfig -local -repair olr <file name>
 
The OLR is backed up at the end of an installation or an upgrade. After that time, you can only manually back up the OLR. Automatic backups are not supported for the OLR.

To manually back up OLR:
# ocrconfig –local –manualbackup

To view the contents of the OLR backup file:
#ocrdump -local -backupfile olr_backup_file_name
To change the OLR backup location
#ocrconfig -local -backuploc new_olr_backup_path

To restore OLR: 
 # crsctl stop crs
# ocrconfig -local -restore file_name 
# ocrcheck -local 
# crsctl start crs 
$ cluvfy comp olr
 
 References:
                                     ———————-