Monthly Archives: January 2013

LET’S EDIT GPNP PROFILE

In 11g R2 RAC, Voting disk is stored on an ASM diskgroup. . But CSSD needs the voting files before ASM is online.  At its startup, CSSD scans the device headers of all devices specified in the gpnp profile xml file, tag “DiscoveryString” which contains the same value as specified in asm_diskstring parameter for ASM instance.

 

If it can find at least more than half the number of total no. of voting files, the party takes place,  otherwise,  CSSD will cycle with appropriate error messages in $GRID_HOME/log/hostname/log/cssd/ocssd.log for each loop.

 

To verify it, I modified ASM_DISKSTRING to such a value that the ASM disk containing voting disk was not included and tried to restart crs on a node.. Looking up the CSSD logfile, I saw that the CSSD had trouble identifying its voting files.

 

Now, I had to change back the ASM disk_string parameter without having ASM running, and with no CSSD available, which is necessary to start it? How do we tell the CSSD, that’s running fairly in advance of ASM, to scan the right devices?  I used gpnptool to edit the gpnp profile and restored the discovery string to the appropriate value. Now I could get crs running on my machine.

 

Overview:
– Move voting disk to a new diskgroup VOTE.
– Set ASM_DISKTRING to all the disks which are members of all the other diskgroups (DATA/FRA) using SQL
— stop and restart crs on the node –
– check that HAS services have started but rest of the services are not up
–Check the ocssd.log — Scans all the disks which are part of the discovery string  but does not find voting disk
– Edit the gpnp profile to modify discovery string  for asm
– Try to restart crs
– check that all the daemons have started and cluster servicves are up

 

Implementation:
– cretae a diskgroup with external redundancy to be used to store VD

– Move voting disk to the diskgroup vote – fails because
    — diskgroup not mounted on all the nodes – mount it
   — diskgroup compatibility < 11.2.0.0 – Modify it
[root@host01 ~]# crsctl replace votedisk +vote
Failed to create voting files on disk group vote.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.
  – Mount vote diskgroup on all the nodes
  —  Modify diskgroup compatibility to 11.2.0.0
  — Move voting disk to the diskgroup vote – succeeds
 
[root@host01 ~]#  crsctl replace votedisk +vote
Successful addition of voting disk 443f1c60e16f4fa5bfbfeaae6b2f919d.
Successful deletion of voting disk 3d9da0d16baa4f10bf4e4b9b4aa688c6.
Successful deletion of voting disk 369621097c034f6dbf29a8cc97dc4bbc.
Successful deletion of voting disk 407b5d6588ea4f6fbf1503f7f6cc2951.
Successfully replaced voting disk group with +vote.
CRS-4266: Voting file(s) successfully replaced
– check that voting disk has been moved to vote diskgroup
[root@host01 ~]#  crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
–  —–    —————–                ——— ———
 1. ONLINE   443f1c60e16f4fa5bfbfeaae6b2f919d (ORCL:ASMDISK014) [VOTE]
Located 1 voting disk(s).
– Find out the disks in the other diskgroups – DATA and FRA
[grid@host01 ~]$ asmcmd lsdsk -G data
Path
ORCL:ASMDISK01
ORCL:ASMDISK010
ORCL:ASMDISK011
ORCL:ASMDISK012
[grid@host01 ~]$ asmcmd lsdsk -G fra
Path
ORCL:ASMDISK02
ORCL:ASMDISK03
– Set ASM_DISKTRING to all the disks which are members of DATA/FRA diskgroups

 SQL> sho parameter asm_diskstring
NAME                                 TYPE        VALUE
———————————— ———– ————
asm_diskstring                       string
SQL> alter system set asm_diskstring='ORCL:ASMDISK01',
   'ORCL:ASMDISK010','ORCL:ASMDISK011','ORCL:ASMDISK012',
   'ORCL:ASMDISK02','ORCL:ASMDISK03';
– check that discovery string in gpnp profile  points to the disks as specified 
 
[root@host01 ~]# cd /u01/app/11.2.0/grid/gpnp/host01/profiles/peer/ 
                         vi profile.xml
DiscoveryString=”ORCL:ASMDISK01,ORCL:ASMDISK010,ORCL:ASMDISK011,ORCL:ASMDISK012,ORCL:ASMDISK02,ORCL:ASMDISK03

 

— stop and restart crs on the node –
]
[root@host01 ~]# crsctl stop crs 
  [root@host01 ~]# crsctl start  crs
– check that has has started but rest of the services are not up
[root@host01 ~]# crsctl check has
CRS-4638: Oracle High Availability Services is online
[root@host01 ~]# crsctl check crs
]CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[root@host01 ~]# crsctl check css
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
[root@host01 ~]# crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
–Check the ocssd.log
– Scans all the disks which are part of the discovery string but does not find voting disk

 [root@host01 ~]# vi $ORACLE_HOME/log/host01/cssd/ocssd.log
2013-01-21 12:55:14.759: [    CLSF][2985733008]Read ASM header off dev:ORCL:ASMDISK01:0:0
2013-01-21 12:55:14.759: [   SKGFD][2985733008]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0x91d0410 for disk :ORCL:ASMDISK01:
2013-01-21 12:55:14.759: [    CLSF][2985733008]Read ASM header off dev:ORCL:ASMDISK010:0:0
2013-01-21 12:55:14.759: [   SKGFD][2985733008]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0x91d0c10 for disk :ORCL:ASMDISK010:
2013-01-21 12:55:14.759: [    CLSF][2985733008]Read ASM header off dev:ORCL:ASMDISK011:0:0
2013-01-21 12:55:14.759: [   SKGFD][2985733008]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0x91d1508 for disk :ORCL:ASMDISK011:
2013-01-21 12:55:14.759: [    CLSF][2985733008]Read ASM header off dev:ORCL:ASMDISK012:0:0
2013-01-21 12:55:14.759: [   SKGFD][2985733008]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0x91d1e00 for disk :ORCL:ASMDISK012:
2013-01-21 12:55:14.760: [    CLSF][2985733008]Read ASM header off dev:ORCL:ASMDISK02:0:0
2013-01-21 12:55:14.760: [   SKGFD][2985733008]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0x91d26f8 for disk :ORCL:ASMDISK02:
2013-01-21 12:55:14.760: [    CLSF][2985733008]Read ASM header off dev:ORCL:ASMDISK03:0:0
@
2013-01-21 12:55:14.760: [    CLSF][2985733008]Read ASM header off dev:ORCL:ASMDISK03:0:0
2013-01-21 12:55:14.760: [   SKGFD][2985733008]Lib :ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0x91d2ff0 for disk :ORCL:ASMDISK03:
2013-01-21 12:55:14.760: [    CSSD][2985733008]clssnmvDiskVerify: file is not a voting file, cannot recognize on-disk signature for a voting
.
.
.
2013-01-21 12:55:14.760: [    CSSD][2985733008]clssnmvDiskVerify: Successful discovery of 0 disks
2013-01-21 12:55:14.760: [    CSSD][2985733008]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2013-01-21 12:55:14.760: [    CSSD][2985733008]clssnmvFindInitialConfigs: No voting files found
– Copy the existing gpnp profile to profile.bak  and edit the copy to modify discovery string  for asm
[root@host01 ~]# cd /u01/app/11.2.0/grid/gpnp/host01/profiles/peer/
[root@host01 peer]# cp profile.xml profile.bak
– remove the oracle signature from the file –
[root@host01 peer]# gpnptool unsign -p=profile.bak
Warning: some command line parameters were defaulted. Resulting command line:
         /u01/app/11.2.0/grid/bin/gpnptool.bin unsign -p=profile.bak -o-

 

– change the DiscoveryString itself –
[root@host01 peer]# gpnptool edit -asm_dis='ORCL:*' -p=profile.bak -o=profile.bak -ovr
Resulting profile written to “profile.bak”.
Success.
–  sign the profile xml file with the wallet (Notice: the path is only the directory to the wallet, NOT the wallet file itself) –
[root@host01 peer]# gpnptool sign -p=profile.bak -w=file:/u01/app/11.2.0/grid/gpnp/host01/wallets/peer/ -o=profile.new
Resulting profile written to “profile.new”.
Success.
–  move the original profile.xml out of the way 
[root@host01 peer]# mv profile.new profile.xml
– check that discovery string has been modified
[root@host01 peer]# vi profile.xml
DiscoveryString=”ORCL:*”
– Try to restart crs 
[root@host01 peer]# ps -ef |grep d.bin
kill all the processes
[root@host01 peer]# crsctl start crs
– check that all the daemons have started and cluster servicves are up
[root@host01 peer]# ps -ef |grep d.bin
[root@host01 peer]#crsctl stat res -t
Conclusion:
To access voting disk, gpnp profile is read to find the location of voting disk : found as asm
The asm disks as per the discovery string in gpnp profile are scanned to search for voting disk.
– Note : GPNP profile on other nodes also contains erroneous discovery string.
                  Hence copy the profile from the current node to the other nodes also
References:
————————————————————————————————–
Related links:
 
                                                      ——————
~

11g R2 RAC : REAPIR ORACLE CLUSTR REGISTRY(OCR)

In this post, I will demonstrate how we can use ocrconfig -repair command to repair ocr configuration on a node which was not up when the configuration was modified on the other nodes.

Current scenario:
3 node cluster
Nodes: host01, host02, host03
Nodes host02 and host03 are up
Node host01 is down
OCR is stored on ASM diskgroup DATA

Overview:
- Store OCR on additionally on FRA diskgroup
- This information is modified in /etc/oracle/ocr.loc on nodes host02 and host03 which are up
- This information is not modified in /etc/oracle/ocr.loc on node host01 which is down.
– Startup Node host01
– Clusterware does not come up on host01
– Check alert log and crsd log on host01
– Repair OCR configuration on host01 so that /etc/oracle/ocr.loc on host01 gets updated
– STart clusterware on host01 – succeeds

Implementation:
 
- Store OCR on additionally on FRA diskgroup
[root@host02 ~]# ocrconfig -add +FRA
- Check that new OCR location is added  in /etc/oracle/ocr.loc on nodes host02 and host03 which are up
[root@host02 ~]# cat /etc/oracle/ocr.loc
#Device/file  getting replaced by device +FRA
ocrconfig_loc=+DATA
ocrmirrorconfig_loc=+FRA
[root@host03 ~]# cat /etc/oracle/ocr.loc
#Device/file  getting replaced by device +FRA
ocrconfig_loc=+DATA
ocrmirrorconfig_loc=+FRA

- Check that new OCR location is not added  in /etc/oracle/ocr.loc on node host01 which was down
[root@host01 ~]# cat /etc/oracle/ocr.loc
ocrconfig_loc=+DATA
local_only=FALSE
– Bring up host01 . 
– Check that clusterware has not come up there
[root@host01 host01]# crsctl stat res -t
- Check the alert log of host01
[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[ohasd(4914)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘host01‘.
2013-01-18 23:35:01.950

- Check the crsd  log of host01 – Indicates that local and master information of OCR configuration does not match

 [root@host01 crsd]# vi /u01/app/11.2.0/grid/log/host01/crsd/crsd.log
[  OCRMAS][2876611472]th_calc_av:5′: Rturn persisted AV [186646784] [11.2.0.1.0]
2013-01-18 23:35:13.931: [  OCRSRV][2876611472]th_not_master_change: Master change callback not registered
2013-01-18 23:35:13.931: [  OCRMAS][2876611472]th_master:91: Comparing device hash ids between local and master failed
2013-01-18 23:35:13.931: [  OCRMAS][2876611472]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)
2013-01-18 23:35:13.931: [  OCRMAS][2876611472]th_master:91 Master dev (1862408427, 1897369836, 0, 0, 0)
2013-01-18 23:35:13.931: [  OCRMAS][2876611472]th_master:9: Shutdown CacheLocal. my hash ids don’t match

– Repair OCR configuration on host01 
[root@host01 crsd]# ocrconfig -repair -add +FRA
- Check that new OCR location is added  in /etc/oracle/ocr.loc on node host01 
[root@host01 crsd]# cat /etc/oracle/ocr.loc
#Device/file  getting replaced by device +FRA
ocrconfig_loc=+DATA
ocrmirrorconfig_loc=+FRA

– Shutdown and restart cluster on host01
[root@host01 crsd]# crsctl stop crs -f
[root@host01 crsd]# crsctl start crs
[root@host01 crsd]# crsctl start cluster
– Check that crsd is started on host03
[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/ alerthost01.log
[crsd(7297)]CRS-1012:The OCR service started on node host01.
2013-01-18 23:46:07.609
[crsd(7297)]CRS-1201:CRSD started on node host01.
[root@host02 ~]# crsctl stat res -t
——————————————————————————————-
Related links:

                                          ——————-

11g R2 RAC: LET’S LOSE GPNP PROFILE

In an earlier post GPNP PROFILE DEMYSTIFIED , I had explained what is GPnP profile, what does it contain and how is it used by clusterware.
In this post, I will demonstrate what happens if we lose GPnP profile.

Current scenario:
2 Node cluster
Cluster name : cluster01.example.com
Host names : host01, host02

Overview:
- Rename GPnP profile on a node (host02)
- Stop and restart crs
- File pending.xml is created in GRID_HOME/gpnp//profiles/peer
- Change ASM_DISKSTRING parameter which causes GPnP profile to be updated
- As soon as the parameter is changed, pending.xml is renamed to profile.xml and Discoverystring for ASM is updated in it.

 Let’s start …
– Rename GPnP profile on host02

   [root@host02 cdata]# mv /u01/app/11.2.0/grid/gpnp/host02/profiles/peer/profile.xml /u01/app/11.2.0/grid/gpnp/host02/profiles/peer/profile.bak
- Stop and restart crs on host02
     [root@host02 cdata]#crsctl stop crs
     [root@host02 cdata]#crsctl start crs
-- Check the alert log simultaneously and note that just prior to the start of CSSD daemon, GPnP daemon on node2 copies the GPnP profile from host01 and file pending.xml is created in GRID_HOME/gpnp//profiles/peer

[ohasd(24187)]CRS-2772:Server ‘host02′ has been assigned to pool ‘Free’.
2013-01-17 01:35:14.635
[cssd(24416)]CRS-1713:CSSD daemon is started in clustered mode
 [root@host02 cdata]# ls /u01/app/11.2.0/grid/gpnp/host02/profiles/peer/pending*
/u01/app/11.2.0/grid/gpnp/host02/profiles/peer/pending.xml

—- Open the file pending.xml and note that DiscoveryString for ASM is “”
DiscoveryString=””SPFile=”+DATA/cluster01/asmparameterfile/registry.253.804837235″/>

– Change ASM_DISKSTRING parameter which causes GPnP profile to be updated
   SQL>alter system set ASM_DISKSTRING='ORCL:*';
- As soon as the parameter is changed, pending.xml is renamed to profile.xml and Discoverystring for ASM is updated in it.
[root@host02 cdata]# ls /u01/app/11.2.0/grid/gpnp/host02/profiles/peer/profile.xml
/u01/app/11.2.0/grid/gpnp/host02/profiles/peer/profile.xml

 —- Open the file profile.xml and note that DiscoveryString for ASM has been updated
DiscoveryString=”ORCL:*” SPFile=”+DATA/cluster01/asmparameterfile/registry.253.804837235″/>

Conclusion:
Whenever GPnP profile is not present, it is automatically copied from existing nodes by GPnPd as pending.xml . Whenever, profile is updated, it is automatically renamed to profile.xml.
———————————————————————————-

Related links:

 

11g R2 RAC : INSTANCE FAILOVER IN SERVER POOL

   In my earlier post , I had explained the concept of server pools which was introduced in 11g R2 RAC. In this post, I will demonstrate how instance failover takes place in a policy managed database in 11g R2 RAC.

 

Prior to 11g R2 RAC, various instance of a database were mapped to various nodes in the cluster. Imagine a cluster having 4 nodes host01, host02, host03 and host04. I have a database DB configured to run on 3 nodes in the cluster i.e.

 

Instance DB1 runs on host01
Instance DB2 runs on host02
Instance DB3 runs on host03

 

Let us say I have a service DB_SERV defined for the database with
Preferred instances : DB1, DB2
AVailable instances  : DB3

 

As the clusterware starts up , the status is as follows:
Instance DB1 runs on host01
Instance DB2 runs on host02
Instance DB3 runs on host03

 

Service DB_SERV is supported by host01 and host02 i.e. if a user connects to the database using service db_serv, he will be connected to either of host01 and host02.

 

                     DB1                DB2                  DB3
                   DB_SERV      DB_SERV          |
                        |                      |                           |
                       V                    V                          V
                    host01           host02        host03      host04

 

Now, if host02 crashes,
i.e.
                     DB1                DB2                 DB3
                   DB_SERV      DB_SERV         |
                        |                      |                          |
                       V                    V                         V
                    host01           host02        host03      host04
                                              X
- Instance DB2 crashes.
- Service DB_SERV fails over from crashed instance (DB2)  to available instance i.e. DB3 i.e. now  if a user connects to the database using service db_serv, he will be connected to either of host01 and host03.
                       DB1                                     DB3
                   DB_SERV                        DB_SERV
                        |                                             |
                       V                                            V
                    host01                               host03      host04
It implies that although we had an idle server host04 available but still instance DB2 crashed instead of failing over to host04 since  instance DB2  has been configured to run on host02 only.
ENTER SERVER POOLS:
Above problem has been resolved in 11g R2 RAC with the introduction of server pools. Now the database is configured to run in a server pool. The cardinality of the database is decided by MIN-SIZE attribute of the server pool. The database runs on the servers which are assigned to the server pool. If a node hosting the database crashes, the idle node (if available) will be moved from Free pool to the server pool and the instance running on the crashed node will fail over to the that node. All the services supported by the instance will fail over along with the instance.
Let us see how.
Imagine a cluster having 3 nodes host01, host02 and host03
                         HOST01         HOST02            HOST03
– create a server pool pmpool with min = 2, max = 2, importance = 1
[grid@host01 ~]$srvctl add srvpool -g pmpool -l 3 -u 3 -i 1
– check that 2 servers (host02, bost03)  have been allocated to this server pool
[grid@host01 ~]$crsctl  status  serverpool
                                                         PMPOOL
                                           +——————————————+
                                           |                                                               |
                  HOST01      |    HOST02            HOST03           |
                                          +——————————————-+
– create a policy managed database pmdb which will run in the serverpool pmpool
– check that two servers (host02 and host03) have been allocated to serverpool pmpool
[root@host01 ~]#crsctl status serverpool
– check that two instances of pmdb are running on the 2 servers (host02, host03) in pmpool
[oracle@host01 ~]$ srvctl  status database -d pmdb
Instance pmdb_1 is running on node host02
Instance pmdb_2 is running on node host03
                                                         PMPOOL
                                           +——————————————+
                                           |     PMDB_1           PMDB_2         |
                   HOST01      |    HOST02            HOST03         |
                                          +——————————————-+
REBOOT HOST03 WHERE INSTANCE PMDB-2 is running
                                                         PMPOOL
                                           +——————————————+
                                           |     PMDB_1           PMDB_2          |
                   HOST01      |    HOST02            HOST03           |
                                           |                                 X                           |
                                          +——————————————-+
– Check the status of the database pmdb repeatedly –
–Note that instance pmdb_2 fails over to host01 –
[oracle@host01 ~]$ srvctl  status database -d pmdb
Instance pmdb_1 is running on node host02
Instance pmdb_2 is running on node host03
[oracle@host01 ~]$
[oracle@host01 ~]$
[oracle@host01 ~]$ srvctl  status database -d pmdb
Instance pmdb_1 is running on node host02
Instance pmdb_2 is not running on node host03
[oracle@host01 ~]$
[oracle@host01 ~]$
[oracle@host01 ~]$
[oracle@host01 ~]$
[oracle@host01 ~]$ srvctl  status database -d pmdb
Instance pmdb_1 is running on node host02
Instance pmdb_2 is running on node host01
                                                         PMPOOL
                                           +——————————————+
                                           |     PMDB_1           PMDB_2          |
                   HOST03      |    HOST02            HOST01           |
                                          +——————————————-+
– check that host01 has been moved to pmpool
                                                       PMPOOL
                                           +——————————————-+
                                           |                                                                 |
                   HOST03      |    HOST02            HOST01             |
                            X             +——————————————-+
[root@host02 ~]# crsctl status serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.pmpool
ACTIVE_SERVERS=host01 host02
——————————————————————————————————–
 Related links:

 

11g R2 RAC: SERVER POOLS

                                        

In this post, I will explain about Server Pools. 11G R2 RAC introduced serverpools which will revolutionalize cluster management and ease the life of the DBA’s. To understand srever pools, let us see how the things were earlier :

 

ADMIN MANAGED DATABASES
 In prior releases of Oracle RAC, the DBA would explicitly manage which nodes of a cluster would be used to run the various instances associated with a RAC database. Whenever a node was added/removed from the cluster, DBAs were responsible for adding/removing instances from a RAC database, including the creation of redo log threads and undo tablespaces. Additionally, database services would be manually configured with preferred and available nodes, which facilitated the balancing of connections and failover of the services in a RAC environment. This scheme worked well for a cluster with a small number of nodes.

 

NEED FOR POLICY MANAGED DATABASES
 - As the number of nodes become large, DBA’s tasks became very complicated .
- It was not possible to add more instances dynamically to newly added servers .
- It was not possible for additional instances of database to start automatically as some servers in the cluster became idle.
 –  It was not possible for a service to be available on newly added nodes dynamically.

 

 Now, from Oracle 11g R.2 RAC all these tasks are automated using server pool , a new feature of Oracle 11g RAC. Server Pools logically divide the cluster into small segments that can be used for varying workloads.  Now databases and services can be defined to run in server pools. and  servers in the server pools can change dynamically.

 

ENTER SERVER POOLS
Oracle 11g R2 introduced server pools in which we could divide the cluster into smaller server pools. Each pool could be assigned some attributes:

 

IMPORTANCE : 0(default) to 1000 – It is used to determine how to reconfigure the servers in various pools when a node joins/leaves the cluster. Using importance parameter you can ensure that low precedence workload doesn’t eat up more resources.

 

MAX_SIZE: The maximum number of servers a server pool can contain. Default value is -1 which indicates  means that all the servers in the cluster can be a part of this pool.

 

MIN_SIZE: The minimum size of a server pool. If the number of servers contained in a server pool is below the number you specify in this attribute, then Oracle Clusterware automatically moves servers from other pools into this one until that number is met. The default value is 0.

 

ACL:  String in the following format:owner:user:rwx,pgrp:group:rwx,other::r— Defines the owner of the server pool and which privileges are granted to various operating system users and groups.
  • r: Read only
  • w: Modify attributes of the pool or delete it
  • x: Assign resources to this pool
By default, the identity of the client that creates the server pool is the owner.
Clusterware can move servers from server pool robotically to maintain important server pool configuration. By default two pools get created after installation of the cluster : Generic Pool and Free Pool

 

Generic Pool: The servers in the pool are used to host Admin Managed databases.
  When you upgrade your existing version of the clusterware to 11.2, all the nodes get mapped to the in-built Generic pool.This is an internally managed server pool and the modification of the attributes of this pool is not allowed. When you create a database which is Admin-managed, that also becomes a part of the Generic Pool as a child pool of it.

 

 Free Pool: All the servers which have not been assigned to any pool are part of this pool. When the clusterware starts , initially all the servers are assigned to this pool  only. As applications start, servers move from this pool to Generic or user defined pool dynamically. As like the Generic pool, this is also an internally managed pool but still some attributes are available to be modified by the dba like IMPORTANCE.
In addition to above pools, users can also define server pools.

 

The allocation of the hosts is going to be happening in the following order:
     1. Generic Pool
     2. Explicitly created pool
     3. Free Pool

 

Servers will be assigned to explicitly created pools  as per the following rules:
1.    Fill all server pools in order of importance until they meet  their minimum.
2.    Fill all server pools in order of importance until they meet  their maximum.
3.    By default any left over go to FREE.

 

If servers leave the cluster for any reason, a server reallocation may take place. If there are servers in the Free Pool and another server pool falls below its maximum value, a free server is allocated to the affected server pool. If there are no free servers, then server reallocation takes place only if a server pool falls below its minimum level. If that occurs, a server will be sourced from one of the following locations in the following order:

1. The server pool with the lowest importance that has more than its minimum number of servers
2. The server pool with the lowest importance that has at least one server and has lower importance than the affected server pool

Using these mechanisms, server pools can maintain an optimal level of resources based on current no. of servers that are available.

DEMONSTRATION:
 
Current scenario:
Nodes in the cluster : host01, host02, host03

 

- Let’s see the current status of server pools
#crsctl status serverpool
NAME=Free
ACTIVE_SERVERS=host01 host02 host03
NAME=Generic
ACTIVE_SERVERS=

 

- create a serverpool pool mypool with min = 2, max = 2, imp = 1
[root@host02 host02]# srvctl add srvpool -g mypool -l 2 -u 2 -i 1

 

- check that two servers have been allocated to mypool from free pool
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=host01
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host02 host03

 

- Add another server pool mypool1 with min = 2, max = 2, imp = 1
[root@host02 host02]# srvctl add srvpool -g mypool1 -l 2 -u 2 -i 1

 

- check that one  server has been allocated to mypool1 from free pool
 [root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host02 host03
NAME=ora.mypool1
ACTIVE_SERVERS=host01

 

- Increase the importance of mypool1 to 2
[root@host02 host02]# srvctl modify srvpool -g mypool1 -i 2

 

- Check that one server has been moved from mypool of lower importance(1) to mypool of higher importance (2)
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host03
NAME=ora.mypool1
ACTIVE_SERVERS=host01 host02

 

- Increase the importance of mypool to 3
[root@host02 host02]# srvctl modify srvpool -g mypool -i 3

 

- Check that one server has been moved from mypool1 to mypool
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.mypool
ACTIVE_SERVERS=host02 host03
NAME=ora.mypool1
ACTIVE_SERVERS=host01

 

- Cleanup –
[root@host02 host02]# srvctl remove srvpool -g mypool1
[root@host02 host02]# srvctl remove srvpool -g mypool

 

CHILD POOLS
 We can have various child pools  for the same parent pool.  The child pools will  have their own properties.  Servers will be allocated to the child pools based on the IMPORTANCE parameter assigned to them. For example we can have a database which runs in a parent pool. We can assign various services of the database assigned to different pools. We can have different attributes set up for the child pools depending on the importance that we want to give to each service of the database. The hosts assigned  to various parent pools are mutually exclusive i.e. same server can’t be assigned to different parent pools.  For the child pools, whether the same server can be assigned to different child pools or not, would be based on the  value of the attribute EXCLUSIVE_POOL for them. If there is a string value assigned to this parameter and this is shared by two child pools, host assigned to one child pool can’t be shared by its other sibling.

 

– Create a parent pool pmpool3 with min = 3, max=3 and importance = 1
[root@host02 host02]# srvctl add srvpool -g pmpool3 -l 3 -u 3 -i 1

 

– Check that all the three servers have been assigned to pmpool3 –
[root@host02 host02]# crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03

 

- Add a child pool ch1 with parent pool = pmpool3, min = 2, max =2 and importance = 1
[grid@host02 ~]$ crsctl add serverpool ch1 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=1"

 

- Check that two servers have been allocated to child pool ch1
[root@host02 ~]# crsctl status serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01 host02
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03

 

- Add another child pool ch2 with min = 2, max = 2, importance = 2(higher than ch1(1))
 [grid@host02 ~]$ crsctl add serverpool ch2 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=2"

 

- Check that two servers have been allocated to child pool ch2. Since we haven’t set the parameter EXCLUSIVE_POOLS, the same hosts host01 and  host02  are  shared by both the children.
[root@host02 ~]# crsctl status serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01 host02
NAME=ch2
ACTIVE_SERVERS=host01 host02
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03
————————————————————————————

 

EXCLUSIVE_POOLS

 

– Delete both child pools ch1 and ch2
[grid@host02 ~]$ crsctl delete serverpool ch1 
[grid@host02 ~]$ crsctl delete serverpool ch2

 

– Add child pool ch1 with min = 2,max = 2, importance = 1, exclusive_pools=’x’
[grid@host02 ~]$ crsctl add serverpool ch1 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=1,EXCLUSIVE_POOLS='x'"

 

– check that 2 servers have been assigned to child pool ch1
[grid@host02 ~]$ crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01 host02
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03

 

- Add another child pool ch2 with min = 2, max = 2, importance = 2(higher than ch1(1)), exclusive_pools = x so that it won’t share any servers with ch1
[grid@host02 ~]$ crsctl add serverpool ch2 -attr "PARENT_POOLS=ora.pmpool3,MIN_SIZE=2,MAX_SIZE=2,IMPORTANCE=2,EXCLUSIVE_POOLS='x'"

 

- CHeck that one server has been moved from ch1 to ch2 as importance of ch2 is higher and servers are not shared between ch1 and ch2.
[grid@host02 ~]$ crsctl stat serverpool
NAME=Free
ACTIVE_SERVERS=
NAME=Generic
ACTIVE_SERVERS=
NAME=ch1
ACTIVE_SERVERS=host01
NAME=ch2
ACTIVE_SERVERS=host02 host03
NAME=ora.pmpool3
ACTIVE_SERVERS=host01 host02 host03
=========

 

WIth the introduction of server pools, we can have two management styles for RAC database.

 

Administrator managed databases (similar to what we had prior to 11g R2)
  . DBA defines which servers should the database run on
  . DBA decides the servers (preferred/available nodes) on which services of the database should run.
  . DBA decides which instance of the database will run on which server.
  . This database type would assign the instances to a child pool of the default pool available Generic pool.

 

Although this type seems to be much simpler, this can’t be used when the number of nodes pass a number which is difficult to manage by a DBA, for example 100. In that case, the next type of the 11.2 database would be helpful-Policy Managed database.

 

Policy Managed Databases
- The database can be associated with a server pool rather than a specific set of nodes. It will decide   the minimum and maximum no. of servers needed by that resource (database , service, third party application).
- The  database will run on the servers which have been allocated to the serverpool it has been assigned to.
- Since servers allocated to the server pool can change dynamically, this would allow Oracle to dynamically deliver services based on the total no. of servers available to the cluster.
  . The database will started on enough servers subject to the availability of the servers.
  . We need not hardcode the servers a database should run on.
  . Any instance of the database can run on any node. There is no fixed mapping between instance number and node.
  . As servers are freed/added/deleted, they will be allocated to existing pools as per the rules mentioned earlier.

 

For example, if a cluster consisted of eight nodes in total and supported three RAC databases. Each database would be defined with a minimum and maximum number of servers. Let’s assume that

 

DB1 is defined with a minimum of 4 and a maximum of 6 with an importance of 10,
DB2 is defined with a minimum of 2 and maximum of 3 and an importance of 7, and
DB3 is set to a minimum of 2 and maximum of 3 with an importance of 5.

 

Initially the 8 nodes could be configured as nodes 1-4 allocated to DB1, nodes 5-6 allocated to DB2 and nodes 7-8 allocated to DB3. If node 3 failed for some reason, the system would allocate node 7 or 8 to DB1 because it has a higher importance than DB3 and a minimum requirement of 4 servers, even though it would cause DB3 to fall below the minimum number of servers. If node 3 is re-enabled it would be allocated immediately to DB3 to bring that database back to its minimum required servers.

 

If a 9th node were added to the cluster, it would be assigned to DB1 because of the importance factor and the fact that DB1 has not yet hit its maximum number of servers.

Conclusion
As the number of cluster increase, DBA’s will get a huge benefit from server pools as the management of the clusterware resources will become quite easy and dynamic.
In my next post, I will write about instance failover in server pools.
References:
http://docs.oracle.com/cd/E11882_01/rac.112/e16794/admin.htm
 http://allthingsoracle.com/an-introduction-to-11-2-rac-server-pools/
http://www.mahir-quluzade.com/2013/05/converting-administrator-managed.html
Policy based cluster management in Oracle 12c

———————————————————————————
Related links:

HOME   

11gR2 RAC INDEX

11g R2 RAC :SCAN DEMYSTIFIED

Single Client Access Name or SCAN  is a new feature that has been introduced in 11g R2 RAC. It enables the clients to  use a single hostname to access oracle databases running in the cluster. Clients using SCAN do not need to change their TNS configuration if you add or remove nodes in the cluster.
 To understand it better, let us go back a little to 10g/11gR1 RAC.
Let us say we have a database orcl which is running on 3 nodes in a cluster;
Name of the cluster: cluster01.example.com
Instance       Host name                       IP address             VIP       Pvt network
orcl1       host01.example.com       192.9.201.181   192.9.201.182    10.0.0.1
orcl2       host02.example.com       192.9.201.183   192.9.201.184    10.0.0.2
orcl3       host03.example.com       192.9.201.185   192.9.201.186    10.0.0.3
DNS server
               server1.example.com       192.9.201.59
Storage
               openfiler1.example.com    192.9.201.9
Prior to 11g R2, when the user requested the connection to the database by issuing
$sqlplus hr/hr@orcl
Entry in tnsnames.ora on client side was read which looked like:
ORCL =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = host01-vip.example.com)(PORT = 1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = host02-vip.example.com)(PORT = 1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = host03-vip.example.com)(PORT = 1521))
      (LOAD_BALANCE = yes)
    )
    (CONNECT_DATA =
      (SID = orcl)
    )
  )
TNS entry contains reference(address) to all the nodes in the cluster. The request was to be sent to any one of the host names listed in ADDRESS_LIST.
To resolve the host name two methods were available :
- hosts file
- DNS Server
Let us discuss hosts file first. The /etc/hosts file contained entries using which hostnames could be resolved to corresponding VIPs.
#Public Network
192.9.201.181    host01.example.com host01
192.9.201.183    host02.example.com host02
192.9.201.185    host03.example.com host03
#Private Network
10.0.0.1        host01-priv.example.com host01-priv
10.0.0.2        host02-priv.example.com host02-priv
10.0.0.3        host03-priv.example.com host03-priv
#Virtual Network
192.9.201.182   host01-vip.example.com host01-vip
192.9.201.184  host02-vip.example.com host02-vip
192.9.201.186   host03-vip.example.com host03-vip
#Storage
192.9.201.9     openfiler1.example.com openfiler
Once VIP of the host was known, the request was sent to that IP address where it would be received by the local listener running there. The listener’s information was stored on the listener.ora file on the server side.
In this method of hostname resolution, whenever we added/deleted a node , we needed to modify both – tnsnames.ora and hosts file on each client machine which was a cumbersome task. This problem of hosts file was resolved by  storing hostname resolution entries in a centralised DNS server. Now , on adding/deleting a node, hostname resolution entries needed to be modified only in one place i.e. DNS server but tnsnames.ora still needed to be modified on each client machine.
To avoid modifying tnsnames.ora on each client machine, SCAN was introduced in 11gR2.
ENTER SCAN in 11g R2 RAC ….
Now we can specify only the cluster name in the tnsnames.ora  and  client would be connected to any of the nodes in the cluster. Let us see how it can be implemented:
 Now tnsnames.ora entry looks like:
ORCL =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = cluster01-scan.cluster01.example.com)(PORT = 1521))
      (LOAD_BALANCE = yes)
    )
    (CONNECT_DATA =
      (SID = orcl)
    )
  )
When a user tried to connect to orcl database, he needs to resolve cluster name i.e. cluster01.example.com. To resolve the cluster name again we can use the hosts file or DNS server.
SCAN USING HOSTS FILE
If we use /etc/hosts file, it would look like this :
cluster01-scan 192.9.201.181
Here, we have the limitation that cluster name can resolve only to one SCAN-VIP i.e. all the client connection requests will be directed to same SCAN listener i.e. client side load balancing won’t take place. In case we want client side load balancing,  cluster name should  resolve to multiple SCAN-VIP’s  and  we must use DNS .
SCAN USING DNS
Hence we include the following in /etc/resolve.conf to forward any connection request for example.com to DNS server running at IP address 192.9.201.59
/etc/resolv.conf
Search nameserver
example.com  192.9.201.59
Various files associated with DNS on DNS server are :
- /var/named/chroot/etc/named.conf
- /var/named/chroot/var/named/forward.zone
- /var/named/chroot/var/named/reverse.zone
- /var/named/chroot/var/named/reverse1.zone
——————————————
- /var/named/chroot/etc/named.conf
——————————————
When a request comes to DNS, this file /var/named/chroot/etc/named.conf
is looked into first
Here it informs that domain example.com’s info is in forward.zone
zone “example.com” IN {
        type master;
        file “forward.zone”;
        };
In forward.zone , following are specified :
– IP addresses of all the nodes in the cluster
– VIP addresses of all the nodes in the cluster
– Private network  IP addresses of all the nodes in the cluster
– IP address of DNS Server
– IP address of Storage
- SCAN-VIP’s
************
——————————————
- /var/named/chroot/var/named/forward.zone
—————————————–
This file is used to resolve host names to ip addresses.
                     IN NS           server1
                     IN A            192.9.201.59
server1         IN A            192.9.201.59
host01          IN A            192.9.201.183
host02          IN A            192.9.201.185
host03          IN A            192.9.201.187
openfiler1      IN A            192.9.201.181
host01-priv     IN A            10.0.0.11
host02-priv     IN A            10.0.0.22
host03-priv     IN A            10.0.0.33
host01-vip      IN A            192.9.201.184
host02-vip      IN A             192.9.201.186
host03-vip      IN A             192.9.201.188
cluster01-scan   IN A             192.9.201.190
                 IN A             192.9.201.191
                 IN A             192.9.201.192
After getting  address, the  request is sent in round-robin fashion  to the SCAN listeners listening at SCAN-VIP’s.
——————————————–
- /var/named/chroot/var/named/reverse.zone
——————————————–
This file is used to resolve ip addresses to host names.
********
$TTL    86400
@       IN      SOA     server1.example.com. root.server1.example.com.  (
                                       997022700 ; Serial
                                      28800      ; Refresh
                                      14400      ; Retry
                                      3600000    ; Expire
                                      86400 )    ; Minimum
        IN      NS      server1.example.com.
59      IN      PTR     server1.example.com.
182     IN      PTR     openfiler.example.com.
183     IN      PTR     host01.example.com.
185     IN      PTR     host02.example.com.
187     IN      PTR     host03.example.com.
184     IN      PTR     host01-vip.example.com.
186     IN      PTR     host02-vip.example.com.
188     IN      PTR     host03-vip.example.com.
190     IN      PTR     cluster01-scan.example.com
191     IN      PTR     cluster01-scan.example.com
192     IN      PTR     cluster01-scan.example.com
——————————————–
- /var/named/chroot/var/named/reverse1.zone
——————————————–
 This file is used to resolve private interconnect ip addresses
IN      NS      server1.example.com.
11      IN      PTR     host11-priv.example.com.
22      IN      PTR     host22-priv.example.com.
33      IN      PTR     host33-priv.example.com.
Thus, now we do not need to modify tnsnames.ora file on client machine whenever we add/delete node(s)as the file  contains just a reference to the cluster. Also, we need not modify the entries in DNS on adding / deleting as node as SCAN name in DNS resolves to SCAN-VIP’s and not node-VIP’s. But we still needed to assign VIPs manually to various nodes. If we let DHCP assign VIP’s to various nodes automatically and let these VIP’s be known to DNS somehow, we would just need to run a few commands to let clusterware know about the node addition/deletion and we are done!!! No need to assign VIP’s manually.
This feature of 11gR2 RAC is called have GPNP – Grid Plug n Play . This is implemented by GNS (Grid Naming Service) which is similar to DNS but is used to resolve the cluster name only within a corporate domain (i.e. example.com).
SCAN WITH GNS
Let us see how DNS entries are modified now :
Various files associated with DNS are :
- /var/named/chroot/etc/named.conf
- /var/named/chroot/var/named/forward.zone
- /var/named/chroot/var/named/reverse.zone
- /var/named/chroot/var/named/reverse1.zone
——————————————
- /var/named/chroot/etc/named.conf
——————————————
When a request comes to DNS, this file is looked into first
Here domain example.com’s info is in forward.zone
zone “example.com” IN {
        type master;
        file “forward.zone”;
        allow-transfer { 192.9.201.180; };
};
As per forward.zone  request for cluster01.example.com is to be forwarded to the address for cluster01-gns i.e. 192.9.201.180
——————————————
- /var/named/chroot/var/named/forward.zone
—————————————–
This file is used to resolve host names to ip addresses.
                   IN NS           server1
                   IN A            192.9.201.59
server1         IN A            192.9.201.59
host01          IN A            192.9.201.183
host02          IN A            192.9.201.185
host03          IN A            192.9.201.187
openfiler1      IN A            192.9.201.181
host01-priv     IN A            10.0.0.11
host02-priv     IN A            10.0.0.22
host03-priv     IN A            10.0.0.33
$ORIGIN cluster01.example.com.
@       IN      NS      cluster01-gns.cluster01.example.com.
cluster01-gns   IN      A       192.9.201.180
After getting gns address, it comes back to named.conf and checks if transfer is allowed to that address. if it is allowed, request is placed on that address to gns.
zone “example.com” IN {
        type master;
        file “forward.zone”;
        allow-transfer { 192.9.201.180; };
};
——————————————–
- /var/named/chroot/var/named/reverse.zone
——————————————–
This file is used to resolve ip addresses to host names.
$TTL    86400
@       IN      SOA     server1.example.com. root.server1.example.com.  (
                                      997022700 ; Serial
                                      28800      ; Refresh
                                      14400      ; Retry
                                      3600000    ; Expire
                                      86400 )    ; Minimum
           IN      NS      server1.example.com.
 59      IN      PTR     server1.example.com.
182     IN      PTR     openfiler.example.com.
183     IN      PTR     host01.example.com.
185     IN      PTR     host02.example.com.
187     IN      PTR     host03.example.com.
——————————————–
- /var/named/chroot/var/named/reverse1.zone
——————————————–
 This file is used to resolve private interconnect
ip addresses
IN      NS      server1.example.com.
11      IN      PTR     host11-priv.example.com.
22      IN      PTR     host22-priv.example.com.
33      IN      PTR     host33-priv.example.com.
——————————————————————-
Once the request goes to GNS, it is multicast to all the nodes in the cluster. In all, three SCAN Listeners run on any of the nodes in the cluster. mDNS process running on the nodes hosting the SCAN listeners return the VIP’s of SCAN listeners to GNS. GNS in turn returns the addresses of the three SCAN listeners to DNS. The client request is forwarded to any one of the three SCAN listeners (say SCAN-Listener1). All the SCAN listeners have the load information of all the nodes. The SCAN listener which receives the client request (SCAN-Listener1) forwards the request to the local listener running on the least loaded node.
     Thus, with SCAN all the VIP’s (Host VIP’s and SCAN-VIPs) are assigned dynamically by DHCP. The only VIP which is static is the VIP for GNS. Note that we have to specify the range of IP’s to be assigned by DHCP in the file /etc/dhcpd.conf.. For example the following entry specifies the IP range from 192.9.201.190 to 192.9.201.254.
range dynamic-bootp 192.9.201.190 192.9.201.254;
Now, to add a node, We just need to add an entry giving the physical IP of the node in files of DNS server. The VIP’s will be automatically assigned by DHCP.
I hope you found this post useful.
Your comments and suggestions are always welcome.
References:
——————————————————————————————————-

Related links:

Home

                                                              ——————

LATCHES, LOCKS, PINS AND MUTEXES

In my earlier article in this series ,I had discussed about the architecture of the shared pool. In this post, I will discuss about various memory protection structures used in shared pool i.e. latches, locks, pins and mutexes.
                                                        LATCHES
  If there is not enough memory (no single chunk of free memory large enough) to create a new object in the library cache, Oracle uses LRU algorithm to delink objects  from their hash chains list— which means picking two existing objects that currently point to each other, setting the forward and backward pointers in the new object to point to them, and then modifying the backward pointer on one object and the forward pointer on the other object to point to the new object.
    Hash buckets               Hash chains
         H1             S1 <-> S2 <–> S5 <–> S9
                            ^                                 ^
                            |___________________|
         H2             S3 <–>  S8
                            ^           ^
                            |_______|
         H3             S4  <–>  S6
                           ^            ^
                           |_______|
         H4             S7
For example, if user issues a statement S10 which hashes to same value as bucket H2, following steps need to be taken :
- set forward pointer of S10 to S8
- set backward pointer of S10 to S3
- set forward pointer of S3 to S10
- set backward pointer of S8 to S10
so that finally hash bucket H2 looks like:
        H2             S3  <–>  S10 <–> S8
                          ^                          ^
                           |_______________|
    Consider a scenario. User1 issues statement S1, finds out that it belongs to hash bucket H1. At the same time, another user User2 needs to load another object for which the  object S2 in the bucket H1 needs to be aged out. User2 will need to delink S2 from bucket H1 for which he will
  – set forward pointer of S1 to point to S5
  – set backward pointer of S5 to point to S1
  – set forward/backward pointers of S2 so that it is attached to memory free list.
so that finally hash bucket H1 looks like:
        H1             S1 <–> S5 <–> S9
                          ^                       ^
                          |______________|
and S2 is linked to doubly linked list of free memory.
If both the users are allowed to access the contents of the hash bucket H1 at the same time, user1 will scan the doubly linked hash chain and  could end up following a pointer to an object user2 has just discarded (S2) because user2 hadn’t yet time to correct both the pointers of S2. Somehow oracle has to ensure that a user can’t modify the linked list if another user is walking it and vice versa.
  That is where oracle latches come into picture. Latches are simple, low-level serialization mechanisms to protect shared data structures in the System Global Area (SGA).Latches eliminate problems that would appear if multiple processes were allowed to modify shared memory simultaneously. A server or background process acquires a latch for a very short time while manipulating or looking at one of these structures.
  Shared Pool Latch is used to protect and achieve the concurrency in Shared Pool.  Multiple hash buckets are protected by one latch so that whenever a process wants to search for an object in a hash bucket, it has to first obtain the latch protecting the bucket, walk the linked list, do whatever it needs to do with the object if it finds it, and then drop the latch. If another user wants to walk a hash chain protected by the same latch, he will wait until the first user has dropped the latch. This gives rise to latch contention.
                             LATCH CONTENTION – LOCKS AND PINS
  Latch contention occurs because a latch is held for too long, or because demand for the latch is too high, or both.
There are three important aspects, then, of latch contention:
   • No. of different latches are covering the library cache — If number of latches are more, less no. of buckets will be protected by one latch and hence you are less likely to collide with someone who needs to walk a linked list covered by the latch you want; on the other hand, the more latches you have, the more work you may have to do in some form of maintenance, reporting, or garbage collection.
   In  shared pool architecture  there are a fixed number of hash buckets (that grow to the next level as needed) protected by a fixed number of library cache latches .
 Until 10g the number of latches covering the library cache was remarkably small. The number was dependent on the number of CPUs (it’s roughly the same as the cpu_count parameter) up to a maximum of 67 latches. That’s a surprisingly small number, really, given the potential for collisions occurring on even a small number of frequently executed statements; and two processes don’t have to be executing the same SQL, or even accessing the same hash bucket, to be colliding on the same latch—they need only be accessing two hash buckets covered by the same latch.
  • No. of times you need to go through the process of acquiring a given latch —  The more times you have to get the latch and walk the list, the more likely you are to collide with someone who wants the same latch.
  We can minimize the number of times we have to search the library cache for an object by attaching a KGL lock to an object once we’ve found it so that we have a shortcut to it (Stored in PGA as open/closed cursor). Next time the same statement is issued, we  will find it in the PGA itself and need not search hash chains for the object again (Softer parse).
Time for which you have to hold the latch - The longer everyone holds latches, the more likely they are to have problems with other people holding latches that they want to acquire.
 We ought to avoid holding latches for any length of time. This means that if we need to do something time-consuming when we’ve found the bit of memory we were searching for, we can pin the object  to protect that memory while we’re using it so that we can release the latch.
   So, after the server process has found out the corresponding hash bucket, the steps followed are :
1. Obtain latch on the corresponding hash bucket.
2. Obtain lock on the object so that pointer to the object is placed in the PGA (as Open cursor).
3. Pin the object and release the latch.
4. Do whatever you want to do with the object (e.g. execute a statement/procedure).
5. Obtain the latch, unpin the object and release the latch.
Locks and Pins are usually in share mode unless modifications are being made.
          HOW ARE LIBRARY CACHE LOCKS (KGL LOCKS) OBTAINED?
There are three major ways in which you can obtain KGL locks :
• You can write your front-end code to “hold” cursors that you know you are going to use frequently.
• You can set the session_cached_cursors parameter so that Oracle library code will automatically start holding a cursor if it sees you using a statement more than two or three times.
• You can benefit from the semiautomatic way in which PL/SQL will hold cursors that have been opened (explicitly or implicitly) from within a PL/SQL call—from Oracle 9.2.0.5 onward, setting the session_cached_cursors parameter will also control this feature.The session_cached_cursors parameter dictates how many cursors can be held when your code does not explicitly hold cursors. It also controls the number of cursors that can be held open by PL/SQL code being run by the session.
                                        BENEFITS OF KGL LOCKS
  Besides minimizing the number of times we have to search the library cache for an object, locks provide additional benefits:
-  By means of locks, one client can prevent other clients from accessing the same object. Library cache locks use object handles as resource structures and locks are taken on that resource. If the resources are not available in a compatible mode, then sessions must wait for library cache objects to be available.
-  Library cache locks aka parse locks are needed to maintain dependency mechanism between objects and their dependent objects. For example, if an object (e.g. table) definition needs to be modified then dependent objects (e.g. execution plan of SQL statement referencing the table) must be invalidated. This dependency is maintained using library cache locks. For example, if a column is dropped from a table then all SQLs dependent upon the table must be invalidated and reparsed during next access to that object. Library cache locks are designed to implement this tracking mechanism. If we have lock on a SQL statement, we have pointer to its execution plan stored in PGA. If the structure of an underlying table is modified, the parse lock on the statement will be broken and the cursor in PGA will be invalidated i.e. next time we re-excute the same statement, it will be searched in the library cache and will subsequently be hard parsed.
                       WHY ARE LIBRARY CACHE PINS (KGL PINS) NEEDED
  The KGL pin comes into play when you actually use an object. Although a KGL lock will hold an object in memory(cursor in PGA pointing to the object in shared pool), there are parts of the object that are dynamically re-creatable (the execution plan for an SQL statement, for example), and these can still be discarded (if memory is required in shared pool, the memory location to which the cursor in PGA is pointing may have some other object there)  if there is a heavy demand for memory even if you have a KGL lock in place.
However, when you are actually using an object (running an SQL statement, say), you need to ensure that the re-creatable bits can’t be pushed out of memory, so you pin the object to protect them.
Pinning an object causes the heaps to be loaded into memory. If a client wants to modify or examine  the object, he must acquire a pin after the lock.
 For example, an underlying object should not be modified when a session is executing or accessing a dependent object (SQL).If a user tried to modify the underlying object (table structure for example) he will need to acquire library cache pin in exclusive mode on the table. If a dependent SQL is executing, the pins won’t be available and there will be waits for library cache pins. So, before parse locks on a library cache object can be broken, library cache pins must be acquired in Exclusive mode and then only library cache objects can be altered. Typically, this happens for long running SQL statement.
Library cache locks and pins are externalized in three x$ tables i.e. x$kgllk, x$kglpn and x$kglob.
 . x$kgllk is externalizing all locking structures on an object.
 . Entries in x$kglob acts as a resource structure.
 . x$kglpn is externalizing all library cache pins.
                             ISSUES WITH KGL LOCKS AND PINS
 The mechanisms of locking and pinning in the library cache allowed two problems to creep in.
 The KGL locks and KGL pins themselves are simply little packets of memory that, at one time, were individually created and discarded on demand by making calls to allocate memory from the shared pool. Since a KGL lock is about 200 bytes and a KGL pin is about 40 bytes, you can imagine that between them the constant allocation and freeing of memory could cause the free memory in the shared pool to end up in a “honey-combed” state—i.e., lots of free memory in total, but not many large pieces of contiguous memory. The KGL pins were particularly nasty because they would come and go very quickly; the KGL locks weren’t quite so bad because they could stay attached to an object for some time.
 Another issue with the KGL locks and KGL pins was that, to use them, you had to constantly create a packet of memory, label it properly, and insert it into a linked list (or, conversely, to remove a link from a linked list and put it back in the shared pool)and you had to do this while holding a latch exclusively . So for very busy systems, the whole lock/pin issue could become a significant scalability threat.
  In 10g, Oracle Corp. introduced the library cache lock latch and the library cache pin latch, which allowed some concurrent activity to be done on different hash buckets covered by the same library cache latch (you could pin a cursor in one bucket while I locked a cursor in another bucket because we wouldn’t both need to hold the same library cache latch at the same time).
However, as we moved through 10g to 11g, the whole KGL lock/KGL pin mechanism was gradually replaced by the mutex mechanism.
                                                    MUTEXES
  To improve cursor execution and also hard parsing, a new fine-grained memory serialization mechanism has was created in 10gR2. For certain shared-cursor related operations, mutexes (Mutual exclusion objects) were used as a replacement for library cache latches and library cache pins. Basically, it works same way as a latch does, but the code path for operating mutexes is shorter and hence it is lightweight and  often directly hardware-supported. Thus, using mutexes is faster, uses less CPU and also allows significantly improved concurrency over the latch mechanism. On 32bit linux installation, regular latch structure is 110 bytes. whereas a mutex is 28 bytes in size. ALso, mutexes take less instructions. It takes  150-200 instructions to get a latch but a mutex get is about 30-35 instructions.
 The size of mutex structure is smaller becauses they give you less information about who is waiting for what and for how long. You have information about sleep times, but not number of requests and misses.
BENEFITS OF MUTEX:
They can be taken in shared or exclusive mode and also getting a mutex can de done in wait or no-wait mode.
1. Mutexes have less potential for false contention. Previously multiple hash buckets were protected by the same latch. Hence, two users searching two different buckets protected by the same latch were prone to unnecessary false contention i.e. contention used to be for the protection mechanism (latch) rather than the target object you are attempting to access. Unlike latches, with mutexes it is possible to create a mutex for each structure protected. This means false contention is much less likely because each structure can be protected by its own mutex. Now each parent and child cursor has its own mutex, and we do not have to contend for a latch that’s co-used by many cursors.
2. Replace latches and pins : Mutexes have a dual nature; they can act as a serialization mechanism (as a latch) and also as a pin (for example, preventing an object from aging out). Whereas a latch can’t be obtained by multiple sessions, a mutex can be eferenced by many sessions, providing that all sessions reference the mutex in shared mode. The total no. of sessions referencing a mutex in shared (S) mode is called reference count (ref count). The ref count for a mutex is stored in the mutex itself. A mutex can also be held in exclusive(X) mode by one session only.
  The ref count of a mutex is a replacement for a library cache pin. In case of latches, whenever a session executed a statement, it first created and then deleted a library cache pin but with mutexes each session increments and decrements the ref count ( so the ref countreplaces n distinct pins). It means that an object can’t be aged out until its refcount drops to zero i.e. no user is currently accessing the object.
3. The mutex structure is located in each child cursor handle and the mutex itself acts as cursor pin structure. To change the cursor pin status, earlier we needed to get the library cache latch but now we can modify the cursor’s mutex refcount directly.
As per Oracle Guru Tanel Podar :But the main scalability benefit comes from that there’s a mutex structure in each child cursor handle and the mutex itself acts as cursor pin structure. So if you have a cursor open (or cached in session cursor cache) you don’t need to get the library cache latch (which was previously needed for changing cursor pin status), but you can modify the cursor’s mutex refcount directly (with help of pointers in open cursor state area in sessions UGA).

Therefore you have much higher scalability when pinning/unpinning cursors (no library cache latching needed, virtually no false contention) and no separate pin structures need to be allocated/maintained.

Few notes:
1) library cache latching is still needed for parsing etc, the mutexes address only the pinning issue in library cache
2) mutexes are currently used for library cache cursors (not other objects like PL/SQL stored procs, table defs etc)
3) As mutexes are a generic mechanism (not library cache specific) they’re used in V$SQLSTATS underlying structures too
4) When mutexes are enabled, you won’t see cursor pins from X$KGLPN anymore (as X$KGLPN is a fixed table based on the KGL pin array – which wouldn’t be used for cursors anymore)
 Note that latches and mutexes are independent mechanisms i.e. a process can hold a latch and a mutex at thye same time.In Oracle 10.2.0.2+ the library cache pin latch usage was replaced with mutexes whenever _kks_use_mutex_pin was true, also few other things like V$SQLSTATS arrays and parent cursor examination were protected by mutexes. However(the right child cursor lookup using kksfbc()) was still protected by library cache latches which could become a problem with frequent soft parsing combined with too little cursor cache and long library cache hash chains (remember, the library cache latches were always taken exclusively even for plain hash chain scanning).
In 10g you see there are 3 types of mutexes used:
Cursor Stat
Cursor Parent
Cursor Pin
  So 11.2 got rid of all the little packets of memory and the linked lists — in fact, it even got rid of the latches used to protect the hash chains, and replaced them all with mutexes. Each hash bucket is protected by a mutex. If you have mutexes on hash buckets, you have one micro-latch per bucket instead of a maximum of 67 latches covering 131,072 BUCKETS. If you have a mutex on each library cache object to represent the KGL locks and another to represent the KGL pins, then you don’t have to do all that memory allocation and deallocation and don’t have to run code to connect things to linked lists. So, starting from 11g, each library cache bucket is protected by a separate mutex (yes all 131072 of them!).
 In 11g there are couple additional mutexes, one (and most important) of them is Library Cache mutex.
In 11g all library cache related latches except “library cache load lock” are gone and corresponding operations are protected by mutexes instead. The “library cache” latches have been replaced by “Library Cache” mutexes.
The following latches which were present in 10g are no longer seen in 11g:
library cache pin allocation
library cache lock allocation
library cache hash chains
library cache lock
library cache
library cache pin
The only library cache latch present in 11g is :
library cache load lock
The following wait events related to library cache which were present in 10g but are no longer seen in 11g:
latch: library cache
latch: library cache lock
latch: library cache pin
The wait events related to library cache present in 11g are:
library cache pin
library cache lock
library cache load lock
library cache: mutex X
library cache: mutex S
OSD IPC library
library cache revalidation
library cache shutdown
 Of course the library cache mutexes still don’t solve all problems in the world (especially the ones related to excessive hard parsing!), for example there’s still a chance of hash collision of two entirely different cursors. Also, if there are many child cursors under a parent and the application cursor management is poor (e.g. cursors are closed after every execution and no session cursor caching is done) then you could still have contention on the mutex due to continuous library cache hash chain traversing.
A demonstration of library cache locks/pins can be seen here.
In my next article in this series on shared pool, I will give a quick overview of shared pool tuning.
I hope this information was useful. Your comments and suggestions are always welcome.

 

References:

Oracle Core: Essential Internals for DBAs and Developers by Jonathan Lewis

———————————————————————————————————-

                                                ———————–