Category Archives: 11g R2 RAC

MIGRATION OF A RAC DATABASE

In this post, I will demonstrate how we can transport a RAC database from one cluster to another
(e.g. from production to test environment).
Current scenario:
- Database to be converted : orcl
– Target cluster name : cluster01.example.com
– Target cluster SCAN name : cluster01-scan.cluster01.example.com
– No. of nodes in the target cluster : host01, host02, host03
– Name of the database after conversion : newdb
Overview:
———————-
On the source Server
———————-
- Check prerequisites
– Start the database in READ ONLY mode
– Verify the database is ready for migration
– Make a folder to hold converted datafiles
– Use RMAN Convert command to create
. Transport script
. Initialization parameter file
. Converted Datafiles———————
On the target server
———————

– Create appropriate folders
– Transfer above files to the proper location on the target system.
– Edit initialization parameter file.
– Edit the transport script transport.sql
– Make an entry for the database in /etc/oratab
– Run transport script to create a single instance database on the target server.
– Register the database with clusterware
– Put the database in archivelog mode if not aLREADY
– Use rconfig to convert the single instance database to RAC database
– Create entries for the database in tnsnames.ora

Implementation:

———————-
On the source Server
———————

– Check prerequisites

– check the platform name of current database —

SQL>SELECT PLATFORM_NAME
FROM  V$DATABASE;

PLATFORM_NAME
——————————
Linux IA (32-bit)

— Start the database in READ ONLY mode

[oracle@host01]$srvctl stop database -d orcl

SQL> startup mount;
alter database open read only;

— Verify the database is ready for migration

SQL>   SET  SERVEROUTPUT ON;
DECLARE
db_ready BOOLEAN;
BEGIN
db_ready :=
DBMS_TDB.CHECK_DB('Linux IA (32-bit)',DBMS_TDB.SKIP_READONLY);
if db_ready
then
dbms_output.put_line('YES your database can be transported. ');
else
dbms_output.put_line('NO your database can not be transported. ');
end if;
end;
/

– Make a folder to hold converted datafiles

[oracle@host01]$mkdir /home/oracle/convertdb

- Use RMAN Convert Database command to create
. Transport script
. Inittalization parameter file
. Converted Datafiles

The CONVERT DATABASE command specified in the example creates
– a transport script named /home/oracle/convertdb/transport.sql,
– a PFILE initnewdb.ora in file /home/oracle/convertdb/newdb
– a copy of all data files in the =/home/oracle/convertdb/newdb  directory in the format
of the target platform ‘Linux IA (32-bit)’

RMAN>CONVERT DATABASE
NEW DATABASE 'newdb'
TRANSPORT SCRIPT '/home/oracle/convertdb/transport.sql'
TO PLATFORM 'Linux IA (32-bit)'
FORMAT '/home/oracle/convertdb/newdb/%U';

Starting conversion at source at 23-FEB-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=42 instance=orcl1 device type=DISK

External table SH.SALES_TRANSACTIONS_EXT found in the database

Directory SYS.SS_OE_XMLDIR found in the database
Directory SYS.DATA_PUMP_DIR found in the database
Directory SYS.SUBDIR found in the database
Directory SYS.MEDIA_DIR found in the database
Directory SYS.LOG_FILE_DIR found in the database
Directory SYS.DATA_FILE_DIR found in the database
Directory SYS.XMLDIR found in the database
Directory SYS.ORACLE_OCM_CONFIG_DIR found in the database

BFILE PM.PRINT_MEDIA found in the database

User SYS with SYSDBA and SYSOPER privilege found in password file
channel ORA_DISK_1: starting datafile conversion
input datafile file number=00001 name=+DATA/orcl/datafile/system.256.808138599
converted datafile=/home/oracle/convertdb/newdb/data_D-ORCL_I-1335919524_TS-SYSTEM_FNO-1_16o2n0vb
.
.
.
.
converted datafile=/home/oracle/convertdb/newdb/data_D-ORCL_I-1335919524_TS-USERS_FNO-4_1co2n13i
channel ORA_DISK_1: datafile conversion complete, elapsed time: 00:00:03

Edit init.ora file /home/oracle/convertdb/newdb/init_00o2n0vb_1_0.ora. This PFILE will be used to

create the database on the target platform

Run SQL script /home/oracle/convertdb/transport.sql on the target platform to create database

To recompile all PL/SQL modules, run utlirp.sql and utlrp.sql on the target platform
To change the internal database identifier, use DBNEWID Utility
Finished conversion at source at 23-FEB-13

———————
On the target server
———————

- Create appropriate folders

[oracle@host01]$mkdir -p /home/oracle/convertdb/newdb
mkdir -p /u01/app/oracle/admin/newdb/adump

- Transfer above files to the proper location on the target system
. transport.sql to /home/oracle/convertdb folder
. init….ora  and converted datafiles to /home/oracle/convertdb/newdb folder

Edit initialization parameter file

[oracle@host01]$vi  /home/oracle/convertdb/newdb/init_00o2n0vb_1_0.ora

# Please change the values of the following parameters:

# db_create_file_dest      = “/home/oracle/convertdb/newdb/+DATA”
db_create_file_dest      = “+DATA”

#  db_recovery_file_dest    = “/home/oracle/convertdb/newdb/+FRA”
db_recovery_file_dest    = “+FRA”

db_recovery_file_dest_size= 4039114752

# remote_listener          = “NEWDBcluster01-scan.cluster01.example.com:1521″
remote_listener          = “cluster01-scan.cluster01.example.com:1521″

#  cluster_database         = TRUE
cluster_database         = FALSE

#  audit_file_dest          = “/home/oracle/convertdb/newdb/adump”
audit_file_dest          = “/u01/app/oracle/admin/newdb/adump”
– Edit the transport script transport.sql

[oracle@host01]$vi  /home/oracle/convertdb/transport.sql

– Delete the following lines to create redo threads 2 and 3

— Create log files for threads other than thread one.

SQL>ALTER DATABASE ADD LOGFILE THREAD 2
GROUP 3 SIZE 50M BLOCKSIZE 512 REUSE,
GROUP 4 SIZE 50M BLOCKSIZE 512 REUSE;
ALTER DATABASE ADD LOGFILE THREAD 3
GROUP 5 SIZE 50M BLOCKSIZE 512 REUSE,
GROUP 6 SIZE 50M BLOCKSIZE 512 REUSE;

- Make following entry for the database in /etc/oratab

newdb:/u01/app/oracle/product/11.2.0/db_1:N

- Run transport script to create a single instance database on the target server

[oracle@host01]$. oraenv - newdb

sqlplus  / as sysdba

SQL>@/home/oracle/convertdb/transport.sql

– check that database has been transported —

[oracle@host01]$. oraenv - newdb

sqlplus  / as sysdba

SQL>startup

select * from hr.employees;

- Register the database with clusterware

– Check the database is currently not registered with clusterware

[oracle@host01 ~]$ srvctl status database -d newdb

PRCD-1120 : The resource for database newdb could not be found.
PRCR-1001 : Resource ora.newdb.db does not exist

[root@host01 trace]# crsctl stat res -t |grep newdb

— Register the database

[oracle@host01 ~]$ srvctl add database -d newdb -o $ORACLE_HOME -p /u01/app/oracle/product/11.2.0/dbhome_1/dbs/spfilenewdb.ora -a DATA,FRA

– Check the database is now registered with clusterware

[root@host01 trace]# crsctl stat res -t |grep newdb

ora.newdb.db

— Check that the database has no configured instances as it is a single
instance database

[oracle@host01 ~]$ srvctl start database -d newdb

– Put the database in archivelog mode if not already
( needed to convert it to a RAC database)

SQL> shu immediate;
startup mount;
alter database archivelog;
alter database open;

- Use rconfig to convert the single instance newdb database to a RAC database running
on three nodes host01, host02 and host03 with

. Datafiles on +DATA diskgroup
. Flash recovery area on +FRA diskgroup

— copy ConvertToRAC_AdminManaged.xml to another file my.xml

[oracle@host01]$cd $ORACLE_HOME/assistants/rconfig/sampleXMLs

cp ConvertToRAC_AdminManaged.xml my.xml

— Edit my.xml and make following changes :

. Specify current OracleHome of non-rac database for SourceDBHome

. Specify OracleHome where the rac database should be configured. It can be same as

SourceDBHome

. Specify SID of non-rac database and credential. User with sysdba role is required to perform

conversion

. Specify the list of nodes that should have rac instances running for the Admin Managed

Cluster Database. LocalNode should be the first node in this nodelist.

. Instance Prefix tag is optional starting with 11.2. If left empty, it is derived from

db_unique_name

. Specify the type of storage to be used by rac database. Allowable values are CFS|ASM

. Specify Database Area Location to be configured for rac database.

. Specify Flash Recovery Area to be configured for rac database.

— Run rconfig to convert orcl from single instance database to 2 instance RAC database

host01$rconfig my.xml

Converting Database “NEWDB” to Cluster Database. Target Oracle Home:

/u01/app/oracle/product/11.2.0/dbhome_1. Database Role: PRIMARY.
Setting Data Files and Control Files
Adding Database Instances
Adding Redo Logs
Enabling threads for all Database Instances
Setting TEMP tablespace
Adding UNDO tablespaces
Adding Trace files
Setting Flash Recovery Area
Updating Oratab
Creating Password file(s)
Configuring Listeners
Configuring related CRS resources
Starting Cluster Database

Operation Succeeded

<Oracle_Home>
/u01/app/oracle/product/11.2.0/dbhome_1

— Check the log file for rconfig while conversion is going on

oracle@host01$ls -lrt  $ORACLE_BASE/cfgtoollogs/rconfig/*.log

              tailf <logfilename>

— check that the database has been converted successfully

host01$srvctl status database -d newdb

Instance newdb1 is running on node host01
Instance newdb2 is running on node host02
Instance newdb3 is running on node host03

[root@host01 trace]# srvctl config database -d newdb

Database unique name: NEWDB
Database name: NEWDB
Oracle home: /u01/app/oracle/product/11.2.0/dbhome_1
Oracle user: oracle
Spfile: +DATA/spfilenewdb.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: NEWDB
Database instances: newdb1,newdb2,newdb3
Disk Groups: DATA,FRA
Services:
Database is administrator managed

[root@host01 trace]# crsctl stat res -t

ora.newdb.db
1        ONLINE  ONLINE       host01                   Open
2        ONLINE  ONLINE       host02                   Open
3        ONLINE  ONLINE       host03                   Open

- Create entries for the database in tnsnames.ora

[oracle@host01]$vi $ORACLE_HOME/network/admin/tnsnames.ora

— Following is the entry I copied and added :

NEWDB =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = cluster01-scan.cluster01.example.com)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = newdb)
)
)

– check that database can be connected remotely from remote node.

host02$sqlplus system/oracle@newdb

— check that datafiles have converted to ASM

SQL>select name from v$datafile;

NAME
——————————————————————————–
+DATA/newdb/datafile/system.276.808311515
+DATA/newdb/datafile/sysaux.275.808311589
+DATA/newdb/datafile/undotbs1.272.808311655
+DATA/newdb/datafile/users.282.808311747
+DATA/newdb/datafile/example.273.808311721
+DATA/newdb/datafile/undotbs2.280.808311735
+DATA/newdb/datafile/undotbs3.281.808311743

I hope that you found this post useful. Your comments and suggestions are always welcome.

——————————————————————————————————-


Related links:

Home

11G R2 RAC Index
11g R2 RAC: Clone Database Home

 

TRACING SERVICES IN RAC

In this post, I will demonstrate how we can trace all the sessions who connected to a RAC database using a service . In this case, trace files of different sessions will be generated on the servers hosting the instances they are connected to. Hence, we will need to combine their
trace files Enable tracing for a service and collect tracefiles generated in various instances and then run tkprof on the resulting file.

 

CURRENT SCENARIO:
Name of the admin managed RAC database : amdb
No. of nodes : 3
Host names   : host01, host02, host03
- Create a service amdbs with
  . preferred nodes host01 and host02
  . Available nodes : host03
– enable tracing for sessions connected using amdbs service
SYS> exec dbms_monitor.SERV_MOD_ACT_TRACE_ENABLE('amdbs');

 

– create 2 sessions using amdbs
[oracle@host01 ~]$ sqlplus system/oracle@amdbs

 

– Issue two different queries in two sessions –
SQL1>select * from hr.employees;
SQL2>select * from scott.emp;

 

– After queries have completed, disable tracing
SQL>  exec dbms_monitor.SERV_MOD_ACT_TRACE_disable('amdbs');

 

– Find out the instance names of the two sessions which issued
– the queries. Trace files will be created on corresponding servers
SQL1> sho parameter instance_name
NAME                                 TYPE        VALUE
———————————— ———– ——————————
instance_name                        string      amdb3
SQL2> sho parameter instance_name
NAME                                 TYPE        VALUE
———————————— ———– ——————————
instance_name                        string      amdb1

 

– Run trcsess on both the nodes to get the corresponding files containing   statements executed by sessions connected using service amdbs
[oracle@host01 trace]$ cd /u01/app/oracle/diag/rdbms/amdb/amdb1/trace
trcsess output=amdbhost01.trc service=amdbs *.trc
[oracle@host03 trace]$ cd /u01/app/oracle/diag/rdbms/amdb/amdb3/trace
trcsess output=amdbhost03.trc service=amdbs *.trc

 

– combine trcsess output files from both nodes –

[oracle@host03 trace]$scp amdbshost03.trc host01:/u01/app/oracle/diag/rdbms/amdb/amdb1/trace/ 

[oracle@host01 trace]$ cat amdbhost01.trc amdbshost03.trc>amdbhost.trc 

[oracle@host01 trace]$ tkprof amdbhost.trc amdbhost.out

 

– check that tkprof output file amdbhost.out contains the statements issued by both the sessions connected using amdbs (serach for hr.employees and scott.emp)
 [oracle@host01 trace]$vi amdbhost.out
—————————————————————————————————————————-

Related links:

Home

11G R2 RAC Index
11g R2 RAC : Services
Service Created Using EM Does Not Update DD or OCR
Workload Management in 11g R2 RAC : Failover

 

 

CACHE FUSION DEMONSTRATED

In this post I will demonstrate cache fusion.Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the database : racdb
Names of the instances : racdb1, racdb2, racdb3

In this demonstration, I will simulate read/read, read/write and write/write contention and will track the contents of the buffer cache of various instances as well as the Global Resource Directory (GRD).

OVERVIEW

— Get data_object_id for scott.emp
— Get File_id and block_id of emp table
— shutdown the database and restart
— Manually master the scott.emp table to node1

— Retrieve the data from disk on node3 (Physical read)

Read/Read contention:
– Read the block from cache on node2 (Cache read)

Read/Write contention
– Update a row in the block from node1

Write/write contention
– Update another row in the same block from node3

– Trigger checkpoint

IMPLEMENTATION

– Get data_object_id for scott.emp

SYS@NODE1> col owner for a10
col data_object_id for 9999999
col object_name for a15
select owner, data_object_id, object_name
from dba_objects
where owner = 'SCOTT'
and object_name = 'EMP';

OWNER DATA_OBJECT_ID OBJECT_NAME
———- ————– —————
SCOTT 73181 EMP

– Get File_id and block_id of emp table

SYS@NODE1> select empno, dbms_rowid.rowid_relative_fno(rowid) fileno,
dbms_rowid.rowid_block_number(rowid) block_no
from scott.emp
where empno in (7788, 7369);

EMPNO FILENO BLOCK_NO
———- ———- ———-
7369 4 151
7788 4 151

– Following queries will be used repeatedly during this demonstration –

– get_buffer_stat.sql
— The view x$bh contains the status of the block in the buffer cache.
— We can get the status of any cached buffer in the buffer cache from this view.
— The object-id for EMP table is used to restrict output only for blocks of EMP table.
— To get status of the block that contains the rows of empno = 7788 and 7369, dbablk = block value retrieved above is used. Class=1 in predicate is used to get the details about the data blocks of the emp table.

col object_name for a6
select (select object_name from dba_objects where object_id = b.obj) as
object_name,
decode (state,0, 'Free', 1, 'XCUR', 2, 'SCUR', 3, 'CR', 4,
'BEING READ',
5, 'MREC', 6, 'IREC', 7, 'WRITE_CLONE', 8, 'PI')
state,
mode_held, le_addr, cr_scn_bas, cr_scn_wrp
from x$bh b
where obj = 73181
and dbablk = 151
and class = 1;

– get_resource_name.sql
— Find the GCS resource name to be used in the query
— x$kjbl.kjblname = resource name in hexadecimal format([id1],[id2],[type]
— x$kjbl.kjblname2 = resource name in decimal format
— Hexname will be used to query resource in V$gc_element and v$dlm_ress views

col hexname for a25
col resource_name for a15
select b.kjblname hexname, b.kjblname2 resource_name,
b.kjblgrant, b.kjblrole, b.kjblrequest

from x$le a, x$kjbl b
where a.le_kjbl=b.kjbllockp
and a.le_addr = (     select le_addr
from x$bh
where dbablk = 151
and obj    = 73181
and class  = 1
and state   <> 3);

~
— get_resource_stat.sql
— Monitor the resources allocated to the data block from EMP table
— In the last line of SQL statement, substitute the literal by the value returned by above query

col resource_name for a22
col state for a8
col mast for 9999
col grnt for 9999
col cnvt for 9999

select a.resource_name, b.state, a.master_node mast, a.on_convert_q cnvt,
a.on_grant_q grnt, b.request_level, b.grant_level, b.owner_node
from   v$dlm_ress a, v$ges_enqueue b
where upper(a.resource_name) = upper(b.resource_name1)
and a.resource_name like '%%';

— IMPLEMENTATION ——

– Find the GCS resource name to be used in the query

SYS@NODE1>@get_resource_name

HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL        KJUSERPR           0 KJUSERNL

– In the script get_resource_stat.sql, in the last line of SQL statement, substitute the literal by the hexname returned by above query i.e.

col resource_name for a22
col state for a8
col mast for 9999
col grnt for 9999
col cnvt for 9999

select a.resource_name, b.state, a.master_node mast, a.on_convert_q cnvt,                        a.on_grant_q grnt, b.request_level, b.grant_level, b.owner_node
from   v$dlm_ress a, v$ges_enqueue b
where upper(a.resource_name) = upper(b.resource_name1)
and a.resource_name like '%[0x97][0x4],[BL]%';

– NODE1 – shutdown the database and restart

[oracle@host01 ~]$ srvctl stop database -d racdb
srvctl start database -d racdb
srvctl status database -d racdb

– ON NODE1, NODE2, NODE3

– Run queries on all the nodes to look at the status of locks. The buffer cache does not contain
any block belonging to EMP table. Hence, no resources are needed to keep track of them. The table EMP is mastered by NODE1 and every request will be routed thru NODE1.

SYS>@get_buffer_stat

no rows selected

SYS>@get_resource_name

no rows selected

SYS>@get_resource_stat

no rows selected

——————BEGIN READ FROM DISK ———————

– Retrieve the data from disk on node3 (Physical read)

– Query the data from emp on node3 to load a buffer from disk to buffer cache of node3.
Instance 1 is the master of the resource.

SCOTT@NODE3> select empno, ename, sal
from emp
where  empno = 7788;

EMPNO ENAME             SAL
---------- ---------- ----------
7788 SCOTT            3000<

– Manually master the EMP table to node1 –

SYS@NODE1>oradebug lkdebug -m pkey
SYS@NODE1>oradebug lkdebug -m pkey 73181

– Query x$kjbr.kjbrmaster to find master node for a resource.
Note that node1 is master of the block (KJBRMASTER = 0, node numbering starts with 0)

SYS@NODE1>select kjbrname, kjbrmaster
                            from x$kjbr

where kjbrname like  '%[0x97][0x4],[BL]%';

KJBRNAME                       KJBRMASTER
------------------------------ ----------
[0x97][0x4],[BL]                        0

Monitor the resource movements between master and holder instances.

— Check that the block is in current shared mode (state = SCUR) on NODE3 from where
read was issued

SYS@NODE3>@get_buffer_stat

OBJECT_NAM STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
---------- -------- ---------- -------- ---------- ----------
EMP       SCUR              0 3A3EAA60          0          0

— Check that a protected read lock (shared read) as indicated by KJUSERPR has been granted on this resource

SYS@NODE3>@get_resource_name

HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL        KJUSERPR           0 KJUSERNL

— Issue following queries on master node NODE1 (Master) —

— check that master node is node1 (node numbering starts with 0) and
resource is granted on a protected read level (grant_level = KJUSERPR)

SYS@NODE1>@get_resource_stat

RESOURCE_NAME          STATE     MAST  CNVT  GRNT REQUEST_L GRANT_LEV OWNER_NODE
---------------------- -------- ----- ----- ----- --------- -------------------
[0x97][0x4],[BL]       GRANTED      0     0     1 KJUSERNL  KJUSERPR     2

—————— END READ FROM DISK ———————

—————— BEGIN READ THE BLOCK FROM CACHE —————-
READ/READ CONTENTION

– Run the same query from node2. The data is already cached in node3 and we should get the block from instance on node3. The on disk version of the block is same as that on disk as the block has not been modified yet.

SCOTT@NODE2> select empno, ename, sal
from emp
where empno = 7788;

EMPNO ENAME             SAL
---------- ---------- ----------
7788 SCOTT            3000

– Check that the block is held in current shared mode (state = SCUR)

SYS@NODE2>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP   SCUR              0 3A7E4C34          0          0

– Check that the resource has been granted in protected read mode to node2 and node3

SYS@NODE2>@get_resource_name

HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL       KJUSERPR           0 KJUSERNL

SYS@NODE3>@get_resource_name

HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL        KJUSERPR           0 KJUSERNL

-- Issue following queries on master node NODE1 (Master) and check that
on nodes node2 and node3 (node numbering starts with 0) the
resource is granted a protected read level (grant_level = KJUSERPR)

SYS@NODE1>@get_resource_stat

RESOURCE_NAME          STATE     MAST  CNVT  GRNT REQUEST_L GRANT_LEV OWNER_NODE
---------------------- -------- ----- ----- ----- --------- --------- ----------
[0x97][0x4],[BL]       GRANTED      0     0     1 KJUSERNL KJUSERPR     2
[0x97][0x4],[BL]       GRANTED      0     0     1 KJUSERNL  KJUSERPR     1

——- END READ BLOCK FROM CACHE ———————-

– —- START UPDATE THE BLOCK IN INSTANCE 1 ——————
READ/WRITE CONTENTION

SCOTT@NODE1>update emp set sal = sal where empno = 7788;
commit;

-- Check that the block is held in exclusive  mode (state = XCUR) on
the node where update command was issued
SYS@NODE1>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    XCUR              0 3A3F1EC0          0          0

– Check that on node2 and node3, now we have CR copies (STATE = CR)
of the block with same SCN# = 955092

SYS@NODE2>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    CR                0 00           955092          0

SYS@NODE3>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    CR                0 00           955092          0

– Check that the resource has been granted exclusive access (KJBLGRNT = KJUSEREX) on node1 from where update was issued

SYS@NODE1>@get_resource_name

HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL        KJUSEREX           0 KJUSERNL

– Note that node 2 and 3 no longer have any lock on the resource. The block is present in their buffer cache in CR mode and can be used locally by the instance for query.However, the block cannot be used for update and cannot be served to other instances, so it does not require any locks.

SYS@NODE2>@get_resource_name
no rows selected

SYS@NODE3>@get_resource_name
no rows selected

– Issue following queries on master node NODE1 (Master) to get GRD information—
— Note that now the resource has been granted to node1 (OWNER_NODE=0)
in exclusive mode (GRANT_LEVEL=KJUSEREX)
— Only node1 has the most recent block in the buffer cache and will serve the
block in case any instance requests it.

SYS@NODE1>@get_resource_stat

RESOURCE_NAME          STATE     MAST  CNVT  GRNT REQUEST_L GRANT_LEV  OWNER_NODE
---------------------- -------- ----- ----- ----- --------- ---------   ----------
[0x97][0x4],[BL]       GRANTED      0     0     1 KJUSERNL  KJUSEREX        0

i

-——- END UPDATE THE BLOCK IN INSTANCE 1 ——

– START UPDATE ANOTHER ROW IN THE SAME BLOCK ON NODE3 –
WRITE/WRITE CONTENTION

SCOTT@NODE3>update emp set sal=sal where empno = 7369;
commit;

– Check that there are two copies of the same block on node3 where update command was issued
– CR copy at SCN# = 955092 which was cached earlier
– Current block held in xclusive mode (state = XCUR) on

SYS@NODE3>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    XCUR              0 3A3EAA60          0          0
EMP    CR                0 00           955092                 0

– Check that on node1, we have Past image(State = PI) of the block since earlier update was issued on node1
— On node2, now we have CR copies (STATE = CR) with SCN#= 955092 of the block with different SCN#

SYS@NODE1>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    PI                0 3A3EFA00          0          0

SYS@NODE2>@get_buffer_stat

OBJECT STATE        MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ ----------- ---------- -------- ---------- ----------
EMP    CR                   0 00           955092             0

— Check that the resource has been granted exclusive access (KJBLGRNT = KJUSEREX) on node1 from where update was issued

SYS@NODE3>@get_resource_name

HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL        KJUSEREX          64 KJUSERNL

— Note that node 2 no longer has any lock on the resource. The block is present in their buffer cache in CR mode and can be used locally by the instance for query.However, the block cannot be used for update and cannot be served to other instances, so it does not require any locks.

SYS@NODE2>@get_resource_name
no rows selected

— Note that on node1 where earlier update was done and where PI is present , lock has been downgraded to null (KJBLGRANT = KJUSERNL)

— Master keeps track of the PI images across the buffer cache. All these
instances will be asked to discard the PI copies once the current block is
written to disk.

SYS@NODE1>@get_resource_name

HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL        KJUSERNL         192 KJUSERNL

— Issue following queries on master node NODE1 (Master) to get GRD information
– Node2 holds the block in exclusive mode (GRANT_LEVEL=KJUSEREX)
— node1 which has the PI, its lock level has been downgraded to null
— Master keeps track of the PI images across the buffer cache. All these
instances will be asked to discard the PI copies once the current block is
written to disk.

SYS@NODE1>@get_resource_stat
RESOURCE_NAME          STATE     MAST  CNVT  GRNT REQUEST_L GRANT_LEV OWNER_NODE
---------------------- -------- ----- ----- ----- --------- --------- ----------
[0x97][0x4],[BL]       GRANTED      0     0     1 KJUSERNL  KJUSERNL     0
[0x97][0x4],[BL]       GRANTED      0     0     1 KJUSERNL  KJUSEREX     2

-—- END UPDATE ANOTHER ROW IN THE SAME BLOCK ON NODE3 –

—— DISK WRITES DUE TO CHECKPOINT ——————-
– Trigger a global checkpoint –

SYS@NODE1> alter system checkpoint global;

— The PI on node1 is discarded and the buffer is changed to CR mode.

SYS@NODE1>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    CR                0 00           997470          0

— CR block on node2 stays as it is —

SYS@NODE2>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    CR                0 00           997343          0

— The checkpoint request from an instance does not change any lock status in the node holding the current block .

SYS@NODE3>@get_buffer_stat

OBJECT STATE     MODE_HELD LE_ADDR  CR_SCN_BAS CR_SCN_WRP
------ -------- ---------- -------- ---------- ----------
EMP    XCUR              0 387E8354          0          0
EMP    CR                0 00           997343          0

— PI from node1 has been discarded after checkpoint and its status has been changed to CR.
— Note that node 2 and node1 no longer have any lock on the resource.The block is present in their buffer cache in CR mode and can be used locally by the instance for query.However, the block cannot be used for update and cannot be served to other instances, so it does not require any locks.

SYS@NODE1>@get_resource_name
no rows selected

SYS@NODE2>@get_resource_name
no rows selected

— The checkpoint request does not change the lock status in the node which is holding the current block.

SYS@NODE3>@get_resource_name
HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
------------------------- --------------- --------- ---------- ---------
[0x97][0x4],[BL]          151,4,BL        KJUSEREX           0 KJUSERNL

Conclusion:

– When a block is read from the disk as a result of select statement(for read only), – the block is held in current shared mode (state = SCUR) on the node from where read was issued and a protected read lock (shared read) is granted on this resource. Role of the block is local. (SL0)

– When the same block is requested for read purpose from another node( read/read contention), the cached from earlier node is sent to the requesting node. . The on disk version of the block is same as that on disk as the block has not been modified yet. The requesting node holds the block in current shared mode (state = SCUR). The block is granted in protected read mode to both the nodes. The role of the block is local in both the nodes. (SL0)

– When the block is requested for update operation form another node (Read/write contention), its mode on earlier both the nodes is downgraded to null so that those two nodes can use their copy for local read on that instance only and can’t send that copy to another node (Null, L,0). The node where update operation is performed holds the block in exclusive mode and local role since another copy of the block is not available in any other instance. (XL0).

– When the block is requested from another node for update, its PI is kept on the previous node with the state degraded to null and role to global (Null, G,1). The requesting node holds the block in exclusive, global mode (XG0).

– On checkpoint, PI of the block is discarded on the earlier node and its state changes to CR.

References:
Oracle Db 10G Real App.Clusters By Gopalakrishnan
http://www.toadworld.com/platforms/oracle/w/wiki/663.cache-fusion-otnn48.aspx

—————————————————————————————————

Related links:

Home

11G R2 RAC Index

11g R2 RAC : Dynamic Remastering
11g R2 RAC: How To Identify The Master Node In RAC
11g R2 RAC: How To Find The Resource Master?

 

———————

INSTANCE RECOVERY IN RAC


   In this post, I will discuss how instance recovery takes place in 11g R2 RAC. Instance recovery aims at
- writing all committed changes to the datafiles
- undoing all the uncommitted changes from the datafiles
- Incrementing the checkpoint no. to the SCN till which changes have been written to datafiles.
In a single instance database, before the instance crashes,
- some committed changes are in the redo log files but have not been written to the datafiles
- some uncommitted changes have made their way to datafiles
- some uncommitted changes are in the redo log buffer
After  the instance crashes in a single instance database
- all uncommitted changes in the redo log buffer are wiped out
- Online redo log files are read to identify the blocks that need to be recovered
- Identified blocks are read from the datafiles
- During roll forward phase, all the changes (committed/uncommitted) in redo log files are applied to them
- During rollback phase, all uncommitted changes are rolled back after reading undo from undo tablespace.
- CKTP# is incremented in control file/data file headers
In a RAC database there can be two scenarios :
- Only one instance crashes
- Multiple instances crash
We will discuss these cases one by one.
Single instance crash in RAC database
In this case, scenario is quite similar to instance crash in a single instance database. But there is slight difference also.
Let us consider a 3 node setup. We will consider a data block B1 with one column and 4 records in it . The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . In the following chart, update operations on the block in various nodes and corresponding states of the block are represented. Colour code followed is : CR, PI, XCUR:
SCN# —-Update operation on —        ———– State of the block on ————
        Node1          Node2             Node3      Node1           Node2          Node3        Disk
  1   100->101        –                      –                 101                     –                     –                100
                                                                               200                     –                     –                200
                                                                               300                     –                     –                300
                                                                               400                     –                     –                400

 

  2      –           200->201                                 101                  101                    –                100
                                                                              200                 201                    –                200
                                                                              300                 300                    –                300
                                                                              400                 400                    –                400

 

  3      –                –           300->301                101                 101                101                 100
                                                                              200                201                 201                200
                                                                             300                 300                 301                300
                                                                              400                400                 400                400

 

 4                                                                                             CRASH 
                                                                                                 (Node2)
         –                –           300->301                101                  101                101                 100
                                                                           200                  201                  201                200
                                                                           300                  300                 301                300
                                                                           400                  400                 400                400

It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.

Before crash status of block on various nodes is as follows:

- PI at SCN# 2 on Node1
- PI at SCN# 3 on Node2
- XCUR on Node3

 

Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1
Node2 : B1:200 -> 201, SCN# 2
Node3 : B1:300 -> 301, SCN# 3

After the crash,

- Redo logs of crashed node (Node2) is analyzed and it is identified that block B1 needs to be recovered.
- It is also identified that role of the block is global as its different versions are available in Node1 and Node3
- It is identified that there is a PI on node1 whose SCN# (2) is earlier than the SCN# of crash (4)
- Changes from redo logs of Node2 are applied to the PI on Node1 and the block is written to disk
- Checkpoint # of node1 is incremented.
- a BWR is placed in redo log of Node1 to indicate that the block has been written to disk and need not be recovered in case Node1
Here it can be readily seen that there are certain differences from the instance recovery in single instance database.
The Role of the block is checked.
  If the role is local, then the block will be read from the disk and changes from redo logs of Node2 will be applied i.e. just like single instance database
  If the role is global,
     It is checked if PI of the block at a SCN# earlier than the SCN# of crash is available
         If PI is available, then changes in redo logs of node2 are applied to the PI ,instead of reading the block from the disk,
         If PI is not available (has been flushed to disk due to incremental checkpointing
                                       on the owner node of PI  or
                                       on any of the nodes at a SCN# > PI holder)
             the block will be read from the disk and changes from redo logs of Node2 will be applied just like it used to happen in OPS.
Hence, it can be inferred that PI, if available, speeds up the instance recovery as need to read the block from disk is eliminated. If PI is
not available, block is read from the disk just like in OPS.

Multiple instance crash in RAC database

Let us consider a 4 node setup. We will consider a data block B1 with one column and 4 records in it
. The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . It can be represented as:

SCN#  —- Update operation on —–         ————– State of the block on ————–

        Node1       Node2       Node3    Node4        Node1         Node2        Node3      Node4   Disk

    1   100->101        –               –            –                        101                   –                      –                 –           100

                                                                                             200                 –                        –                –           200

                                                                                             300                 –                        –                –           300

                                                                                             400                 –                        –                –           400

   2         –         200->201        –            –                        101                101                     –            –           100

                                                                                             200                201                     –             –           200

                                                                                             300                300                    –             –           300

                                                                                              400                400                   –             –           400

   3         –              –          300->301     –                           101               101                 101          –           100

                                                                                                 200               201               201           –           200

                                                                                                 300               300              301           –           300

                                                                                                 400               400               400           –           400

   4                                                                                     CKPT

                                                                                                   101               101                 101            –             101

                                                                                                   200               201                201            –             201

                                                                                                   300               300                301            –             300

                                                                                                    400               400                400            –             400

  5          –               –               –        400->401                       101                101            101      101          100

                               –               –            –                                       200               201             201         201          201

                               –               –            –                                        300               300             301         301          300

                               –               –            –                                        400               400             400         401        400

 6     401->402       –               –           –                                     101                101             101        101         100

                                                                                                        200               201             201         201         201

                                                                                                        300               300             301         301        300

                                                                                                       400               400             400         401        400

                                                                                                        101

                                                                                                        201

                                                                                                        301

                                                                                                        402

  7                                                                                                                   CRASH        CRASH

                                                                                                                        (Node2)    (Node3)

                                                                                                           101                 –                 –            101          101

                                                                                                          200                 –                 –            201          201

                                                                                                           300                 –                 –            301          301

                                                                                                           400                 –                 –            401          400

                                                                                                              101

                                                                                                              201

                                                                                                              301

                                                                                                              402

 Explanation:

SCN#1 – Node1 reads the block from disk and updates 100 to 101 in  record. It holds the block in XCUR mode
SCN#2 – Node2  requests the same block for update. Node1 keeps the PI and Node2 holds the block in XCUR mode
SCN#3 – Node3  requests the same block for update. Node2 keeps the PI and Node3 holds the block in XCUR mode . Now we have two PIs
             – On Node1 with SCN# 2
             – On Node2 with SCN# 3
SCN# 4 – Local checkpointing takes place on Node2. PI on this node has SCN# 3.
              It is checked if any of the other nodes has a PI at an earlier SCN# than this. Node1 has PI at SCN# 2.
             CHanges in redo log of Node2 are applied to its PI and it is flushed to disk.
             BWR is placed in redo log of Node2 to indicate that the block has been written to disk and need not be recovered in case Node2 crashes.
             PI at node2 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
             PI at node1 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
             BWR is placed in redo log of Node1 to indicate that block has been written to disk and need not be recovered in case Node2 crashes.
             Now on disk version of block contains changes of both Node1 and Node2.
SCN# 5 – Node4  requests the same block for update. Node3 keeps the PI and Node4 holds the block in XCUR mode .Node1 and Node2 have the CR’s.
SCN# 6 – Node1 again requests the same block for update. Node4 keeps the PI and Node1 holds the block in XCUR mode. Now Node1 has both the same block in CR and XCUR mode. Node3 has PI at SCN# 5.
SCN# 7 – Node2 and Node3 crash.
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
 Before crash status of block on various nodes is as follows:
- CR at SCN# 2 on Node1, XCUR on Node1
- CR at SCN# 3 on Node2
- PI  at SCN# 5 on Node3
- PI at SCN# 6 on Node4
Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1, BWR for B1 , B1:401->402 at SCN#6
Node2 : B1:200 -> 201, SCN# 2, BWR for B1
Node3 : B1:300 -> 301, SCN# 3
Node4 : B1:400->401 at SCN# 5
After the crash,
- Redo logs of crashed node (Node2) are analyzed and it is identified that block B1 has been flushed to disk as of SCN# 4 and need not be recovered as no changes have been made to it from Node2.
- No Redo log entry from Node2  needs to be applied
- Redo logs of crashed node (Node3) are analyzed and it is identified that block B1 needs to be recovered
- It is also identified that role of the block is global as its different versions was/is  available in Node1(XCUR), Node2(crashed) , Node4(PI)
- Changes from Node3 have to be applied . It is checked if any PI is available which is earlier than the SCN# of the change on node3 which needs to be applied i.e. SCN# 3.
- It is identified that no PI is available  whose SCN is earlier  than the  SCN# (3). Hence, block is read from the disk.
- Redo log entry which needs to be applied is : B1:300 -> 301, SCN# 3
-  Redo is applied to the block read from the disk and the block is written to disk so that on disk version contains changes made by Node3 also.
- Checkpoint # of node2 and Node3 are incremented.
After instance recovery :
Node1 : holds CR and XCUR
Node2 :
Node3 :
Node4 : holds PI
On disk version  of the block is:
101
201
301
400
References:
——————————————————————————————————


Related links:

Home

11G R2 RAC Index

——————-

 

11g R2RAC Dynamic remastering

In this post, I will demonstrate dynamic remastering of the resources in RAC .
In RAC, every data block is mastered by an instance. Mastering a block simply means that master instance keeps track of the state of the block until the next reconfiguration event .When one instance departs the cluster, the GRD portion of that instance needs to be redistributed to the surviving nodes. Similarly, when a new instance enters the cluster, the GRD portions of the existing instances must be redistributed to create the GRD portion of the new instance. This is called dynamic resource  reconfiguration.
In addition to dynamic resource reconfiguration, This is called dynamic remastering. The basic idea is to master a buffer cache resource on the instance where it is mostly accessed. In order to determine whether dynamic remastering is necessary, the GCS essentially keeps track of the number of GCS requests on a per-instance and per-object basis. This means that if an instance, compared to another, is heavily accessing blocks from the same object, the GCS can take the decision to dynamically migrate all of that object’s resources to the instance that is accessing the object most. LMON, LMD and LMS processes are responsible for Dynamic remastering.
– Remastering can be triggered as result of
    – Manual remastering
    – Resource affinity
    – Instance crash
– CURRENT SCENARIO -
- Database version : 11.2.0.1
- 3 node setup
- name of the database – racdb
— SETUP –
– Get data_object_id for scott.emp
SYS>  col owner for a10
            col data_object_id for 9999999 
            col object_name for a15 
            select owner, data_object_id, object_name 
           from dba_objects 
           where owner = 'SCOTT' 
             and object_name = 'EMP';
OWNER      DATA_OBJECT_ID OBJECT_NAME
———- ————– —————
SCOTT               73181 EMP
 – Get File_id and block_id of emp table
SQL>select empno, dbms_rowid.rowid_relative_fno(rowid), 
                  dbms_rowid.rowid_block_number(rowid) 
          from scott.emp 
           where empno in (7788, 7369);
     EMPNO DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID) DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
———- ———————————— ————————————
      7369                                    4                                  151
      7788                                    4                                  151
– MANUAL REMASTERING –
You can manually remaster an object with oradebug command :
oradebug lkdebug -m pkey <data_object_id>
– NODE1 – shutdown the database and restart
[oracle@host01 ~]$ srvctl stop database -d racdb 
                  srvctl start database -d racdb
                  srvctl status database -d racdb
– Issue a select on the object from NODE2
SCOTT@NODE2> select * from  emp;

 

– Find the GCS resource name to be used in  the query
   x$kjbl.kjblname = resource name in hexadecimal format([id1],[id2],[type]
   x$kjbl.kjblname2 = resource name in decimal format
   Hexname will be used to query resource in V$gc_element and v$dlm_rss views
get_resource_name
SYS@NODE2>col hexname for a25 
             col resource_name for a15 
             select b.kjblname hexname, b.kjblname2 resource_name, 
                     b.kjblgrant, b.kjblrole, b.kjblrequest  
           from x$le a, x$kjbl b 
             where a.le_kjbl=b.kjbllockp 
              and a.le_addr = (select le_addr 
                                from x$bh 
                               where dbablk = 151 
                                and obj    = 73181 
                               and class  = 1 
                                and state   <> 3);
HEXNAME                   RESOURCE_NAME   KJBLGRANT   KJBLROLE KJBLREQUE
————————- ————— ——— ———- ———
[0x97][0x4],[BL]          151,4,BL        KJUSERPR           0 KJUSERNL
– Check the current master of the block –
– Note that current master of scott.emp is node1 (numbering starts from 0)
– Previous master = 32767  is a place holder indicating that prior master
   was not known, meaning first remastering of that object.hat index happened.
   Now the master is 0 which is instance 1.
– REMASTER_CNT = 1 indicating the object has been remastered only once
SYS>select o.object_name, m.CURRENT_MASTER, 
                   m.PREVIOUS_MASTER, m.REMASTER_CNT 
          from   dba_objects o, v$gcspfmaster_info m
           where o.data_object_id=73181
           and m.data_object_id = 73181 ;
OBJECT CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
—— ————– ————— ————
EMP                 0           32767            1
–  Use following SQL to show master and owner of the block.
 This SQL joins   x$kjbl with x$le to retrieve resource name.
– Note that current master is node1(KJBLMASTER=0) and current owner of the block is
node2(KJBLOWNER = 1)
SYS@NODE2> select kj.kjblname, kj.kjblname2, kj.kjblowner, 
                       kj.kjblmaster
            from (select kjblname, kjblname2, kjblowner, 
                         kjblmaster, kjbllockp         
                  from x$kjbl
                   where kjblname = '[0x97][0x4],[BL]'
                  ) kj, x$le le
            where le.le_kjbl = kj.kjbllockp
            order by le.le_addr;
KJBLNAME                       KJBLNAME2                       KJBLOWNER  KJBLMASTER
—————————— —————————— ———-  ———-
[0x97][0x4],[BL]               151,4,BL                                1     0
– Manually master the EMP table to node2 –
SYS@NODE2>oradebug lkdebug -m pkey 74625
– Check that the current master of the block has changed to node2 (numbering starts from 0)
– Previous master = 0 (Node1)
– REMASTER_CNT = 2 indicating the object has been remastered twice
SYS>select o.object_name, m.CURRENT_MASTER, 
                   m.PREVIOUS_MASTER, m.REMASTER_CNT 
          from   dba_objects o, v$gcspfmaster_info m 
           where o.data_object_id=74625
            and m.data_object_id = 74625 ;
OBJECT CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
—— ————– ————— ————
EMP                 1               0            2
–  Find master and owner of the block. 
– Note that current owner of the block is Node2 (KJBLOWNER=1)
   from where query was issued)
– current master of the block has been changed to node2 (KJBLMASTER=1)
SYS> select kj.kjblname, kj.kjblname2, kj.kjblowner, 
             kj.kjblmaster 
           from (select kjblname, kjblname2, kjblowner, 
                         kjblmaster, kjbllockp 
                 from x$kjbl
                  where kjblname = '[0x97][0x4],[BL]'                                ) kj, x$le le 
           where le.le_kjbl = kj.kjbllockp   
           order by le.le_addr;
KJBLNAME                       KJBLNAME2                       KJBLOWNER KJBLMASTER
—————————— —————————— ———-  ———-
[0x97][0x4],[BL]               151,4,BL                                1  1
—————————————————————————————
– REMASTERING DUE TO RESOURCE AFFINITY –

GCS masters a buffer cache resource on the instance where it is mostly accessed. In order to determine whether dynamic remastering is necessary, the GCS essentially keeps track of the number of GCS requests on a per-instance and per-object basis. This means that if an instance, compared to another, is heavily accessing blocks from the same object, the GCS can take the decision to dynamically migrate all of that object’s resources to the instance that is accessing the object most.
X$object_policy_statistics maintains the statistics about objects and OPENs
on those objects.LCK0 process maintains these object affinity statistics.
Following parameters affect dynamic remastering due to resource affinity :
_gc_policy_limit : If an instance opens 50 more opens on an object then the other instance (controlled by _gc_policy_limit parameter), then that object is a candidate for remastering. That object is queued and LMD0 reads the queue and initiates GRD freeze. LMON performs reconfiguration of buffer cache locks working with LMS processes. All these are visible in LMD0/LMON trace files.
_gc_policy_time : It controls how often the queue is checked to see if the remastering must be triggered or not with a default value of 10 minutes.
_gc_policy_minimum: This parameter is defined as “minimum amount of dynamic affinity activity per minute” to be a candidate for remastering. Defaults to 2500 and I think, it is lower in a busy environment.
To disable DRM completely, set _gc_policy_limit and _gc_policy_minimum to much higher value, say 10Million.  Setting the parameter _gc_policy_time to 0 will completely disable DRM, but that also means that you can not manually remaster objects. Further, $object_policy_statistics is not maintained if DRM is disabled.
— SETUP  –-
SYS>drop table scott.test purge; 
     create table scott.test as select * from sh.sales; 
     insert into scott.test select * from scott.test; 
    commit; 
     insert into scott.test select * from scott.test; 
     commit; 
    insert into scott.test select * from scott.test; 
     commit; 
     insert into scott.test select * from scott.test; 
     commit;
– Get data_object_id for scott.test
SYS> col data_object_id for 9999999 
         col object_name for a15 
         select owner, data_object_id, object_name, object_id  
         from dba_objects 
         where owner = 'SCOTT' 
           and object_name = 'TEST';
OWNER                          DATA_OBJECT_ID OBJECT_NAME      OBJECT_ID
—————————— ————– ————— ———-
SCOTT                                   74626 TEST                 74626
– Check the initial values of the parameters _gc_policy_minimum and _gc_policy_time
– Enter name of the parameter when prompted
SYS> 
 SET linesize 235 
 col Parameter FOR a20 
 col Instance FOR a10 
 col Description FOR a40 word_wrapped 

 SELECT a.ksppinm  "Parameter", 
       c.ksppstvl "Instance", 
        a.ksppdesc "Description" 
 FROM x$ksppi a, x$ksppcv b, x$ksppsv c, v$parameter p 
 WHERE a.indx = b.indx AND a.indx = c.indx 
   AND p.name(+) = a.ksppinm 
   AND UPPER(a.ksppinm) LIKE UPPER('%&parameter%') 
 ORDER BY a.ksppinm; 

 Enter value for parameter: gc_policy 
 old  11:   AND UPPER(a.ksppinm) LIKE UPPER('%&parameter%') 
 new  11:   AND UPPER(a.ksppinm) LIKE UPPER('%gc_policy%')
Parameter            Instance   Description
——————– ———- —————————————-
_gc_policy_minimum   1500       dynamic object policy minimum activity
                                per minute
_gc_policy_time      10         how often to make object policy
                                decisions in minutes
– Set _gc_policy_minimum and _gc_policy_time to very small values
   so that we can demonstrate remastering
SYS>alter system set "_gc_policy_minimum" = 10 scope=spfile; 
          alter system set "_gc_policy_time" = 1 scope=spfile;
– NODE1 – shutdown the database and restart
[oracle@host01 ~]$ srvctl stop database -d racdb 
                   srvctl start database -d racdb 
                   srvctl status database -d racdb
– Check that parameter values have been changed to the minimum
   allowed by oracle although these values are not the ones we specified
– Enter name of the parameter when prompted
SYS>
SET linesize 235

col Parameter FOR a20

col Instance FOR a10

col Description FOR a40 word_wrapped

SELECT a.ksppinm  "Parameter", c.ksppstvl "Instance",       a.ksppdesc "Description" 
FROM x$ksppi a, x$ksppcv b, x$ksppsv c, v$parameter p 
WHERE a.indx = b.indx 
AND a.indx = c.indx   
AND p.name(+) = a.ksppinm   
AND UPPER(a.ksppinm) LIKE UPPER('%&parameter%') 
ORDER BY a.ksppinm; 

old  11:   AND UPPER(a.ksppinm) LIKE UPPER('%&parameter%')
new  11:   AND UPPER(a.ksppinm) LIKE UPPER('%gc_policy%')
Enter value for parameter: gc_policy
Parameter            Instance   Description
——————– ———- —————————————-
_gc_policy_minimum   20         dynamic object policy minimum activity
                                per minute
_gc_policy_time      4          how often to make object policy
                                decisions in minutes
- Assign TEST to node1 manually
– Issue a select on  scott.test from node1 –
SYS@NODE1>oradebug lkdebug -m pkey 74626 
     SCOTT@NODE1>select * from scott.test;
– check the current master of scott.test –
– Note that current master of scott.test is node1 (numbering starts from 0)
– Previous master = 2 (node3)
– REMASTER_CNT = 3 because while I was doing this demonstartion, remastering   was initiated 2 times earlier also.
SYS@NODE1>select o.object_name, m.CURRENT_MASTER, 
                         m.PREVIOUS_MASTER, m.REMASTER_CNT 
                  from   dba_objects o, v$gcspfmaster_info m 
                  where o.data_object_id=74626 
                   and m.data_object_id = 74626 ;
OBJECT_NAME     CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
————— ————– ————— ————
TEST                         0               2            3
– Issue an insert statement on scott.test from node3 so that scott.test will be remastered to node3
SCOTT@NODE3>insert into scott.test select * from test;
– check repeatedly that opens are increasing on scott.test with time
SYS@NODE1>select inst_id, sopens, xopens 
           from x$object_policy_statistics 
           where object=74626;
 INST_ID     SOPENS     XOPENS
———- ———- ———-
         1       3664          0
SYS@NODE1>/
   INST_ID     SOPENS     XOPENS
———- ———- ———-
         1       7585       1305
            .
            .
            .
SYS@NODE1>/
   INST_ID     SOPENS     XOPENS
———- ———- ———-
         1      12788      17000
SYS@NODE1>/
   INST_ID     SOPENS     XOPENS
———- ———- ———-
         1      35052      39297
– check repeatedly if remastering has been initiated —
— Note that 
 after some time
    . current master changes from node1CURRENT_MASTER =0) to node3 (CURRENT_MASTER =2)
    . Previous master changes from node3 ( PREVIOUS_MASTER=2) to node1( PREVIOUS_MASTER=0)
    – Remaster count increases from 3 to 4.
    .
SYS@NODE2>select o.object_name, m.CURRENT_MASTER, 
                         m.PREVIOUS_MASTER, m.REMASTER_CNT 
          from   dba_objects o, v$gcspfmaster_info m
           where o.data_object_id=74626 
             and m.data_object_id = 74626 ;
16:09:16 SYS@NODE2>/
OBJECT_NAME
 OBJECT_NAME  CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
—————–      ————– ————— ————
TEST                                             0                        2                                     3
                        .
                        .
                        .
                        .
16:12:24 SYS@NODE2>/
OBJECT_NAME CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
——————————————————————————–
TEST                                                 2                     0                            4
—- REMASTERING DUE TO INSTANCE CRASH –
Presently node3 is the master of SCOTT.TEST
Let us crash node3 and monitor the remastering process
root@node3#init 6
– check repeatedly if remastering has been initiated –
– Note that scott.test has been remastered to node2 (CURRENT_MASTER=1)
– PREVIOUS_MASTER =2 and REMASTER_CNT has increased from 4 to 5
SYS@NODE2>select o.object_name, m.CURRENT_MASTER, 
                          m.PREVIOUS_MASTER, m.REMASTER_CNT
                   from   dba_objects o, v$gcspfmaster_info m 
                   where o.data_object_id=74626 
                   and m.data_object_id = 74626 ;
OBJECT_NAME     CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
————— ————– ————— ————
TEST                         1               2            5
— CLEANUP —
SYS@NODE1>drop table scott.test purge; 
 SYa@NODE1S>
     alter system reset "_gc_policy_minimum" = 10 scope=spfile; 
     alter system reset "_gc_policy_time" = 1 scope=spfile; 

 [oracle@host01 ~]$ srvctl stop database -d racdb 
                    srvctl start database -d racdb 
                    srvctl status database -d racdb
——————————————————————————————————
Related Links:
 

 

11g R2 RAC: NIC BONDING

 

 Oracle uses interconnect for both cache fusion and oracle clusterware messaging. Based on the number of nodes in the configuration, either the interconnect can be a crossover cable when only two nodes are participating the cluster or it can be connected via a switch. Networks, both public and private can be single points of failure, Such failures can disrupt the operation of the cluster and reduce availability. To avoid such failures, redundant networks should be configured. This means That dual network adapters should be configured for both public and private networks. However, to enable dual network connections and to load-balance network traffic across the dual network Adapters, features such as network interface card (NIC) bonding or NIC pairing should be used whenever possible.

 

  NIC bonding is a method of pairing multiple physical network connections into a single logical interface. This logical interface is used to establish connection with the database server. By allowing all network connections that are part of the logical interface to be used during communication, this provides load-balancing capabilities that would not be otherwise available. In addition, when one of the network connections fails, the other connection will continue to receive and transmit data, making it fault tolerant.

 

  In a RAC configuration, there is a requirement to have a minimum of two network connections. One connection is for the private interface between the nodes in the cluster, and the other connection , called the public interface, is for users or application servers to connect and transmit data to the database server.
  Private IP addresses are required by Oracle RAC to provide communication between the cluster nodes. Depending on your private network configuration, you may need one or more IP addresses.

 

  Let’s implement and test NIC bonding in an 11gR2 RAC setup.

 

Current configuration is as follows:

 

– 2 node VM setup – Node1, Node2

 

Public network  – eth0
                       Node1 : 192.9.201.183
                       Node2 : 192.9.201.187

 

Private interconnect – eth1
                       Node1 : 10.0.0.1
                       Node2 : 10.0.0.2

 

– On both the nodes, before powering them on, Create  two network interfaces eth2 and eth3 which will be bonded and will replace current  private interconnect eth1.
– Power on both machines
– check that cluster services are running –
Node1#crsctl stat res -t
Node1#oifcfg getif
eth0 192.9.201.0 global public
eth1 10.0.0.0 global cluster_interconnect
———————
– On both the nodes,
———————

 

Step:
 – Using neat
   . deactivate both eth2 and eth3 (have been assigned ipaddress by DHCP)
   . Edit eth1 and remove entries for ipaddress and netmask because we will use same ipaddress (10.0.0.1/2) for bond0 on both the nodes so that  we don’t have to make a new entry in DNS.

 

Step:
- Create a logical interface (bond0) :
  Create bond0 Configuration File in /etc/sysconfig/network-scripts directory on both machines. The IPaddress used is the same as that was used for eth1 on both the nodes.
Node1# vi /etc/sysconfig/network-scripts/ifcfg-bond0
Add the following lines:
DEVICE=bond0
IPADDR=10.0.0.1
NETWORK=10.0.0.0
NETMASK=255.255.255.0
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
Node2# vi /etc/sysconfig/network-scripts/ifcfg-bond0
Add the following lines:
DEVICE=bond0
IPADDR=10.0.0.2
NETWORK=10.0.0.0
NETMASK=255.255.255.0
USERCTL=no
BOOTPROTO=none
ONBOOT=yes

 

Step :
- Modify the individual network interface configuration files to reflect bonding details. The MASTER clause indicates which logical interface (bond0) this specific NIC belongs to, and the SLAVE clause indicates that it’s one among other NICs that are bonded to the master and a slave to its master.
  Edit eth2 and eth3 configuration files on both machines:
# vi /etc/sysconfig/network-scripts/ifcfg-eth2
Modify/append as follows:
DEVICE=eth2
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes
# vi /etc/sysconfig/network-scripts/ifcfg-eth3

Make sure file readS as follows for eth3 interface:
DEVICE=eth3
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes

 

Step :
- Configure the bond driver/module
   The configuration consists of two lines for a logical interface, where miimon (the media independent interface monitor) is configured in milliseconds and represents the link monitoring frequency. Mode indicates the type of configuration that will be deployed between the interfaces that are bonded or paired together. Mode alb  indicates that a round robin policy will be used, and all interfaces will take turns in transmitting; Mode active-backup indicates either of them can be used.Here, only one slave in the bond device will be active at the moment. If the active slave goes down, the other slave becomes active and all traffic is then done via the newly active slave.
 Modify kernel modules configuration file on all the nodes:
# vi /etc/modprobe.conf
Append following two lines:
alias bond0 bonding
options bond0 mode=balance-alb miimon=100

 

Step :
- Test configuration
1. On all the nodes,  load the bonding module, enter:
# modprobe bonding
2. On all the nodes, restart the networking service in order to bring up bond0 interface, enter:
# service network restart
3. Disable restart of crs on both nodes
node1#crsctl disable crs
node2#crsctl disable crs
4. The current status of the bond device bond0 is present in /proc/net/bonding/bond0.Type the following cat command to query the current status of Linux kernel bonding driver:
# cat /proc/net/bonding/bond0
Sample outputs:
Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)
Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:0c:29:2f:ee:13
Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:0c:29:2f:ee:1d

 

5. Display the bonded interface using ‘ifconfig’ command which shows “bond0″ running as the master and both “eth2″ and “eth3″ running as slaves.Also, the hardware address of bond0 and its underlying devices eth2 and eth3 are the same.
# ifconfig
Sample output
bond0     Link encap:Ethernet  HWaddr 00:0C:29:2F:EE:13
          inet addr:10.0.0.1  Bcast:10.0.0.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe2f:ee13/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:4211 errors:0 dropped:0 overruns:0 frame:0
          TX packets:332 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1431791 (1.3 MiB)  TX bytes:68068 (66.4 KiB)
eth0      Link encap:Ethernet  HWaddr 00:0C:29:2F:EE:FF
          inet addr:192.9.201.183  Bcast:192.9.201.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe2f:eeff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:40757 errors:0 dropped:0 overruns:0 frame:0
          TX packets:44923 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:25161328 (23.9 MiB)  TX bytes:14121868 (13.4 MiB)
          Interrupt:67 Base address:0x2024
eth1      Link encap:Ethernet  HWaddr 00:0C:29:2F:EE:09
          inet6 addr: fe80::20c:29ff:fe2f:ee09/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5839 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5396 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3141765 (2.9 MiB)  TX bytes:2291293 (2.1 MiB)
          Interrupt:75 Base address:0x20a4
eth2      Link encap:Ethernet  HWaddr 00:0C:29:2F:EE:13
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:3236 errors:0 dropped:0 overruns:0 frame:0
          TX packets:174 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1239253 (1.1 MiB)  TX bytes:34959 (34.1 KiB)
          Interrupt:75 Base address:0x2424
eth3      Link encap:Ethernet  HWaddr 00:0C:29:2F:EE:13
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:975 errors:0 dropped:0 overruns:0 frame:0
          TX packets:158 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:192538 (188.0 KiB)  TX bytes:33109 (32.3 KiB)
          Interrupt:59 Base address:0x24a4
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:10252 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10252 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:7863150 (7.4 MiB)  TX bytes:7863150 (7.4 MiB)

 

6. Get current pvt. network interface configuration being used by cluster
grid Node1$oifcfg getif
eth0  192.9.201.0  global  public
eth1  10.0.0.0  global  cluster_interconnect

 

7. Set new pvt interconnect to bond0 – Updates OCR
grid Node1$oifcfg setif -global bond0/10.0.0.0:cluster_interconnect
8. Restart CRS
Node1#crsctl stop crs
                crsctl start crs
9. check that crs started on both nodes
Node1#crsctl stat res -t
10. Get current pvt. interconnect info
   – will display eth0 – public interconnect
                  eth1 – earlier pvt interconnect
                  bond0 – pvt interconnect
Node1#oifcfg getif
11. Delete earlier pvt interconnect (eth1)
Node1#oifcfg delif -global eth1/10.0.0.0

12. Get current pvt. interconnect info
   – will display eth0 – public interconnect
                  bond0 – pvt interconnect
Node1#oifcfg getif
13. On node2 , remove network adapter eth3 -
- Click VM > settings
- Click on last network adapter (eth3) > Remove > OK
- check that eth3 has been removed – will not be listed
Node2#ifconfig
- check that cluster services are still running on Node2 as eth2 is providing private interconnect service
Node1#crsctl stat res -t
14. On node1 , remove network adapter eth3 -
- Click VM > settings
- Click on last network adapter (eth3) > Remove > OK
- check that eth3 has been removed – will not be listed
Node1#ifconfig
- Check that cluster services are still running on Node2 as eth2 on node1 is providing private interconnect service
Node1#crsctl stat res -t

15. On node2 , remove network adapter eth2 (only adapter left for pvt interconnect)
- Click VM > settings
- Click on last network adapter (eth2) > Remove > OK
- Node2 immediately gets rebooted as it can’t communicate with node1
- Check that node2 is not a part of the cluster any more
Node1#crsctl stat res -t
                              HAPPY BONDING !!!
References:
———————————————————————–
Related Links:

Home
  
11g R2 RAC INDEX           
 
                       
                                        ————————

RECOVER VOTING DISK – SCENARIO-II

In this post, I will demonstrate how to recover voting disk in case we lose 2 out of 3 copies of voting disk.
In this case, voting disk will be recovered using surviving copy of voting disk.

Current scenario:
3 copies of voting disk are present in test diskgroup  on disks ASMDIsk010, ASMDIsk011, ASMDIsk012.
We will corrupt two disks ASMDIsk010, ASMDIsk011 so that ASMDISK012 still has a copy of the voting disk. We will restore voting disk to another diskgroup using the only valid copy we have.

Let’s start ...
– Currently, we have 3 voting disks. AT least 2 should be accessible for the clusterware to work. Let us corrupt one of the voting disks and check if clusterware still continues

– FIND OUT LOCATION OF VOTEDISK
[grid@host01 cssd]$ crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
–  —–    —————–                ——— ———
 1. ONLINE   00ce3c95c6534f44bfffa645a3430bc3 (ORCL:ASMDISK012) [TEST]
 2. ONLINE   a3751063aec14f8ebfe8fb89fccf45ff (ORCL:ASMDISK010) [TEST]
 3. ONLINE   0fce89ac35834f99bff7b04ccaaa8006 (ORCL:ASMDISK011) [TEST]
Located 3 voting disk(s).

– FIND OUT THE NO. OF DISKS IN test DG (CONTAINING VOTEDISK)
ASMCMD> lsdsk -G test
Path
ORCL:ASMDISK010
ORCL:ASMDISK011
ORCL:ASMDISK012

-- Let us corrupt ASMDISK010
– bs = blocksize = 4096
– count = # of blocks overwritten = 1000000 (~1M)
– total no. of bytes corrupted = 4096 * 1000000
                                 (~4096M = size of one partition)
#dd if=/dev/zero of=/dev/soracleasm/disks/ASMDISK010 bs=4096 count=1000000
CHECK THAT C/W KEEPS RUNNING AS 2 VOTING DISKS (MORE THAN HALF OF
  VOTING DISKS) STILL AVAILABLE
#crsctl stat res -t
-- Now let us corrupt ASMDISK011
– bs = blocksize = 4096
– count = # of blocks overwritten = 1000000 (~1M)
– total no. of bytes corrupted = 4096 * 1000000
                                 (~4096M = size of one partition)
#dd if=/dev/zero of=/dev/oracleasm/disks/ASMDISK011 bs=4096 count=1000000
Here, I was expecting clusterware to stop as only 1 voting disk ( < half of total(3)) were available but surprisingly clusterware kept running. I event waited for quite some time but to no avail.  I would be glad if someone can give more input  on this.
Finally, I stopped clusterware and tried to restart it. It was not able to restart.

- CHECK THAT C/W IS NOT RUNNING
#crsctl stat res -t
– Now we have one copy of the voting disk on one of the disks in test diskgroup we can use that copy to get voting disk back. Since voting disk can’t be restored back to test diskgroup as disks in test have been corrupted, we will restore voting disk to data diskgroup .

-- RECOVER VOTING DISK –
– To move voting disk to data diskgroup, ASM instance should be up and for ASM instance to be up, CRS should be up. Hence we will
     – stop crs on all the nodes
     – start crs in exclusive mode on one of the nodes (host01)
     – start asm instance on host01 using pfile (since spfile of ASM instance is on ASM)
     – move voting disk to data diskgroup.
     – drop test diskgroup (it will allow as it does not have voting disk any more)
     – stop crs on host01(was running in exclusive mode)
     – restart crs on host01
     – start crs on rest of the nodes
     – start cluster on all the nodes

-- IMPLEMENTATION –
    - stop crs on all the nodes(if it does not stop, kill ohasd process and retry)
root@hostn# crsctl stop crs -f
     – start crs in exclusive mode on one of the nodes (host01)
root@host01# crsctl start crs -excl
     - start asm instance on host01 using pfile (since spfile is on ASM)
grid@host01$ echo INSTANCE_TYPE=ASM >> /u01/app/oracle/init+ASM1.ora 
             chown grid:oinstall /u01/app/oracle/init+ASM1.ora 
SQL>startup pfile='/u01/app/oracle/init+ASM1.ora';
- Check that data diskgroup is mounted on host01. if not, mount it.
ASMCMD>lsdg 
        mount data
 - move voting disk to data diskgroup.  voting disk will be automatically recovered using surviving copy of voting disk.
root@host01#crsctl replace votedisk +data
     - drop test diskgroup (it will allow as it does not have voting disk)
SQL>drop diskgroup test force including contents;
     – stop crs on host01(was running in exclusive mode)
root@host01#crsctl stop crs
     – restart crs on host01
root@host01#crsctl start crs
     - start crs on rest of the nodes (if it does not start, kill ohasd process and retry)
root@host02#crsctl start crs 
root@host03#crsctl start crs 
     – start cluster on all the nodes and check that it is running
root@host01#crsctl start cluster -all                                        crsctl stat res -t
————————————————————————————————
Related links:

                                                           ———————