INSTANCE RECOVERY IN RAC


   In this post, I will discuss how instance recovery takes place in 11g R2 RAC. Instance recovery aims at
- writing all committed changes to the datafiles
- undoing all the uncommitted changes from the datafiles
- Incrementing the checkpoint no. to the SCN till which changes have been written to datafiles.
In a single instance database, before the instance crashes,
- some committed changes are in the redo log files but have not been written to the datafiles
- some uncommitted changes have made their way to datafiles
- some uncommitted changes are in the redo log buffer
After  the instance crashes in a single instance database
- all uncommitted changes in the redo log buffer are wiped out
- Online redo log files are read to identify the blocks that need to be recovered
- Identified blocks are read from the datafiles
- During roll forward phase, all the changes (committed/uncommitted) in redo log files are applied to them
- During rollback phase, all uncommitted changes are rolled back after reading undo from undo tablespace.
- CKTP# is incremented in control file/data file headers
In a RAC database there can be two scenarios :
- Only one instance crashes
- Multiple instances crash
We will discuss these cases one by one.
Single instance crash in RAC database
In this case, scenario is quite similar to instance crash in a single instance database. But there is slight difference also.
Let us consider a 3 node setup. We will consider a data block B1 with one column and 4 records in it . The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . In the following chart, update operations on the block in various nodes and corresponding states of the block are represented. Colour code followed is : CR, PI, XCUR:
SCN# —-Update operation on —        ———– State of the block on ————
        Node1          Node2             Node3      Node1           Node2          Node3        Disk
  1   100->101        –                      –                 101                     –                     –                100
                                                                               200                     –                     –                200
                                                                               300                     –                     –                300
                                                                               400                     –                     –                400

 

  2      –           200->201                                 101                  101                    –                100
                                                                              200                 201                    –                200
                                                                              300                 300                    –                300
                                                                              400                 400                    –                400

 

  3      –                –           300->301                101                 101                101                 100
                                                                              200                201                 201                200
                                                                             300                 300                 301                300
                                                                              400                400                 400                400

 

 4                                                                                             CRASH 
                                                                                                 (Node2)
         –                –           300->301                101                  101                101                 100
                                                                           200                  201                  201                200
                                                                           300                  300                 301                300
                                                                           400                  400                 400                400

It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.

Before crash status of block on various nodes is as follows:

- PI at SCN# 2 on Node1
- PI at SCN# 3 on Node2
- XCUR on Node3

 

Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1
Node2 : B1:200 -> 201, SCN# 2
Node3 : B1:300 -> 301, SCN# 3

After the crash,

- Redo logs of crashed node (Node2) is analyzed and it is identified that block B1 needs to be recovered.
- It is also identified that role of the block is global as its different versions are available in Node1 and Node3
- It is identified that there is a PI on node1 whose SCN# (2) is earlier than the SCN# of crash (4)
- Changes from redo logs of Node2 are applied to the PI on Node1 and the block is written to disk
- Checkpoint # of node1 is incremented.
- a BWR is placed in redo log of Node1 to indicate that the block has been written to disk and need not be recovered in case Node1
Here it can be readily seen that there are certain differences from the instance recovery in single instance database.
The Role of the block is checked.
  If the role is local, then the block will be read from the disk and changes from redo logs of Node2 will be applied i.e. just like single instance database
  If the role is global,
     It is checked if PI of the block at a SCN# earlier than the SCN# of crash is available
         If PI is available, then changes in redo logs of node2 are applied to the PI ,instead of reading the block from the disk,
         If PI is not available (has been flushed to disk due to incremental checkpointing
                                       on the owner node of PI  or
                                       on any of the nodes at a SCN# > PI holder)
             the block will be read from the disk and changes from redo logs of Node2 will be applied just like it used to happen in OPS.
Hence, it can be inferred that PI, if available, speeds up the instance recovery as need to read the block from disk is eliminated. If PI is
not available, block is read from the disk just like in OPS.

Multiple instance crash in RAC database

Let us consider a 4 node setup. We will consider a data block B1 with one column and 4 records in it
. The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . It can be represented as:

SCN#  —- Update operation on —–         ————– State of the block on ————–

        Node1       Node2       Node3    Node4        Node1         Node2        Node3      Node4   Disk

    1   100->101        –               –            –                        101                   –                      –                 –           100

                                                                                             200                 –                        –                –           200

                                                                                             300                 –                        –                –           300

                                                                                             400                 –                        –                –           400

   2         –         200->201        –            –                        101                101                     –            –           100

                                                                                             200                201                     –             –           200

                                                                                             300                300                    –             –           300

                                                                                              400                400                   –             –           400

   3         –              –          300->301     –                           101               101                 101          –           100

                                                                                                 200               201               201           –           200

                                                                                                 300               300              301           –           300

                                                                                                 400               400               400           –           400

   4                                                                                     CKPT

                                                                                                   101               101                 101            –             101

                                                                                                   200               201                201            –             201

                                                                                                   300               300                301            –             300

                                                                                                    400               400                400            –             400

  5          –               –               –        400->401                       101                101            101      101          100

                               –               –            –                                       200               201             201         201          201

                               –               –            –                                        300               300             301         301          300

                               –               –            –                                        400               400             400         401        400

 6     401->402       –               –           –                                     101                101             101        101         100

                                                                                                        200               201             201         201         201

                                                                                                        300               300             301         301        300

                                                                                                       400               400             400         401        400

                                                                                                        101

                                                                                                        201

                                                                                                        301

                                                                                                        402

  7                                                                                                                   CRASH        CRASH

                                                                                                                        (Node2)    (Node3)

                                                                                                           101                 –                 –            101          101

                                                                                                          200                 –                 –            201          201

                                                                                                           300                 –                 –            301          301

                                                                                                           400                 –                 –            401          400

                                                                                                              101

                                                                                                              201

                                                                                                              301

                                                                                                              402

 Explanation:

SCN#1 – Node1 reads the block from disk and updates 100 to 101 in  record. It holds the block in XCUR mode
SCN#2 – Node2  requests the same block for update. Node1 keeps the PI and Node2 holds the block in XCUR mode
SCN#3 – Node3  requests the same block for update. Node2 keeps the PI and Node3 holds the block in XCUR mode . Now we have two PIs
             – On Node1 with SCN# 2
             – On Node2 with SCN# 3
SCN# 4 – Local checkpointing takes place on Node2. PI on this node has SCN# 3.
              It is checked if any of the other nodes has a PI at an earlier SCN# than this. Node1 has PI at SCN# 2.
             CHanges in redo log of Node2 are applied to its PI and it is flushed to disk.
             BWR is placed in redo log of Node2 to indicate that the block has been written to disk and need not be recovered in case Node2 crashes.
             PI at node2 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
             PI at node1 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
             BWR is placed in redo log of Node1 to indicate that block has been written to disk and need not be recovered in case Node2 crashes.
             Now on disk version of block contains changes of both Node1 and Node2.
SCN# 5 – Node4  requests the same block for update. Node3 keeps the PI and Node4 holds the block in XCUR mode .Node1 and Node2 have the CR’s.
SCN# 6 – Node1 again requests the same block for update. Node4 keeps the PI and Node1 holds the block in XCUR mode. Now Node1 has both the same block in CR and XCUR mode. Node3 has PI at SCN# 5.
SCN# 7 – Node2 and Node3 crash.
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
 Before crash status of block on various nodes is as follows:
- CR at SCN# 2 on Node1, XCUR on Node1
- CR at SCN# 3 on Node2
- PI  at SCN# 5 on Node3
- PI at SCN# 6 on Node4
Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1, BWR for B1 , B1:401->402 at SCN#6
Node2 : B1:200 -> 201, SCN# 2, BWR for B1
Node3 : B1:300 -> 301, SCN# 3
Node4 : B1:400->401 at SCN# 5
After the crash,
- Redo logs of crashed node (Node2) are analyzed and it is identified that block B1 has been flushed to disk as of SCN# 4 and need not be recovered as no changes have been made to it from Node2.
- No Redo log entry from Node2  needs to be applied
- Redo logs of crashed node (Node3) are analyzed and it is identified that block B1 needs to be recovered
- It is also identified that role of the block is global as its different versions was/is  available in Node1(XCUR), Node2(crashed) , Node4(PI)
- Changes from Node3 have to be applied . It is checked if any PI is available which is earlier than the SCN# of the change on node3 which needs to be applied i.e. SCN# 3.
- It is identified that no PI is available  whose SCN is earlier  than the  SCN# (3). Hence, block is read from the disk.
- Redo log entry which needs to be applied is : B1:300 -> 301, SCN# 3
-  Redo is applied to the block read from the disk and the block is written to disk so that on disk version contains changes made by Node3 also.
- Checkpoint # of node2 and Node3 are incremented.
After instance recovery :
Node1 : holds CR and XCUR
Node2 :
Node3 :
Node4 : holds PI
On disk version  of the block is:
101
201
301
400
References:
——————————————————————————————————


Related links:

Home

11G R2 RAC Index

——————-

 

9 thoughts on “INSTANCE RECOVERY IN RAC

  1. Hi Maam,

    Thanks for such great post.
    Can you please explain what is BWR here.
    And I guess after every DML commit is also getting executed.

    Just want to Add:–

    When the same dirty block is requested by some other instance for write of read purpose, an image of the block is created in owning instance and then the block is shifted to requesting instance. This image copy of the block is called Past Image (PI).

    XCUR–Exclusive current lock which is required to update the block.

    CR–Is consistent version of block.
    +++++++++++++++++++++++++++++++++++++++++++++++++++++
    Small request can you please write post on Single instance recovery or help in understanding below question

    1)Can you please explain how the commit maker which oracle write in redo stream when transaction committed help in roll forward recovery.

    2)After db gets open Oracle now wants to roll back the uncommitted transaction that happen before db was abort but how oracle or smon determine
    which block it need to rollback after db is open.

    3)Suppose USER A running transaction for 7 mins and after 5 mins checkpoint happen in database so all the dirty buffer will be flush down to datafile
    by dbwr and before the dbwr writes lgwr will write the redo buffer of this block to redo file after 7 mins USER A commit the transaction oracle will
    write the commit marker in redo stream and update the undo header slot that transaction is committed. Now this undo header slot are free to use by other transaction
    as previous transaction is commited. Now USER B had overwrite some of the undo entry of USER A transaction. Now in between USER C process want
    to acccess all block modified by USER A so it will read the blocks from datafile and when it will check the block it will see some active ITL entry in it.
    as when the block written to datafile transaction was active. Now from the ITL entry it will try to access the undo header slot to determine if transacton
    commited or not but as the entry is overwritten by USER B what will happen to this block.

    1. Hi Jamsher,

      From whatever little I know, I will try to answer your questions.
      - PI is kept in the owning instance only if the block is requested for write operation by another instance.
      - BWR means Block written Record. After DBWR of an instance has written some dirty blocks to disk, , a BWR is placed in the redo stream of that instance to reflect it. At the time of recovery of that instance , only the redo beyond the BWR needs to be applied to the datafiles.
      - During roll forward phase of instance recovery, redo is applied after reading redo logs. In this process, undo is generated. This undo is used to rollback uncommitted changes.

      I hope your questions are answered.

      Regards
      Anju Garg
      Delete

    2. Hi Maam,

      Thanks for your reply. Please let me know if i have understand correctly.

      Suppose if a transaction that modify 1 to 10 blocks during time T1 tO T10. Suppose 1 to 5 blocks having commit transaction and 6 to 10 have uncommited transaction
      and at T11 instance crash. As instance crash all blocks are lost what where present in buffer cache.

      Rollforward (mount state):-Smon will apply the change from to all 10 blocks which will cause the undo to generate
      Rollbackward (open state):-Now as transaction recovery will start as first 5 blocks are commited they will be remain untouch
      But as 6 to 10 blocks are not commited the undo which is genrated in mount state will be reapply again on them with old values.

      Thanks
      Jamsher

  2. The thing is there are many blogs by oracle experts but it is the simplicty of this blog that makes it best one, help us understand the concepts easily especially complex subject as RAC..starting from basic and ending as well. Great work.

Your comments and suggestions are welcome!