11g R2 RAC : NODE EVICTION DUE TO MEMBER KILL ESCALATION

    If the Oracle Clusterware itself is working perfectly but one of the RAC instances is hanging , the database LMON process will request a member kill escalation and ask the CSS process to remove the hanging  database instance from the cluster.
The following example will demonstrate it in a cluster consisting of two nodes:
SQL> select instance_name, host_name from gv$instance;
SQL> col host_name for a20
          select instance_name, host_name from gv$instance;
INSTANCE_NAME    HOST_NAME
—————- ——————–
orcl1            host01.example.com
orcl2            host02.example.com
-  On host02 server  stop the execution of all rdbms processes (by sending the STOP signal)
– Find out current database processes
[root@host02 ~]#  ps -ef | grep ora_ | grep orcl2
oracle    6215     1  0 11:20 ?        00:00:00 ora_pmon_orcl2
oracle    6217     1  0 11:20 ?        00:00:00 ora_vktm_orcl2
oracle    6221     1  0 11:20 ?        00:00:00 ora_gen0_orcl2
oracle    6223     1  0 11:20 ?        00:00:00 ora_diag_orcl2
oracle    6225     1  0 11:20 ?        00:00:00 ora_dbrm_orcl2
oracle    6227     1  0 11:20 ?        00:00:00 ora_ping_orcl2
oracle    6229     1  0 11:20 ?        00:00:00 ora_psp0_orcl2
oracle    6231     1  0 11:20 ?        00:00:00 ora_acms_orcl2
oracle    6233     1  0 11:20 ?        00:00:00 ora_dia0_orcl2
oracle    6235     1  0 11:20 ?        00:00:00 ora_lmon_orcl2
oracle    6237     1  0 11:20 ?        00:00:02 ora_lmd0_orcl2
……
– stop the execution of all rdbms processes (by sending the STOP signal)
[root@host02 ~]#  ps -ef | grep ora_ | grep orcl2 | awk ‘{print $2}’ | while read PID
                               do
                               kill -STOP $PID
                               done
–. From the client point of view the Real Application Cluster database is hanging on both nodes. No queries or DMLs are possible. Try to execute a query. The query will hang.
SQL> select instance_name, host_name from gv$instance;
– no output, query hangs …
– . Due to missing heartbeats the healthy RAC instance on node host01  will remove the hanging RAC instance by requesting a member kill escalation.
– Check the  database alert log file on host01 : LMS process issues a request to CSSD to reboot the node.
   The node is evicted and instance is restarted after node joins the cluster.
[root@host01 trace]# tailf /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/alert_orcl1.log
LMS0 (ospid: 31771) has detected no messaging activity from instance 2
LMS0 (ospid: 31771) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Nov 09 11:15:04 2012
Remote instance kill is issued with system inc 30
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Fri Nov 09 11:15:13 2012
IPC Send timeout detected. Sender: ospid 6308 [oracle@host01.example.com (PZ97)]
Receiver: inst 2 binc 429420846 ospid 6251
Waiting for instances to leave:
2
Reconfiguration started (old inc 4, new inc 8)
List of instances:
 1 (myinst: 1)
 …..  Recovery of instance 2 starts
Global Resource Directory frozen
….
All grantable enqueues granted
 Post SMON to start 1st pass IR
—-
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
IPC Send timeout to 2.0 inc 4 for msg type 12 from opid 42
Completed redo scan
 read 93 KB redo, 55 data blocks need recovery
Started redo application at
 Thread 2: logseq 9, block 42
Recovery of Online Redo Log: Thread 2 Group 3 Seq 9 Reading mem 0
  Mem# 0: +DATA/orcl/onlinelog/group_3.266.798828557
  Mem# 1: +FRA/orcl/onlinelog/group_3.259.798828561
Completed redo application of 0.05MB
Completed instance recovery at
 Thread 2: logseq 9, block 228, scn 1069404
 52 data blocks read, 90 data blocks written, 93 redo k-bytes read
Thread 2 advanced to log sequence 10 (thread recovery)
Fri Nov 09 12:18:55 2012
….
— Check the cluster clusterware alert log of host01 –
– The node is evicted and rebooted to join the cluster
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[cssd(14493)]CRS-1607:Node host02 is being evicted in cluster incarnation 247848838; details at (:CSSNM00007:) in
/u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-09 11:15:56.140
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: mo103324, with time stamp: L-2012-11-09-
11:15:56.580
[ohasd(12412)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from
CSS
2012-11-09 11:16:17.365
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 11:16:17.400
[crsd(14820)]CRS-5504:Node down event reported for node ‘host02‘.
2
…… Node 2 joins the cluster
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server ‘host02′ has been assigned to pool ‘Generic’.
2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server ‘host02′ has been assigned to pool ‘ora.orcl’.
7. After the node rejoins the cluster and the instance is restarted,   reexecute the query – it succeeds
SQL> conn sys/oracle@orcl as sysdba
          col host_name for a20
          select instance_name, host_name from gv$instance;
INSTANCE_NAME    HOST_NAME
—————- ——————–
orcl1            host01.example.com
orcl2            host02.example.com
——————————————————————————————————-
Related links:

                                                     ———————-

 

 

8 thoughts on “11g R2 RAC : NODE EVICTION DUE TO MEMBER KILL ESCALATION

    1. Hi Shankar,

      Following processes are responsile for node eviction:
      1. ocssd
      . Missing network heartbeat
      . Missing disk heartbeat
      . After escalation of a member kill from a client (e.g. LMON)
      2. oclskd (Oracle clusterware kill daemon)
      . Reboots a node based on requests from other nodes in the cluster
      3. cssdagent and cssdmonitor
      . node hang
      . ocssd hang

      Regards
      Anju Garg

      1. Thank you anuj !! . Well the article could help anyone to understand the node evictions in RAC . it’s such a simple one. Pls keep up the great work !

  1. great artical !

    in my env:
    11.2.0.4 rac

    kill -SIGSTOP all oracle related process on node2 , can not result member kill escalation,

    clssgmmkLocalKillResults( lmon/dbwr/lgwr,chkp/mmon/lock/rbal was killed successful) just return succed with 30s, so no node eviction .

    node2 still there with all failed oracle related process .

    something wrong with me ?need you help!

    (forgive my poor english)

    Regards
    long

Your comments and suggestions are welcome!