11g R2 RAC : NODE EVICTION DUE TO MEMBER KILL ESCALATION

If the Oracle Clusterware itself is working perfectly but one of the RAC instances is hanging , the database LMON process will request a member kill escalation and ask the CSS process to remove the hanging database instance from the cluster.

The following example will demonstrate it in a cluster consisting of two nodes:

SQL> select instance_name, host_name from gv$instance;

SQL> col host_name for a20

select instance_name, host_name from gv$instance;

INSTANCE_NAME HOST_NAME

—————- ——————–

orcl1 host01.example.com

orcl2 host02.example.com

- On host02 server stop the execution of all rdbms processes (by sending the STOP signal)

– Find out current database processes

[root@host02 ~]# ps -ef | grep ora_ | grep orcl2

oracle 6215 1 0 11:20 ? 00:00:00 ora_pmon_orcl2

oracle 6217 1 0 11:20 ? 00:00:00 ora_vktm_orcl2

oracle 6221 1 0 11:20 ? 00:00:00 ora_gen0_orcl2

oracle 6223 1 0 11:20 ? 00:00:00 ora_diag_orcl2

oracle 6225 1 0 11:20 ? 00:00:00 ora_dbrm_orcl2

oracle 6227 1 0 11:20 ? 00:00:00 ora_ping_orcl2

oracle 6229 1 0 11:20 ? 00:00:00 ora_psp0_orcl2

oracle 6231 1 0 11:20 ? 00:00:00 ora_acms_orcl2

oracle 6233 1 0 11:20 ? 00:00:00 ora_dia0_orcl2

oracle 6235 1 0 11:20 ? 00:00:00 ora_lmon_orcl2

oracle 6237 1 0 11:20 ? 00:00:02 ora_lmd0_orcl2

……

– stop the execution of all rdbms processes (by sending the STOP signal)

[root@host02 ~]# ps -ef | grep ora_ | grep orcl2 | awk ‘{print $2}’ | while read PID

kill -STOP $PID

done

–. From the client point of view the Real Application Cluster database is hanging on both nodes. No queries or DMLs are possible. Try to execute a query. The query will hang.

SQL> select instance_name, host_name from gv$instance;

– no output, query hangs …

– . Due to missing heartbeats the healthy RAC instance on node host01 will remove the hanging RAC instance by requesting a member kill escalation.

– Check the database alert log file on host01 : LMS process issues a request to CSSD to reboot the node.

The node is evicted and instance is restarted after node joins the cluster.

[root@host01 trace]# tailf /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/alert_orcl1.log

LMS0 (ospid: 31771) has detected no messaging activity from instance 2

LMS0 (ospid: 31771) issues an IMR to resolve the situation

Please check LMS0 trace file for more detail.

Fri Nov 09 11:15:04 2012

Remote instance kill is issued with system inc 30

Remote instance kill map (size 1) : 2

LMON received an instance eviction notification from instance 1

The instance eviction reason is 0x20000000

The instance eviction map is 2

Fri Nov 09 11:15:13 2012

IPC Send timeout detected. Sender: ospid 6308 [oracle@host01.example.com (PZ97)]

Receiver: inst 2 binc 429420846 ospid 6251

Waiting for instances to leave:

Reconfiguration started (old inc 4, new inc 8)

List of instances:

1 (myinst: 1)

….. Recovery of instance 2 starts

Global Resource Directory frozen

….

All grantable enqueues granted

Post SMON to start 1st pass IR

—-

Instance recovery: looking for dead threads

Beginning instance recovery of 1 threads

Started redo scan

IPC Send timeout to 2.0 inc 4 for msg type 12 from opid 42

Completed redo scan

read 93 KB redo, 55 data blocks need recovery

Started redo application at

Thread 2: logseq 9, block 42

Recovery of Online Redo Log: Thread 2 Group 3 Seq 9 Reading mem 0

Mem# 0: +DATA/orcl/onlinelog/group_3.266.798828557

Mem# 1: +FRA/orcl/onlinelog/group_3.259.798828561

Completed redo application of 0.05MB

Completed instance recovery at

Thread 2: logseq 9, block 228, scn 1069404

52 data blocks read, 90 data blocks written, 93 redo k-bytes read

Thread 2 advanced to log sequence 10 (thread recovery)

Fri Nov 09 12:18:55 2012

….

— Check the cluster clusterware alert log of host01 –

– The node is evicted and rebooted to join the cluster

[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log

[cssd(14493)]CRS-1607:Node host02 is being evicted in cluster incarnation 247848838; details at (:CSSNM00007:) in

/u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.

2012-11-09 11:15:56.140

[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: mo103324, with time stamp: L-2012-11-09-

11:15:56.580

[ohasd(12412)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from

CSS

2012-11-09 11:16:17.365

[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .

2012-11-09 11:16:17.400

[crsd(14820)]CRS-5504:Node down event reported for node ‘host02‘.

…… Node 2 joins the cluster

[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .

2012-11-09 12:18:52.713

[crsd(14820)]CRS-2772:Server ‘host02′ has been assigned to pool ‘Generic’.

2012-11-09 12:18:52.713

[crsd(14820)]CRS-2772:Server ‘host02′ has been assigned to pool ‘ora.orcl’.

7. After the node rejoins the cluster and the instance is restarted, reexecute the query – it succeeds

SQL> conn sys/oracle@orcl as sysdba

col host_name for a20

select instance_name, host_name from gv$instance;

INSTANCE_NAME HOST_NAME

—————- ——————–

orcl1 host01.example.com

orcl2 host02.example.com

References:

http://www.unbreakablecloud.com/wordpress/2010/11/02/understanding-cluster-node-eviction/

——————————————————————————————————-

Related links:

Home

11G R2 RAC Index

Node Eviction Due To Missing Network Heartbeat

Node Eviction Due To Missing Disk Heartbeat

Node Eviction Due To CSSD Agent Stopping

11g R2 RAC: Reboot-less Node Fencing

11g R2 RAC :Reboot-less Fencing With Missing Disk Heartbeat

11g R2 RAC: Reboot-less Fencing With Missing Network Heartbeat

———————-

> DefSemiHidden="true" DefQFormat="false" DefPriority="99" LatentStyleCount="267"> UnhideWhenUsed="false" QFormat="true" Name="Normal"/> UnhideWhenUsed="false" QFormat="true" Name="heading 1"/> Name="heading 2"/> Name="heading 3"/> Name="heading 4"/> Name="heading 5"/> Name="heading 6"/> Name="heading 7"/> Name="heading 8"/> Name="heading 9"/> Name="caption"/> UnhideWhenUsed="false" QFormat="true" Name="Title"/> Paragraph Font"/> UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/> UnhideWhenUsed="false" QFormat="true" Name="Strong"/> UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/> UnhideWhenUsed="false" Name="Table Grid"/> Name="Placeholder Text"/> UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/> UnhideWhenUsed="false" Name="Light Shading"/> UnhideWhenUsed="false" Name="Light List"/> UnhideWhenUsed="false" Name="Light Grid"/> UnhideWhenUsed="false" Name="Medium Shading 1"/> UnhideWhenUsed="false" Name="Medium Shading 2"/> UnhideWhenUsed="false" Name="Medium List 1"/> UnhideWhenUsed="false" Name="Medium List 2"/> UnhideWhenUsed="false" Name="Medium Grid 1"/> UnhideWhenUsed="false" Name="Medium Grid 2"/> UnhideWhenUsed="false" Name="Medium Grid 3"/> UnhideWhenUsed="false" Name="Dark List"/> UnhideWhenUsed="false" Name="Colorful Shading"/> UnhideWhenUsed="false" Name="Colorful List"/> UnhideWhenUsed="false" Name="Colorful Grid"/> UnhideWhenUsed="false" Name="Light Shading Accent 1"/> UnhideWhenUsed="false" Name="Light List Accent 1"/> UnhideWhenUsed="false" Name="Light Grid Accent 1"/> UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/> UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/> UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/> Name="Revision"/> UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/> UnhideWhenUsed="false" QFormat="true" Name="Quote"/> UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/> UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/> UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/> UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/> UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/> UnhideWhenUsed="false" Name="Dark List Accent 1"/> UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/> UnhideWhenUsed="false" Name="Colorful List Accent 1"/> UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/> UnhideWhenUsed="false" Name="Light Shading Accent 2"/> UnhideWhenUsed="false" Name="Light List Accent 2"/> UnhideWhenUsed="false" Name="Light Grid Accent 2"/> UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/> UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/> UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/> UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/> UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/> UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/> UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/> UnhideWhenUsed="false" Name="Dark List Accent 2"/> UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/> UnhideWhenUsed="false" Name="Colorful List Accent 2"/> UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/> UnhideWhenUsed="false" Name="Light Shading Accent 3"/> UnhideWhenUsed="false" Name="Light List Accent 3"/> UnhideWhenUsed="false" Name="Light Grid Accent 3"/> UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/> UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/> UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/> UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/> UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/> UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/> UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/> UnhideWhenUsed="false" Name="Dark List Accent 3"/> UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/> UnhideWhenUsed="false" Name="Colorful List Accent 3"/> UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/> UnhideWhenUsed="false" Name="Light Shading Accent 4"/> UnhideWhenUsed="false" Name="Light List Accent 4"/> UnhideWhenUsed="false" Name="Light Grid Accent 4"/> UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/> UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/> UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/> UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/> UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/> UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/> UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/> UnhideWhenUsed="false" Name="Dark List Accent 4"/> UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/> UnhideWhenUsed="false" Name="Colorful List Accent 4"/> UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/> UnhideWhenUsed="false" Name="Light Shading Accent 5"/> UnhideWhenUsed="false" Name="Light List Accent 5"/> UnhideWhenUsed="false" Name="Light Grid Accent 5"/> UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/> UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/> UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/> UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/> UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/> UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/> UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/> UnhideWhenUsed="false" Name="Dark List Accent 5"/> UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/> UnhideWhenUsed="false" Name="Colorful List Accent 5"/> UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/> UnhideWhenUsed="false" Name="Light Shading Accent 6"/> UnhideWhenUsed="false" Name="Light List Accent 6"/> UnhideWhenUsed="false" Name="Light Grid Accent 6"/> UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/> UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/> UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/> UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/> UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/> UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/> UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/> UnhideWhenUsed="false" Name="Dark List Accent 6"/> UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/> UnhideWhenUsed="false" Name="Colorful List Accent 6"/> UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/> UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/> UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/> UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/> UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/> UnhideWhenUsed="false" QFormat="true" Name="Book Title"/> "/> Name="TOC Heading"/>

8 thoughts on “11g R2 RAC : NODE EVICTION DUE TO MEMBER KILL ESCALATION”

Anonymous says:

April 11, 2013 at 4:55 am

Good one.

–Jamsher

Umesh says:

June 6, 2013 at 5:36 am

Nice Article, I have also listed Top 4 Reasons for Node Eviction.

http://www.dbas-oracle.com/2013/06/Top-4-Reasons-Node-Reboot-Node-Eviction-in-Real-Application-Cluster-RAC-Environment.html

shankar says:

January 19, 2017 at 3:08 am

is there other background process which involves while during node evictions ?

Thanks ,
Shankar

1. Anju Garg says:
  
  January 19, 2017 at 11:20 pm
  
  Hi Shankar,
  
  Following processes are responsile for node eviction:
  1. ocssd
  . Missing network heartbeat
  . Missing disk heartbeat
  . After escalation of a member kill from a client (e.g. LMON)
  2. oclskd (Oracle clusterware kill daemon)
  . Reboots a node based on requests from other nodes in the cluster
  3. cssdagent and cssdmonitor
  . node hang
  . ocssd hang
  
  Regards
  Anju Garg
  
  1. shankar says:
    
    January 19, 2017 at 11:44 pm
    
    Thank you anuj !! . Well the article could help anyone to understand the node evictions in RAC . it’s such a simple one. Pls keep up the great work !
    
    1. Anju Garg says:
      
      January 19, 2017 at 11:53 pm
      
      Thanks Shankar for your time and feedback.
      Your comments and suggestions are always welcome.
      
      Regards
      ANju GArg
      
long says:

June 21, 2017 at 12:24 am

great artical !

in my env:
11.2.0.4 rac

kill -SIGSTOP all oracle related process on node2 , can not result member kill escalation,

clssgmmkLocalKillResults( lmon/dbwr/lgwr,chkp/mmon/lock/rbal was killed successful) just return succed with 30s, so no node eviction .

node2 still there with all failed oracle related process .

something wrong with me ?need you help!

(forgive my poor english)

Regards
long

1. Anju Garg says:
  
  June 21, 2017 at 1:12 am
  
  Thanks Long for your time and feedback.
  
  I have demonstrated the scenario in 11.2.0.3 RAC. Since your environment is 11.2.0.4 RAC, It might be some new functionality introduced by Oracle that is prevented the node from being evicted. It could be a case of rebootless fencing. (http://oracleinaction.com/11g-r2-rac-reboot-less-node-fencing/).
  
  regards
  Anju Garg