If the Oracle Clusterware itself is working perfectly but one of the RAC instances is hanging , the database LMON process will request a member kill escalation and ask the CSS process to remove the hanging database instance from the cluster.
The following example will demonstrate it in a cluster consisting of two nodes:
SQL> select instance_name, host_name from gv$instance;
SQL> col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
—————- ——————–
orcl1 host01.example.com
orcl2 host02.example.com
- On host02 server stop the execution of all rdbms processes (by sending the STOP signal)
– Find out current database processes
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2
oracle 6215 1 0 11:20 ? 00:00:00 ora_pmon_orcl2
oracle 6217 1 0 11:20 ? 00:00:00 ora_vktm_orcl2
oracle 6221 1 0 11:20 ? 00:00:00 ora_gen0_orcl2
oracle 6223 1 0 11:20 ? 00:00:00 ora_diag_orcl2
oracle 6225 1 0 11:20 ? 00:00:00 ora_dbrm_orcl2
oracle 6227 1 0 11:20 ? 00:00:00 ora_ping_orcl2
oracle 6229 1 0 11:20 ? 00:00:00 ora_psp0_orcl2
oracle 6231 1 0 11:20 ? 00:00:00 ora_acms_orcl2
oracle 6233 1 0 11:20 ? 00:00:00 ora_dia0_orcl2
oracle 6235 1 0 11:20 ? 00:00:00 ora_lmon_orcl2
oracle 6237 1 0 11:20 ? 00:00:02 ora_lmd0_orcl2
……
– stop the execution of all rdbms processes (by sending the STOP signal)
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2 | awk ‘{print $2}’ | while read PID
do
kill -STOP $PID
done
–. From the client point of view the Real Application Cluster database is hanging on both nodes. No queries or DMLs are possible. Try to execute a query. The query will hang.
SQL> select instance_name, host_name from gv$instance;
– no output, query hangs …
– . Due to missing heartbeats the healthy RAC instance on node host01 will remove the hanging RAC instance by requesting a member kill escalation.
– Check the database alert log file on host01 : LMS process issues a request to CSSD to reboot the node.
The node is evicted and instance is restarted after node joins the cluster.
[root@host01 trace]# tailf /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/alert_orcl1.log
LMS0 (ospid: 31771) has detected no messaging activity from instance 2
LMS0 (ospid: 31771) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Nov 09 11:15:04 2012
Remote instance kill is issued with system inc 30
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Fri Nov 09 11:15:13 2012
IPC Send timeout detected. Sender: ospid 6308 [oracle@host01.example.com (PZ97)]
Receiver: inst 2 binc 429420846 ospid 6251
Waiting for instances to leave:
2
Reconfiguration started (old inc 4, new inc 8)
List of instances:
1 (myinst: 1)
….. Recovery of instance 2 starts
Global Resource Directory frozen
….
All grantable enqueues granted
Post SMON to start 1st pass IR
—-
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
IPC Send timeout to 2.0 inc 4 for msg type 12 from opid 42
Completed redo scan
read 93 KB redo, 55 data blocks need recovery
Started redo application at
Thread 2: logseq 9, block 42
Recovery of Online Redo Log: Thread 2 Group 3 Seq 9 Reading mem 0
Mem# 0: +DATA/orcl/onlinelog/group_3.266.798828557
Mem# 1: +FRA/orcl/onlinelog/group_3.259.798828561
Completed redo application of 0.05MB
Completed instance recovery at
Thread 2: logseq 9, block 228, scn 1069404
52 data blocks read, 90 data blocks written, 93 redo k-bytes read
Thread 2 advanced to log sequence 10 (thread recovery)
Fri Nov 09 12:18:55 2012
….
— Check the cluster clusterware alert log of host01 –
– The node is evicted and rebooted to join the cluster
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[cssd(14493)]CRS-1607:Node host02 is being evicted in cluster incarnation 247848838; details at (:CSSNM00007:) in
/u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-09 11:15:56.140
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component: mo103324, with time stamp: L-2012-11-09-
11:15:56.580
[ohasd(12412)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from
CSS
2012-11-09 11:16:17.365
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 11:16:17.400
[crsd(14820)]CRS-5504:Node down event reported for node ‘host02‘.
2
…… Node 2 joins the cluster
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 host02 .
2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server ‘host02′ has been assigned to pool ‘Generic’.
2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server ‘host02′ has been assigned to pool ‘ora.orcl’.
7. After the node rejoins the cluster and the instance is restarted, reexecute the query – it succeeds
SQL> conn sys/oracle@orcl as sysdba
col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
—————- ——————–
orcl1 host01.example.com
orcl2 host02.example.com
References:
——————————————————————————————————-
Related links:
———————-
Good one.
–Jamsher
Nice Article, I have also listed Top 4 Reasons for Node Eviction.
http://www.dbas-oracle.com/2013/06/Top-4-Reasons-Node-Reboot-Node-Eviction-in-Real-Application-Cluster-RAC-Environment.html
is there other background process which involves while during node evictions ?
Thanks ,
Shankar
Hi Shankar,
Following processes are responsile for node eviction:
1. ocssd
. Missing network heartbeat
. Missing disk heartbeat
. After escalation of a member kill from a client (e.g. LMON)
2. oclskd (Oracle clusterware kill daemon)
. Reboots a node based on requests from other nodes in the cluster
3. cssdagent and cssdmonitor
. node hang
. ocssd hang
Regards
Anju Garg
Thank you anuj !! . Well the article could help anyone to understand the node evictions in RAC . it’s such a simple one. Pls keep up the great work !
Thanks Shankar for your time and feedback.
Your comments and suggestions are always welcome.
Regards
ANju GArg
great artical !
in my env:
11.2.0.4 rac
kill -SIGSTOP all oracle related process on node2 , can not result member kill escalation,
clssgmmkLocalKillResults( lmon/dbwr/lgwr,chkp/mmon/lock/rbal was killed successful) just return succed with 30s, so no node eviction .
node2 still there with all failed oracle related process .
something wrong with me ?need you help!
(forgive my poor english)
Regards
long
Thanks Long for your time and feedback.
I have demonstrated the scenario in 11.2.0.3 RAC. Since your environment is 11.2.0.4 RAC, It might be some new functionality introduced by Oracle that is prevented the node from being evicted. It could be a case of rebootless fencing. (http://oracleinaction.com/11g-r2-rac-reboot-less-node-fencing/).
regards
Anju Garg