Erman Arslan's Oracle Blog: ODA X4-2 , Oracle RAC 11gR2-- Node shutdown -- split brain and rebootless node fencing, ACFS readonly filesystem errors

Thursday, June 18, 2015

ODA X4-2 , Oracle RAC 11gR2-- Node shutdown -- split brain and rebootless node fencing, ACFS readonly filesystem errors

Here is an investigation for an unexpected ASM shutdown encountered in a 11gR2 RAC (11.2.0.4) ODA X4 environment.

The problem appeared, when I logged in to the server(ermannode2).. ASM was shutdown and all the filesystems residing on ACFS volumes were readonly.. This was clearly because the ACFS volumes were already taken offline.

To analyze this, we first set the GRID env and get the background dump dest location..

[grid@ermannode2 ~]$ . oraenvORACLE_SID = [grid] ? +ASM2
The Oracle base has been set to /u01/app/grid
[grid@ermannode2 ~]$ sqlplus "/as sysdba"

SQL> show parameter backg
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
background_core_dump string partial
background_dump_dest string /u01/app/grid/diag/asm/+asm/+ASM2/trace

Next, we checked the alert log located in background_dump_dest

[grid@ermannode2 ~]$ cd /u01/app/grid/diag/asm/+asm/+ASM2/trace/

Inside the alert_+ASM2.log:

NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
opidcl aborting process unknown ospid (47456) as a result of ORA-29709
Thu May 14 05:10:19 2015
PMON (ospid: 46782): terminating the instance due to error 481
Instance terminated by PMON, pid = 46782
Thu May 14 05:14:13 2015
NOTE: No asm libraries found in the system
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
NOTE: parameter asm_diskstring not allowed in ODA appliance; overriding asm_diskstring to "/dev/mapper/*D_*p*"
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
opidcl aborting process unknown ospid (47456) as a result of ORA-29709

Okay.. We had a process termination at 05:10, to analyze this further; checked the OS syslog for the same node.

May 14 05:06:34 ermannode2 kernel: ADVMK-00011: Mirror recovery for volume asm/vmrepo1-209 in diskgroup DATA completed.

May 14 05:10:15 ermannode2 kernel: ADVMK-00017: The ASM instance terminated unexpectedly. All ADVM volumes will be taken offline.

May 14 05:10:15 ermannode2 kernel: ACFSK-0029: Volume /dev/asm/vmtemp1-209 is now offline

May 14 05:10:15 ermannode2 kernel: ACFSK-0029: Volume /dev/asm/vmtemp2-209 is now offline

May 14 05:10:15 ermannode2 kernel: ACFSK-0029: Volume /dev/asm/vmrepo1-209 is now offline

May 14 05:10:15 ermannode2 kernel: ACFSK-0030: ASM Instance has Failed.

Hmm, we saw the pretty much the same things in OS logs, as ADVM and ACFS processes was writing their logs from their point of view, so they said "we see that, the ASM is terminated and we take offline the volumes on top of it.."

Okay. So there was an ASM termination, but why?
We proceeded with the traces and founnd the traces produced in the related interval..

[grid@ermannode2 trace]$ ls -lrt|grep "May 14" |grep 5:10
-rw-r----- 1 grid oinstall 10730 May 14 05:10 +ASM2_lmon_46798.trm
-rw-r----- 1 grid oinstall 52478 May 14 05:10 +ASM2_lmon_46798.trc
-rw-r----- 1 grid oinstall 602 May 14 05:10 +ASM2_diag_46792.trm
-rw-r----- 1 grid oinstall 5904 May 14 05:10 +ASM2_diag_46792.trc
-rw-r----- 1 grid oinstall 5863 May 14 05:10 +ASM2_dia0_46796_base_1.trm
-rw-r----- 1 grid oinstall 31510 May 14 05:10 +ASM2_dia0_46796_base_1.trc
-rw-r----- 1 grid oinstall 180 May 14 05:10 +ASM2_vmb0_49368.trm
-rw-r----- 1 grid oinstall 4235 May 14 05:10 +ASM2_vmb0_49368.trc
-rw-r----- 1 grid oinstall 289 May 14 05:10 +ASM2_ora_47456.trm
-rw-r----- 1 grid oinstall 6382 May 14 05:10 +ASM2_ora_47456.trc
-rw-r----- 1 grid oinstall 187 May 14 05:10 +ASM2_pmon_46782.trm
-rw-r----- 1 grid oinstall 9000 May 14 05:10 +ASM2_pmon_46782.trc

[grid@ermannode2 trace]$ ls -lrt|grep "May 14" |grep 5:09

-rw-r----- 1 grid oinstall 253 May 14 05:09 +ASM2_ora_66911.trm
-rw-r----- 1 grid oinstall 4128 May 14 05:09 +ASM2_ora_66911.trc
-rw-r----- 1 grid oinstall 70 May 14 05:09 +ASM2_ora_66966.trm
-rw-r----- 1 grid oinstall 1032 May 14 05:09 +ASM2_ora_66966.trc
-rw-r----- 1 grid oinstall 314 May 14 05:09 +ASM2_ora_67198.trm
-rw-r----- 1 grid oinstall 5160 May 14 05:09 +ASM2_ora_67198.trc
-rw-r----- 1 grid oinstall 253 May 14 05:09 +ASM2_ora_67264.trm
-rw-r----- 1 grid oinstall 4128 May 14 05:09 +ASM2_ora_67264.trc
-rw-r----- 1 grid oinstall 192 May 14 05:09 +ASM2_ora_69406.trm
-rw-r----- 1 grid oinstall 3096 May 14 05:09 +ASM2_ora_69406.trc

[grid@ermannode2 trace]$ ls -lrt|grep "May 14" |grep 5:08

-rw-r----- 1 grid oinstall 131 May 14 05:08 +ASM2_ora_57444.trm
-rw-r----- 1 grid oinstall 2064 May 14 05:08 +ASM2_ora_57444.trc
rw-r----- 1 grid oinstall 315 May 14 05:08 +ASM2_ora_57387.trm
-rw-r----- 1 grid oinstall 5160 May 14 05:08 +ASM2_ora_57387.trc
-rw-r----- 1 grid oinstall 442 May 14 05:08 +ASM2_ora_58899.trm
-rw-r----- 1 grid oinstall 7224 May 14 05:08 +ASM2_ora_58899.trc
-rw-r----- 1 grid oinstall 317 May 14 05:08 +ASM2_ora_58847.trm
-rw-r----- 1 grid oinstall 5160 May 14 05:08 +ASM2_ora_58847.trc
-rw-r----- 1 grid oinstall 253 May 14 05:08 +ASM2_ora_59044.trm
-rw-r----- 1 grid oinstall 4128 May 14 05:08 +ASM2_ora_59044.trc
-rw-r----- 1 grid oinstall 377 May 14 05:08 +ASM2_ora_58994.trm
-rw-r----- 1 grid oinstall 6192 May 14 05:08 +ASM2_ora_58994.trc
-rw-r----- 1 grid oinstall 378 May 14 05:08 +ASM2_ora_59778.trm
-rw-r----- 1 grid oinstall 6192 May 14 05:08 +ASM2_ora_59778.trc
-rw-r----- 1 grid oinstall 192 May 14 05:08 +ASM2_ora_61144.trm
-rw-r----- 1 grid oinstall 3096 May 14 05:08 +ASM2_ora_61144.trc
-rw-r----- 1 grid oinstall 192 May 14 05:08 +ASM2_ora_61079.trm
-rw-r----- 1 grid oinstall 3096 May 14 05:08 +ASM2_ora_61079.trc
-rw-r----- 1 grid oinstall 253 May 14 05:08 +ASM2_ora_61300.trm
-rw-r----- 1 grid oinstall 4128 May 14 05:08 +ASM2_ora_61300.trc
-rw-r----- 1 grid oinstall 316 May 14 05:08 +ASM2_ora_61244.trm
-rw-r----- 1 grid oinstall 5160 May 14 05:08 +ASM2_ora_61244.trc
-rw-r----- 1 grid oinstall 192 May 14 05:08 +ASM2_ora_63528.trm
-rw-r----- 1 grid oinstall 3096 May 14 05:08 +ASM2_ora_63528.trc
-rw-r----- 1 grid oinstall 379 May 14 05:08 +ASM2_ora_63476.trm
-rw-r----- 1 grid oinstall 6221 May 14 05:08 +ASM2_ora_63476.trc
-rw-r----- 1 grid oinstall 70 May 14 05:08 +ASM2_ora_63590.trm
-rw-r----- 1 grid oinstall 1032 May 14 05:08 +ASM2_ora_63590.trc

We were interested in background process traces, as the background processes recorded their activity like "PMON terminating the instance" in the alert log.

So, related traces should have been produced at 05:10 the exact time, when the problem happened..

It was 5:10, because it seems that there are only the _ora_ process traces produced at 5:08 and 5:09.

Okay , here are the traces on which we needed to focus..

[grid@ermannode2 trace]$ ls -lrt|grep "May 14" |grep 5:10
-rw-r----- 1 grid oinstall 10730 May 14 05:10 +ASM2_lmon_46798.trm
-rw-r----- 1 grid oinstall 52478 May 14 05:10 +ASM2_lmon_46798.trc
-rw-r----- 1 grid oinstall 602 May 14 05:10 +ASM2_diag_46792.trm
-rw-r----- 1 grid oinstall 5904 May 14 05:10 +ASM2_diag_46792.trc
-rw-r----- 1 grid oinstall 5863 May 14 05:10 +ASM2_dia0_46796_base_1.trm
-rw-r----- 1 grid oinstall 31510 May 14 05:10 +ASM2_dia0_46796_base_1.trc
-rw-r----- 1 grid oinstall 180 May 14 05:10 +ASM2_vmb0_49368.trm
-rw-r----- 1 grid oinstall 4235 May 14 05:10 +ASM2_vmb0_49368.trc
-rw-r----- 1 grid oinstall 289 May 14 05:10 +ASM2_ora_47456.trm
-rw-r----- 1 grid oinstall 6382 May 14 05:10 +ASM2_ora_47456.trc
-rw-r----- 1 grid oinstall 187 May 14 05:10 +ASM2_pmon_46782.trm
-rw-r----- 1 grid oinstall 9000 May 14 05:10 +ASM2_pmon_46782.trc

We copied the trace files with the user friendly names, as "vi" did not like them as the way they are.. (their names were starting with "+")

cp +ASM2_lmon_46798.trc erman_lmon
cp +ASM2_diag_46792.trc erman_diag
cp +ASM2_dia0_46796_base_1.trc erman_dia0
cp +ASM2_vmb0_49368.trc erman_vmb0
cp +ASM2_ora_47456.trc erman_ora
cp +ASM2_pmon_46782.trc erman_pmon

LMON trace:

LMON: Global Enqueue Service Monitor Process, monitors an Oracle RAC cluster to manage global resources

LMON maintains instance membership within Oracle RAC. The process detects instance transitions and performs reconfiguration of GES and GCS resources.

2015-05-14 05:06:26.881714 :
Reconfiguration complete

* domain 0 valid?: 1
* domain 1 valid?: 1
* domain 2 valid?: 1
* domain 3 valid?: 1
*** 2015-05-14 05:10:13.207
kjxggpoll: change db group poll time to 50 ms

DIAG trace:

I'm the voting node
Group reconfiguration cleanup

*** 2015-05-14 05:06:25.135
kjzdm_rac: disabling heartbeat during reconfig
confirm->incar_num 1, rcfgctx->prop_incar 8
Send my bitmap to master 0
kjzgmappropose : incar 8, newmap -
3000000000000000000000000000000000000000000000000000000000000000
kjzgmappropose : rc from psnd : 30
Rcfg confirmation is received from master 0
I agree with the rcfg confirmation

*** 2015-05-14 05:06:25.347
Reconfiguration completes [incarn=8]
*** 2015-05-14 05:06:25.347
kjzdm_rac: enabling heartbeat after reconfig

*** 2015-05-14 05:10:13.152

Group state in transition

DIA trace:

There was a hang analyze there..
Chains most likely to have caused the hang:
[a] Chain 1 Signature: <not in a wait>

Chain 1 Signature Hash: 0x673a0128

[b] Chain 2 Signature: <not in a wait>

Chain 2 Signature Hash: 0x673a0128

[c] Chain 3 Signature: <not in a wait>

Chain 3 Signature Hash: 0x673a0128

ORA trace file:

2015-05-14 05:10:16.852: [GIPCXCPT] gipcInternalSendSync: failed sync request, ret gipcretConnectionLost (12)

*** 2015-05-14 05:10:16.893
*** SESSION ID:(469.1) 2015-05-14 05:10:16.893
*** SERVICE NAME:() 2015-05-14 05:10:16.893
*** MODULE NAME:() 2015-05-14 05:10:16.893
*** ACTION NAME:() 2015-05-14 05:10:16.893

2015-05-14 05:10:16.893: [GIPCXCPT] gipcSendSyncF [clsssServerRPC : clsss.c : 6289]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xb7d3b50 [0000000000000012] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=a7b5736a-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-a7b5736a-18544))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb720f60, sendp (nil)flags 0x3861e, usrFlags 0x30010 }, addr 0000000000000000, buf 0x7fa175ffbc68, len 96, flags 0x8000000

2015-05-14 05:10:16.893: [ CSSCLNT]clsssServerRPC: send failed with err 12, msg type 5

2015-05-14 05:10:16.893: [ CSSCLNT]clssgsGroupGetList: global group OSM_ALL RPC failed status 3 respmsgsz 0

kgxgnqgrp: error: return status 3 (1979471616 )

Group services Error [Communication error] @ kfms.c:4582015-05-14 05:10:16.916: [GIPCXCPT] gipcInternalSend: connection not valid for send operation endp 0xb7d52f0 [0000000000000018] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=a7b5736a-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-a7b5736a-18544))', numPend 0, numReady 0, numDone 10, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb71e860, sendp 0xb7d6010flags 0x8024a71e, usrFlags 0x34000 }, ret gipcretConnectionLost (12)2015-05-14 05:10:16.916: [GIPCXCPT] gipcSendF [gipcmodMuxSendCLSC : gipcmodMux.c : 1437]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xb7d52f0 [0000000000000018] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=a7b5736a-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-a7b5736a-18544))', numPend 0, numReady 0, numDone 10, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb71e860, sendp 0xb7d6010flags 0x8024a71e, usrFlags 0x34000 }, addr 0000000000000000, buf 0x7fa175ff0c38, len 88, cookie 0xb7ebb20, flags 0x40002015-05-14 05:10:16.939: [GIPCXCPT] gipcSendSyncF [clsssServerRPC : clsss.c : 6289]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xb7ed3a0 [000000000000005d] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=f248d19c-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-f248d19c-18544))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 1, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb7e6000, sendp (nil)flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x7fa175ff0c38, len 88, flags 0x8000000

2015-05-14 05:10:16.939: [ CSSCLNT]clsssServerRPC: send failed with err 12, msg type 9
2015-05-14 05:10:16.939: [ CSSCLNT]clssgsShareExit: RPC failure, rc 3
kgxgnsdr: warning: return status 3 (448 )

Hmm, there were connection failures reported here.. (ipcretConnectionLost)

PMON trace:

kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----

ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5570<-ksuxfd_check_dead()+461<-ksuxfd()+135<-ksuxfl()+425<-ksucln_dpc_main()+197<-ksucln_dpc()+29<-ksucln()+1250<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250

<-ssthrdmain()+265<-main()+201<-__libc_start_main()+244

----- End of Abridged Call Stack Trace -----

PMON (ospid: 46782): terminating the instance due to error 481

Pretty much the same error message as the alert log, but there was also a call stack in here.

VMB0 trace:

2015-05-14 05:10:16.713: [GIPCXCPT] gipcInternalSendSync: failed sync request, ret gipcretConnectionLost (12)

2015-05-14 05:10:16.834: [GIPCXCPT] gipcSendSyncF [clsssServerRPC : clsss.c : 6289]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xc454330 [000000000000004d] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=2424e517-e6f3b3b1-49368))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=e6f3b3b1-2424e517-18544))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xc45c110, sendp (nil)flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x7f4a37db3a28, len 88, flags 0x8000000

2015-05-14 05:10:16.834: [ CSSCLNT]clsssServerRPC: send failed with err 12, msg type 9

2015-05-14 05:10:16.834: [ CSSCLNT]clssgsShareExit: RPC failure, rc 3

kgxgnsdr: warning: return status 3 (0 )

kfvsmMain: RAC mode - Unjoining OFS_CSS group

2015-05-14 05:10:16.842: [GIPCXCPT] gipcInternalSend: connection not valid for send operation endp 0xc444490 [0000000000000018] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=017e8e81-e6f3b3b1-49368))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=e6f3b3b1-017e8e81-18544))', numPend 0, numReady 0, numDone 10, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xc409ae0, sendp 0xc4451b0flags 0x8024a71e, usrFlags 0x34000 }, ret gipcretConnectionLost (12)

2015-05-14 05:10:16.842: [GIPCXCPT] gipcSendF [gipcmodMuxSendCLSC : gipcmodMux.c : 1437]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xc444490 [0000000000000018] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=017e8e81-e6f3b3b1-49368))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=e6f3b3b1-017e8e81-18544))', numPend 0, numReady 0, numDone 10, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xc409ae0, sendp 0xc4451b0flags 0x8024a71e, usrFlags 0x34000 }, addr 0000000000000000, buf 0x7f4a37db38f8, len 87, cookie 0xc462340, flags 0x4000

2015-05-14 05:10:16.842: [GIPCXCPT] gipcSendSyncF [clsssServerRPC : clsss.c : 6289]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xc4552f0 [0000000000000041] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=b5b2f198-e6f3b3b1-49368))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=e6f3b3b1-b5b2f198-18544))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 1, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xc45bd50, sendp (nil)flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x7f4a37db38f8, len 87, flags 0x8000000

2015-05-14 05:10:16.842: [ CSSCLNT]clsssServerRPC: send failed with err 12, msg type 9

2015-05-14 05:10:16.842: [ CSSCLNT]clssgsShareExit: RPC failure, rc 3

kgxgnsdr: warning: return status 3 (49056762 )

Reporting the same gipcretConnectionLost...

ASMB trace:

[grid@ermannode2 trace]$ ls -lrt|grep "May 14"|grep asmb

-rw-r----- 1 grid oinstall 70 May 14 05:14 +ASM2_asmb_15803.trm
-rw-r----- 1 grid oinstall 1040 May 14 05:14 +ASM2_asmb_15803.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_asmb_15803.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options

ORACLE_HOME = /u01/app/11.2.0.4/grid

System name: Linux
Node name: ermannode2
Release: 2.6.39-400.126.1.el5uek
Version: #1 SMP Fri Sep 20 10:54:38 PDT 2013
Machine: x86_64
VM name: Xen Version: 4.1 (HVM)
Instance name: +ASM2
Redo thread mounted by this instance: 0 <none>
Oracle process number: 0
Unix process pid: 15803, image: oracle@ermannode2 (ASMB)

*** 2015-05-14 05:14:30.298

kgfmReadOak: max_disk_count is 500
kgfmReadOak: appliance_name is ODA
kgfmReadOak: diskstring is /dev/mapper/*D_*p*
kgfmReadOak: file_version is 2
kgfmReadOak: oda_version is 2
kgfmReadOak: jbod_count is 1
kgfmReadOak: jbod_slot_count is 24
kgfmReadOak: max_missing is 0
kgfmReadOak: min_partners is 2
kgfmReadOak: asm_compatibility is 11.2.0.4
kgfmReadOak: rdbms_compatibility is 11.2.0.2
kgfmReadOak: _asm_hbeatiowait is 100

--> It seems at this point the the ermannode1 was shutdown

cssd(18544)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ermannode2 .
2015-05-14 05:00:33.347:
[crsd(47421)]CRS-5504:Node down event reported for node 'ermannode1'.
2015-05-14 05:00:33.348:
[crsd(47421)]CRS-2773:Server 'ermannode1' has been removed from pool 'Free'.
2015-05-14 05:06:02.630:

--> It seems at this point the the ermannode1 was started

[cssd(18544)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ermannode1 ermannode2 .
2015-05-14 05:06:45.131:
[crsd(47421)]CRS-2772:Server 'ermannode1' has been assigned to pool 'Free'.
2015-05-14 05:09:58.562:

--> It seems at this point the the ermannode2 was shutdown again. But it seems it could not shutdown.. This shutdown operation must have been in hang state, as we didnt see any node down events.. On the other hand, we saw communication errors.

[cssd(18544)]CRS-1612:Network communication with node ermannode1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.510 seconds

2015-05-14 05:10:06.564:

[cssd(18544)]CRS-1611:Network communication with node ermannode1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.510 seconds

2015-05-14 05:10:10.565:

[cssd(18544)]CRS-1610:Network communication with node ermannode1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.510 seconds

2015-05-14 05:10:13.076:

[cssd(18544)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log.

2015-05-14 05:10:13.076:

[cssd(18544)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log

2015-05-14 05:10:13.129:

[cssd(18544)]CRS-1652:Starting clean up of CRSD resources.

[cssd(18544)]CRS-1652:Starting clean up of CRSD resources.

2015-05-14 05:10:14.653:

[/u01/app/11.2.0.4/grid/bin/oraagent.bin(48050)]CRS-5016:Process "/u01/app/11.2.0.4/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0.4/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0.4/grid/log/ermannode2/agent/crsd/oraagent_grid/oraagent_grid.log"

2015-05-14 05:10:15.428:

[/u01/app/11.2.0.4/grid/bin/oraagent.bin(48050)]CRS-5016:Process "/u01/app/11.2.0.4/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0.4/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0.4/grid/log/ermannode2/agent/crsd/oraagent_grid/oraagent_grid.log"

2015-05-14 05:10:15.495:

[/u01/app/11.2.0.4/grid/bin/oraagent.bin(48050)]CRS-5016:Process "/u01/app/11.2.0.4/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0.4/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0.4/grid/log/ermannode2/agent/crsd/oraagent_grid/oraagent_grid.log"

2015-05-14 05:10:15.626:

[cssd(18544)]CRS-1654:Clean up of CRSD resources finished successfully.

2015-05-14 05:10:15.627:

[cssd(18544)]CRS-1655:CSSD on node ermannode2 detected a problem and started to shutdown.

2015-05-14 05:10:15.853:

[cssd(18544)]CRS-1660:The CSS daemon shutdown has completed

2015-05-14 05:10:15.824:[/u01/app/11.2.0.4/grid/bin/oraagent.bin(48050)]CRS-5822:Agent '/u01/app/11.2.0.4/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:12} in /u01/app/11.2.0.4/grid/log/ermannode2/agent/crsd/oraagent_grid/oraagent_grid.log.

2015-05-14 05:10:15.965:[/u01/app/11.2.0.4/grid/bin/orarootagent.bin(48054)]CRS-5822:Agent '/u01/app/11.2.0.4/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:15903} in /u01/app/11.2.0.4/grid/log/ermannode2/agent/crsd/orarootagent_root/orarootagent_root.log.

2015-05-14 05:10:16.181:
[ohasd(13719)]CRS-2765:Resource 'ora.crsd' has failed on server 'ermannode2'.

2015-05-14 05:10:17.000:

[ohasd(13719)]CRS-2765:Resource 'ora.ctssd' has failed on server 'ermannode2'.

2015-05-14 05:10:17.011:

[ohasd(13719)]CRS-2765:Resource 'ora.evmd' has failed on server 'ermannode2'.2015-05-14 05:10:17.039:

[ohasd(13719)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'ermannode2'.

2015-05-14 05:10:17.604

[ohasd(13719)]CRS-2765:Resource 'ora.cssd' has failed on server 'ermannode2'.

2015-05-14 05:10:19.453:

[ctssd(71272)]CRS-2402:The Cluster Time Synchronization Service aborted on host ermannode2. Details at (:ctss_css_init1:) in /u01/app/11.2.0.4/grid/log/ermannode2/ctssd/octssd.log.

[ohasd(13719)]CRS-2765:Resource 'ora.asm' has failed on server 'ermannode2'.
2015-05-14 05:10:20.478:
[crsd(71274)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /u01/app/11.2.0.4/grid/log/d
aroravmsrv2/crsd/crsd.log.
2015-05-14 05:10:20.569:
[ohasd(13719)]CRS-2878:Failed to restart resource 'ora.ctssd'

--> Restart attempts were failing..

2015-05-14 05:10:20.591:
[ohasd(13719)]CRS-2769:Unable to failover resource 'ora.ctssd'.
2015-05-14 05:10:21.576:
[ohasd(13719)]CRS-2765:Resource 'ora.crsd' has failed on server 'ermannode2'.
2015-05-14 05:10:21.651:

[cssd(71367)]CRS-1713:CSSD daemon is started in clustered mode 2015-05-14 05:10:36.703:

[ohasd(13719)]CRS-2765:Resource 'ora.cluster_interconnect.haip' has failed on server 'ermannode2'.

2015-05-14 05:10:37.867: [cssd(71367)]CRS-1707:Lease acquisition for node ermannode2 number 2 completed
2015-05-14 05:10:39.119:
[cssd(71367)]CRS-1621:The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log

2015-05-14 05:10:39.119:
[cssd(71367)]CRS-1617:The information required to do node kill for node ermannode2 is incomplete; details at (:CSSNM00004:) in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log

2015-05-14 05:10:39.137:

[cssd(71367)]CRS-1605:CSSD voting file is online: /dev/mapper/HDD_E0_S00_1461220864p1; details in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log.

2015-05-14 05:10:39.155:

[cssd(71367)]CRS-1605:CSSD voting file is online: /dev/mapper/HDD_E0_S01_1461215112p1; details in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log.

2015-05-14 05:10:39.179:

[cssd(71367)]CRS-1605:CSSD voting file is online: /dev/mapper/HDD_E0_S02_1461225600p1; details in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log.

2015-05-14 05:14:01.325:

[cssd(71367)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ermannode2 .

2015-05-14 05:14:03.963:
[ctssd(14047)]CRS-2407:The new Cluster Time Synchronization Service reference node is host ermannode2.
2015-05-14 05:14:03.963:
[ctssd(14047)]CRS-2401:The Cluster Time Synchronization Service started on host ermannode2.

2015-05-14 05:14:05.743:
[ohasd(13719)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

Okay, there was a connection problem for sure.

Here it says;

"This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0.4/grid/log/ermannode2/cssd/ocssd.log."

Next checked the ocssd.log..

OCCSD.log:

2015-05-14 05:10:13.076: [ CSSD][4096358720](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, ermannode2, is smaller than cohort of 1 nodes led by node 1, ermannode1, based on map type 2

2015-05-14 05:10:13.076: [ CSSD][4143662816]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(2) event(2), incarn 11, mbrc 3, to member 0, events 0x20, state 0x0

2015-05-14 05:10:13.076: [ CSSD][4096358720]###################################

2015-05-14 05:10:13.076: [ CSSD][4096358720]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

2015-05-14 05:10:13.076: [ CSSD][4096358720]###################################

2015-05-14 05:10:13.076: [ CSSD][4096358720](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

2015-05-14 05:10:13.076: [ CSSD][4143662816]clssgmQueueGrockEvent: groupName(CRF-) count(2) master(0) event(2), incarn 235276, mbrc 2, to member 0, events 0x38, state 0x0

2015-05-14 05:10:13.076: [ CSSD][4096358720]

location type point (? means dubious value)

2015-05-14 05:10:13.084: [ CSSD][4096358720]-------------------- -------- -------------------- ----------------------------

2015-05-14 05:10:13.107: [ CSSD][4096358720]clssscExit()+745 call kgdsdst() 000000000 ? 000000000 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720] 7F65F4291DD0 ? 7F65F4291EA8 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720] 7F65F4296950 ? 000000003 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720]clssnmCheckDskInfo( call clssscExit() 7F65F05413F0 ? 000000002 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720])+2001 7F65F4291DD0 ? 7F65F4291EA8 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720] 7F65F4296950 ? 000000003 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720]clssnmDoSyncUpdate( call clssnmCheckDskInfo( 7F65F05413F0 ? 7F65F00A5D50 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720])+4692 ) 000000000 ? 7F65F4291EA8 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720] 7F65F4296950 ? 000000003 ?

2015-05-14 05:10:13.107: [ CSSD][4096358720]clssnmRcfgMgrThread call clssnmDoSyncUpdate( 7F65F05413F0 ? 7F65F00A5D50 ?

2015-05-14 05:10:13.108: [ CSSD][4096358720]()+2735 ) 000000000 ? 000000000 ?

2015-05-14 05:10:13.108: [ CSSD][4096358720] 7F65F4297050 ? 000000003 ?

2015-05-14 05:10:13.108: [ CSSD][4096358720]clssscthrdmain()+25 call clssnmRcfgMgrThread 7F65F05413F0 ? 7F65F00A5D50 ?

2015-05-14 05:10:13.108: [ CSSD][4096358720]3 () 000000000 ? 000000000 ?

2015-05-14 05:10:13.108: [ CSSD][4096358720] 7F65F4297050 ? 000000003 ?

Hmm, a split brain condition.. ( Aborting local node to avoid splitbrain)

It was the cause definetly..

Next we took a look at cssd logs as a final resort.

LETS LOOKED WHAT WE HAVE IN CSSD LOG DIRECTORY:

/u01/app/11.2.0.4/grid/log/ermannode2/cssd

[root@ermannode2 cssd]# ls -al

drwxr-x--- 2 grid oinstall 4096 May 14 18:06 .
drwxr-xr-t 25 root oinstall 4096 May 14 18:02 ..
-rw------- 1 grid oinstall 74452992 May 14 05:10 core.ocssd.bin.18544
-rw-rw-r-- 1 grid oinstall 43718 May 14 05:10 cssdOUT.log
-rw-r--r-- 1 grid oinstall 52767689 May 12 17:54 ocssd.l01
-rw-r--r-- 1 grid oinstall 52770386 May 10 21:57 ocssd.l02
-rw-r--r-- 1 grid oinstall 52770918 May 8 21:42 ocssd.l03
-rw-r--r-- 1 grid oinstall 52770966 May 6 21:23 ocssd.l04
-rw-r--r-- 1 grid oinstall 52771320 May 4 21:05 ocssd.l05
-rw-r--r-- 1 grid oinstall 52771232 May 2 20:48 ocssd.l06
-rw-r--r-- 1 grid oinstall 52773753 Apr 30 20:28 ocssd.l07
-rw-r--r-- 1 grid oinstall 52784867 Apr 28 19:57 ocssd.l08
-rw-r--r-- 1 grid oinstall 52724293 Apr 26 14:30 ocssd.l09
-rw-r--r-- 1 grid oinstall 52850102 Apr 23 18:34 ocssd.l10
-rw-r--r-- 1 grid oinstall 46007706 May 14 18:07 ocssd.log

SO WE HAD A CORE DUMP,

We have analyzed the core dump using gdb..

[root@ermannode2 cssd]# gdb /u01/app/11.2.0.4/grid/bin/ocssd.bin /u01/app/11.2.0.4/grid/log/ermannode2/cssd/core.ocssd.bin.18544

GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-45.el5)
Copyright (C) 2009 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /u01/app/11.2.0.4/grid/bin/ocssd.bin...(no debugging symbols found)...done.

[New Thread 21383]
[New Thread 44951]
[New Thread 44947]
[New Thread 44943]
[New Thread 44949]
[New Thread 44940]
[New Thread 44944]
[New Thread 40607]
[New Thread 21372]
[New Thread 40605]
[New Thread 39128]
[New Thread 18544]
[New Thread 44946]
[New Thread 44942]
[New Thread 44941]
[New Thread 44939]
[New Thread 21369]
[New Thread 44938]
[New Thread 44937]
[New Thread 44945]
[New Thread 21238]

Reading symbols from /u01/app/11.2.0.4/grid/lib/libhasgen11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libhasgen11.so
Reading symbols from /u01/app/11.2.0.4/grid/lib/libocr11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libocr11.so
Reading symbols from /u01/app/11.2.0.4/grid/lib/libocrb11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libocrb11.so
Reading symbols from /u01/app/11.2.0.4/grid/lib/libocrutl11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libocrutl11.so
Reading symbols from /u01/app/11.2.0.4/grid/lib/libclntsh.so.11.1...(no debugging symbols found)...done.

Loaded symbols for /u01/app/11.2.0.4/grid/lib/libclntsh.so.11.1
Reading symbols from /u01/app/11.2.0.4/grid/lib/libskgxn2.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libskgxn2.so
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.

[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /u01/app/11.2.0.4/grid/lib/libasmclntsh11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libasmclntsh11.so
Reading symbols from /u01/app/11.2.0.4/grid/lib/libcell11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libcell11.so
Reading symbols from /u01/app/11.2.0.4/grid/lib/libskgxp11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libskgxp11.so
Reading symbols from /u01/app/11.2.0.4/grid/lib/libnnz11.so...(no debugging symbols found)...done.
Loaded symbols for /u01/app/11.2.0.4/grid/lib/libnnz11.so

Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done

Loaded symbols for /lib64/libc.so.6
Reading symbols from /usr/lib64/libaio.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaio.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /u01/app/11.2.0.4/grid/lib/libnque11.so...(no debugging symbols found)...done.

Loaded symbols for /u01/app/11.2.0.4/grid/lib/libnque11.so
Core was generated by `/u01/app/11.2.0.4/grid/bin/ocssd.bin '.

Program terminated with signal 6, Aborted.
#0 0x000003479430265 in raise () from /lib64/libc.so.6

NOTHING MUCH THERE..

lets see the "cssdOUT.log";

05/14/15 05:10:15: CSSD handling signal 6
05/14/15 05:10:20: CSSD starting
The same thing.. signal 6.. Did not help, as it was a little internal.

ANYWAYS,

THE CAUSE WERE THE IN OCSSD.LOG, there was no need to go further:

"Aborting local node to avoid splitbrain." This was our thing..

The following documents may give you an idea what a split brain condition means in RAC.

What is Split Brain in Oracle Clusterware and Real Application Cluster (Doc ID 1425586.1)
Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)
11gR2 CSS Terminates/Node Eviction After Unplugging one Network Cable in Redundant Interconnect Environment (Doc ID 1481481.1)

Split brain meant that there were 2 or more distinct sets of nodes, or "cohorts",but there was no communication between the two cohorts.. That’s why the cluster was divided into two sub-clusters – one with node1 and the other with node2

With the cohort message below:

2015-05-14 05:10:13.076: [ CSSD][4096358720]clssnmCheckSplit: Node 1, ermannode1, is alive, DHB (1431569412, 4294384750) more than disk timeout of 27000 after the last NHB (1431569383, 4294354860)

2015-05-14 05:10:13.076: [ CSSD][4143662816]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(2) event(2), incarn 11, mbrc 3, to member 2, events 0x0, state 0x0
2015-05-14 05:10:13.076: [ CSSD][4096358720]clssnmCheckDskInfo: My cohort: 2
2015-05-14 05:10:13.076: [ CSSD][4096358720]clssnmCheckDskInfo: Surviving cohort: 1

The message above was saying;

"I can communicate with node 2.

From the Voting disk , I know that node 1 communicates with no one."

Normally, the cohorts with low number of member are terminated to prevent split brain condition.. But in here cohorts are equal in number 1=1.

On the other hand, oracle says that node 2 (local node) is lower based on map type 2;

Cohort of 1 nodes with leader 2, "ermannode2", is smaller than cohort of 1 nodes led by node 1, ermannode1, based on map type 2

This leads a CRS restart on local node (ermannode2) intermittently (no node reboot)

In node 1, we saw 2 shutdowns and startups in somewhere between 5:04 and 5:23;
This was a manual shutdown, which triggers the problem actually.. It got triggered 10 mins before the CSS restart in node 2..

May 14 04:59:58 ermannode1 shutdown[53393]: shutting down for system halt

May 14 04:59:59 ermannode1 init: Switching to runlevel: 0

After the shutdown of node 1 (ermannode1) , a startup was recognized in node 1..

May 14 05:04:39 ermannode1 syslogd 1.4.1: restart.
So node1 runs from 5:04 to 5:23..

There were also some warnings in that period but they can be ignored..

Such as :

WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)

In contrast to previous version of CRS, 11gR2 ASM does not depend on CRSD anymore; rather if OCR is located on ASM, ASM will start before CRSD while clusterware is coming up, and stops after CRSD while clusterware is going down. Due to the new startup sequence, ASM can not update diskgroup resource status in CRSD either during clusterware startup or shutdown, thus above message is logged.

SOLUTION
The message can be ignored, the status of diskgroup and diskgroup resource will be synchronize by oraagent once clusterware is fully up.
Then we saw another reboot at 5:23.. This was a manual reboot.
May 14 05:23:37 ermannode1 syslogd 1.4.1: restart

Checking the OCSS log file of node 1, we see ;

In OCSSD.LOG , Node 1 also reports,

2015-05-14 05:10:13.819: [ CSSD][2838202688]clssnmPollingThread: local diskTimeout set to 200000 ms, remote disk timeout set to 200000, impending reconfig status(0)

2015-05-14 05:10:13.824: [ CSSD][2849257792]clssnmvDHBValidateNcopy: node 2, ermannode2, has a disk HB, but no network HB, DHB has rcfg 325353703, wrtcnt, 84488343, LATS 4294385610, lastSeqNo 84488340, uniqueness 1429880553, timestamp 1431569413/1688028674

2015-05-14 05:10:15.825: [ CSSD][2849257792]clssnmvDHBValidateNcopy: node 2, ermannode2, has a disk HB, but no network HB, DHB has rcfg 325353703, wrtcnt, 84488344, LATS 4294387610, lastSeqNo 84488343, uniqueness 1429880553, timestamp 1431569415/1688031364

2015-05-14 05:10:15.825: [ CSSD][2849257792]clssnmReadDskHeartbeat:manual shutdown of nodename ermannode2, nodenum 2 epoch 1431569415 msec 1688031364

In node 2, we see the following;

2015-05-14 05:10:13.076: [ CSSD][4096358720]clssnmCheckSplit: Node 1, ermannode1, is alive, DHB (1431569412, 4294384750) more than disk timeout of 27000 after the last NHB (1431569383, 4294354860)

2015-05-14 05:10:13.076: [ CSSD][4096358720](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, ermannode2, is smaller than cohort
of 1 nodes led by node 1, ermannode1, based on map type 2

2015-05-14 05:10:13.076: [ CSSD][4143662816]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(2) event(2), incarn 11, mbrc 3, to member 0, events 0x20, state 0x0

2015-05-14 05:10:13.076: [ CSSD][4096358720]###################################
2015-05-14 05:10:13.076: [ CSSD][4096358720]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2015-05-14 05:10:13.076: [ CSSD][4096358720]###################################
2015-05-14 05:10:13.076: [ CSSD][4096358720](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

So , was it CSSD reboot, I mean was this split brain condition is a result rather than cause?

What was sending the Network HearthBeat?

Okay The heartbeat was maintained by background processes like LMON, LMD, LMS and LCK
Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain.

So we came back to this:

NOTE: client exited [47421]
Thu May 14 05:10:16 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
opidcl aborting process unknown ospid (47456) as a result of ORA-29709
Thu May 14 05:10:19 2015
PMON (ospid: 46782): terminating the instance due to error 481
Instance terminated by PMON, pid = 46782

From the OS Watcher logs; the process which gets aborted by opidcl is :

4 S grid 47456 1 0 80 0 - 321302 read_e Apr24 ? 00:00:07 oracle+ASM2_ocr (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

So, it was the OCR process..

Lets see the trace file of that process in the diag directory..
It has following log inside:

2015-05-14 05:10:16.852: [GIPCXCPT] gipcInternalSendSync: failed sync request, ret gipcretConnectionLost (12)

*** 2015-05-14 05:10:16.893
*** SESSION ID:(469.1) 2015-05-14 05:10:16.893
*** SERVICE NAME:() 2015-05-14 05:10:16.893
*** MODULE NAME:() 2015-05-14 05:10:16.893
*** ACTION NAME:() 2015-05-14 05:10:16.893
2015-05-14 05:10:16.893: [GIPCXCPT] gipcSendSyncF [clsssServerRPC : clsss.c : 6289]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xb7d3b50 [0000000000000012] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=a7b5736a-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-a7b5736a-18544))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb720f60, sendp (nil)flags 0x3861e, usrFlags 0x30010 }, addr 0000

000000000000, buf 0x7fa175ffbc68, len 96, flags 0x8000000
2015-05-14 05:10:16.893: [ CSSCLNT]clsssServerRPC: send failed with err 12, msg type 5
2015-05-14 05:10:16.893: [ CSSCLNT]clssgsGroupGetList: global group OSM_ALL RPC failed status 3 respmsgsz 0
kgxgnqgrp: error: return status 3 (1979471616 )
Group services Error [Communication error] @ kfms.c:4582015-05-14 05:10:16.916: [GIPCXCPT] gipcInternalSend: connection not valid for send operation endp 0xb7d52f0 [0000000000000018] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=a7b5736a-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-a7b5736a-18544))', numPend 0, numReady 0, numDone 10, numDead 0, nu

mTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb71e860, sendp 0xb7d6010flags 0x8024a71e, usrFlags 0x34000 }, ret gipcretConnectionLost (12)2015-05-14 05:10:16.916: [GIPCXCPT] gipcSendF [gipcmodMuxSendCLSC : gipcmodMux.c : 1437]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xb7d52f0 [0000000000000018] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=a7b5736a-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-a7b5736a-18544))', numPend 0, numReady 0, numDone 10, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb71e860, sendp 0xb7d6010flags 0x8024a71e, usrFlags 0x34000

}, addr 0000000000000000, buf 0x7fa175ff0c38, len 88, cookie 0xb7ebb20, flags 0x4000

2015-05-14 05:10:16.939: [GIPCXCPT] gipcSendSyncF [clsssServerRPC : clsss.c : 6289]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0xb7ed3a0 [000000000000005d] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=f248d19c-41a7dc91-47456))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ermannode2_)(GIPCID=41a7dc91-f248d19c-18544))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 1, objFlags 0x0, pidPeer 18544, readyRef (nil), ready 0, wobj 0xb7e6000, sendp (nil)flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x7fa175ff0c38, len 88, flags 0x8000000
2015-05-14 05:10:16.939: [ CSSCLNT]clsssServerRPC: send failed with err 12, msg type 9
2015-05-14 05:10:16.939: [ CSSCLNT]clssgsShareExit: RPC failure, rc 3
kgxgnsdr: warning: return status 3 (448 )

When the trace analyzed, we can say that the keywords were;
kgxgncin: CLSS init failed with status 3
gipcInternalSend: connection not valid for send operation endp
gipcretConnectionLost (12)

The same gepcreConnectionLost again..
Above messages indicate the communication between the nodes was not working..

In conclusion:

It seemed , during the poweroff, somethings got hang and that's why there was a disk heartbeat but no network heartbeat..Thus, node 2 got evicted.

Note that: In a 2 node cluster always the node with the higher node number will get evicted, regardless from which node the cable was unplugged.

So, Eviction made the ASM instance to terminate and ACFS volumes to be offline. Thus we have seen the read only filesystems we we logged in.

It was a rebootless fencing , which was introduced in 11.2.0.2.

Note that:

In rebootless fencing, the clusterware will attempt to stop GRID infrastructure gracefuly on the node.. So it is not directly reboots the node. It only reboots the node if the rebootless fencing fails.

At the bottom line,

Shutdown of RAC nodes are important.. Shutdown of RAC nodes should be done gracefully. An improper shutdown , or a hang in the shutdown operation may trigger a cluster failure if the cluster is a 2 node cluster, especially if it is on a Virtualized ODA platform.

Shutdown of RAC nodes should be done gracefully. An improper shutdown , or a hang in the shutdown operation in 1 node, may trigger a cluster failure if the cluster is a 2 node cluster.

Think about a scenario where you have ACFS filesystems on top of ASM and you place your Virtual Operating system ‘s filesystems on top of these ACFS filesystems. So once , the node is evicted ASM will be terminated and ACFS filesystems will be offline. Thus, Virtual OS on these ACFS file systems will see the disks read only. Even after the problem will be fixed, there should be a manual intervention for the OS to mount the disks read write.

Also , if the node that is shutting down is in hang situation, then the problem wont be fixed for a long time. It is just in a hang situation.. So this will make the second node to be evicted and create a downtime in the Oracle Services in cluster level .. A manual intervention will be needed at this point for shutting down the hang node.. Maybe a hard reboot..

Thursday, June 18, 2015

ODA X4-2 , Oracle RAC 11gR2-- Node shutdown -- split brain and rebootless node fencing, ACFS readonly filesystem errors

No comments :

Post a Comment