It was just a week ago, that I have encountered this strange problem in one of my Customer site.
The system was an Exadata X3-2 , which was hosting a production EBS database on top of it.
A flash card had to be replaced , so we have shut down the machine properly for the replacement.
However, after the flash card replacement, databases could not be started again.
Problem: RAC services can not be started after rebooting Exadata compute and cell nodes.
Analysis: When the issue was escalated to me , I directly jump into the OHASD log file, as OHASD was the process that should have been the process starting the Oracle Clusterware processes.
The flashing error was there;
2016-05-12 13:29:15.815: [ CRSPE][1145149760] {0:0:178} ICE has queued an operation. Details: Operation [START of [ora.crsd 1 1] on [osrvdb01] : local=0, unplanned=00x2aaab427de30] cannot run cause it needs W lock for: WO for Placement Path RI:[ora.ctssd 1 1] server [osrvdb01] target states [ONLINE ], locked by op [START of [ora.asm 1 1] on [osrvdb01] : local=0, unplanned=00x2aaab00bee30].
Owner: CRS-2683: It is locked by 'SYSTEM' for command 'Resource Autostart : osrvdb01'
It was obvious that there was a cluster level locking problem...
Kept checking on the other logs and saw that the CSSD daemon OCCSD was terminating abornormally.
Note that: OCR of this RAC was in DBFS_DG diskgroup in ASM...
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmCompleteVFDiscovery: Completing voting file discovery
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmvDiskStateChange: state from discovered to pending disk o/192.168.10.3/DBFS_DG_CD_03_osrvcel01
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmvDiskStateChange: state from pending to configured disk o/192.168.10.3/DBFS_DG_CD_03_osrvcel01
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmvDiskStateChange: state from discovered to deconfigured disk o/192.168.10.4/DBFS_DG_CD_02_osrvcel02
2016-05-12 14:54:02.927: [ CLSF][478835504]Allocated CLSF context
2016-05-12 14:54:02.929: [ SKGFD][478835504]Handle 0x2aaaac668b60 from lib :OSS:: for disk :o/192.168.10.3/DBFS_DG_CD_03_osrvcel01:
2016-05-12 14:54:02.929: [ CLSF][478835504]Opened hdl:0x2aaaac5af9d0 for dev:o/192.168.10.3/DBFS_DG_CD_03_osrvcel01:
2016-05-12 14:54:02.929: [ CSSD][478835504]ASSERT clssnml.c 4532016-05-12 14:54:02.929: [ CSSD][478835504]clssnmlgetleasehdls: Do not have sufficient voting files, found 1 of 2 configured files, needed at least 2
2016-05-12 14:54:02.929: [ CSSD][478835504]###################################
2016-05-12 14:54:02.929: [ CSSD][478835504]clssscExit: CSSD aborting from thread Main
2016-05-12 14:54:02.929: [ CSSD][478835504]###################################2016-05-12 14:54:02.929: [ CSSD][478835504](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
From this perspective, there was a Voting Disk problem.
Note that: Voting disks manage information about node membership. Each voting disk must be accessible by all nodes in the cluster for nodes to be members of the cluster.
As seen in the logs above, OCSSD found 2 disks, but then made one of them deconfigured .
Then it was saying "lssnmlgetleasehdls: Do not have sufficient voting files, found 1 of 2 configured files, needed at least 2"
There could be a problem with OCSSD, as it did not like the voting disk DBFS_DG_CD_02_osrvcel02.(which was physically in a normal situation, no errors were on that disk physically, as we checked it using cellcli , celldisk, physicaldisk and griddisk, all were fine for that disk.)
So OCSSD discovered that disk but then deconfigured it and then it could not continue with only 1 voting disk and aborts.
Note that, In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.
ASM, on the other hand, was started and terminated due to ORA-600 errors, reported in its alert log as follows;
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_24013.trc (incident=433951):
ORA-00600: internal error code, arguments: [kfdvfGetCurrent_baddsk], [], [], [], [], [], [], [], [], [], [], []
Explaination of the cause of this bug ORA-600:
The system was an Exadata X3-2 , which was hosting a production EBS database on top of it.
A flash card had to be replaced , so we have shut down the machine properly for the replacement.
However, after the flash card replacement, databases could not be started again.
Problem: RAC services can not be started after rebooting Exadata compute and cell nodes.
Analysis: When the issue was escalated to me , I directly jump into the OHASD log file, as OHASD was the process that should have been the process starting the Oracle Clusterware processes.
Picture Ref : http://docs.oracle.com/cd/E14795_01/doc/rac.112/e10717/intro.htm
The flashing error was there;
2016-05-12 13:29:15.815: [ CRSPE][1145149760] {0:0:178} ICE has queued an operation. Details: Operation [START of [ora.crsd 1 1] on [osrvdb01] : local=0, unplanned=00x2aaab427de30] cannot run cause it needs W lock for: WO for Placement Path RI:[ora.ctssd 1 1] server [osrvdb01] target states [ONLINE ], locked by op [START of [ora.asm 1 1] on [osrvdb01] : local=0, unplanned=00x2aaab00bee30].
Owner: CRS-2683: It is locked by 'SYSTEM' for command 'Resource Autostart : osrvdb01'
It was obvious that there was a cluster level locking problem...
Kept checking on the other logs and saw that the CSSD daemon OCCSD was terminating abornormally.
Note that: OCR of this RAC was in DBFS_DG diskgroup in ASM...
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmCompleteVFDiscovery: Completing voting file discovery
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmvDiskStateChange: state from discovered to pending disk o/192.168.10.3/DBFS_DG_CD_03_osrvcel01
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmvDiskStateChange: state from pending to configured disk o/192.168.10.3/DBFS_DG_CD_03_osrvcel01
2016-05-12 14:54:02.453: [ CSSD][1141094720]clssnmvDiskStateChange: state from discovered to deconfigured disk o/192.168.10.4/DBFS_DG_CD_02_osrvcel02
2016-05-12 14:54:02.927: [ CLSF][478835504]Allocated CLSF context
2016-05-12 14:54:02.929: [ SKGFD][478835504]Handle 0x2aaaac668b60 from lib :OSS:: for disk :o/192.168.10.3/DBFS_DG_CD_03_osrvcel01:
2016-05-12 14:54:02.929: [ CLSF][478835504]Opened hdl:0x2aaaac5af9d0 for dev:o/192.168.10.3/DBFS_DG_CD_03_osrvcel01:
2016-05-12 14:54:02.929: [ CSSD][478835504]ASSERT clssnml.c 4532016-05-12 14:54:02.929: [ CSSD][478835504]clssnmlgetleasehdls: Do not have sufficient voting files, found 1 of 2 configured files, needed at least 2
2016-05-12 14:54:02.929: [ CSSD][478835504]###################################
2016-05-12 14:54:02.929: [ CSSD][478835504]clssscExit: CSSD aborting from thread Main
2016-05-12 14:54:02.929: [ CSSD][478835504]###################################2016-05-12 14:54:02.929: [ CSSD][478835504](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
From this perspective, there was a Voting Disk problem.
Note that: Voting disks manage information about node membership. Each voting disk must be accessible by all nodes in the cluster for nodes to be members of the cluster.
As seen in the logs above, OCSSD found 2 disks, but then made one of them deconfigured .
Then it was saying "lssnmlgetleasehdls: Do not have sufficient voting files, found 1 of 2 configured files, needed at least 2"
There could be a problem with OCSSD, as it did not like the voting disk DBFS_DG_CD_02_osrvcel02.(which was physically in a normal situation, no errors were on that disk physically, as we checked it using cellcli , celldisk, physicaldisk and griddisk, all were fine for that disk.)
So OCSSD discovered that disk but then deconfigured it and then it could not continue with only 1 voting disk and aborts.
Note that, In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.
ASM, on the other hand, was started and terminated due to ORA-600 errors, reported in its alert log as follows;
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_24013.trc (incident=433951):
ORA-00600: internal error code, arguments: [kfdvfGetCurrent_baddsk], [], [], [], [], [], [], [], [], [], [], []
Explaination of the cause of this bug ORA-600:
1) CSS is not able to recognize the online of the disks and re-establish that the majority is met for the voting file count. Solution : Reboot (however, I rebooted the nodes and as suggested I rebooted them one by one, but it didnt solve the problem)
2)Bug 12330287 , upgrade to 12.1
CSS really could not recognize the disks. It was reported in its logs as I mentioned earlier, so we are in correct way.
At this point, if we summarize what we have;
-> crsdb can not be started
-> cssdb can not see all the voting disks
-> ASM was crashing because of cssdb , as it can not see the voting disks properly.
Decision about the possible causes:
What I have done was;
I executed the following command to start crs in exclusive mode in node1.
$GRID_HOME/bin/crsctl start crs -excl -nocrs
At this point, if we summarize what we have;
-> crsdb can not be started
-> cssdb can not see all the voting disks
-> ASM was crashing because of cssdb , as it can not see the voting disks properly.
Decision about the possible causes:
- A race condition which might be caused by an improper shutdown of the machine nodes.
- Bug 14619103 "Operation [START of xxx] ... has been replaced with [START of yyy]" chooses incorrect resource.
Decision:
There is repository level locking issue between ASM and CRS, which make CRS not be able to be started and which in turn make abnormal chaing of event happen (such as CSSD can not be started because it could not see the voting disks properly and ASM crashed because of that. )
Solution:
The problem resolved by starting crs in exclusive mode. By starting crs in exclusive mode, I thought that the locking issue may be resolved. I though , in exclusive mode, crs might be started without problem and then start might solve the locking problem(actually kind of a deadlock) between CRS and ASM.What I have done was;
I executed the following command to start crs in exclusive mode in node1.
$GRID_HOME/bin/crsctl start crs -excl -nocrs
Note that:
The -nocrs option ensures that the crsd process and OCR do not start with the rest of the Oracle Clusterware stack.
-excl option ensures that the Oracle Clusterware stack start in exclusive mode on one node.
After executing the command above in node 1, I immediately saw the ASM2 in node2 was started. It seemed the lock was released.
So, after that, I rebooted node1 and after the reboot of node 1, everyting was started to be working.
Both of the nodes was seeing 3 voting disks and ASM instance and databases was started without any problems.
Outputs:
[oracle@osrvdb02 bin]$ ./crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE ad944505cc0c4f4ebf29eb746962ab06 (o/192.168.10.3/DBFS_DG_CD_03_osrvcel01) [DBFS_DG]
2. ONLINE 1a8ea7860b0a4f86bf7cbd7423710dfd (o/192.168.10.5/DBFS_DG_CD_02_osrvcel03) [DBFS_DG]
3. ONLINE 2a57155433e64f78bfb3424215e8dc16 (o/192.168.10.4/DBFS_DG_CD_02_osrvcel02) [DBFS_DG]
Located 3 voting disk(s).
[oracle@osrvdb01 bin]$ ./crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE ad944505cc0c4f4ebf29eb746962ab06 (o/192.168.10.3/DBFS_DG_CD_03_osrvcel01) [DBFS_DG]
2. ONLINE 1a8ea7860b0a4f86bf7cbd7423710dfd (o/192.168.10.5/DBFS_DG_CD_02_osrvcel03) [DBFS_DG]
3. ONLINE 2a57155433e64f78bfb3424215e8dc16 (o/192.168.10.4/DBFS_DG_CD_02_osrvcel02) [DBFS_DG]
Located 3 voting disk(s).
Before finishing this blog post, I would like to underline 2 things;
This situation can be thought as kind of a one time issue, which can be considered as a generic CRS issue, which was actually caused by the design of Oracle RAC.
-excl option ensures that the Oracle Clusterware stack start in exclusive mode on one node.
After executing the command above in node 1, I immediately saw the ASM2 in node2 was started. It seemed the lock was released.
So, after that, I rebooted node1 and after the reboot of node 1, everyting was started to be working.
Both of the nodes was seeing 3 voting disks and ASM instance and databases was started without any problems.
Outputs:
[oracle@osrvdb02 bin]$ ./crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE ad944505cc0c4f4ebf29eb746962ab06 (o/192.168.10.3/DBFS_DG_CD_03_osrvcel01) [DBFS_DG]
2. ONLINE 1a8ea7860b0a4f86bf7cbd7423710dfd (o/192.168.10.5/DBFS_DG_CD_02_osrvcel03) [DBFS_DG]
3. ONLINE 2a57155433e64f78bfb3424215e8dc16 (o/192.168.10.4/DBFS_DG_CD_02_osrvcel02) [DBFS_DG]
Located 3 voting disk(s).
[oracle@osrvdb01 bin]$ ./crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE ad944505cc0c4f4ebf29eb746962ab06 (o/192.168.10.3/DBFS_DG_CD_03_osrvcel01) [DBFS_DG]
2. ONLINE 1a8ea7860b0a4f86bf7cbd7423710dfd (o/192.168.10.5/DBFS_DG_CD_02_osrvcel03) [DBFS_DG]
3. ONLINE 2a57155433e64f78bfb3424215e8dc16 (o/192.168.10.4/DBFS_DG_CD_02_osrvcel02) [DBFS_DG]
Located 3 voting disk(s).
Before finishing this blog post, I would like to underline 2 things;
This situation can be thought as kind of a one time issue, which can be considered as a generic CRS issue, which was actually caused by the design of Oracle RAC.
The suggested solution was upgrading GRID + DB to 11.2.0.4, as for 11.2.0.3 and versions below that are already desupported and no patches will be provided.
No comments :
Post a Comment
If you will ask a question, please don't comment here..
For your questions, please create an issue into my forum.
Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html
Register and create an issue in the related category.
I will support you from there.