Tuesday, October 27, 2020

ODA -- All MMONs die at random times and DBs down / investigation

I've been working on a case last week. In fact, this case was escalated to me (usually as the end point of escalation)

The environment was an ODA X6-2 HA and it was installed as bare metal.
The problem was about the RAC databases.. There were critical production databases running on this machine and they were encountering errors. All the instances in one of the nodes were getting terminated..

The issue was on node 1, on the first node of the ODA.. Since the databases were configured as RAC, there was no major business impact. But it was still annoying of course.

Well, in order to see the real cause behind this, I decided make a full stack analysis. (note that, in this environment, it is not allowed to run scripts and collect diag data, so manual diagnostics were required.)

I looked up from the hardware and took a look at the stack. (I counted myself lucky as it was not a virtualized environment :) From top to bottom, we had databases, GRID infrastructure(+ASM +ACFS), Oracle Linux Operating Systems and the Oracle Servers (the hardware itself).. Bytheway, this is not a surprise, of course. I am writing this because there may be some who do not know ODA.

I concantrated on the last time the database failed.

My findings were; 

*There were 18C and 12CR2 databases on this ODA. The GRID was 18C. OS was Oracle Linux 6.10.

*The load was getting higher during that period.. I saw a peak in the sar outputs.  (Note that, the instances were failed at 10:16 PM)

10:10:01 PM         9      3877      6.22      7.30      7.76
10:20:01 PM        63      3583     66.82     44.44     24.37
10:30:01 PM        12      3614      6.10     12.03     16.17

*I saw the following errors in the syslog (/var/log/messages), but they are ignorable though;
/sys/bus/pci/devices/0000:af:00.1/virtfn0/uevent failed! 2BROADCOM[87864]: ERROR 

*Node 1 was evicted.. (so some things were going on cluster wide..)

*The first error that was seen in one of the database instances was: "ORA-00445: background process "m000" did not start after 120 seconds"

*All the databases were getting terminated. (18C, 12CR2 all of them) -- I though that there might be an ASM or ACFS issue.. On new ODA environments , we have ACFS right..

*MMON of ASM instance was also terminated. The error stack was;

ksedst1()+110 call kgdsdst() 7FFE764BE710 000000002
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
ksedst()+64          call     ksedst1()            000000000 000000001
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbkedDefDump()+2385  call     ksedst()             000000000 000000001 ?
9                                                  7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
ksedmp()+593         call     dbkedDefDump()       000000003 000000002
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbgexPhaseII()+2130  call     ksedmp()             0000003EB 000000002 ?
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbgexExplicitEndInc  call     dbgexPhaseII()       7F9FB6F726D0 7F9FB1D59528
()+609                                             7FFE764C28B0 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbgeEndDDEInvocatio  call     dbgexExplicitEndInc  7F9FB6F726D0 7F9FB1D59528
nImpl()+695                   ()                   7FFE764C28B0 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
ksbsrvn_opt()+4562   call     dbgeEndDDEInvocatio  7F9FB6F726D0 7F9FB1D59528
                              nImpl()              7F9FB6FB2980 7FFE764B8B68 ?
                                                   7F9FB6FB2980 000000082 ?
ksbsrv_opt()+45      call     ksbsrvn_opt()        7F9FB6F726D0 ? 7FFE764C63C8
                                                   7F9FB6FB2980 ? 7FFE764B8B68 ?
                                                   7F9FB6FB2980 ? 000000082 ?
ksvspawn()+889       call     ksbsrv_opt()         7F9FB6F726D0 ? 7FFE764C63C8 ?

Ofcourse, there were a lot of other findings in CRS, ora agent logs, ASM alert logs, ACFS logs and so on.. However; they didn't affect the result very much, so I am not including them here.

Anyways; I was concantrated on the MMON traces.. MMON could give me a clue to solve this problem. 

In fact, MMON processes were already giving clues .. MMON was the first failing process.. More interestingly, just before the issue, MMON processes of "all" instances (including ASM) were getting similar errors.

Well, ORA-00445 was generic, but the error stack wasn't. The call for the dbgeEndDDEInvocatio and all the other calls under it was directing me to the MOS note ->  ORA-00445: background process "m000" did not start after 120 seconds (Doc ID 2679704.1).

The cause was ; "Bug 30276911 - AIM: BACKGROUND PROCESS "M000" DID NOT START AFTER 120 SECONDS - KEBM_MMON_SCHEDULE_SLAVE (duplicate of Bug 29902299 - ORA-445 - KEBMSS_SPAWN_SLAVE)"

The solution was applying Patch 29902299 or upgrading the databases. "But, wait a sec.. This is a database patch, but ASM was also getting this MMON error. There could be another reason for this common problem.. " After saying that, I have connected to the ILOM of node 1.. (I should have done this before), and saw the following;


A physical memory problem.. Acutally multiple memory problems.. Yeah, I know those faults may be cleared from ILOM, but look; altough they seem correctable (by ECC - Error Correcting Code),  why Are they produced in the first place ? (an answer to this question: Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state.) And why for this RAM ?

Additional info about the error:

Description : Multiple correctable ECC errors on a memory DIMM have been detected.

Response : The affected page(s) of memory associated with the faulty
memory module maybe immediately retired by the operating
system to avoid subsequent errors.

Impact : The system will continue to operate in the presence of this
fault. The memory DIMM is still in use and is not disabled.
If the DIMM_CE_MAP_OUT policy is enabled, the memory DIMM is
disabled on next system reboot and will remain unavailable
until repaired. System performance may be impacted slightly
due to retired memory pages.

Action : Please refer to the associated reference document at
http://support.oracle.com/msg/SPX86A-8002-XM for the latest
service procedures and policies regarding this diagnosis.

Linux was still using that memory.. (probably it would discard that memory in the first reboot) Somehow, when ASM and DBs started to use the memory space corresponding to this memory, the problem was arising and all of them were getting errors. Note that, these findings also showed us that, ASM and all the database instances of node1 were all starting to use this area at the same time. I didn't need to dig anymore. After all, there was a memory problem and the customer was dealing with meaningless instance terminations.

Yes.. this was the reason behind.. As a matter of fact, the call stack in the MOS note 2679704.1 did not exactly fit the one I got from the environment. So especially, this one -> "Call Stack must contain: kebm_mmon_schedule_slave" We didn't have it! The error stack was similar, but we didn't have a call for kebm_mmon_schedule_slave.

After further research, I found the MOS note named "Top ASM Instance Crash Issues (Doc ID 2247412.1)". The issue named "Issue #2: ASM Instance Terminated with ORA-00445 and ORA-29743" was similar to the issue I was dealing with.. The high load was documented, IPC Send timeout was also documented there and the call stack was similar.. (not similar to the call stack of MMON , but it was similar to the CJQ's call stack.. CJQ was the first process that failed after MMON). The note was stating that the cause as "BUG 20134113 - ORA-445: BACKGROUND PROCESS "M000" DID NOT START AFTER 120 SECONDS". The solution offered was "Applying the Interim Patch 20134113 by downloading from MOS or logging a SR to get the patch from Oracle Support".. This little patch was designed to change kjxgna.o. But, it was weird, because the patch was only for database tier.. However; ASM instances are running on GRID, so definetely, a SR was  required there..

I checked all the logs, all the notes , all the Oracle Support and didn't find anything more which is directly related ( except Note 2679704.1 and 2247412.1), and then I checked the ILOM and found the failing RAM. Replaced the RAM and the issue didn't come up again.. yet!.. :) 2 patches I mentioned above are still in question and Oracle SR will guide us after this point. End of the story :)

Note that; a RAM failure may cause an instance crash a process crash.. The database node does not have to crash completely. ( Reference : Exadata DB Crash Due To Failing Memory DIMM (Doc ID 2512283.1)

No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.