Tuesday, October 27, 2020

ODA -- All MMONs die at random times and DBs down / investigation

I've been working on a case last week. In fact, this case was escalated to me (usually as the end point of escalation)

The environment was an ODA X6-2 HA and it was installed as bare metal.
The problem was about the RAC databases.. There were critical production databases running on this machine and they were encountering errors. All the instances in one of the nodes were getting terminated..

The issue was on node 1, on the first node of the ODA.. Since the databases were configured as RAC, there was no major business impact. But it was still annoying of course.

Well, in order to see the real cause behind this, I decided make a full stack analysis. (note that, in this environment, it is not allowed to run scripts and collect diag data, so manual diagnostics were required.)

I looked up from the hardware and took a look at the stack. (I counted myself lucky as it was not a virtualized environment :) From top to bottom, we had databases, GRID infrastructure(+ASM +ACFS), Oracle Linux Operating Systems and the Oracle Servers (the hardware itself).. Bytheway, this is not a surprise, of course. I am writing this because there may be some who do not know ODA.

I concantrated on the last time the database failed.

My findings were; 

*There were 18C and 12CR2 databases on this ODA. The GRID was 18C. OS was Oracle Linux 6.10.

*The load was getting higher during that period.. I saw a peak in the sar outputs.  (Note that, the instances were failed at 10:16 PM)

10:10:01 PM         9      3877      6.22      7.30      7.76
10:20:01 PM        63      3583     66.82     44.44     24.37
10:30:01 PM        12      3614      6.10     12.03     16.17

*I saw the following errors in the syslog (/var/log/messages), but they are ignorable though;
/sys/bus/pci/devices/0000:af:00.1/virtfn0/uevent failed! 2BROADCOM[87864]: ERROR 

*Node 1 was evicted.. (so some things were going on cluster wide..)

*The first error that was seen in one of the database instances was: "ORA-00445: background process "m000" did not start after 120 seconds"

*All the databases were getting terminated. (18C, 12CR2 all of them) -- I though that there might be an ASM or ACFS issue.. On new ODA environments , we have ACFS right..

*MMON of ASM instance was also terminated. The error stack was;

ksedst1()+110 call kgdsdst() 7FFE764BE710 000000002
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
ksedst()+64          call     ksedst1()            000000000 000000001
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbkedDefDump()+2385  call     ksedst()             000000000 000000001 ?
9                                                  7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
ksedmp()+593         call     dbkedDefDump()       000000003 000000002
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbgexPhaseII()+2130  call     ksedmp()             0000003EB 000000002 ?
                                                   7FFE764B8A50 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbgexExplicitEndInc  call     dbgexPhaseII()       7F9FB6F726D0 7F9FB1D59528
()+609                                             7FFE764C28B0 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
dbgeEndDDEInvocatio  call     dbgexExplicitEndInc  7F9FB6F726D0 7F9FB1D59528
nImpl()+695                   ()                   7FFE764C28B0 ? 7FFE764B8B68 ?
                                                   7FFE764BE230 ? 000000082 ?
ksbsrvn_opt()+4562   call     dbgeEndDDEInvocatio  7F9FB6F726D0 7F9FB1D59528
                              nImpl()              7F9FB6FB2980 7FFE764B8B68 ?
                                                   7F9FB6FB2980 000000082 ?
ksbsrv_opt()+45      call     ksbsrvn_opt()        7F9FB6F726D0 ? 7FFE764C63C8
                                                   7F9FB6FB2980 ? 7FFE764B8B68 ?
                                                   7F9FB6FB2980 ? 000000082 ?
ksvspawn()+889       call     ksbsrv_opt()         7F9FB6F726D0 ? 7FFE764C63C8 ?

Ofcourse, there were a lot of other findings in CRS, ora agent logs, ASM alert logs, ACFS logs and so on.. However; they didn't affect the result very much, so I am not including them here.

Anyways; I was concantrated on the MMON traces.. MMON could give me a clue to solve this problem. 

In fact, MMON processes were already giving clues .. MMON was the first failing process.. More interestingly, just before the issue, MMON processes of "all" instances (including ASM) were getting similar errors.

Well, ORA-00445 was generic, but the error stack wasn't. The call for the dbgeEndDDEInvocatio and all the other calls under it was directing me to the MOS note ->  ORA-00445: background process "m000" did not start after 120 seconds (Doc ID 2679704.1).

The cause was ; "Bug 30276911 - AIM: BACKGROUND PROCESS "M000" DID NOT START AFTER 120 SECONDS - KEBM_MMON_SCHEDULE_SLAVE (duplicate of Bug 29902299 - ORA-445 - KEBMSS_SPAWN_SLAVE)"

The solution was applying Patch 29902299 or upgrading the databases. "But, wait a sec.. This is a database patch, but ASM was also getting this MMON error. There could be another reason for this common problem.. " After saying that, I have connected to the ILOM of node 1.. (I should have done this before), and saw the following;


A physical memory problem.. Acutally multiple memory problems.. Yeah, I know those faults may be cleared from ILOM, but look; altough they seem correctable (by ECC - Error Correcting Code),  why Are they produced in the first place ? (an answer to this question: Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state.) And why for this RAM ?

Additional info about the error:

Description : Multiple correctable ECC errors on a memory DIMM have been detected.

Response : The affected page(s) of memory associated with the faulty
memory module maybe immediately retired by the operating
system to avoid subsequent errors.

Impact : The system will continue to operate in the presence of this
fault. The memory DIMM is still in use and is not disabled.
If the DIMM_CE_MAP_OUT policy is enabled, the memory DIMM is
disabled on next system reboot and will remain unavailable
until repaired. System performance may be impacted slightly
due to retired memory pages.

Action : Please refer to the associated reference document at
http://support.oracle.com/msg/SPX86A-8002-XM for the latest
service procedures and policies regarding this diagnosis.

Linux was still using that memory.. (probably it would discard that memory in the first reboot) Somehow, when ASM and DBs started to use the memory space corresponding to this memory, the problem was arising and all of them were getting errors. Note that, these findings also showed us that, ASM and all the database instances of node1 were all starting to use this area at the same time. I didn't need to dig anymore. After all, there was a memory problem and the customer was dealing with meaningless instance terminations.

Yes.. this was the reason behind.. As a matter of fact, the call stack in the MOS note 2679704.1 did not exactly fit the one I got from the environment. So especially, this one -> "Call Stack must contain: kebm_mmon_schedule_slave" We didn't have it! The error stack was similar, but we didn't have a call for kebm_mmon_schedule_slave.

After further research, I found the MOS note named "Top ASM Instance Crash Issues (Doc ID 2247412.1)". The issue named "Issue #2: ASM Instance Terminated with ORA-00445 and ORA-29743" was similar to the issue I was dealing with.. The high load was documented, IPC Send timeout was also documented there and the call stack was similar.. (not similar to the call stack of MMON , but it was similar to the CJQ's call stack.. CJQ was the first process that failed after MMON). The note was stating that the cause as "BUG 20134113 - ORA-445: BACKGROUND PROCESS "M000" DID NOT START AFTER 120 SECONDS". The solution offered was "Applying the Interim Patch 20134113 by downloading from MOS or logging a SR to get the patch from Oracle Support".. This little patch was designed to change kjxgna.o. But, it was weird, because the patch was only for database tier.. However; ASM instances are running on GRID, so definetely, a SR was  required there..

I checked all the logs, all the notes , all the Oracle Support and didn't find anything more which is directly related ( except Note 2679704.1 and 2247412.1), and then I checked the ILOM and found the failing RAM. Replaced the RAM and the issue didn't come up again.. yet!.. :) 2 patches I mentioned above are still in question and Oracle SR will guide us after this point. End of the story :)

Note that; a RAM failure may cause an instance crash a process crash.. The database node does not have to crash completely. ( Reference : Exadata DB Crash Due To Failing Memory DIMM (Doc ID 2512283.1)

Sunday, October 25, 2020

ODA as a Weblogic Appliance ? Or is it better to consider PCA?

I was recently designing an architecture for enterprise applications and databases. 

Yes, we don't have any confusion when it comes to the database tier :) Especially for the enterprise databases, which requires high-end performance, resiliency and fault tolerancy. Independent of the type of workload, the platform that we recommend is Exadata. 

However, recommending a platform for application tier is another subject. Just like for the database tier; for large scale mission critical application environments, which mostly rely on Oracle Weblogic and FMW products, enterprise customers prefer engineered systems. 

Supposing the needs of the database tier are already satisfied with Exadata, the first platform that comes to mind for this kind of an application tier is Exalogic, which provides extreme performance, reliability and scalability for Oracle, Java and other business applications. However; currently I don't see a price for on-prem Exalogic.. I mean, according to the Oracle Engineered Systems Price List -September 25, 2020, Exalogic is not even on sale. Currently this exalogic link ( https://www.oracle.com/engineered-systems/exalogic/index.html) redirect us to https://www.oracle.com/engineered-systems/private-cloud-appliance

Yes! the link redirect us to Private Cloud Appliance (PCA). But before we get to that point; we must ask ourselves -> why not Oracle Database Appliance (ODA)?  There are strong E-Business Suite (EBS) references for ODA and I personally migrated several EBS customers to ODA, and Oracle still have several customers using EBS on ODA..  Well, mostly for the mid size EBS customers, we use ODA as a consolidation platform for hosting several EBS Application and databases.

Besides, we know that; although the name implies that it is a Database Appliance, ODA can also function as an Oracle WebLogic Appliance.

However, this context may change based on the real life stories and needs. I mean regardless of having EBS or other applications that rely on Weblogic and FMW, we may want to position an Engineered System for our applications and middleware, and only for them! (considering our database layer may be on Exadata..) 

So is ODA the solution for hosting or consolidating application environments and application environments only ?!?

Well, in fact, even if ODA may be suitable for this scenario, it does not make much sense to use ODA in such a scenario. Especially when we consider the storage mechanism provided for virtual environments on ODA. I mean the GRID, ASM and ACFS that are used for providing it .. These things are pretty simplified on ODA. Correct, but still we find ourselves in the DBA world for hosting Application environments, and for hosting only the applications!

Think about it... You just put applications on Guest VMs of ODA, you don't have any database residing on ODA, but still you maintain GRID, ASM and ACFS.

In addition to that, ODA can not scale out well. That is, when you reach the limits of its capacity, there is the possibility of vertical growth, but unfortunately there is no possibility of horizontal growth.

So this may be a problem for enterprise application environments, especially for large companies.

Okay, now we are here, the new engineered system ( at least it is new for me, as I didn't make any implementation on it, yet..).

Private Cloud Appliance comes into play at this point. It is avaible for on-premises and allows customers to efficiently consolidate business critical middleware and application workloads.

It has an integrated ZFS Storage and Oracle X8-2 servers for management and compute nodes.

It is scalable (up to 1200 cores,  3.3 Pb disk capacity).

It is high available and cloud ready. 

Moreover, PCA can be direct attached to Exadata to provide the lowest latency between Middleware and DB tier.

Virutalization technology used in PCA is OVM. However; we are expecting to have KVM be supported on PCA soon. (Remember what I wrote earlier about this OVM and KVM thing ->https://ermanarslan.blogspot.com/2020/09/end-of-premier-support-of-ovm-it-is.html)

Virtualization on PCA supports Oracle Solaris, Oracle Linux, Red Hat Enterprise Linux and Microsoft Windows Server for Guest Vms.

There is even a tool for automating the migration of virtual machines from VMware vSphere to Oracle VM.

This machine also has strong references.

I can keep writing about PCA, but it's more helpful to have a look yourself.

Check -> https://www.oracle.com/servers/technologies/private-cloud-appliance.html

Read the Private Cloud Appliance FAQs. 

Check the youtube video -> https://www.youtube.com/watch?v=Gtv6Mssbnp0 ( Oracle Private Cloud Appliance a.k.a PCA X8)

Very likely, you will think what I think and we will come to the same point.

This is the platform we are looking for large and critical Oracle application environments. 

An Engineered System that can be used to host and consolidate layers to our applications and can bring us many new features and capabilities.

Have a nice weekend :)

Monday, October 5, 2020

Erman Arslan's Oracle Forum -- Questions and Answers Series - September 2020

Let's start with the following question and answer. 
PS, this question and answer reveals what my motivation is. ->

Question: How much time do you spend/lose? 
Answer: Well, how much time I gain? :) 

In September, again I tried to answer all the questions. I gave advices when necessary, and provided guidance for the solutions when I had enough info about those problems and the environments where those problems arise. 

Take a look at the issues and related topics in Erman Arslan's Oracle Forum. Collect the harvest you can gather from the support and technical directions provided!


Erman Arslan 's Oracle Forum September 2020 -> 

Thursday, October 1, 2020

Custom SSO / Login to OBIEE from a 3rd party app. By sending a POST request.. This works even when the LightWeightSSO is enabled!

In one of my previous blog post (https://ermanarslan.blogspot.com/2020/09/obiee-sso-integrating-with-third-party.html), I shared a third party SSO integration method for OBIEE.

We were just passing the user and password info as url arguments and it was working.

On that blog post, there was the following sentence; 

That is -> We make our OBIEE to get the user and password through the OBIEE url. (on-the-fly login using url arguments).. Note that this is the simplest way of doing this work.. Ofcouse, customer's ability to post the usernames and passwords using any other method than this one, will make us change/improve the design of this login flow.

Anyways, this was one of the ways, but today we realized something else.. Something else that is refuting that way.

That is, if we login to OBIEE and then try to reach ODV from there,  we find ourself in a login dialog, where we should enter our user and password information once again. Yes.. This is not cool..

Fortuneatly; we have a solution for this too!

The solution is to enable LightweightSSO. Sound simple right? But wait a sec, LightweightSSO is not compatible with our 3rd party integration method , I mean -> Logging into the OBIEE from a third party app by passing user and password as arguments in OBIEE URL...

Remember, in that blog post, I already mentioned that when 12.2.1.3 LightWeightSSO is ON, NQPwd/User(I mean the URL method) won't work for OBIEE login.. So, as I mentioned in that earlier blog post, we disabled LightWeightSSO to be able to pass user and password info through url.

However; when the LightWeightSSO is disabled, we can't directly reach ODV from OBIEE.. I mean, ODV requires us to re-enter our user and password info as I just mentioned. 
So it is not acceptable. 
This means we need to enable LightWeightSSO to make  automatic SSO integration between OBIEE and ODV work.. Ofcourse, this time (when the LightWeightSSO is enable), our OBIEE login (through url arguments user and password) will not work..

Well, this is what makes me write this blog post.

The question : How can we login to OBIEE from a 3rd application automatically in a custom SSO-like way, even when the LightWeightSSO is enabled?

In order to answer this, we take a look at the OBIEE login flow, I mean we do a technical login mechanism analysis. 

I don't mean a code analysis, but we use our browser (For instance Chrome-> F12-> Network tab) to analyze the http requests, http headers and the form data.. We need to check the required the arguments.

Once we do those analysis; we can see that, when the LightWeightSSO is enabled, the login page changes. 
Our login page becomes login.jsp. Login.jsp get the user and password info from the user and authenticates it using "login" (without .jsp suffix). 

So when we check that "login", we see that it is designed to receive some POST request arguments. j_username, j_password and so on. 
So if we can make a HTTP POST request to "login" directly from our 3rd party app, it should work.. 

This way, we will be able to pass the username and password info to OBIEE and OBIEE will let us in automatically. (even when the LightWeightSSO is enabled!)

So, we create a simple html to test this..
Note that the values that you see below are just examples.  -> 

<html>
<form id='redirectForm' method='POST' action='https://oiee_host:obiee_port/bi-security-login/login'>
<input type='hidden' name='j_username' value='weblogic'/>
<input type='hidden' name='j_password' value='erman'/>
<input type='hidden' name='j_msi' value='none'/>
<input type='hidden' name='j_language' value='en'/>
<input type='hidden' name='j_redirect' value='L2FuYWx5dGljcy9zYXcuZGxsP2JpZWVob21lJnN0YXJ0UGFnZT0xJmhhc2g9RlEyeDZFaGp3cnJHQXNzbmVWOWtSeVVuYmxVQjYyczZMR0JESFEtR3F5ZEoxcXh2bjMyMmxKaUlwU1R4VFIxMA'/>
</form>
<h1><a href="#" onclick="document.getElementById('redirectForm').submit()">GO!!</a></h1>
</body>
</html>

Please note the hidden input names -> j_username, j_password, j_msi, j_language and j_redirect..
j_redirect is the url that OBIEE will redirect us after the login process. It is in the base64 form. (in this case it is basically set to -> /analytics/saw.dll?bieehome&startPage=1)

So, we open this html with our browser and click GO! Guess what? We found ourselves in the OBIEE home page! (logged in automatically in the backend by posting user and password info) So it works! 

At the end; we pass this html to the developers of the third party application as a reference and they modify their OBIEE login code and that's it :) We login to OBIEE automatically from a 3rd app automatically even when the LightWeightSSO is enabled.

I 'm not finished! :)

If the third party app requires a form, and if it doesn't like the form of the login.jsp. (because it is doing its work with javascript probably) , I mean if the 3rd party app requires a submit button, then we create a wrapper html like the one below and deploy it to our Weblogic (or any webserver that we have)..
Want to deploy it to a Weblogic? -> here is the way ->  "How To Publish a Static HTML Page To WebLogic Server and Request Through Oracle HTTP Server 11g (Doc ID 1192439.1)" -- Part 1 is enough..

With this action, we actually put a middle man between our 3rd app and  OBIEE login and make the 3rd app to post to OBIEE login using that middle man :) This works too!

So the flow becomes;  "3rd pary app -> Wrapper html -> OBIEE Login"

<html>
    <form  name="loginform" method='POST' 
        action='/bi-security-login/login' 
        style="visibility:hidden">
    <input type='hidden' name='j_username' value=''/>
    <input type='hidden' name='j_password' value=''/>
    <input type='hidden' name='j_msi' value=''/>
    <input type='hidden' name='j_language' value=''/>
    <input type='hidden' name='j_redirect' value=''/>
    <input type='submit' value='Login'/>
</form>
</body>
</html>

That is it for today:) I hope this will help you.