Tuesday, August 14, 2018

EBS R12 -- REQAPPRV ORA-24033 error after 12C DB upgrade /rulesets & queues

Encountered ORA-24033 in an EBS 12.1.3 environment.
Actually, this error started to be produced in workflow , just after upgrading the database of this environment from 11gR2 to 12cR1.

The database upgrade (running dbua and other stuff) was done by a different company, so that we were not able to check if it is done properly..
However; we were the ones who needed to solve this issue when it appeared :)

Anyways, functional team encountered this error while checking the workflows in Workflow Administrator Web Applications -> Status Monitor, and reported it as follows;


ORA-24033 was basically saying us, there is a queue-subscriber problem in the environment, so we started working with the queues, subscribers and the rulesets.

The analysis showed that, we had 1 ruleset and 1 rule missing in this environment..

select * from
dba_objects
where object_name like 'WF_DEFERRED_QUEUE%'

The following output was produced in a reference environment, on which workflow REQAPPRV was running without any problems.


The following output, on the other hand; was produced in this problematic environment.


As seen above, we had 1 ruleset named WF_DEFERRED_QUEUE_M$1 and 1 rule named WF_DEFERRED_QUEUE_M$1 missing in this problematic environment..

In addition to that, WF_DEFERRED related rulesets were invalid in this problematic environment.

In order to create (validate) these ruleset , we followed 2 MOS documents and executed our action plan accordingly.

Fixing Invalid Workflow Rule Sets such as WF_DEFERRED_R and Related Errors on Workflow Queues:ORA-24033 (Doc ID 337294.1)
Contracts Clause Pending Approval with Error in Workflow ORA-25455 ORA-25447 ORA-00911 invalid character (Doc ID 1538730.1)

So what we executed in this context was as follows;

declare
l_wf_schema varchar2(200);
lagent sys.aq$_agent;
l_new_queue varchar2(30);

begin
l_wf_schema := wf_core.translate('WF_SCHEMA');
l_new_queue := l_wf_schema||'.WF_DEFERRED';
lagent := sys.aq$_agent('WF_DEFERRED',null,0);
dbms_aqadm.remove_subscriber(queue_name=>l_new_queue, subscriber=>lagent);
end;
/
commit;

declare
l_wf_schema varchar2(200);
lagent sys.aq$_agent;
l_new_queue varchar2(30);

begin
l_wf_schema := wf_core.translate('WF_SCHEMA');
l_new_queue := l_wf_schema||'.WF_DEFERRED';
lagent := sys.aq$_agent('WF_DEFERRED',null,0);
dbms_aqadm.add_subscriber(queue_name=>l_new_queue, subscriber=>lagent,rule=>'1=1');
end;
commit;

declare

lagent sys.aq$_agent;
begin
lagent := sys.aq$_agent('APPS','',0);
dbms_aqadm.add_subscriber(queue_name=>'APPLSYS.WF_DEFERRED_QUEUE_M',
subscriber=>lagent,
rule=>'CORRID like '''||'APPS'||'%''');
end;
/

So what we did was to;

Remove and add back the subscriber/rules to the WF_DEFERRED queue 
+
Add the subscriber and rule back into the WF_DEFERRED_QUEUE_M queue.  (if needed we could remove the subscribe before adding it)

By taking these actions; the ruleset named WF_DEFERRED_QUEUE_M$1 and the rule named WF_DEFERRED_QUEUE_M$ were automatically created and actually, this fixed the ORA-24033 error in REQAPPRV :)

Monday, August 13, 2018

EBS -- MIGRATION // 2 interesting problems & 2 facts -- autoconfig rule (2n-1) & APPL_SERVER_ID in the plan.xml of ebsauth

Recently migrated a production EBS from an Exadata to another Exadata. That was an advanced operation, as it involved Oracle Access Manager(OAM), Oracle Internet Directory(OID) and 2 EBS disaster environments.,

This was a very critical operation, because it was not tested.. Moreover; we needed to do this work without any tests and we needed start working immediately..

The environment was as follows;

PROD : 1 Load Balancer, 2 Apps Nodes, 1 OAM/OID node and 2 Database nodes (Exadata)
-- Parallel Concurrent Processing involved as well..
Local Standby : 1 Apps Node, 2 Database nodes (Exadata)
Remote Standby: 1 Apps Nodes, 2 Database nodes (Exadata)

What we needed to do was; migrating the DB nodes of PROD, to Local Standby.
In order to do this; we followed the action plan below;

Note: actually we did much more than this, but this action plan should give you the idea :) 
  • stopped OAM+OID+EBSAccessGate + Webgate etc..
  • stopped EBS apps services which were running on both of the Prod Apps nodes.
  • Switched over the EBS Prod database to be primary in Local Standby.
  • Reconfigured local standby to be the new primary and configured it as the primary for the remote standby as well.
  • After switching the database over the standby site; we cleaned up the apps specific conf which was stored in the database (fnd_conc_clone.setup_clean)
  • We built context files (adbldxml.pl) and executed autoconfig on the new db nodes. 
  • Once db nodes were configured properly; we manually edited the apps tier context files and executed autoconfig on each of the apps tier nodes. (note that ; apps services were not migrated to any other servers)
  • We started the apps tier services.
  • We reconfigured the workflow mailer (its configuration was overwritten by autoconfig)
  • We logged in locally (without OAM) , checked the OAF , Forms and concurrent managers.
  • Everything was running except the concurrent manager which were configured to run in the second apps node. No matter what we did from the command line and from the concurrent manager administration screens, we couldn't fix it.. There was nothing written in the internal manager log, but the concurrent managers of node 2 could not be started.. 
    • The first fact : If you have a multi node EBS apps tier, AutoConfig has to be run '2n - 1' times. In other words;   for an application which has 'n' number of application nodes, AutoConfig has to be run '2n - 1' times so that the tnsnames.ora file on each node has FNDSM entries for all the other nodes. So, as for the solution, we executed autoconfig once more in the second node, and the problem dissapeared.
Reference: AutoConfig Does Not Populate tnsnames.ora With FNDSM Entries For All The Nodes In A Multi-Node Environment (Doc ID 1358073.1)
  • After fixing the conc managers, we continued with the OAM and OID.. We changed the datasource of the SSO (in weblogic) with the new db url and also changed dbc file there.. Then, we started Access Gate, Webgate, OAM and OID and checked the EBS login by using SSO-enabled url. But, the login was throwing http 404..
  • All the components of SSO (OAM,OID and everyting) was running.. But only the deployment named ebsauth_prod was stopped and it could not be started ( it was getting errors)
    • The second fact : if you changed the host of the EBS database and if your APPL_SERVER_ID was changed, then you need to redeploy the ebsauth by modifying it Plan.xml with the new APPL_SERVER_ID. Actually you have 2 choices; 1) Set the app_APPL_SERVER_ID to a valid value in the Plan.xml file for the AccessGate deployment and then restart the EAG Servers. The Plan.xml file location is specified on the Overview Tab for the AccessGate Deployment within the Weblogic Console where AccessGate is deployed. 2) Undeploy and redeploy AccessGate. 
Reference: EBS Users Unable To Sign In using SSO After Upgrading To EBSAccessGate 1234 With OAM 11GR2PS2 (Doc ID 2013855.1)
  • Well. after this move, SSO-enabled EBS login started to work as well. The operation was completed , and we deserved a good night sleep :)

Saturday, August 4, 2018

Oracle VM Server -- Guest VM in blocked state, VM console connection(VNC), Linux boot (init=/bin/bash)

This is an interesting one.. It involves an interesting way of booting Linux, dealing with Oracle Vm Server and its Hypervisor.

Last week, after a power failure in a critical datacenter, one of the production EBS application nodes couldn't be started.. That EBS application node was a VM running on a Oracle VM Server, and although Oracle VM Server could be started without any problem, that application node couldn't.

As, I like to administrate Oracle VM Server using xm commands, I directly jumped into the Oracle VM Server by connecting it using ssh (as root).

The repositories were there.. They were all accessable and xm list command were showing that EBS node, but its state was "b".. (blocked)

I did restart the EBS Guest VM couple of times, but it didn't help.. The EBS Guest VM was going into the blocked state just after starting.

Customer was afraid of it, as the status "blocked" didn't sound good...

However; the fact was that, it was normal for a Guest VM to be in blocked status if it doesn't do anything or let's say if it has nothing actively running on CPU.

This fact made me think that there should be problem during the boot process of this EBS Guest VM.

The OS installed on this VM was Oracle Linux, and I thought that, probably, Oracle Linux wasn't doing anything during its boot process.. Maybe it was asking something during its boot, or maybe it was waiting for an input..

In order to understand that, we needed to have a console connection to this EBS Guest VM..

To have a console connection, I modified the vm.cfg of this EBS Guest VM -- actually added VNC specific parameters to it.

Note that, in Oracle VM Server we can use VNC to connect to the Guest machines even during their boot process.

After modifying the vm.cfg file of the EBS Guest VM, I restarted this guest machine using xm commands and directly connnected to its console using VNC.

I started to watch the Linux boot process of this EBS Guest VM and then I saw it stopped..
It stopped, because it was reporting a filesystem corruption and asking us to run fsck manually..

So far so good.. It was as I expected..

The Oracle Linux normally was asking for the root password to be able to give us a terminal for running fsck manually. However; we just couldn't get the password.

So we were stuck..

We tried to ignore the fsck message of Oracle Linux, but then it couldn't boot..

We needed find a way.

At that time, I put my Linux admin hat on , and did the following;

During the boot, I opened the GRUB(GRand Unified Bootloader) menu. (bootloader)
Selected the appropriate boot entry (uek kernel in our case) in the GRUB menu and pressed e to edit the line.
Selected the kernel line and pressed e again to edit it.
Appended init=/bin/bash at the end of line.
Booted it.

By using the init=/bin/bash, I basically told the Linux kernel to run /bin/bash as init, rather than the system init.

As you may guess, by using init=/bin/bash, I booted the Linux and obtained a terminal without supplying the root password.

After this point, running fsck was a piece of cake :)

So I executed fsck for the root filesystem and actually for the other ones also.. Repaired all of them and rebooted the Linux once again..

This time, Linux OS of that virtualized EBS application node booted perfectly and the EBS application services on it could be started without any problems..

It was a stressful work but it made me have this interesting story :)