Monday, September 29, 2014

EBS,SSO -- enabling SSL on SSO due to security vulnerability

Some day or another, your security team may indicate a vulnerability in EBS login..
They may record this vulnerability according to security findings, especially when they listen the client machine and get the password used while the client user logging with his/her credentials..
That is ; an attacker who traces the client machine, could obtain the user and password information supplied by the users during EBS logins through the web browser.
If EBS is not a SSO integrated EBS, this should be considered as an EBS security vulnerability, and can be fixed using EBS SSL implementation, as I documented in the following link;
http://ermanarslan.blogspot.com.tr/2014/07/enabling-ssl-in-oracle-e-business-suite.html

On the other hand; If  a SSO server (Oracle SSO) is integrated to EBS, this should be considered as a SSO security vulnerability, and can be fixed using SSO SSL implementation and this will be the subject of this post..

Okay, when you integrate SSO to your EBS system, your login page comes from the SSO server. SSO servers the login..
So , actually the client supplies the user and/pass to the SSO system ..
That's why , the traffic between the client and the SSO server should be encrypted  to encyrpt the user/pass information transferred from client to server.. Ofcourse, to be in the safe side -> SSL should be implemented in EBS and OID , too.. However; these kind of operations are not in the scope of this post.

Following diagram represent an EBS R12 environment , which utilizes integrated SSO 10g and OID 11g products.

So , as you see above, SSO serves the login page and user/pass info flows unencyrpted through the line.
In order to fix this; it is obvious that, SSO server should be configured to work with SSL.

To implement SSL in SSO 10g, we can follow the action plan below;

1)
Create a wallet , export certificate request and send them to the security admins who are responsible for producing the corresponding certificate files. Once you got the certificate files, import them to your wallet.
Note that: you need to use the owm binary located in your SSO 10G Oracle Home.. So set your path accordingly.. If you use owm binary located in another Oracle Home ,for example: OID 11g Home(suppose you have OID 11g and SSO 10G running on the same server), you ll encounter problems..

2)
In SSO Server;
In opmn.xml, change the value for the start-mode parameter to ssl-enabled.

<ias-component id="HTTP_Server">
<process-type id="HTTP_Server" module-id="OHS">
<module-data>
<category id="start-parameters">
<data id="start-mode" value="ssl-enabled"/>
</category>
</module-data>
<process-set id="HTTP_Server" numprocs="1"/>
</process-type>
</ias-component>

3)
In SSO Server;
Note that: If you are using Distributed Configuration Management; update your distributed cluster management database using -> $ORACLE_HOME/dcm/bin/dcmctl updateconfig -ct opmn

Reload the modified opmn configuration file:
$ORACLE_HOME/opmn/bin/opmnctl reload

4) 
In SSO Server;
Restart Oracle HTTP Server.

$ORACLE_HOME/opmn/bin/opmnctl stopproc process-type=HTTP_Server
$ORACLE_HOME/opmn/bin/opmnctl startproc process-type=HTTP_Server

5)
In SSO Server;
Change single server urls using ssocfg.sh
$ORACLE_HOME/sso/bin/ssocfg.sh protocol host ssl_port
For ex: ssocfg.sh https ermanhost.domain 4443

Check, OID odsm;open cn=OracleContext>cn=Products>cn=DA and select cn=OperationURLs, The right hand pane will display a number of fields or attributes. The last field will be orcldasurlbase and will
should be set to sso server and its https port.
Clear the cache of OID, just in case;

Clear oid cache by connecting to OID database using sqlplus.. (with orasso user) and executing the following;
sqlplus orasso/(orasso_password) --> find it first
wwsec_oid.refresh_local_cache(true);
commit;

-To find Orasso Password:

echo "enter username :\c "
read username
echo "enter orcladmin password :\c "
read orclpass
echo "enter oid port :\c "
read oid_port
host=`uname -n`
ldapsearch -D cn=orcladmin -w $orclpass -p $oid_port -h $host \
-b "cn=IAS,cn=Products,cn=OracleContext" \
-s sub -v OrclresourceName=$username | grep orclpasswordattribute |cut -d= -f


6)
In SSO Server;
update $ORACLE_HOME/sysman/emd/targets.xml file.

Update the three attributes this way:
<Property NAME="HTTPMachine" VALUE="ermanhost.domain"/>
<Property NAME="HTTPPort" VALUE="4443"/>
<Property NAME="HTTPProtocol" VALUE="HTTPS"/>

and reload the conf.. -> $ORACLE_HOME/bin/emctl reload

7)
In SSO Server;
Restart your HTTP Server and OC4J Security processes..
$ORACLE_HOME/opmn/bin/opmnctl restartproc process-type=HTTP_Server
$ORACLE_HOME/opmn/bin/opmnctl restartproc process-type=OC4J_SECURITY

8) 
From the client's browser; 
Check and see that you can reach the SSO server using https.. https://host:ssl_port.
Once you are able to reach the SSO Server  using https without any certificate errors or warnings, continue with the next step

9) 
In EBS Server; 
Deregister and register your EBS;
“txkrun.pl -script=SetSSOReg -deregister=Yes”
“txkrun.pl -script=SetSSOReg -removereferences=Yes”
Scripts above, will  deregister SSO & OID from Apps & 10g AS Infrastructure Tier. 

$FND_TOP/bin/txkrun.pl -script=SetSSOReg , and this script will register SSO again.
That 's it.. Now check you EBS.. Your login page should be served by SSO via https..

Sunday, September 28, 2014

EBS 12.2 -- Create Accounting --FAH-- performance problem // real life example

In 12.2 environments, watch out the EBS Create Accounting Program!
Accounting Program/Creating accounting may not end at all.. Due to missing statistics, or because of a Database Layer bug, you may find yourself waiting for couple of days just to see the completion of a daily account process.
Especially, if you have FAH(Financial Account Hub) implementation in your project, this may start a chaos in the customer site, and make you closer to the deadline..

For these type of accounting issues, I strongly recommend you to read the following Oracle Support Doc, it has several recommendations about increasing the thoughput of the process ..

R12 SLA/FAH: How to Improve Performance in Subledger Accounting & Financials Accounting Hub (Doc ID 791049.1)


However, the most dramatic performance increase will be after applying the 11.2.0.3 database patch 14013094 and gathering the full stats using fnd_stats(or via Gather Schema Stats concurrent program)..

Lets have a quick look to the issue;

When the customer will report the performance problem, you will find at least 2 concurrent running;
One of them is parent and the other one is child..

Parent Concurrent runs the following code, waits in a loop for the Child to finish its execution;

DECLARE
l_flag BOOLEAN;
BEGIN
l_flag := XLA_CREATE_ACCT_RPT_PVT.BeforeReport;
IF (l_flag)
THEN
:XDO_OUT_PARAMETER := 1;
END IF;
END;


Child Concurrent runs the following insert and will never be able to complete it.. Thus this insert becomes the actual problem..
INSERT INTO XLA_GLT_69009 (status,
                           ledger_id,
                           user_je_source_name,
                           user_je_category_name,
                           accounting_date,
                           currency_code,
                           date_created,
                           created_by,
                           actual_flag,
                           budget_version_id,
                           encumbrance_type_id,
                           code_combination_id,
                           stat_amount,
                           entered_dr,
                           entered_cr,
                           accounted_dr,
                           accounted_cr,
                           reference1,
                           reference4,
                           reference5,
                           reference10,
                           reference11,
                           subledger_doc_sequence_id,
                           subledger_doc_sequence_value,
                           gl_sl_link_table,
                           gl_sl_link_id,
                           request_id,
                           ussgl_transaction_code,
                           je_header_id,
                           GROUP_ID,
                           period_name,
                           jgzz_recon_ref,
                           reference_date,
                           funds_reserved_flag,
                           reference25,
                           reference26,
                           reference27,
                           reference28,
                           reference29,
                           reference30)
   SELECT /*+ ordered index(aeh,xla_ae_headers_n1) use_nl(jc,led,ael,gps) */
         'NEW',
          aeh.ledger_id,
          :1,
          DECODE (:2, 'Y', jc.je_category_key, jc.user_je_category_name),
          DECODE (:3,
                  'P', gps.end_date,
                  'F', gps.end_date,
                  aeh.accounting_date)      --16967056 groupd by gl period fix
                                      ,
          DECODE (aeh.balance_type_code,
                  'E', led.currency_code,
                  ael.currency_code),
          SYSDATE,
          :4,
          aeh.balance_type_code,
          aeh.budget_version_id,
          ael.encumbrance_type_id                                   -- 4458381
                                 ,
          ael.code_combination_id,
          ael.statistical_amount,
          DECODE (aeh.balance_type_code,
                  'E', ael.accounted_dr,
                  ael.entered_dr)                                   -- 4458381
                                 ,
          DECODE (aeh.balance_type_code,
                  'E', ael.accounted_cr,
                  ael.entered_cr)                                   -- 4458381
                                 ,
          accounted_dr,
          accounted_cr,
          :5                                                     -- Reference1
            ,
             DECODE (reference_date,
                     NULL, NULL,
                     TO_CHAR (reference_date, 'DD-MON-YYYY'))
          || DECODE (:6,
                     'A', TO_CHAR (aeh.accounting_date, 'DD-MON-YYYY'),
                     'P', aeh.period_name,
                     'D', aeh.ae_header_id,
                     'E', TO_CHAR (aeh.accounting_date, 'DD-MON-YYYY') -- added E/F lookup code for bug8681466
                                                                      ,
                     'F', aeh.period_name)                        --Reference4
                                          ,
          DECODE (:7, 'D', SUBSTRB (aeh.description, 1, 240), NULL),
          DECODE (
                DECODE (:8,  'D', 'D',  'E', 'D',  'F', 'D',  'S')
             || ael.gl_transfer_mode_code --added bug 8846459 to show line description
                                         ,
             'SS', NULL,
             SUBSTRB (ael.description, 1, 240)),
          DECODE (:9 || ael.gl_transfer_mode_code,
                  'AS', jgzz_recon_ref,
                  'PS', jgzz_recon_ref,
                  aeh.ae_header_id || '-' || ael.ae_line_num)   -- Reference11
                                                             ,
          aeh.doc_sequence_id,
          aeh.doc_sequence_value,
          ael.gl_sl_link_table,
          ael.gl_sl_link_id,
          :10,
          ael.ussgl_transaction_code,
          aeh.ae_header_id,
          :11,
          aeh.period_name,
          ael.jgzz_recon_ref,
          aeh.reference_date,
          DECODE (
             led.enable_budgetary_control_flag,
             'Y', DECODE (aeh.funds_status_code,
                          'A', 'Y',
                          'S', 'Y',
                          'P', 'Y',
                          NULL),
             'Y'),
          aeh.entity_id,
          aeh.event_id,
          ael.ae_header_id,
          ael.ae_line_num,
          ael.accounted_dr,
          ael.accounted_cr
     FROM xla_ae_headers aeh,
          xla_ae_lines ael,
          gl_je_categories jc,
          gl_period_statuses gps,
          gl_ledgers led
    WHERE     ael.application_id = aeh.application_id
          AND ael.ae_header_id = aeh.ae_header_id
          AND aeh.GROUP_ID = :12
          AND aeh.application_id = :13                               --4769315
          AND aeh.je_category_name = jc.je_category_name
          AND gps.application_id = 101
          AND gps.ledger_id = aeh.ledger_id
          AND led.ledger_id = gps.ledger_id
          AND aeh.period_name = gps.period_name
          AND aeh.gl_transfer_status_code = 'S'

After making the analysis, it can be seen that Oracle have put index hints for the query, but when you look to the execution plan, you will see a Full Table Scan towards to the XLA_AE_LINES table.. 
That is , the execution plan is wrong..

For the solution, you need to apply the database patch 14013094 with all its post installation instructions , and you need to gather stats for All Schemas , not only for XLA schema..
Note that : XLA schema(Subledger Accounting) is where EBS writes its data before GL..

After taking the actions above; when you check the execution plan of that particular insert and you 'll see that now it is not making any Full Table Scans.. 
The cost of the query may increase, but you dont need to mind it.. 
It is because the old execution plan is a wrong execution plan..

After taking the action ; your exection plan will be something like this and it is acceptable..


Okay... while we are reaching to the end of this post,  I 'm happy to say that, these actions made a problematic accounting process (including 24894 events and 68414 lines) to complete in 7 mins.. 
Note that , this was implemented on an EBS 12.2 customer environment with the FAH implementations..

Hope you will find this helpful.

Thursday, September 25, 2014

Linux -- rescan scsi device or rescan scsi host --sysfs

/sys directory is like a virtual filesystem which contains information and stats about the devices in the system.. Furthermore; It gives us the opportunity in a way to control the devices connected to our system.
It is also known as sysfs, and provides a set of virtual files by exporting information about various kernel subsystems, hardware devices and associated device drivers from the kernel's device model to user space. In addition to providing information about various devices and kernel subsystems, exported virtual files are also used for their configuring. 

You may ask, what is the /dev filesystem then? 
/dev contains the real device files, on the other hand; /sysfs contains info and control mechanisms which are exported by the kernel during the runtime.

Okay, I will not go in details because these are deeper issue..
After giving the general information about the /sys directory , lets jump to the scsi_host and scsi_device directories located in this sysfs..  

There are directories to hold the files related with the devices of all kinds in the /sys/class directory.
scsi_host and scsi_device directories are one of these directories, as well.. They both reside the /sys/class directory. 
The files in the scsi_host directory are like the interfaces to the scsi bus and the files in the scsi_device directory can be used to control the scsi devices connected to the system.

In this article , I will focus on the files named rescan  , located in these directories.. 
Both scsi_device and scsi_host directories contains rescan files in their subdirectories.. 
As you expected, rescan file in the scsi_device is there for rescanning the associated scsi device, and the rescan file located in scsi_host directory is used for rescanning the scsi bus .. 

I 'll keep in short and explain scsi scan activities by giving 2 example scenarios.. When to use?  and what to use? according to the scenarios..
  • Suppose you have added a disk to your VM .. Then you have to rescan the scsi bus.. The driver should scan for the newly added device and notify the SCSI midlayer.. 
So, you must issue the following command;
echo "- - -" > /sys/class/scsi_host/host0/scan   (host may be changed according to the environment)
- - - signs are wildcard for the channel, SCSI target ID, and LUN..  It is like described in the /proc/scsi/scsi file..
With "- - -" ; we say -> scan everyting..
After this operation use fdisk -l .. You will see your new device listed in the output..
  • Suppose you have extended the size of a disk which was already connected to your VM .. 
Then you have to rescan the associated scsi device, as follows;

echo 1 > /sys/class/scsi_device/2\:0\:2\:0/device/rescan
After this operation use fdisk -l .. You will see your device listed with its new size..

As you see above, you can control the devices using /sys filesystem. You can reach the exported options and modify them while the system is running.. Thus, you can manage the devices and make them actually do something you want..
In addition, as explained in the above scenarios ; by rescanning the scsi devices or scsi hosts, you ll not have to reboot your server everytime when a new disk is added or a when a disk is modified.. 
That is ; your system will recognize the change without a need for reboot and these scenarios are valid for both Physical and Virtual servers..

Tuesday, September 23, 2014

EBS 12.2-- Output Post Processor -- java.lang.OutOfMemoryError

When we install and configure EBS 12.2, Output Post Processor(OPP) comes enabled by default, .
In generally ; we need at least one OPP process up&running in our EBS projects/environments..
Without OPP; certain concurrent programs, which need post processing, would complete with errors or their outputs would be not as what want them to be..
OPP handles the post processing actions of the concurrent programs.
As an example: when a request is submitted with an XML Publisher template specified as a layout for its output, the corresponding concurrent manager will utilize OPP to apply XML publisher template for creating the desired output.
What actually happens is -> at runtime the concurrent manager calls output post processor , and XML publisher is called . XML publisher puts the data that your program provides , into the template files. Thus you ll have rich documents/outputs in PDF, HTML, RTF, EXCEL (HTML) format, or even text format for use with EFT and EDI transmissions. 
So, while running a concurrent program that produces XML outputs to utilize the XML publisher, we ll need an Output Post Processor(OPP) to be up&running. 
So far so good.. Okay, without going into more detail, lets come back to our topic. 

As in my previous posts, the problem & fix which you will read in the paragraph below, is based on a true story :)

Last week , a concurrent program which produces XML outputs have started to get errors. The program was running without any problems till last week, but it suddenly started to complete with warnings.
when I analyze the logfile of the problematic program, I saw some error lines.. The most important clue was the line -> "One or more post-processing actions failed. Consult the OPP service log for details."
Just after seeing this line, it is needles to say that I analyzed the Output Post Processor log file.
OPP services was running as a concurrent manager with 4 processes and I could reach its log file using Concurrent Manager > Manage -forms screen.

The problem was there.. " java.lang.OutOfMemoryError"
So It was obvious that the OPP service was encountering a memory leak.

Then I used the following query to see the current memory settings of the OPP service;

SELECT service_id, service_handle, developer_parameters
FROM fnd_cp_services
WHERE service_id = (SELECT manager_type
FROM fnd_concurrent_queues
WHERE concurrent_queue_name = 'FNDCPOPP');

It was 512 MB.. The concurrent program was heavy and must be producing a big data or making heavy processing.. With this in mind, I updated the memory limits of OPP Services as follows, and then restart the concurrent managers.

UPDATE fnd_cp_services
SET developer_parameters =
'J:oracle.apps.fnd.cp.gsf.GSMServiceController:-mx1024m'
WHERE service_id = (SELECT manager_type
FROM fnd_concurrent_queues
WHERE concurrent_queue_name = 'FNDCPOPP');

After modifiying the memory setting and restarting the concurrent managers, the problem dissapeared as expected.

OVS-- Oracle VM Server 3.3.1 installation process --things to consider

I implemented the new release OVS 3.3.1 in one of my clients..

The need of the OVS arose by the the license aggrements..
The underlying machine was a powerful one, with several CPUs on it, on the other hand; the license of the customer was just for few CPUs and this made us use the OVM and  to meet the license constraints.. We have used a virtual machine created from OVM to restrict the cpu usage of this physical machine..

This Oracle Virtual Server implementation was my third Oracle Virtual Server implementation, and again I needed the overcome some boring challenges..
Actually, after your second implementation , you become familiar with the product and difficulties which may arise during the way... However; these implementations are so rare, so the actions for overcoming these difficulties are usually forgotten, and they cause distress during the installations..
That's why, I m here today, and I m writing down these difficulties ...

Firstly, I must say that the installation of OVS 3.3.1 and OVM 3.3.1 are similar to the OVS 3.2.8 and OVM 3.2.8 installations, which I have document in one of previous posts.
http://ermanarslan.blogspot.com.tr/2014/05/ovm-oracle-vm-server-328-installation.html

The difficulties actually depends on your installation type.. Will you import an ova template or will you use Oracle VM Manager which is a license-free management console? Will you use a NFS share for your repository, or do you have enough physical disks to place your repository?

All these implementations types brings their different actions plans to follow, and all have their own little difficulties during the way.

Okay. I will keep it short and make a list of the difficulties you may face as follows;


  • After the installation, vm machines can not be created manually.. In OVS 3.3.1 , we need VM Manager to do this kind activities. On the other hand, you can use the "xm commands" to import templates.
  • VM Manager must be installed to manage the VM environment effectively.. To install VM manager, we need to use a dedicated server.. Actually, it can be installed in to the dom0, but I dont recommend it. It is also tricky. The problem is, the OVS is there for the consolidation, and customers dont want to dedicate a machine for its Management console (VM Manager) To overcome this problem, you may import a Linux template using "xm commands" and use that template to place your vm manager..
  • VM Manager wants Redhat Release 5.5 to be written in /etc/redhat-release.. So before the installation you may want to use the following;
  • cp -rp /etc/redhat-release ~/redhat-release.orig echo "Red Hat Enterprise Linux Server release 5.5 (Tikanga)" > /etc/redhat-release
  •  VM Manager wants you to present your template files from an ftp or http server. So keep that in mind , you ll need an ftp or http server to serve your template files for the use of VM Manager..
  • VM Manager wants you to present your .iso files from an ftp or http server.. So , in order to make a new OS installation, you need to put your .iso files into a http or ftp server, which is accessible from the VM Manager host.
  • After you will install OVS and install&configure VM Manager , you may realize you may not connect to VM Console using VM Manager.. (show console, show serial console).. In order to fix this, download ovm-console-1.0.0-2.i386.rpm , or ovm-console-1.0.0-2.x86_64.rpm according to your Arch, and install them.
  • To enable the console button in Oracle VM Manager, the following rpms need to
    be installed depending on which browser you use to access Oracle VM Manager:
    The rpms are located at http://oss.oracle.com/oraclevm/manager/RPMS
    If you use a Mozilla/Firefox browser on Linux, download the ovm-console rpm and
    install it on the machine where your browser is running:
    rpm -i ovm-console-1.0.0-2.i386.rpm (if on an x86 browser)
    or rpm -i ovm-console-1.0.0-2.x86_64.rpm (if on an x86-64 browser)
    This installs an ovm-console plugin into your browser.
    If you are not using a Mozilla browser or are not using a browser from Linux
    (e.g. you are using a browser on Windows), download and install the
    tightvnc-java rpm on the host running Oracle VM Manager:
    rpm -i tightvnc-java-1.3.9-3.noarch.rpm
    Please install the tightvnc-java rpm *after* installing Oracle VM Manager.
    src rpms for these packages are located at:
    http://oss.oracle.com/oraclevm/manager/SRPMS
  • If you install a new machine/import a template manually.. I mean copy , unzip & and use xm create, then you will end up an orphan server discovered in your VM manager console, and it seems you will need to live with that.
  • If the server dedicated for OVS install does not have a CD/DVD Rom, then you need to boot the server with usb. and it as little challenge.. I mean To format the usb and import the iso into it, you have to use the program called unetbootin and alsoyou need to create a directory in the usb and put the iso file in to that directory.. This was the most important trick of the installation, as Oracle VM server's installer cant find the needed drivers without that directory.. During the installation, when the installer will ask the drivers, you may choose the hard drive and then choose the usb drive and point that directory for the requested drivers..
  • Creating the repository is another challenge, I mean -- if you have only one disk.. (For example: one raid 5 consists of 5 disks)  "Oracle VM doesnt accept a partition.."It wants to have a completely free disk to create its repository.. In short  it want /dev/sdb , not /dev/sdb1 :) --as an example
  • Importing the assembly, creating template from it, creating a VM from that template ... These are all time consuming activities.. So keep that in mind, too.
  • Even if you dont need clustering, and even if you dont choose clustering, you need to specify an Virtual IP address for the server pool.
After all, once you deploy and configure Oracle VM Server properly , it is stable.. It satisfies the need as a virtualized platform, and VM Manager is a featured tool for managing it. 

Wednesday, September 17, 2014

Exadata-- TNS-12170 error diagnostics -- a detailed approach

It has been a while since my article about Exadata.
This will be about an ORA-12170 error diagnostics in Exadata.. I find it quite interesting, as the cause of the problem mentioned above, was an unexpected one..
This post will be like short story , but I will try to give some detailed information during the transitions as well.
Anyways.. The story is based on a problem reported by one of my customers..
Customer has an quarter-rack Exadata X3 machine, configured to host an Oracle EBS 11i Production database..
We have configured this Exadata Machine and have implemented EBS on it.  We migrated the EBS database from Solaris to this Exadata machine, and migrated the EBS Apps Tier from Solaris to a new Oracle Linux server, as well.
It was successfull project , as we did hard and complicated works during the migration.(the source systems were Solaris, and database was 10gR2 and so on.)
As you may guess , the customer was also using Discoverer for reporting purposes.
Discoverer Desktop edition have been running all over the place , and no issues were reported about this product till yesterday.

The problematic situation has arisen when a bunch of the employees tried to connect to the mentioned EBS database located in Exadata..
They were using Discoverer tool to connect to the database, and encountering ORA-12170/TNS-12170 errors during their connection attempts..

I was already familiar with TNS-12170 error , and knew it was related with the client-server network of the corresponding database environment..
Even so, I made a quick search about this error, and found the following;

Cause: The server shut down because connection establishment with a client failed to complete within the allotted time interval. This may be a result of network/system delays; or this may indicate that a malicious client is trying to cause a Denial of Service attack on the server. 


Action: If the error occurred because of a slow network/system, reconfigure the SQLNET.INBOUND_CONNECT_TIMEOUT parameter in sqlnet.ora to a larger value. If a malicious client is suspected, use the address in sqlnet.log to identify the source and restrict access. Note that logged addresses may not be reliable as they can be forged (e.g. in TCP/IP). 

So, as you see above, the search have provided the familiar information about the error, and a generic action plan that can be checked and implemented accordingly..

So far so good...
On the other hand, this was a quite new Exadata... It was configured by the Professionals and EBS was running without a problem.. In fact, all the other Discoverer Clients were connecting to the database without any problems.. 
These facts made me think that, there should have been a specific network problem between these problematic clients and the Exadata Database Machine ..

Here is what I did for the analysis;

First, checked the firewall in the db nodes and did not see any deny-rule there.
Then thought that, using wireshark for capturing the client-server network communication to see the break might be useful. But I quickly give up that thought when I saw the infiband network device configuration..
The infiniband network was as follows;

DO NOT REMOVE THESE LINES ####
%GENERATED BY CELL% ####
DEVICE=bondib0
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.11.2
NETMASK=255.255.252.0
NETWORK=192.168.7.0
BROADCAST=192.168.11.255
BONDING_OPTS="mode=active-backup miimon=100 downdelay=5000 updelay=5000 num_grat_arp=100"
IPV6INIT=no
MTU=65520

So infiband/private interconnect was on network 192.168.7.0...
Then I realized that these problematic clients were on the same network .. I mean, they were using the ip addresses from 192.168.7.200 to 19.2.168.7.220..

Afterwards; I took a look to the route configuration of the db nodes in this Exadata Machine.

route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.10.10.0 * 255.255.255.0 U 0 0 0 eth0
192.168.0.0 * 255.255.252.0 U 0 0 0 bondeth0
192.168.7.0 * 255.255.252.0 U 0 0 0 bondib0
169.254.0.0 * 255.255.0.0 U 0 0 0 bondib0
default 192.168.0.3 0.0.0.0 UG 0 0 0 bondeth0

And yes! The cause was there.. The route declaration for all the ip address in the 192.168.7.0  network was directed from the bondib0 device..
Bondib is the bond device of internal network in Exadata..
So, it was normal for these clients to not to be able to connect to the databases on this Exadata, as Infiniband network is an internal network...


It is needless to say that there was a bad luck or misconfiguration in the network layer of this environment ..
I mean that the infiniband network should have been never used for clients, so bondib which is a private network should have been non routable and unique..

On the other hand, as I have faced this situation, I needed to solve:)

I knew that the network is derived from the subnet mask.

For example: If you want to have an private interconnect ip addres as 192.168.10.1 and if you want to have a subnet mask 255.255.252.0 , you get 192.168.8.0 as your network ,and this effects your route configurations..

Exactly as follows;

010 = 00001010
252 = 11111100
--------------------- AND
8 =   00001000  --> so the network is 192.168.8.0


Related MOS notes:
The Basics of IPv4 Subnet and Oracle Clusterware ( Doc ID 1386709.1 ) 
How to calculate the correct subnet for an interface ( Doc ID 1059759.1 ) 
How to Validate Network and Name Resolution Setup for the Clusterware and RAC ( Doc ID 1054902.1 ) 

So, in order to fix this; there were 3 ways/methods to prefer;

Methods:

1) Changing Exadata's infiniband network
2) Changing the network of these problematic clients.
3) Adding the routes to the Exadata Db nodes for overwriting the general route configuration for accessing those clients.

Considerations:

1) Changing Exadata infiniband network was a risky thing, and it wasnt worth changing it to make 15-20 discoverer clients happy. 
Additional info: 
The steps are mentioned in the Oracle Exadata Database MachineOwner's Guide 11g Release 2 Page 7-60
Also in  Changing IP addresses on Exadata Database Machine ( Doc ID 1317159.1 ) 

2) In my opinion , it was the best way, but the customer didnt prefer this one.

***3) "Seemed like the the second best way in this situation. So , I applied this method as follows.."*****

For adding specific routes , I used route command.
I executed the following commands on db nodes one by one.

route add -host 192.168.7.200 dev bondeth0
route add -host 192.168.7.201 dev bondeth0
route add -host 192.168.7.202 dev bondeth0
route add -host 192.168.7.203 dev bondeth0
route add -host 192.168.7.204 dev bondeth0
route add -host 192.168.7.205 dev bondeth0
route add -host 192.168.7.206 dev bondeth0
route add -host 192.168.7.207 dev bondeth0
route add -host 192.168.7.208 dev bondeth0
route add -host 192.168.7.209 dev bondeth0
route add -host 192.168.7.210 dev bondeth0
route add -host 192.168.7.211 dev bondeth0
....
Above ip addresses were the ip addresses of the problematic Discoverer clients.

Also, to make these routes permenant; I have added these scripts in to the /etc/rc.local files of the Db nodes as follows;

#BEGIN DO NOT REMOVE Added by Oracle Exadata 
if [ -x /etc/rc.d/rc.Oracle.Exadata ]; then
. /etc/rc.d/rc.Oracle.Exadata
fi
#END DO NOT REMOVE Added by Oracle Exadata ###########

#DO NOT REMOVE Added by Erman  :)##
#ADD ROUTE FOR DISCOVERER PROBLEM##
route add -host 192.168.7.100 dev bondeth0
route add -host 192.168.7.101 dev bondeth0
route add -host 192.168.7.102 dev bondeth0
route add -host 192.168.7.103 dev bondeth0
route add -host 192.168.7.104 dev bondeth0
route add -host 192.168.7.105 dev bondeth0
route add -host 192.168.7.106 dev bondeth0
route add -host 192.168.7.107 dev bondeth0
route add -host 192.168.7.108 dev bondeth0
route add -host 192.168.7.109 dev bondeth0
route add -host 192.168.7.110 dev bondeth0
route add -host 192.168.7.111 dev bondeth0
###########################################

I had to add these routes in rc.local because the route -p option is not there in Linux.. -p option is only available in windows. :)

So that was it . TNS-12170 errors have dissapeared immediately after adding those routes..

Wednesday, September 10, 2014

Linux -- tune2fs, adjusting reserved blocks percentage

Normally, In Linux systems , %5 percentage of each filesystem is reserved for the root user , by default.
That is; if you have a mount with 100gb in size, you can fill it up to 95gb.. When the filesystem contains 95gb , you will see it as "%100 used" in df command.
Furthermore, if you have a database running on this filesystem, it will hang when it reaches the 95gb , because it will not able to extend its datafiles, or create its logfiles..
Other non-root programs also will face with troubles, as when they will not able to save their data, and typically quit or crash. 
In such scenarios, we can say that ; this is just because of the reserved block percentage, which is a reserve that can only be used by the root user..
Okay good, it s an issurance for the general system health.. But how much do we need for root files, right?
The reservation is useful for the system services and other programs, which run under the rights of root and need to write to their log files, or save their data in /var for example.
So the reservation is useful for the general system health.

On the other hand; when your filesystem is big, then %5 of it is also big and unnecessary most of the times..

The good news is that, this reserved space can be changed using the -m option of the tune2fs command..
This can be changed on the fly..

Following command can be executed with root..
It sets the reserved space percentage to be %2..

tune2fs -m 2 /dev/sdb1

Following is another command which sets the reserved space percentage to be %0.
tune2fs -r 0 /dev/sda1

When you issue these commands, you will directly see an increase in the free percentage of the relevant filesystem. 
Just execute df and see the difference.

A reliable place to use these commands would be a filesystem that contains only the files of Oracle RDBMS. 
As this kind of filesystems are only used by the Oracle Software Owner Os account, there is no need to reserve space for root user on them.. (the root user will not craete any files on it..)

As you can imagine, tune2fs can be a life saver, as it did saved one of the suddenly growing standby databases of my customers..

Monday, September 8, 2014

RDBMS-- LGWR Waits & Advantage of EXADATA in Log Writer IO

This post will be about Log Writer and the waits our sessions encounter while waiting the Log Writer to do its work..



LGWR writes commit records and writes enough information to recover transactions.
LGWR performance may impact negatively on OLTP response times.
For instance; in case of a commit operation, that is when  a session wants to commit, it posts the LGWR to write the redo information of its transactions and commit record, as well.
In this state, the session waits the LGWR and its wait event will be Log file Sync.. The session waits in this wait while LGWR is writing the redo information in to the redologs..
In details we can say that;

When a user session commits or rolls back , Oracle/LGWR needs to flush the contents of log buffer to the Redolog files. So, Log file sync occcurs in these operations.
When we wait for log file sync, we actually wait for; LGWR to wake up if idle, gather the redo to be written, issue IO (this IO is a log file parallel write, can be monitored using log file parallel wait event), process after the IO and acknowledge user session/foreground..
In addition to that, we wait for user session/foreground to wake up lastly.



So , in a matter of fact, the possible bottleneck of the log file sync seems the IO, which is log file parallel write. The possible bottlenecek is IO because , we rarely have insufficient cpu resources, and we suppose that the design of the code that makes LGWR  to wake up,  is optimized ..
Anyways; As a solution to do IO originating LGWR waits; we can always use async commits, but it is not recommended, because in async commit - > we tell the application we have committed, but there is no guarantee, because we dont commit synchronously actually..

Okay, lets try to characterize the causes of Log Writer Waits with and if statements :)

*If avg time per log file sync operation << avg wait time per log file parallel write operation -> IO is bottleneck.
*If the avg wait time per log file sync operation >> avg wait time per log -> CPU bottlenecek , Run queue problem or bug. In such a situation, check CPU consumption.. 

Okay, we said that Log file sync wait occurs if a process waits for LGWR to flush the data from the log buffer to redolog files. That is, the user process commits and waits on this event till LGWR will finish the data to the redolog files, in other words; till LGWR will send a signal indicating that the flush request is finished..

So far so good. Lets examine LGWR behaviour during its lifecycle;
This time bash script type if /end if statements :)

If LGWR is idle, I mean only waiting, it will wait on Rdbms ipc message..
If LGWR is updating headers of the redolog files, you will see log file single write waits..
End if LGWR is writing the redo data from log buffer to redolog group, you will see log file parallel write waits, as this operation can be done in parallel.

So all we see is write,write and write right :)?
Then write performance is very important for LGWR .. Write performance is very important for us too.. It is just because our sessions will wait for LGWR in one way or another..
CPU is very important too, because if our system cant schedule cpu for LGWR , we will wait LGWR to get scheduled actually. ,
On the other hand, cpu problems of LGWR are generally rare and when they appear, they appear to us obviously.
I mean; we can identify a Cpu problem by just using 2 commands in Linux.. 
That's why, I m living Cpu related tuning aside and will write about IO related problems in LGWR.

The next paragraphs will be based on my own practices and Andrew Holdsworth speech in 
Real-World Performance.

As Andrew Holdsworth, Vice President of Oracle Database Real-World Performance at Oracle Corporation,  said "When The thing in the bottom is delayed , everbody gets delayed" , the processes in the backbone of our database is very important for us to work smoothly..

In this context; In order to supply a proper environment for our LGWR to work, we need to be sure our IO Subsystem is well tuned..

Acutally, we need to tune all the IO paths under our Database Layer , but the rest of this post will be about the Storage Layer, as LGWR makes standart IO Not db IO, and we suppose and have seen our Linux/Unix systems are well tuned even when they are freshly installed.. (Filesystem - Storage alignment is another topic. I am living it aside, too)

Like Andrew says, Luns & Volumes are allocated on the basis of space, they are not allocated on the basis of performance. Or we think that they are allocated on basis of performance, but they are acutally not..
I mean , performance does not equal to the disk speed alone.. It is not like that.. We cant say we have a performanceful disk layout, only because we have a lun that is provided from a raid group and from a volume consisting several 15k rpm or SSD disks.

Once in the past, I replied back to one of my customer, who was thinking that its storage is very fast, and their performance problems are caused by the Oracle RDBMS , as follows;

"Your storage can not deal with this heavy IO requests coming from these costly Sql statements"

So , their storage was quite fast on paper , but eveything had a limit ..
So when there were a lot of unplanned IO requests coming to the Storage, the application (Oracle EBS ) was slowing down..
It could be clearly seen  in the AWR report though...
Tablespace IO Stats and File IO stats were showing the wait times in the datafiles..
Also Top 5 timed events section was displaying the IO related waits such as Db File scattered reads and DB File sequential reads..

Okay the real life example above was showing the effect of a slow IO device to Database Performance in general..  This example actually have shown us that although the storage is very fast, it can be slowed down by the large number of IO requests coming from the queries or transactions.. The reason can be write requests or read requests.. It does not matter acutally... At the end of the day, we ll have performance problems..

Lets come back to our topic and see this effect eespecially to LGWR in a real life example..
In one of the Real-World Performance videos of Oracle, Andrew Holdsworth clearly shows this. Okay I will keep in short.
When a lot of systems share the same Storage , we may see a negative performance effect in the LGWR writes, too..
Consider a scenario when a big warehouse and a big EBS database share the same Storage disks.. Or consider an ETL load process started in the warehouse during the peak hours of EBS.. What will happen?  A lot of things may happen actually :)
But in terms of LGWR performance, that 's quite predictable  :) if your storage can not deal with, it ; you may see log file sync waits in both EBS and Warehouse..
The question is ; is your storage capable of responding this kind of multiplexed IO request?
The answer to this question is "It depends actually ".. If you do your sizing in view of this fact, then you are fine.. On the other hand, this is not the case usually. We have seen this may times , that 's why I m writing this post.

One solution to this problem is putting the redologs to their own dedicated disks.. This way, we will be safe.. I mean our dedicated storage will not be affected from IO Storms caused by other databases/or applications which shares the same storage with our LGWR..

In EXADATA, we have a solution which is one step ahead from the solution above..
Withing Exadata, Oracle writes the redo information both to the smart logs which are in Flash devices and storage disks , and accepts the one that comes first..
So, If the disks in Exadata Storage are fast at the time our LGWR makes the write requests, we will have no problems.. If the disks are under pressure, our redo will be written to the flash , and we have no problems again..
In the following video, Andrew Holdsworth shows this in action , you may want to take a look..


Wednesday, September 3, 2014

EBS 12.2 -- TCP Socket (KGAS) wait

TCP Socket (KGAS) waits are very popular in these days.. At least I have seen them around.. Last week, these waits appeared in a Glassfish environment... This week, they appeared in a E-Business Suite 12.2 environment. So now it is worth to write about them :)

First of all, TCP Socket(KGAS) is such a wait  that we cant tune in the Oracle Site.
In spite of the name(socket), the session waiting in this event may not only wait for the data arriving from the socket, but it may wait for the data that will come from the place where the session goes using TCP.(For example: Application Server, Mail Server etc)..  I mean the Application Server performance comes into play.
Note that: The event is not related with client-server architecture of the Oracle database. 
So this means -> when we see this wait event ,we need to check the target Application Server, too.
The most important thing about this event is it arises when a session goes to another server using UTL_TCP... UTL_TCP is used by  the packages like UTL_HTTP or UTL_SMTP..
So when we see this wait event, we need to check the use of this packages basically..

Anyways, as our focus is EBS, lets talk about these TCP Socket (KGAS) in EBS 12.2 by walking through a real life example..

It all started with a performance issue :)
Customer reported a performance problem in their custom Create Accounting process.
This custom create accounting was a heavy process which is undertaken by several concurrent programs..
The performance problem was so obvious.. Concurrent Programs were running for 1 day, and they just could not be completed..

To investigate, I found the request ids of these running programs and found their sids using their request ids, as always..
Once I found the sids, I checked their wait events..
There were 3 concurrent programs related with Create Accounting, and 2 of them were waiting in the PL/SQL lock timer wait. So as PL/SQL lock time is a purposeful wait , these 2 concurrent programs were eliminated.. In other words; It was obvious that; these 2 were waiting for the other one to finish.
Okay, so far so good..
When I analyzed the third one, I saw that it was waiting in the TCP Socket (KGAS) wait and its sql statement was a simple insert..
Initially, I stopped and restarted the whole process , but after a while , it stucked  in the TCP Socket (KGAS) again. This time it was doing an update..
It was very strange.. I mean , what could be the relation between an insert/update and TCP Socket (KGAS)?

I started to think that there were another processes which could not be seen at the first sight, but ofcourse I needed prove it.
To analyze further, I dumped the Error Stack of the running process..
To do this, I found the OS process id of the relevant db session and used oradebug tool as follows;

SQL> oradebug setospid 21641
Oracle pid: 163, Unix process pid: 21641, image: oracle@erptest
SQL> oradebug unlimit
Statement processed.
SQL> oradebug dump errorstack 3
Statement processed.


The purpose was to see the exact code ( C , PLSQL etc..) that was running around the TCP Socket (KGAS) wait..

When I analyzed the error stack, I saw the code in the stack.. The problem was there clearly..

0x72947638       135  package body SYS.UTL_TCP
0x72947638       338  package body SYS.UTL_TCP
0x72947638       408  package body SYS.UTL_TCP
0x72948498       107  package body SYS.UTL_SMTP
0x72948498       126  package body SYS.UTL_SMTP
0x72948498       636  package body SYS.UTL_SMTP
0x72948498       647  package body SYS.UTL_SMTP
0x72949a28        93  package body APPS.BLALA_SEND_MAIL --> The problem 
0x6480a630        31  APPS.BLABLA_GL_JE_HEADERS_AIUT


So as seen above, there was a call to a custom send_mail package and this send_mail package was using utl_smtp and as a matter of fact -> utl_tcp to send emails..

Furthermore; this custom send_mail was triggered by another custom code named BLABLA_GL_JE_HEADERS_AIUT and when I analyzed it, I have seen that it was a trigger :) 
 A trigger which was working line by line and trying to send emails ..

This was a sufficient for being the cause for these high TCP Socket (KGAS) waits.
At the end of the day, I requested this trigger or design to be fixed and until it is fixed, I requested the developers to make this trigger disabled..

In conclusion; 
I can say that whenever you see TCP KGAS, you can be sure that somewhere in your codepath there is a call to the UTL_* package listed above.. 
At the same time, you may not able to see these package calls at the first glance.. 
If this is the case, use high level tracing/debugging technics of Oracle to see them.. 
For example Use error stack like I did in this example ..

Monday, September 1, 2014

EBS R12 -- Notification Mailer problem/fnd_user_preferences/Relay access denied/MalformedURLException

If you are working as an Apps Dba Consultant, your life may go by diagnosing Notification Mailer's problems..  Every time a different EBS environment, a different mail server , that' s why a different workflow notification mailer problem..
Altough I feel myself  oversaturated with it, I cant get enough of solving Notification Mailer problems, as they are in a great demand nowadays :)

Okay here we are diagnosing the Workflow Notification Mailer again..

This time, the customer complains about the notification emails, which are not sent to their receipients..
WOW! What a great surprise! :)

First thing I checked was the notifications table;

select notification_id, status, mail_status, begin_date
from WF_NOTIFICATIONS
where status = 'OPEN' and mail_status = 'MAIL';

By using this query, I saw that mails are sitting in the wf_notifications to be sent to their recipients..
This was indicating a problem, clearly.. The mailer did not send or was unable to send the emails.. Okay, but why?
Then I checked the mailer, and saw its queue and concurrent manager were active..

SELECT concurrent_queue_name manager, fcp.last_update_date, fcp.concurrent_process_id pid, meaning, fcp.logfile_name
FROM fnd_concurrent_queues fcq, fnd_concurrent_processes fcp, fnd_lookups flkup
WHERE concurrent_queue_name in ('WFMLRSVC')
AND fcq.concurrent_queue_id = fcp.concurrent_queue_id
AND fcq.application_id = fcp.queue_application_id
AND flkup.lookup_code=fcp.process_status_code
AND lookup_type ='CP_PROCESS_STATUS_CODE'
AND meaning='Active'

At this point, I got suspected it might be related with mail preferences of the EBS users..
Actually, mail preferences of the users might be DISABLED..

So I updated all the mailtypes from DISABLED to MAILHTML..

update wf_local_roles
set notification_preference='MAILHTML'
where orig_system in ('FND_USR','PER')
and name in
(select user_name
from fnd_user_preferences
where preference_name='MAILTYPE'
and module_name='WF'
and preference_value='DISABLED');

update fnd_user_preferences
set preference_value='MAILHTML'
where preference_name='MAILTYPE'
and module_name='WF'
and preference_value='DISABLED';

commit;

By using above query, I updated a lot of records, so yes! Mail preferences of the users were set to DISABLED..
Okay, this should be the fix , right? 
To answer this question, I made the customer test the workflow that would send a notification email in its last step..
The result was negative :(.. The receipients could not get their notification emails..

Then I checked fnd_user_preferences to be sure that Mail Preferences were still set to MAILHTML..

select * from 
fnd_user_preferences
where preference_name='MAILTYPE'
and module_name='WF'
and preference_value='DISABLED';


By using the above query, I saw that preference_value column of the user that was suppose to get an email , was set to DISABLED again...

So, It must be notificaiton mailer making the user's mail preference to be set to DISABLED..
Notification mailer does that if it can send an email to that user. Especially, if Mailer encounters a problem queuing its emails to the Mail Server, It marks the status of the notification as  FAILED, and makes the mail preference of the related user to be set to DISABLED..

I also saw that It was coming in the workflow administrator:)
The notification with the ID of 727095 experienced problems when attempting to dispatch an email notification to the role S_ERMAN. Subsequently the notification preference for the following users has been set to DISABLED. Please correct the issue and re-enable their notification preference.

By knowing this, I made an SMTP test from the command line;
(Note that : The Mail server was requiring a user/pass)

telnet [outbound server] 25
EHLO [mailer node]
auth login
334 BSDLSADX
b3JhY2xlQGNleXBvBCcnY29tdsRy
334 UGFzc3dvMDS6
Y2V5MTSAA==
235 2.0.0 Authentication successful
MAIL FROM: [reply_to address]
RCPT TO: [my_test_email_address]
DATA
Subject: Test message

Test message body
.
quit

So , it got queued.. The test message was delivered without a problem..
Then I decided the make a line by line analysis on the Notification Mailer's log file..
To do that, I first set the log level to statement, reproduced the error and then analyzed the log file.. While analyzing the log file , I saw the following;

oracle.apps.fnd.wf.mailer.SMTPOutboundProcessor.send(Message)]:
From address -> erman@blabladomain.com.tr
Email to [] was not accepted for delivery.[Aug 29, 2014 5:05:38 PM oracle.apps.fnd.wf.mailer.SMTPOutboundProcessor.send(Message)]:Problem encountered when sending to {[[<ahmet@blabladomain.com.tr>]]} -> javax.mail.SendFailedException: Invalid Addresses; nested exception is:
class javax.mail.SendFailedException: 554 5.7.1 <ahmet@blabladomain.com.tr>: Relay access denied

It was clear.. There was a Relay problem.. But what a minute! The mailer was authenticating with user/pass, so we should not get any Relay errors.. 
Okay... Maybe the notification mailer did not authenticate itself using user/pass, although the user and pass information was supplied to it.

At this point, I checked the Oracle Support, and reached the bug record in 2 minutes:)

Bug 13609378 - 1OFF:12.1.3:MAILER FAILS TO DETECT SMTP AUTHENTICATION FOR SOME SERVERS
Bug 14676206 - 1OFF:12.1.3:JAVAX.MAIL.SENDFAILEDEXCEPTION: 554 5.7.1 SENDER ADDRESS REJECTED

When outbound server configuration is applied from OAM UI SMTPUtils.isValidOutbound() API is called to validate the server configuration.
When SMTP server is supporting authentication and outbound user name and password are given, session object will be created using the authenticator with the given details.
When SMTP server is not supporting authentication or outbound user name and password are not given, session object will be created without using the authenticator.
Hence no error will be thrown even if the SMTP server is supporting authentication but Mailer fails to detect it.


So I applied the fix for that :) The fix was Patch 14676206.
This patch brought a new version SMTPUtils.class.
$JAVA_TOP/oracle/apps/fnd/wf/mailer/SMTPUtils.class - 120.0.12010000.6

After applying Patch 14676206, I updated mail preferences to MAILTYPE (just in case) and  asked my customer to test the problematic workflow once again.. 
The result was negative .. Again... But this time, user prefences was not set to DISABLED.. So we clearly made a progress..

Okay, at this point, I jumped into the Mailer's log ,as I was curious to see the new problem:)
By the way, I use the following query to locate the mailers log file.. (it is in $APPLCSF/log actually)

SELECT concurrent_queue_name manager, fcp.logfile_name  
FROM fnd_concurrent_queues fcq, fnd_concurrent_processes fcp, fnd_lookups flkup  
WHERE concurrent_queue_name in ('WFMLRSVC', 'WFALSNRSVC')  
AND fcq.concurrent_queue_id = fcp.concurrent_queue_id  
AND fcq.application_id = fcp.queue_application_id  
AND flkup.lookup_code=fcp.process_status_code  
AND lookup_type ='CP_PROCESS_STATUS_CODE' 

So, It was as I expected, the errors in the log file were changed..

Now there were XML parse errors as follows;

oracle.apps.fnd.wf.mailer.NotificationFormatter.getFormattedMessages()]:Problem parsing XML -> org.xml.sax.SAXException: Problem obtaining the RESOURCE content -> java.net.MalformedURLException[Sep 1, 2014 11:10:31 AM EEST]:1409559031792:-1:-1:erpapps:192.168.0.5:-1:-1:1:20420:SYSADMIN(0):-1:Thread[outboundThreadGroup1,5,outboundThreadGroup]:1943056439:69743:1409558843248:4:ERROR:[SVC-GSM-WFMLRSVC-115970-10006 : oracle.apps.fnd.wf.mailer.SMTPMessageHandler.prepareMessages(String)]:FormatterException > oracle.apps.fnd.wf.mailer.FormatterException: Problem parsing XML> org.xml.sax.SAXException: Problem obtaining the RESOURCE content -> java.net.MalformedURLException at oracle.apps.fnd.wf.mailer.NotificationFormatter.handleResEndTag(NotificationFormatter.java:3470)
at oracle.apps.fnd.wf.mailer.NotificationFormatter.endElement(NotificationFormatter.java:578) at oracle.xml.parser.v2.XMLContentHandler.endElement(XMLContentHandler.java:210)
at oracle.xml.parser.v2.NonValidatingParser.parseElement(NonValidatingParser.java:1345) at oracle.xml.parser.v2.NonValidatingParser.parseRootElement(NonValidatingParser.java:362)
at oracle.xml.parser.v2.NonValidatingParser.parseDocument(NonValidatingParser.java:308) at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:337)
at oracle.apps.fnd.wf.mailer.NotificationFormatter.getFormattedMessages(NotificationFormatter.java:354) at oracle.apps.fnd.wf.mailer.SMTPMessageHandler.prepareMessages(SMTPMessageHandler.java:77)
at oracle.apps.fnd.wf.mailer.SMTPOutboundProcessor.read(SMTPOutboundProcessor.java:796) at oracle.apps.fnd.cp.gsc.SvcComponentProcessor.process(SvcComponentProcessor.java:604) at oracle.apps.fnd.cp.gsc.Processor.run(Processor.java:283)
at java.lang.Thread.run(Thread.java:619) Caused by: org.xml.sax.SAXException: Problem obtaining the RESOURCE content -> java.net.MalformedURLException


So , by looking to the error stack, I could say that these errors must be related with a another bug , because this time the errors were coming purely from the code obviously.
That's why, I did not made any controls and directly jumped in to the Oracle Support and found the related bug as follows;

BUG 9757926 - NOT RECEIVING PURCHASING REQUISITION EMAIL NOTIFICATION:JAVA.NET.MALFORMEDURLEXC
When "Attach images to outbound emails" is enabled the Workflow Mailer will embed any referenced image URL into the email message. PO Requisition (PO_REQ_APPROVE_JRAD) message includes an image that will be included in the email message. The complete URL pointing to this images is not is not being populated by the Workflow Mailer code. The code should retrieve http://<host.domain:port>/OA_MEDIA/ag_transparentpixel.gif. It is actually retrieving http://. This is not a valid URL.
When a Notification contains an Oracle Application Framework regions which is referring to an absolute URL then 'address' value is coming as null in URLMIMETagListener.editImageRef() method and 'java.net.MalformedURLException' is thrown.
For Example, the following image URL in the framework region would cause such problem:
<img src="http://server:port/OA_MEDIA/ag_transparentpixel.gif ...>
So the fix was Patch 9868639. This patch was bringing a new version URLMIMETagListener.java, and the last step in the treatment of this Notification Mailer. :)

$JAVA_TOP/oracle/apps/fnd/wf/mailer/URLMIMETagListener.class  120.4.12010000.6

I want to remind some important things before I finish;

To able to solve a problem in EBS, we must at least have an idea about the things we dealing with.
If the problem is in notification mailer, we as Apps Dbas have to know the technology and how it works in the background.. We cant always use heuristic approach to solve EBS problems..
Based on my experiences, I can say that there are a lot combinations for being false, but most of the time there is only a one good combination for being true in E-Business Suite:)