Actually , we don't have to go to far to find the answer, as I have already written a blog post about it in the year of 2016 ->
https://ermanarslan.blogspot.com/2016/01/ebs-r12-apps-tier-32-bit-or-64-bit.html
In short, when we talk about a 64 bit EBS 12.1 or 12.0 system, we actually talk about a 64 bit Oracle Database + an Application tier which have both 32 bit and 64 bit components.
That means, even if our EBS environments are an 64 bit, we still have 32 bit components deployed and 32 bit code running our EBS environments.
Except some of the 64 bit executables such as executables in the Advanced Planning product line, the EBS apps Tier is 32 bit. That's why we apply 32 bit version of patches in to the 10.1.2 and 10.1.3 Oracle Homes.
Well.. After this quick but crucial acknowledgement, let 's get started with our actual topic, the problem that made me write this blog post.
Two days ago, I have dealed with a problem in an EBS 12.1 environment
An environment in which we had;
2 Apps nodes, 1 OAM/OID node and 2 database nodes.
It was an advanced configuration, but the problem was on a certain area.
Basically, our problem was about concurrent managers..
That is, concurrent manager could not be started ..
Acutally they were started by the Internal Concurrent manager(ICM), but they were then going into the "defunct" state. So they were becoming zombies just after they were started and when we checked the syslog, we saw that the processes were getting segmentation faults.
This cycle was repeated in every 1 mins.. I mean managers were started by ICM and then, they were going into the defunct state..ICM recognized that they were dead, so it killed the remaining processes and then restarted them again and again..
We got one of the Standard Manager process as a sample and checked its log file..
The problem was very certain..
Process was complaining about being unable to do I/O for its log file.. (manager's log file)
The rrror recorded in Standard Manager's Log file was "Concurrent Manager cannot open your concurrent request's log file."
All the standard managers had the same error in their log files and all the associated FNDLIBR processes were going into the defunct state just after they were started by ICM..
When we analyzed the OS architecture in the Apps nodes, we saw that there were NFS shares present..
NFS shares were mounted and there were also symbolic links through these NFS shares to the directories that were hosting the Concurrent Manager's out and log files.
When we "cd" into these directories, we could list those log files and actually they were there.. We could even edit (write) and read the problematic log files without any problems .. The permissions were okay and everything looked good.
However; it seemd that, the code, I mean the FNDLIBR processes couldn't do I/O to these files.
With this acquired knowledge , we have analzed the Storage architecture, the storage configuration itself..
It was a Netapp, a newly acquired one and those NFS shares were migrated to this new Storage 2 days ago.. So it was winking at us.. Something in the storage layer should have been the real cause.
We knew that these FNDLIBR processes were 32 bit, and they may fail while dealing with a 64 bit object .. The code was getting EOVERFLOW probably. EOVERFLOW : Value too large to be stored in data type.
So we told the storage admin to check if there were any configuration which might cause this to happen.. Especially the Inode configuration should have been checked in this new storage.. Using 64 bit inodes might cause this...
Actually we had a solution for this kind of a Inode problem in EBS Application and Linux OS layers as well.
In EBS layer, we could apply the patch -> Patch 19287293: CP CONSOLIDATED BUG FOR 12.1.3.2.
At this point, I asked a question to my self ... How can Oracle fix this by applying a patch to its own code?
Without analyzing the patch, my answer was : probably by delivering a wrapper which redirects these I/O functions like readdir(), stat() etc, and returns 32 bit inode numbers, that the calling application/process could handle. Maybe they have used the LD_PRELOAD which can be used to intercept the syscalls and do this trick.. They may even used the uint64_t for storing the 64 bit inodes in the first place..
Anyways, my answer satisfied my curiosity :), lets continue..
In OS layer, we could use the boot parameter nfs.enable_ino64=0. (note that, this move would make NFS client fake up a 32 bit inode number for readdir and stat syscalls instead of returning a full 64 bit number.)
However; we didn't want to take any actions or risks while there was an unannouced change in question..
At the end of the day, It was just as we thought..
The new Netapp was configured to use 64 bit filehandles :) As for the solution; the Storage admin disabled this feature and everyting went back to normal :)
The new Netapp was configured to use 64 bit filehandles :) As for the solution; the Storage admin disabled this feature and everyting went back to normal :)
Some extra info:
*
To check if a problematic file has a 64 bit inode, you can use the ls -li command.
For ex:$ ls -li erman
12929830884 -rw-r--r-- 1 oracle dba 0 Aug 19 11:43 erman
The inode number is the first one in the output above.
So if you see a number bigger than 4294967295 then it means it is a 64 bit number.
*
To check a binary is 32 bit or 64 bit, you can use file command.
For ex:
file /opt/erman /opt/erman
/opt/erman /opt/erman ELF 32-bit LSB executable
That is it for today :) See you on next article ...
That is it for today :) See you on next article ...
No comments :
Post a Comment
If you will ask a question, please don't comment here..
For your questions, please create an issue into my forum.
Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html
Register and create an issue in the related category.
I will support you from there.