Sunday, November 9, 2014

EBS -- Java Cache & Block Corruption in Workflow Tables

EBS has a caching mechanism for java and since 11.5.10 CU2 several EBS products started to use Java Caching. 
 The objects and data which our OC4J or JVM processes like oacore uses this java cache to provide us faster access on pages and related data structures. Like the Sga in the database tier that provide us the buffer caching, Java cache provide us caching in the JVM's memory.
When our JVM instances like oacore accesses the EBS database and get some data from it, the data is cached in the related JVM memory. In the case of oacore it is the oacore's process memory.
Oacore core process resides in the middle  tier, so the memory for jvm caching is in the middle tier.. The cache starts to be filled up when data is first accessed by the JVM/oacore..

Following is taken from Oracle.. It describes the Java Caching Mechanism used in EBS.


Also the java cache in EBS is distributed.. That means, if you have multiple JVMs in the middle tier; lets say multiple oacore; then they will all have their own java caches.. These caches are syncronized, though.
Following diagram show the distributed java caching mechanism in EBS.




As you see in the diagrams above , the cache is located in the middle tier, between Apache and Oracle Database. Even updates from outside the jvm that invalidates the data cached in jvm is recognized using the Business events(invalidation message).

The cache accesses the database in both ways, for read and for write..In the second diagram, you see the reads from the db and writes into the, db better. So the cache is updated first and the database is updated secondly.. It works something like write-back.

It is a good thing to have such a cache in the middle tier, because it saves us from the time to access the database for everytime we need an information. One of the important things is that If an object is invalidated in a JVM, it will be reread from the database when next accessed. The new version will then be stored in the JVM's cache.   That 's why we can conclude that ; if te data in cache is not invalid, then the cache provides the data.. Also if the data is invalid in the cache, it will be reread from the database.

So far so good, now we know the necessary information about the Java Cache used in EBS.

Lets continue the incident that has made me write this post;

Recently an issue was escalated to me .. It was encountered in a EBS 11i with a 11gR2 Oracle Database running on an Exadata X3

The error was "Unable to generate forwarding URL. Exception: oracle.apps.fnd.cache.CacheException".
Some oaf pages were not able to be displayed because of this error.
The issue was critic because related oaf pages were used for approvals.
After the initial diagnostics , I understood that the SYSADMIN could open the pages without any problems.. The problem have arised when the oaf pages were trying be opened by the Business Users..
So, it should be something related with the data.. Maybe a profile..  But the issue was interesting because the failing code was coming in the Cache package. The customer was encountering caching exceptions..

The log file of java cache is located under $COMMON_TOP/rgf/<Instance >_<Hostname> and it provides error messages specific to Java Caching.

So , I decided to have a look at that.

I have found the following in error stack;

oracle.apps.jtf.base.resources.FrameworkException
at oracle.apps.fnd.cache.GenericCacheLoader.load(GenericCacheLoader.java:232)
at oracle.apps.fnd.cache.GenericCacheLoader.load(GenericCacheLoader.java:199)
at oracle.apps.fnd.cache.GenericCacheLoader.load(GenericCacheLoader.java:174)
at oracle.apps.fnd.cache.GenericCacheLoader.load(GenericCacheLoader.java:149)
at oracle.apps.jtf.cache.GenericCacheLoader.load(GenericCacheLoader.java:87)
at oracle.ias.cache.CacheHandle.findObject(Unknown Source)
at oracle.ias.cache.CacheHandle.locateObject(Unknown Source)
at oracle.ias.cache.CacheAccess.get(Unknown Source)
at oracle.apps.jtf.cache.IASCacheProvider.get(IASCacheProvider.java:656)
at oracle.apps.jtf.cache.CacheManager.getInternal(CacheManager.java:4794)
at oracle.apps.jtf.cache.CacheManager.get(CacheManager.java:4617)
at oracle.apps.fnd.cache.AppsCache.get(AppsCache.java:216)
at oracle.apps.fnd.functionSecurity.User.getUser(User.java:336)
at oracle.apps.fnd.functionSecurity.FunctionSecurity.getUser(FunctionSecurity.java:527)
at oracle.apps.fnd.functionSecurity.RunFunction.createURL(RunFunction.java:1190)
at oracle.apps.fnd.functionSecurity.RunFunction.init(RunFunction.java:389)
at oa_html._RF._jspService(_RF.java:81)
at oracle.jsp.runtime.HttpJsp.service(HttpJsp.java:119)
at oracle.jsp.app.JspApplication.dispatchRequest(JspApplication.java:417)
at oracle.jsp.JspServlet.doDispatch(JspServlet.java:267)
at oracle.jsp.JspServlet.internalService(JspServlet.java:186)
at oracle.jsp.JspServlet.service(JspServlet.java:156)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:588)
at org.apache.jserv.JServConnection.processRequest(JServConnection.java:456)
at org.apache.jserv.JServConnection.run(JServConnection.java:294)
at java.lang.Thread.run(Thread.java:662)
Caused by: oracle.apps.jtf.base.resources.FrameworkException: ORA-01578: ORACLE data block corrupted (file # 36, block # 495018)
ORA-01110: data file 36: '+DATA_OSRV/dev/datafile/apps_ts_tx_data.470.839415343'
ORA-26040: Data block was loaded using the NOLOGGING option
at oracle.apps.jtf.base.resources.FrameworkException.convertException(FrameworkException.java:607)
at oracle.apps.jtf.base.resources.FrameworkException.addException(FrameworkException.java:585)
at oracle.apps.jtf.base.resources.FrameworkException.<init>(FrameworkException.java:66)
at oracle.apps.jtf.base.resources.FrameworkException.<init>(FrameworkException.java:88)
at oracle.apps.jtf.base.resources.FrameworkException.<init>(FrameworkException.java:202)
at oracle.apps.jtf.base.resources.FrameworkException.<init>(FrameworkException.java:218)
at oracle.apps.jtf.base.resources.FrameworkException.<init>(FrameworkException.java:249)


Yes... I found an ORA - error coming from the database..

Caused by: oracle.apps.jtf.base.resources.FrameworkException:
ORA-01578: ORACLE data block corrupted (file # 36, block # 495018)
ORA-01110: data file 36: '+DATA_OSRV/dev/datafile/apps_ts_tx_data.470.839415343'
ORA-26040: Data block was loaded using the NOLOGGING option.

So , the error ORA-01578 was indicating that : The corruption was a software block corruption which was a former LOGICAL corrupt block marked as formally corrupt.

No matter what; It was a corruption error.. So the cache was trying to read the data in the corrupted block, but it could not read it because the block was corrupted.. If we had no cache in EBS; we would see an Apache error maybe.. Maybe the error would be ""You have encountered an unexpected error. Please contact the System Administrator for assistance."

Okay I was happy to find the cause; but it didnt last long.
Unfortuneatly, we had no rman backup in this environment (as it is DEV)..  No cold backups or anything.. Besides the partition was nologging in Dev.

Anyways.. I searched the database to find the object that this corrupted block was belonged;

select segment_name, segment_type, owner
from dba_extents
where file_id = 36
and 495018 between block_id
and block_id + blocks -1 ;

It was a table: APPLSYS.WF_LOCAL_ROLES. It was a workflow related table , and it could be syncronized using the Synchronize Workflow LOCAL table..
It was obvious that the problem was arising in the OAF pages which were trying to access the WF_LOCAL_ROLES table..  This also showed that, oacore first takes the data into its cache , and then builds the web page and sends it client.

Firstly, I tried to repair the block using DBMS_REPAIR package.

BEGIN
DBMS_REPAIR.ADMIN_TABLES (
TABLE_NAME => 'REPAIR_TABLE',
TABLE_TYPE => dbms_repair.repair_table,
ACTION => dbms_repair.create_action,
TABLESPACE => 'APPS_TS_TX_DATA');
END;

BEGIN
DBMS_REPAIR.ADMIN_TABLES (
TABLE_NAME => 'ORPHAN_KEY_TABLE',
TABLE_TYPE => dbms_repair.orphan_table,
ACTION => dbms_repair.create_action,
TABLESPACE => 'APPS_TS_TX_DATA');
END;
/

SET SERVEROUTPUT ON

DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
BEGIN
DBMS_REPAIR.CHECK_OBJECT (
SCHEMA_NAME => 'APPLSYS',
OBJECT_NAME => 'WF_LOCAL_ROLES',
REPAIR_TABLE_NAME => 'REPAIR_TABLE'
);
END;SELECT OBJECT_NAME, BLOCK_ID, MARKED_CORRUPT
FROM SYS.REPAIR_TABLE;

SET SERVEROUTPUT ON
DECLARE num_fix INT;
BEGIN
num_fix := 0;
DBMS_REPAIR.FIX_CORRUPT_BLOCKS (
SCHEMA_NAME => 'APPLSYS',
OBJECT_NAME=> 'WF_LOCAL_ROLES',
OBJECT_TYPE => dbms_repair.table_object,
REPAIR_TABLE_NAME => 'REPAIR_TABLE',
FIX_COUNT=> num_fix);
DBMS_OUTPUT.PUT_LINE('num fix: ' || TO_CHAR(num_fix));
END;
/

As I expect te DBMS_REPAIR has found the corrupted blocks, recorded them but could not fix them.. It was normal actually, because DBMS_REPAIR actually doesnt fix anything. It just let us mark and skip the corrupt blocks if we want to.

So I have used DBMS_REPAIR.SKIP_CORRUPT_BLOCKS as follows;

BEGIN
DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
SCHEMA_NAME => 'APPLSYS',
OBJECT_NAME => 'WF_LOCAL_ROLES',
OBJECT_TYPE => dbms_repair.table_object,
FLAGS => dbms_repair.skip_flag);
END;
/
SELECT OWNER, TABLE_NAME, SKIP_CORRUPT FROM DBA_TABLES
WHERE OWNER = 'APPLSYS';

By skipping the corrupt blocks, the problematic OAF pages have started to be working again. The cache problem went away , as expected.
Lastly I executed the concurrent program "Synchronize WF LOCAL tables" to syncronize WF_LOCAL_ROLES table again.

Strange isnt it?? Sometimes simple things like block corruption in the workflow tables can cause critical user functions to not work..

No comments :

Post a Comment