Wednesday, April 20, 2022

EBS / RDBMS -- High Latch Free Waits --> "Result Cache: RC Latch"

I have dealt an interesting performance issue recently.  The environment was an EBS R12 ( 12.2.6 ) .. EBS database was an Oracle 12.1.0.2 running on Solaris SPARC.

The issue was appearing in the office hours. 

Every business day around 9:30 AM, there were Latch Free waits and this situation was making the end user face a performance degradation and complain..

We had the Enterprise Manager, AWR and ASH reports. The cause aboveground was Latch Free waits, but the reason behind was the Result Cache. (Actually the improper use of the  Result Cache...)

If you don't have the necessary background info about the Result Cache, please read the following blog post for a quick enablement-> 

https://ermanarslan.blogspot.com/2019/10/rdbms-things-to-know-about-result-cache.html

(RDBMS -- things to know about the Result Cache / Result Cache : RC latch)

In this case, the following PL/SQLs were waiting on the Result Cache: RC Latch;

*FND_AOLJ_UTIL.is_Valid_ICX(:2,NULL,'Y',TRUE,TRUE,NULL,NULL,NUL)

*BEGIN   fnd_global.bless_next_init('FND_PERMIT_0006');   

 *FND_GLOBAL.APPS_INITIALIZE(    user_id=>:1    ,resp_id=>:2    ,resp_appl_id=

  >:3    ,security_group_id=>:4    ,server_id=>:5);END;

Look like a login issue right? I mean, it looked like, there was a login storm which was expected in the morning and it was the one that triggers the issue. (and actually it was the one that is becoming the victim of it..) Why do I say that? Because I saw those PLSQLs that were used in session validation, apps initialization and so on.. And! we were seeing the issue in the office office around 9 AM..

In these types issues, customers chase the quick wins at first..

So what might be the solution?

*Increasing the Result Cache Size?

*Disabling the custom code? (there was a Custom SSO on the login) 

*Gathering fixed object stats and dictionary stats?  - ref: Some Queries Are Causing Performance Issues in ISG (Doc ID 2396223.1

*Setting some hidden parameters -> alter system set "_optimizer_dsdir_usage_control" = 0;  -- ref: High Latch Free Result Cache RC LATCH On Object Linked Views (Doc ID 2245996.1) 

*Some more hidden parameters -> *alter system set "_optimizer_ads_use_result_cache" = FALSE; High "Latch Free" Waits with Contention on 'Result Cache: RC Latch' when RESULT_CACHE_MODE = MANUAL on Oracle 12c (Doc ID 2002089.1)

*Doing some other improvements -- for decreasing the latching -- moves like changing the cursor_sharing parameter --- not related actually..

*Enabling session trace for EBS users? -- doesn't tell us much in this case.

Log onto System Administrator Responsibility

Navigate: Profile > System

Search for the profile option Initialization SQL Statement - Custom (USER level. )

Enter the following sql statement into the User Level profile option. 

--modify USERID to the actual username of the user.

BEGIN FND_CTL.FND_SESS_CTL('','', '', 'TRUE','','ALTER SESSION SET TRACEFILE_IDENTIFIER='||''''||'USERID' ||''''||' EVENTS ='||''''||' 10046 TRACE NAME CONTEXT FOREVER, LEVEL 12 '||''''); END;

*Patches? -- It was a good try..
Patch 26436717  (Database/RDBMS Home Patch) and Patch 17305006 - SEVERE PERFORMANCE SLOWDOWN WHEN RESULTS CACHE IS THRASHING is needed.
Patch 27508480: CONNECTION LEAKS FROM WSH.SERVLET.REQUESTRATES değerlendirilebilir. 
PAY122:CONNECTION LEAKS WHEN ACCESSING ONLINE W-2 IN 12.2(Patch 22684327:R12.PAY.C) 
Patches 28831943 and 26599059 for FND 
Patches 25969099 and 20141119
BNE Is Leaking Connections In12.2.4 --- AD/TXK 7, ATG 5 And HRMS PC.C.7 (Doc ID 2141923.1)
21771287: CONNECTION LEAKS FROM GWY.JSP - SIMILAR TO BUG 20446583 -- değerlendirilebilir
R12.ATG_PF.C.delta.6 Patch 21900895 or greater has been applied which includes the latest changes for BNE memory allocation.

I could keep going , but actually no need. Well... That's because, we have considered all of the things above and we actually did some of them, but we couldn't get any benefit..

-->This time, we were not looking for a needle in a haystack. (that is actually pretty easy, because a needle and a haystack are very different).. We were looking for a particular needle in a stack of needles :) -- I like this phrase by the way.

We were almost sure that, the solution should be directly related with the result cache or to those (some EBS code) using it.
Disabling the Result Cache in the database level will resolve the issue, but we were not so into that.

What we did was the following; we first made the result cache off by using the bypass mode.
Then we flushed the Result Cache and made in on again.
Once the Result Cache is flushed, we started monitoring it..

For checking the current result cache
-----------------------------------------
SELECT dbms_result_cache.status() FROM dual;

To enable the bypass mode and flush:
-----------------------------------------
BEGIN
   DBMS_RESULT_CACHE.BYPASS(TRUE);
   DBMS_RESULT_CACHE.FLUSH;
END;
/

To disable the bypass mode
-----------------------------------------
exec DBMS_RESULT_CACHE.BYPASS(FALSE);

We saw that the Result cache is occupied by the FND_PROFILE mostly..
There was no space pressure on the cache but things like FND_PROFILE.GET_PROFILE_APPL_ID and FND_PROFILE.GET_PROFILE_ID was there on the top..

Query used for Result cache monitoring:

select substr(name, 1,100) name,
cache_id,
count(*) result_count,
round(avg(scan_count)) avg_scan_count,type,status
from gv$result_cache_objects
where type = 'Result'
group by name, cache_id,type,status

A culprit from the monitoring output:

NAME                                                                            
--------------------------------------------------------------------------------
CACHE_ID                                                                        
--------------------------------------------------------------------------------
RESULT_COUNT AVG_SCAN_COUNT TYPE       STATUS                                   
------------ -------------- ---------- ---------                                
"APPS"."FND_PROFILE"::11."GET_PROFILE_APPL_ID"#32fb3b6bdac49c05 #245            
85njc349fpj4gc54hq8ca2u41q                                                      
         691         272415 Result     Published                                
                                                                                
"APPS"."FND_PROFILE"::11."GET_PROFILE_ID"#32fb3b6bdac49c05 #216                 
6m1zcmdsv8v40c201cvbaa3y8u                                                      
         721         261123 Result     Published                                
                                                                                
"APPS"."FND_PROFILE"::11."GET_LEVEL_ENABLED_FLAG"#ba365d4c80ddb2d5 #275         
4b86jg87f0gkxafawmtmr29dh4                                                      
        4746         139188 Result     Published                                
                                                                                
"APPS"."FND_PROFILE"::11."GET_HIERARCHY_TYPE"#a0e3bfe28f4de20f #326             
1q2vx7pv7jvc18uksjffyf6vb9                                                      
         678         139164 Result     Published                        

Then , we jumped into the code.. The standard code and found that FND_PROFILE was written in a way that is implicitly use the result cache.. We saw the usage of cross-session PLSQL Fuction Result Cache in multiple places of the code. (the RESULT_CACHE Clause) 
Those FND_PROFILE calls were done mostly during the login and we concluded that, when there is a login storm and when there is a DML that may affect the data/FND_PROFILE related data that is already stored in the result cache, those waits for the Result Cache Latch start appearing..
Note that, we also had some cached things that belongs to the WF_CORE, but this wasn't the issue..

Anyways, as as workaround we wrote a script (job, cron job etc.. -- you can think whatever you like) to flush the result cache every morning just before the issue happens.. (The issue was mostly seen after 9 am..)

We could also blacklist the relevant parts of the result cache, as a more elegant workaround.

Blacklisting those caches:

Using the following query, identify the Cache_IDs that has high RESULT_COUNT.

SQL> select substr(name, 1,100) name,
cache_id,
count(*) result_count,
round(avg(scan_count)) avg_scan_count,type,status
from gv$result_cache_objects
where type = 'Result'
group by name, cache_id,type,status

Use Black_list_Add to ban them..

exec dbms_result_cache.Black_List_Add(xxxxxxxxxx, TRUE);

But what about the fix? Well... Fix is in the code.. The code is written such a way to use the Result Cache, but the processes in the ERP changes the data that is cached by the code and that is too bad!

Either the code or the processes (ERP processes that is updating the data cached by FND_PROFILE) should be revisited..

Nevertheless, we have 2 workarounds! :)  Again -> Be careful when using the result cache!

Well.. I can't stop myself to give some background info here.. Here it is ;

Result cache in these kinds of scenarios brings the data faster than the buffer cache, as it has the data in the result format and it just copies it to our session/process 's buffer.

However; if we try to use the result cache in a dynamic environment, things change.
Things change, because the invalidation ( when the data and objects that builds the contents cached in the result cache change) means "Writing the Result Cache"

Writing is not a very bad thing? Yes for 1 write or 10 writes maybe it is so, but consider 1000 concurrent writes..

Why writing to result cache concurrently is a bad thing then?
It is because , the result cache is protected by a single latch.

In order to write to the cache, it is required to get an exclusive latch.
This exclusive latch is need to be get when there is a cache miss or cache invalidation..

So cache hit doesn't require the session to get that exclusive latch.
This exclusive latch, on the other hands; is a single one, and it has no child.
It is basically, when you want to write to cache, you need to get it -> concurrent write access -> Problem
However, if you read the cache (cache hit), you need to get a shared latch -> no contention -> NO Problemo :)

Here is what Oracle Support says about it ->

The result cache was designed to cache small results sets for a short period of time and once it gets full there is an overhead of handling the data in it. When no empty blocks are found in the RC we attempt to make room first by removing any invalids, then requesting a new chunk to be allocated and finally by removing using LRU. Currently, there is a cost associated with making room in a full result cache, which increases with size of the cache, especially for LRU processing and this is unlikely to change.

If you indirectly or directly put an object or let's say a result into the result cache and if the data/object that produces that result is dynamically changing; then guess what?
You will need to write to the cache..

-- By the word directly, I mean using hints -> SELECT /+result_cache/

You will have Result Cache related activity.
You will see a peak for the latch named Result Cache: RC latch.

Okay.. It is a perfect place to end.. See you in the next post :)

No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.