Friday, October 7, 2016

RAC -- interconnect and gc waits, are select statements affected?

Recently dealed with a cluster wait issue on a RAC database, which was serving as a production database of a Bank.
During the work, there was a question asked by the folks and I felt that there was a misconception about the defition called "writers do not block readers"
They thought that the readers , who are actually using only the select statements should not be affected by the cluster waits...(which is not true bytheway)
Writers do not block readers is a true concept, but it has some additional and undeniable cost when the system is an Active-Active cluster (i.e RAC)

Look at this AWR:
Top User Events
Event
Event Class
% Event
Avg Active Sessions
gc buffer busy acquire
Cluster
75.00
33.73
enq: TX - row lock contention
Application
7.55
3.40
enq: TX - index contention
Concurrency
3.59
1.61
gc cr block lost
Cluster
2.94
1.32
gc buffer busy release
Cluster
2.62
1.18

See the top waits?

Look at the query recorded in AWR ->
SQL ID
PlanHash
Sampled # of Executions
% Activity
Row Source
% RwSrc
Top Event
% Event
SQL Text
1219140043
143
32.26
INDEX - UNIQUE SCAN
32.22
gc buffer busy acquire
30.75
SELECT count(*) FROM BLA_BLA...

You see the wait? (note I will not go into the details of this gc buffer busy acquire wait, it is beyond of the scope of this blog post, but what I will say is, it is there because of the contention and the select recorded in the AWR is affected from it.)

It definetly means a contention and,  we may be talking about a Read-Write contention here, which can happen in RAC system for sure. (note I didnt copy paste the DMLs s here, but trust me .. they are there)

While instance B is making modification on a block, if you try to read(even only read) the same block from instance A, then you may see these waits.

Here is what happens in a scenario like this;

Instance A sends a request to GCS process, which manages this locking.
GCS checks the condition of the block and sees that it is locked by the Instance B.
GCS says to Instance B: remove your lock
If the session in Instance B does not want to remove its lock (if the session is still working with the block), it creates a Consistent Image in its buffer cache and sends this consistent image to Instance A. (consistent read)
This consistent image actually reflects the state of the block, which represents how the block looked like just before the modification.
This consistent image, as you may guess is sent from the private interconnect and Instance B gives this delivery info to GCS.

Well...We can say that this situation can be called a remote cache hit, and the turn of events explained above is caused by the read-write contention.

Note1:Remote cache hit costs more that a Local cache hit.
Note2:In order to  have more local cache hits, it is recommend to do a configuration that will make the session that needs the block and the session that holds the lock for that block to be in the same RAC instance.
Note3:When we talk about a write-write contention, the LGWR, redo flush and PI(Past Image) creation may also come in to play. So it costs even more.

What can be done to prevent these inter-instance contentions?

Partitioning, reverse key or hash partitioned indexing and etc..
However, the best approach would be building a RAC configuration, that will make the session, which needs the block and the session which holds the lock for that block to be in the same RAC instance. (for example, if we have 3 instance, and if we see the data blocks for the same objects are frequently  transferred from instance 3 to instance 2, we need to think that -> why dont we dedicate the related service to instance 2 and prevent this to happen, right?)
In order to make this happen, the approach to be followed is to divide the applications into modules and to configure these modules to connect to the db services, which are dedicated to instances.

So, this blog post is a specific one. The main idea here is;  reader must not see the changes that are done but not committed by the writer. Altough, RAC is based on a shared disk/interconnect architecture, it has a non-shared memory, which acts like a shared one --cache fusion.
Cache fusion is a very sophisticated product and there are serialization mechanisms that come in to play to provide it. In brief, the cache fusion depends on in memory block locking, I/O, interconnect and interprocess communication. So it is costly.

Note: in the next few days, I will be writing a more detailed blog post about RAC. So, a more detailed article about RAC is on its way to you.

Update:

Keep in mind that, the Oracle database versions < 11.2.0.4, there are lots of bug reported with these issues.
DOC ID 1907369.1 is one of the documents reporting them.
The recommended solution for getting rid of those bugs are upgrading the database to 11.2.0.4So if you see a sessing get stuck on gc current request, you may be hitting those bugs.
Recommend solution for those bugs is; killing the stuck session or upgrading the database to 11.2.0.4, as most of those bugs are fixed in that release.

No comments :

Post a Comment