Sunday, May 25, 2014

Awr Analysis -- Part 1

In this post, I will write an article realted to Automatic Workload Repository report, which is a sophisticated tool for investigating the performance on an Oracle Database. Also if we can analyze ASH report in addition to the findings in AWR, we would have a full detailed investigation against an Oracle Databse.
My references in this article are Oracle Support, an Oracle Awr interpretation presenation of Oracle ACS and my own field experiences.

*AWR is first introduced in Oracle Database 10g.
*Statistics gathering for Awr is enabled by default. (Statistics_level=TYPICAL or ALL. This setting enables automatic statistic gathering for AWR. By default hourly and retention is 7 days(10g), 8days(11g))
*Manuel statistic gathering can be invoked by -> EXEC DBMS_WORKLOAD_REPOSITORY.create_snapshot;
*DBMS_WORKLOAD_REPOSITORY.modify_snapshot_settings can be used to modify default settings.
retention => 21600,         -- Minutes (= 15 Days).
interval  => 30                 -- Minutes.
topnsql => 100,               -- the number of Top SQL to flush for each  criteria
dbid => 3310949047);    -- the local database identifier is used by default .
Current value is retained if NULL

*Awr data is stored in SYSAUX tablespace.
*$ORACLE_HOME/rdbms/admin/awrinfo.sql script can be used to display general information about AWR(size, data distribution in SYSAUX)
*$ORACLE_HOME/rdbms/admin/awrsqrpt.sql script can be used to diplay statistics of a sql for a range of awr snapshots.
*$ORACLE_HOME/rdbms/admin/awrextr.sql script can be used to extract awr data from a database. This script will extract Awr data into datapump export file.
*$ORACLE_HOME/rdbms/admin/awrload.sql script can be used to import the AWR data from an export dump file.(The target database version must be >= the database version of exported data)


Db Time: Total time spent by user processes either actively working or actively waiting in a database call..
Response Time:Service (CPU) Time  + Wait Time.  | Service Time = time the request is actively being processed on the CPU.
Elapsed Time:Wall Clock Time  , Compare Db time with Elapsed Time to calculate the workload of the instance.
Average active sessions=DB Time /Elapsed time.

Db Time increases when :
Users(more db calls) increases -> DB Time increases , Large transactions -> DB Time increases , IO performance decreases -> IO Wait time increases -> Db time increases , CPU wait time increases -> DB Time increases


Instance Efficiency
Execute to Parse : 100*(Executes-Parses)/Executes , how often sqls were executed without having to parse
Parse CPU to Parse Elapsed : 100*Parse CPU Time / Parse Elapsed, how much of the total parse time was spent on CPU.
Soft Parse % : 100* (Parses - Hard Parses)/Parses, the ratio of soft parses to hard parses
Latch Hit%: shows how often latches were acquired without having to wait. This percentage is based on the ratio of the total number of latch misses to the number of latch gets for all latches.
% Non-Parse CPU: shows the percent from CPU time (DB CPU) that is dedicated to Execution and not parse.
Buffer Hit%: measures % of buffer found on buffer cache -- without disk IO.

Shared Pool Statistics:

Memory Usage % :the amount of occupied memory from shared pool
% SQL with executions: the percentage of SQLs executed more then once.
% Memory for SQL w/exec: the percentage of SQLs executed more then once, that were in memory.

Top 5 Timed Events:

These events are based on %DB Time.
Average wait time=wait time/# of waits
For IO related waits , avg wait time can help recognizing IO related performance problems.
Taking ASH report for the same time period can help identifying problematic sql's, processes or operations..
Top sql with Top Events section from ASH shows the statements waiting/working on the CPU.
Top SQL with Top Row Sources section has similar information ordered by Row Source of execution plan.


Every wait has a wait type/wait Class.
A session is either consuming CPU or posting a wait event.
Analyzing Wait events is the key for throubleshooting a performance problem.
Wait events are the results of the underlying problems.
Sometimes the root cause of a performance problem, can be a wait event which has no direct impact to the system resouces.(for ex: to DB time)
For example, Latches can cause high CPU usage at OS level=>CPU starvation at OS level =>other waits are inflated.

Wait Classes:

User I/O - waits for user IO
System I/O - waits for background process IO
Concurrency - waits for internal database resources.
Cluster - waits related to RAC resources
Commit - contains one wait event - log file sync - wait for redo log write confirmation after a commit
Application - waits resulting from user application code
Administrative - waits resulting from DBA commands that cause users to wait
Configuration - waits caused by inadequate configuration of database or instance resources
Idle - waits that signify the session is inactive, waiting for work
Other - Waits which should not typically occur on a system
Queue - wait events that signify delays in obtaining additional data in a pipelined environment
Scheduler- waits related to Resource Manager

Wait Events:

*Db file sequential read - block accessed via  index, rowid, rollback.

*Db file scattered read - multi block read via full table can,  index fast full scan.

These waits have 3 attributes , P1 value = file no ,P2 value = block no(starting block number) ,P3 value = blocks(specifies the number of blocks.)

*So following query can be used to identify the object ;

SELECT relative_fno, owner, segment_name, segment_type FROM dba_extents WHERE file_id = &file AND &block BETWEEN block_id AND block_id + blocks - 1;

*Direct path read - Generally used by Oracle when reading directly into PGA memory (as opposed to into the buffer cache).Direct path reads are faster than scattered reads and have less impact on other processes because they avoid latches.

*Read by other session/buffer busy - Waits related to contention on  hot blocks.
This wait happens when a session wants to access a database block in the buffer cache but it cannot as the buffer is "busy".
This wait event means either,
Another session is reading the block into the buffer. Starting with 10g this wait time is counted in read by other session wait.
Another session holds the buffer in an incompatible mode to our request.

This wait evet has 3 attributes: P1 value = file no  ,P2 value = block no , P3 value =  id (Reason Code)/Block Class# in 10g

So following query can be used to identify the object ;

SELECT relative_fno, owner, segment_name, segment_type FROM dba_extents WHERE file_id = &file AND &block BETWEEN block_id AND block_id + blocks - 1;

*Log file Sync -  When a user session commits or rolls back , Oracle/LGWR needs to flush the contents of log buffer to the Redolog files. Log file sync occcurs in these operations.
When we wait for log file sync, we actually wait for;  LGWR to wake up if idle, gather the redo to be written, issue IO (this IO is a log file parallel write, can be monitred using log file parallel wait event), process after the IO and acknowledge user session/foreground.. In addition, we wait for user session/foreground to wake up lastly.

So the possible bottleneck of the log file sync seems the IO, which is log file parallel write.
If avg time per log file sync operation ? avg wait time per log file parallel write operation  -> IO is bottleneck.
If the avg wait time per log file sync operation >> avg wait time per log  -> CPU bottlenecek , Run queue problem or bug. In such a situation, check CPU consumption..

*Enqueues - Enqueues are locks. They are held by database session/sessions in a mode that block the lock request of others.
This Type of wait has a P1 parameter which contains the Lock Type & Mode requested.
enq: TX - row lock contention is a type of lock we often see. It represents transaction locks.
If its mode is 6, then there is an application issue..
If its mode is 4, then there is  unique key/Foreign Key/Bitmap index contention

Note that: insert /*+ append */ - during direct-path INSERT, the database obtains exclusive locks on the table (or on all partitions of a partitioned table).
enq: TM -  table modification /lock mode=6 Exclusive occurs on append hint, which means no concurrent dml can be made on the table during the append operation. Only queries can be run, but they will show the data before the append takes place.
The MOS document 1317447.1 expains Append hint and TM Enqueue lock behaviour.
Unindexed foreing key constraints and truncate operations can also be the cause of enq TM, as well.

Example Scenario:
You have indexes on tables, which are accessed by the application so often.
These indexes are on the columns whose values are populated with monotonically increasing values by a sequence or similar thing.
When you look at your AWR, you see db file sequential read , enq TX - index contention and buffer busy wait in Top 5 Timed Events.
Also if your application is doing insert on such tables, There will be inserts into the indexes,to Leaf/Branch block split is the wait event you may see in this scenario.. This may slow you down..
You will see this kind of indexes in the Segment by Buffer Busy Waits Section.
At this point, if that is the scenario, you can use the reverse key indexes to eliminate the contention which is caused by inserts. On the other hand, it can affect your Range scan performance. Also you can use hash partitioned indexes as an alternative. This index type can spread the contention to the partitions.
Another option to alleviate the problem is to increase the cache size of the sequences and also the NOORDER keyword should be used to avoid an additional cause of SQ enqueue contention that is forced ordering of queued sequence values.

*Latch Waits -
Latches are like locks but they are defined for protecting memory structures.
They are low level serialization mechanisms used to protect shared data structures in the SGA.
By using the latches, Oracle prevents more than one process from executing the same critical code sections at a given time.
Latches are different than Locks in the manner of ordering of waiters. There is no ordered queue of waiters in Latches. Waiters spin on the Cpu to get the latch..
If there are multiple waiters for the same latch, they all spin on the Cpu and retry getting the latch, and there is no order between them.
That is ; anyone might get the latch and conceivably the first one attempting to obtain the latch might be the last one to actually get it.
There are two modes in latch requests. Willing to wait mode and no wait mode.
willing-to-wait mode that will loop, wait, and request again until the latch is obtained. Examples : shared pool and library cache latches.
no wait mode in which the process will request the latch and if it is not available, instead of waiting, another one is requested. Only when all fail does the server process have to wait. Example: the redo copy latch.

If a process cannot get a latch immediately, it spins, tries again and if still unavailable, spins again. When the process fails that, the miss count statistic (available also in AWR -  Latch Sleep Breakdown) is incremented..
The loop is repeated up to a maximum number of times (parameter _SPIN_COUNT). If after this entire loop, the latch is still not available, the process must yield the CPU and go to sleep. Initially it sleeps for one cs. This time is doubled in every subsequent sleep.

Some Common Latches:
latch: library cache - protect operations within the library cache itself. This latch must be acquired in order to add a new statement to the library cache.
latch: shared pool latch - protect critical operations when allocating and freeing memory in the shared pool. Library cache latch protects operations within the library cache, the shared pool latch is used to protect critical operations when allocating and freeing memory in the shared pool. To reduce contention on either of the two latches: Avoid hard parses when possible, parse once, execute many, eliminate  literal SQL, check if shared_pool size is adequate.
latch: library cache pin – is acquired when a statement in the library cache is re-executed. Using private rather than public synonyms or direct object references such as OWNER.TABLE might help reducing this latch.
latch: row cache objects - protects the access of the data dictionary cache in the SGA. This latch is acquired when loading, referencing and freeing objects in the data dictionary cache. To reduce contention for this latch the size of the shared pool should be increased.
latch: cache buffers chains - acquired whenever a block in the buffer cache is accessed (pinned). Reducing contention for the cache buffer chains latch usually requires reducing logical I/O rates by tuning.
latch: cache buffers LRU chain - acquired in order to introduce a new block into the buffer cache and when writing a buffer back to disk, specifically when trying to scan the LRU (least recently used) chain containing all the dirty blocks in the buffer cache. To reduce it check if the buffer cache size is adequate.

so , if you see a latch contention in Awr, you can understand where the Oracle spents time in code, because latches are used for protecting the code to be run by multiple processes concurrently.
Thus, you can understand what Oracle does and take the necessary actions to decrease the load.. What I try to mean is , when you see something in Awr, it must not necessary be the problem, but it can be the result of something..
For example, if you see a high number of latch: library cache wait, then you need to focus on the question why..

*Buffer Cache -

I will not explain what Oracle Buffer Cache is, as I suppose you already know that.
On the other hand, I will give information about accessing the Oracle buffers in memory, as they are related with the AWR.. They can slow you down if there would be contention on them.
First of all Accessing Oracle Buffer Cache is based on Hash bucket and Doubly linked list.
Following figure illustrates the access mechanism used in Buffer Cache.

As the above picture represent, to access a block in the buffer cache, oracle first hashes the block address --DBA and obtain bucket number. Then, gets bucket latch and looks the buffer header. After that , finds the block in cache and reads it.If the block is not in buffer cache it will be read from disk. So in this process we need a bucket latch.. This is under the control of Latch:Cache buffer chains..

So, we can say that accessing the hot blocks ,which are concurrently get accessed, can increase the requests for the same buffer headers which will increase these cache buffer chain latches.
So as Cache buffer chains is a latch, processes trying to get it use our cpu cycles, and our work is serialized. So we need to tune this. In order to decrease this wait event we have to consider reducing the number of buffers read from buffer cache by tuning the sqls, partition the index with hot block by hash and spread out data to reduce contention.
Note that: The copies of a block needed to satisfy the consistent read operations, are maintaned in the same bucket with the block itself, so the same latch is required to reach the buffer containing the consistent get image of that block. So when there is DML activity going on against the hot blocks, the saame latches are required, thus even the dml increases the change of these type of latch contention.
On the other hand, In Oracle 11g there is an optimization in reaching the buffers. It is a feature about the consistent gets for reducing the latch contention. It is controlled with _fastpin_enable hidden parameter. I will explain this hidden parameter in my next post.

*Library Cache-

We have a similar pinning mechanism in library cache.. Sql statements/text are hashed and put in a structure like a hash table. Oracle finds the related bucket using these hashes , the bucket is searched for the Sql, if the sql is not present then Oracle parses it.
Locking mechanism is used in Library Cache too. Oracle uses Locks and pins for the libary cache objects..For example, to execute a cursor Oracle pins the cursor object in share mode, but not puts any locks on it.On the other hand, in a cursor complation, Oracle pins and locks the object in exclusive mode..

To implement such a protection, we have latches ofcourse;
The Library cache latches are library cache pin allocation , library cache lock allocation ,library cache lock ,library cache ,library cache pin  and library cache load lock..

Following picture represent the protection mechanism in Library cache;

So, as I explained above, we have share mode and exclusive mode pins & locks in library cache. Share mode locks do not create contentions naturally, but exclusive mode locks and pins create contention as it is in  their nature. So in order to have an environment without a contention, we need to decrease the hard parsing activity in the first place... Hard parsing is caused by the low shared pool size and/or Sqls which are not shared -- due to using literals. Invalidation can also trigger contention in the Library cache , as cursor compilation will be needed.. High soft parsing activity can also be a reason for the contention in Library cache. If you have high soft parsing , consider increasing session_cached_cursors...
So we need to take necesssary actions, if we see high values in library cache latches..

Note that , In 11g there are mutexes,which replace the library cache latches.. Mutexes are introduced in 10g firstly. Mutexes works faster than latches, as they are using the cpu 's compare &swap instructions.
There is famous wait event Mutexes- Cursor: pin S wait on X, which worths explaining..
This wait event occur when a session request a shared pin on an object while another session has a exclusive pin on it.. This wait event occurs mostly when a session want to execute a cursor while another session is parsing it. If you see that wait, use its first argument to find the related cursor. Use following sql to find the statement;
SELECT sql_id, sql_text, version_count FROM V$SQLAREA where HASH_VALUE='P1 VALUE'Also there are other wait events about the mutexes.
Cursor Mutex X is one of them. This wait event occurs  when trying to get the cursor 's mutex in exclusive mode, to parse it
Cursor Mutex is another wait , which occurs  when trying to get the cursor 's mutex in share mode to parse it
Cursor: pin S occurs  when trying to get the cursor 's pin in share mode to parse it
Library Cache : mutex S occurs when trying to get the library cache mutex is share mode to perform a library cache operation.

*IO Stats Section

IO stats in terms of tablespace and datafiles are another area that we need to focus on the awr reports.
When we see User I/O type waits(like db file sequential read or db file scattered read) in the Top 5 events section, we need to analyze the User I/O section carefully ..

For example, following is taken from an IO bound system's AWR report.

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
db file sequential read 1.599.822 64.774 40 44,5 User I/O
read by other session 2.268.702 46.324 20 31,8 User I/O
log file sync 38.909 13.074 336 9 Commit
db file scattered read 130.722 9.146 70 6,3 User I/O
CPU time 6.771 4,6

As you see aobe, there are all I/O waits and they are consuming nearly all of the database time..
So in this scenario, we need to look at IO stats section of the same AWR to ensure that IO Subsystem is sufficient to meet the IO requests of the running applications..

Here is the IO stats section of the same AWR report;

AWR File IO Stats Section
The important thing in here is to understand whether the IO Subsystem is resposing slow or the the load is so high.. So we dont make any comments directly after sessing these values.. The reason  may be the sql's which are making Oracle do a lot of IO requests which go beyond the limitations of the underlying Storage System..
To understand this, we need to consider response times for highly loaded tablespace/datafiles as well as the response times for quiet tablespaces. Also IO Waits in the Top 5 timed event section should also correspond with these values.. So in other words, if your IO subsystem is a bottleneck, you should see at least one of the following wait events in Top 5 ,

db file sequential read – Random I/O , single row , mostly index IO
db file scattered Read – Sequential I/O , Full table scan, index fast full scan
direct path read – Sequential I/O , Full table scan , directly to the PGA
direct path read temp – Read I/O for sorts
direct path write temp – Write I/O for sorts
read by other session – sessions waiting  while another session is reading the same from the disk. 

In the File IO stats ;
You will see statistics like  average number of read requests per second,Average read time in milliseconds, write waits,number of buffer waits and average wait time in milliseconds for buffer waits..

Av Reads/s - average number of read requests per second.
Av Rd(ms) - average read time in ms.
Av Blks/Rd - average number of blocks per read.
Buffer Waits - number of buffer busy waits.
Av Buf Wt(ms) - average wait time in milliseconds for buffer busy waits.
There is also a Tablespace IO stats section in AWR , in that section the statistics is just grouped by the tablespaces.

These values indicate read and write operations of ARCH, Buffer Cache Reads, Data Pump, DBWR, Direct Reads, Direct Writes, LGWR, RMAN, Recovery, Smart Scan, Streams AQ, XDB, and Others.

Among the values, we need to focus on Avg  Rd/ms, in the first place . It will give you average time for waiting a disk read, under 20 ms is acceptable.. Av Rd(ms) indicates the average time to read data from disk into a buffer. Note that, according to your hardware , even 8ms may be slow for you.. So check these values according to your hardware but usually anyting under 20ms is acceptable.
This is also an important metic, but should be combined with Avg Blkd per Read.  We can have single block reads in the file, and we can have a little amount of them which will increase our avg rd/ms but it s not a real problem always. 
That is, idle files should be ignored, because they can produce high values for IO  due to disk spinup and etc.
Also we can have multiblock read at a time, it is threaded like a single block IO. That is we start the timer, issue one or more blocks, we stop the timer.(Asktom)

Av Buf Wt(ms) is also very important as in a system with low DML activity and a well configured buffer cache and IO susbystem, there should be almost zero buffer waits. Buffer wait helps us see the  how much sessions/processes have to wait for a buffer which is read into the buffer cache.

If we consider the values in the above picture;

After all those analysis, if you decide you have an IO problem, dont upgrade your hardware directly but do the following;

  • Reduce the I/O requirements of the database by tuning SQL:
  • Reduce the I/O requirements of the database by tuning instance parameters:
  • Optimizing I/O at the Operating System level
  • Balancing the database I/O by usage of Oracle ASM (Automatic Storage Manager)
  • Balancing the database I/O by usage of Striping, RAID, SAN or NAS
  • Redistribute database I/O by manual placement of database files across different filesystems, controllers and physical devices
  • Reducing the data volumes of the current database by moving older data out.
  • Investing in more and/or faster hardware.

That's all for now. I will write more on Awr investigation as well as rac related stastics in my next post (Awr Analysis 2)

No comments :

Post a Comment