After migrating a 10gR2 EBS database from a 32 bit Linux machine to a new 64 bit Linux Machine, performance started to slow down.
Load average was always somewhere between 20 and 30, and end users were facing slowness in the EBS OAF and Forms screens..
Altough the users were facing this slowness, they were also saying that their Discoverer reports are completed faster than they were in the old 32 Linux..
The major difference between this old and new machines are as follows;
Old Machine Linux version: Redhat 5.5
Old MAchine Arch: 32 bit
Old MAchine SGA: 8GB (implemented VLM mode + in-memory filesystem (shmfs/tmpfs, ramfs)
Old MAchine PGA: 2G
Old MAchine Total Memory: 31GB
Old Machine CPU core count: 16 Pyhiscal Cores (No Hyper threading)
New Machine Linux version: Redhat 5.9
New MAchine Arch: 64 bit
New MAchine SGA: 18G
New Machine PGA : 10G
New MAchine Total Memory: 48GB (hugepages enabled for SGA)
New Machine CPU core count: 16 Cores (16 virtual sockets , one core each.. no hyper threading)
So , the new machine was 64 bit, had 48GB memory(hugepages were also enabled) and had the same core counts as the old server, but it was on VM (Vmware ESX)
The important difference was that the new machine's CPU cores were virtual..
When we examined the AWR reports, we have seen CPU waits all the time..
Also, In top command , we have seen %96 user space CPU usage all the time..
Load Average was high all the time..
Discoverer reports were running all the time.. They were always in "R" state and they were always consuming %99 cpu..
So they were obviously the main cause of the slowness, but we had no chance to tune the Discoverer Reports, because these reports were also used in the old system by the End users..
AWR gathered for 30 minutes interval representing the problem:
Okay we had 19769 seconds CPU time..
On the other hand we had 16 vm cores in this servers. So for a 30 minutes interval , we had 16 * 30 * 60 seconds CPU time available ..
This made 28800, so we still had 9031 seconds CPU time available..
So, we had a high CPU load but we were not in a hang situation. This high DB CPU usage meant that the database was spending a lot of time in CPU (processing) compared to I/O and other events.
This high CPU usage was caused by the Discoverer Reports, it was for sure..
There was db file sequential read listed at second place in the Top 5 Waits Event.. This wait was indicating single block index reads and its average was 4 ms.. So it was acceptable for this environment.. In adddition, we were using the same disk devices even the same filesystems as the old server, so IO was not the problem..
Load average was always somewhere between 20 and 30, and end users were facing slowness in the EBS OAF and Forms screens..
Altough the users were facing this slowness, they were also saying that their Discoverer reports are completed faster than they were in the old 32 Linux..
The major difference between this old and new machines are as follows;
Old Machine Linux version: Redhat 5.5
Old MAchine Arch: 32 bit
Old MAchine SGA: 8GB (implemented VLM mode + in-memory filesystem (shmfs/tmpfs, ramfs)
Old MAchine PGA: 2G
Old MAchine Total Memory: 31GB
Old Machine CPU core count: 16 Pyhiscal Cores (No Hyper threading)
New Machine Linux version: Redhat 5.9
New MAchine Arch: 64 bit
New MAchine SGA: 18G
New Machine PGA : 10G
New MAchine Total Memory: 48GB (hugepages enabled for SGA)
New Machine CPU core count: 16 Cores (16 virtual sockets , one core each.. no hyper threading)
So , the new machine was 64 bit, had 48GB memory(hugepages were also enabled) and had the same core counts as the old server, but it was on VM (Vmware ESX)
The important difference was that the new machine's CPU cores were virtual..
When we examined the AWR reports, we have seen CPU waits all the time..
Also, In top command , we have seen %96 user space CPU usage all the time..
Load Average was high all the time..
Discoverer reports were running all the time.. They were always in "R" state and they were always consuming %99 cpu..
So they were obviously the main cause of the slowness, but we had no chance to tune the Discoverer Reports, because these reports were also used in the old system by the End users..
AWR gathered for 30 minutes interval representing the problem:
Event | Waits | Time(s) | Avg Wait(ms) | % Total Call Time | Wait Class |
---|---|---|---|---|---|
CPU time | 19,769 | 74.4 | |||
db file sequential read | 209,680 | 850 | 4 | 3.2 | User I/O |
SQL*Net more data to client | 644,830 | 546 | 1 | 2.1 | Network |
latch: cache buffers chains | 4,576 | 389 | 85 | 1.5 | Concurrency |
TCP Socket (KGAS) | 6,313 | 317 | 50 | 1.2 | Network |
Okay we had 19769 seconds CPU time..
On the other hand we had 16 vm cores in this servers. So for a 30 minutes interval , we had 16 * 30 * 60 seconds CPU time available ..
This made 28800, so we still had 9031 seconds CPU time available..
So, we had a high CPU load but we were not in a hang situation. This high DB CPU usage meant that the database was spending a lot of time in CPU (processing) compared to I/O and other events.
This high CPU usage was caused by the Discoverer Reports, it was for sure..
CPU Time (s) | Elapsed Time (s) | Executions | CPU per Exec (s) | % Total DB Time | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|
2,190 | 2,685 | 63,231 | 0.03 | 10.10 | 6j1rnhpxu1bry | Disco10, | SELECT COUNT(T.NO) FROM XXIC_T... |
1,994 | 2,493 | 2 | 996.84 | 9.38 | 79h2r4xdn8g44 | Disco10, | SELECT o481137.CUSTOMER_NAME ... |
1,827 | 2,202 | 2 | 913.39 | 8.29 | 93bkbks2j66q5 | Disco10, | SELECT DECODE((decode(o310037... |
The design of these reports were not optimized.. For example a subquery was running 63231 times in 30 minues and It was a cpu bound query.. On the other hand, it was not the time to tune these sqls or modify these reports, because they were working exactly as the same way as they were working in the old system..
I mean, even the execution plans and the sqls in these reports were the same, they did not change..
In addition, this new system was 64 bit and it had bigger SGA and bigger Physical memory than the old one.. So we were not expecting such a load in the new database server..
That 'is , the cause of high Cpu times must be something else..
It was the CPU... It would be better if I could share the CPU model information in here but the idea in this post is to prove the slowness using AWR ..
So the old 32 bit system had more db file sequential read, more db file scattered read and a significant amount of CPU time..
As we see , The new 64 bit system which has the old CPUs had a load of 18.. On the other hand the old 32 bit system, which has physical CPUs, had a load of 5 ..
The new 64 bit System which has old CPUs, has a BUSY time of 1,674,495 measured in 30 mins , but the old 32 bit system which has physical CPUS has a BUSY time of 1,868,106 measured in 1 hour..
I would appreciate your comments on the paper including the acceptance or rejection on the basis of the things described above.
In the Latch activity , we saw the cache buffer chains with the highest Get Request..
Get Requests | Pct Get Miss | Avg Slps /Miss | Wait Time (s) | NoWait Requests | Pct NoWait Miss |
---|
cache buffers chains | 810,825,589 | 0.02 | 0.03 | 389 | 1,140,617 | 0.03 |
The high cache buffer chains was directing us to find the hot buffers, find the sessions waiting for latch contention, intersect the objects related with hot buffers and the objects touched with queries waiting for latch contention, conclude the problematic sqls and made an sql tuning.
On the other hand, I was not in that mood, as this database was running as the same way as the old database server..
By knowing that, the latches acquired when a process needs them and latches are released when the process finishes its work in the relevant memory structure; we decided "the time spent between acquiring and releasing the latches" is also very important..
That is , if our processes could make their jobs quickly , then more latches would be available..
I mean, the time between acquiring and releasing the latches will be shorter, and that will decrease the cpu time waits instance wide, as the processes will not be on the CPU while waiting for the latches ..
At this point, we changed the CPUs.. I mean we changed the Hardware of the Virtual Machine which were running the database..
Note that , the disks stayed the same and the memory stayed the same..
After the CPU change: the cache buffer chain count was again so high;
Get Requests | Pct Get Miss | Avg Slps /Miss | Wait Time (s) | NoWait Requests | Pct NoWait Miss |
---|
cache buffers chains | 3,439,451,402 | 2.09 | 0.00 | 0 | 823,885 | 0.04 |
Using the new CPUS, there were almost no waits in latch: cache buffers chain;
Event | Waits | %Time -outs | Total Wait Time (s) | Avg wait (ms) | Waits /txn |
---|
atch: cache buffers chains | 17,424 | 0.00 | 0 | 0 | 0.57 |
These values were as follows in the OLD system..
latch: cache buffers chains | 4,576 | 0.00 | 389 | 85 | 0.34 |
So , eventhough there were more number of latches allocated with the new CPUs, there were almost no wait times
Look, we had more db file sequential read, and more cache buffer chains , but our CPU time is lower :)
Note that : This report was taken for an 1 hour interval. Eventhough, it was for a longer period, CPU time was lower than the value with the old CPUs..
Event | Waits | Time(s) | Avg Wait(ms) | % Total Call Time | Wait Class |
---|---|---|---|---|---|
CPU time | 10,730 | 96.0 | |||
db file sequential read | 213,556 | 165 | 1 | 1.5 | User I/O |
TCP Socket (KGAS) | 11,950 | 162 | 14 | 1.4 | Network |
SQL*Net more data to client | 1,359,692 | 93 | 0 | .8 | Network |
log file parallel write | 28,298 | 47 | 2 | .4 | System I/O |
We had almost have the same load on the Oracle Database. Note that: these values are per second and per transaction.
Load Profile (with new CPUs) --1 hour AWR
Per Second | Per Transaction | |
---|---|---|
Redo size: | 58,524.70 | 6,768.44 |
Logical reads: | 509,961.79 | 58,977.58 |
Block changes: | 338.90 | 39.19 |
Physical reads: | 171.86 | 19.88 |
Physical writes: | 24.92 | 2.88 |
User calls: | 2,476.00 | 286.35 |
Parses: | 170.12 | 19.67 |
Hard parses: | 4.22 | 0.49 |
Sorts: | 296.35 | 34.27 |
Logons: | 0.96 | 0.11 |
Executes: | 7,877.99 | 911.10 |
Transactions: | 8.65 |
Load Profile (with OLD CPUs) --30 minutes AWR
Per Second | Per Transaction | |
---|---|---|
Redo size: | 38,658.42 | 6,072.38 |
Logical reads: | 212,494.62 | 33,378.16 |
Block changes: | 195.19 | 30.66 |
Physical reads: | 294.80 | 46.31 |
Physical writes: | 22.43 | 3.52 |
User calls: | 1,471.26 | 231.10 |
Parses: | 94.94 | 14.91 |
Hard parses: | 4.28 | 0.67 |
Sorts: | 157.24 | 24.70 |
Logons: | 0.84 | 0.13 |
Executes: | 7,478.48 | 1,174.70 |
Transactions: | 6.37 |
We had more redo size, more logical reads, more block changes, more physical read, more phyiscal writes, almost the same hard parses, more sorts, more executions and more transactions in New CPU System, but there was low CPU time waits :)
Almost the same sqls were running , same Discoverer reports were at the top, but this time Cpu time was not so high in instance wide..
So this showed that the approach was correct, the CPUs were the bottleneck and the new CPUs fixed the problem.. The new cpus were again Virtual, but the Cpu family of the new hardware was different.. There were major differences in CPU specs... like hyperthreading...
Okay. So far so good. But we had one more question; how was this system dealing with this load earlier? I mean this database was running on 32 bit server with 8gb SGA and we were not seeing this CPU time waits in AWRs... So the question is why?
Lets start with the CPU utilization...
High Cpu utilization may not a problem actually.. If you are not waiting for other resources, you can end up waiting for CPU or running on the CPU..
Especially if your queries are cpu bound, then you will have your CPU time increasing..
CPU time in AWR refers to the amount of CPU time that session used .. Oracle gets this info from OS using getrusage() or similar OS calls.. So this CPU time value does not include time spent waiting on CPU run queues..
High CPU utilization may not necessarily indicate that there is a problem; it could just mean that the system is being well utilized.
Time shown as "DB CPU" or "CPU time" in AWR and similar reports refers to the amount of CPU time that the session/s used and is usually based on information from the OS itself (eg: using getrusage() or similar OS calls on Unix). This measure typically does not include time spent waiting on CPU run queues.
If we could have ASH, it would be better because Time shown as "CPU + Wait for CPU" in ASH reports could give us an idea of the time that session/s were running, or wanting to run on CPU..
Anyways, the bottom line is that , if you are not waiting on other things, then you will be on CPU and your CPU time will increase.. You will use the CPU in parallel and you will increase the load of the server and then you will start to feel performance degregation in system wide..
Lets have a look at the AWR report gathered from the old 32 bit server, running the same database, same queries, same reports and having almost the same number of sessions connected to it..
Do not bother LNS wait.. This system had a standby when this AWR snapshot was taken.
Load Profile (with OLD CPU s --64 bit)
Per Second | Per Transaction | |
---|---|---|
Redo size: | 38,658.42 | 6,072.38 |
Logical reads: | 212,494.62 | 33,378.16 |
Block changes: | 195.19 | 30.66 |
Physical reads: | 294.80 | 46.31 |
Physical writes: | 22.43 | 3.52 |
User calls: | 1,471.26 | 231.10 |
Parses: | 94.94 | 14.91 |
Hard parses: | 4.28 | 0.67 |
Sorts: | 157.24 | 24.70 |
Logons: | 0.84 | 0.13 |
Executes: | 7,478.48 | 1,174.70 |
Transactions: | 6.37 |
Event | Waits | Time(s) | Avg Wait(ms) | % Total Call Time | Wait Class |
---|---|---|---|---|---|
CPU time | 19,769 | 74.4 | |||
db file sequential read | 209,680 | 850 | 4 | 3.2 | User I/O |
SQL*Net more data to client | 644,830 | 546 | 1 | 2.1 | Network |
latch: cache buffers chains | 4,576 | 389 | 85 | 1.5 | Concurrency |
TCP Socket (KGAS) | 6,313 | 317 | 50 | 1.2 | Network |
Load Profile (with OLD Phyiscal Server --32 bit )
Per Second | Per Transaction | |
---|---|---|
Redo size: | 38,685.46 | 5,267.77 |
Logical reads: | 348,930.51 | 47,513.61 |
Block changes: | 218.78 | 29.79 |
Physical reads: | 1,026.41 | 139.77 |
Physical writes: | 282.69 | 38.49 |
User calls: | 2,255.70 | 307.16 |
Parses: | 138.65 | 18.88 |
Hard parses: | 3.75 | 0.51 |
Sorts: | 205.22 | 27.95 |
Logons: | 0.86 | 0.12 |
Executes: | 6,003.52 | 817.50 |
Transactions: | 7.34 |
Event | Waits | Time(s) | Avg Wait(ms) | % Total Call Time | Wait Class |
---|---|---|---|---|---|
CPU time | 15,466 | 93.5 | |||
db file sequential read | 755,438 | 389 | 1 | 2.4 | User I/O |
db file scattered read | 302,776 | 290 | 1 | 1.8 | User I/O |
LNS wait on SENDREQ | 389 | 98 | 253 | .6 | Network |
log file sync | 17,380 | 93 | 5 | .6 | Commit |
So the old 32 bit system had more db file sequential read, more db file scattered read and a significant amount of CPU time..
Lets take a look at te Operating System Statistics ;
Operating System Statistics (OLD Physical Server --32 bit)
Note that : this AWR was for an 60 minutes interval.
Statistic | Total |
---|---|
BUSY_TIME | 1,868,106 |
IDLE_TIME | 3,894,329 |
IOWAIT_TIME | 65,502 |
NICE_TIME | 48 |
SYS_TIME | 222,197 |
USER_TIME | 1,625,270 |
LOAD | 5 |
RSRC_MGR_CPU_WAIT_TIME | 0 |
PHYSICAL_MEMORY_BYTES | 34,076,270,592 |
NUM_CPUS | 16 |
NUM_CPU_SOCKETS | 4 |
- Total Buffer Gets: 1,262,389,050
Latch Name | Get Requests | Pct Get Miss | Avg Slps /Miss | Wait Time (s) | NoWait Requests | Pct NoWait Miss |
---|
Cache buffers chains | 2,406,314,930 | 0.28 | 0.00 | 0 | 5,506,188 | 0.00 |
library cache | 13,180,259 | 0.04 | 0.01 | 0 | 112,143 | 0.09 |
library cache load lock | 10,348 | 0.00 | 0 | 0 | ||
library cache lock | 4,375,448 | 0.00 | 0.00 | 0 | 0 | |
library cache lock allocation | 181,145 | 0.00 | 0 | 0 | ||
library cache pin | 5,453,561 | 0.01 | 0.00 | 0 | 0 | |
library cache pin allocation | 69,931 | 0.00 | 0 | 0 |
Operating System Statistics (OLD CPU s --64 bit)
Note that : this AWR was for an 30 minutes interval.
Statistic | Total |
---|---|
BUSY_TIME | 1,674,495 |
IDLE_TIME | 571,810 |
IOWAIT_TIME | 39,970 |
NICE_TIME | 0 |
SYS_TIME | 296,542 |
USER_TIME | 1,247,579 |
LOAD | 18 |
RSRC_MGR_CPU_WAIT_TIME | 0 |
PHYSICAL_MEMORY_BYTES | 50,750,922,752 |
NUM_CPUS | 16 |
NUM_CPU_SOCKETS | 4 |
- Total Buffer Gets: 444,663,900
Latch Name | Get Requests | Pct Get Miss | Avg Slps /Miss | Wait Time (s) | NoWait Requests | Pct NoWait Miss |
---|
cache buffers chains | 810,825,589 | 0.02 | 0.03 | 389 | 1,140,617 | 0.03 |
library cache | 7,053,564 | 0.12 | 0.24 | 99 | 409 | 0.24 |
library cache load lock | 5,011 | 0.00 | 0 | 7 | 0.00 | |
library cache lock | 1,622,604 | 0.01 | 0.31 | 0 | 0 | |
library cache lock allocation | 62,053 | 0.00 | 1.00 | 0 | 0 | |
library cache pin | 2,719,841 | 0.02 | 0.30 | 4 | 3 | 0.00 |
library cache pin allocation | 31,816 | 0.00 | 0 | 0 |
Also when we look at the cache buffer chains, we see that there are almost no waits for the cache buffer chains in the old 32 bit system.. Note that Total Buffer gets is much higher (3 times more) in
the Old 32 Bit server, and even if it so , it has almost no wait for cache buffers chains..
The situation was almost the same for library cache latches..
At the bottom line ,we can say that altough the old 32 bit server had more operations, its CPU busy time is lower(1/2 rate) than the new 64 bit server which had old cpus..
So this was a proof for the CPU problem in 64 bit server.. Ofcourse , there should be a more detailed explanation for this, but as we knew the sqls were the same, the load was the same and as the storage is the same, we requested a CPU change in the newly migrated 64 bit server, and after the CPU change, the performance problem went away..
No comments :
Post a Comment
If you will ask a question, please don't comment here..
For your questions, please create an issue into my forum.
Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html
Register and create an issue in the related category.
I will support you from there.