We have recently seen a problem in Oracle Linux 6.6, as it was not using all the cpu cores available on that server. It was running on a VMWare, but the problem was not in VM actually.
The configuration was like below;
socket 1 = cpu0,cpu1,cpu2,cpu3 socket2= cpu4,cpu5,cpu6,cpu7
The problem was in the utilization.
That is, when using 3.8.13-98.1.1.el6uek.x86_64, Oracle Linux 6.6 was using only 4 cpu cores. We have analyzed cpu utilization properly and it didnt just not allocate the last 4 cpu cores.
Oracle Linux 6.6 was seeing all the 8 cpus on the other hand..
We have used taskset executable to force a process to run on a specific cpu core which Oracle Linux normally did not not utilize and seen that the process have started running on that cpu core without any problems and we could also see that cpu utilization of that cpu core have become %100, as expected.
[root@somehost opt]# taskset -c -p 6 2313
pid 2313's current affinity list: 0-7
pid 2313's new affinity list: 6
[root@somehost~]# top
top - 19:10:14 up 4 days, 6:02, 6 users, load average: 1.06, 0.63, 0.32
Tasks: 432 total, 2 running, 430 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.7%us, 0.7%sy, 0.0%ni, 98.0%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.7%us, 0.3%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
..
..
Cpu6 : 99.7%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
So, when forced, Oracle Linux 6.6 with 3.8.13-98.1.1.el6uek.x86_64 kernel was using all the cores, but normally the scheduler automatically did not utilize the 4 cores coming from the second cpu socket, even under a very loaded situation as seen below. (cpu4,5,6,7 is not used.. not utilized..)
op - 12:51:32 up 3 days, 23:43, 3 users, load average: 16.74, 9.82, 5.30
Tasks: 454 total, 18 running, 436 sleeping, 0 stopped, 0 zombie
Cpu0 : 92.2%us, 6.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Cpu1 : 94.2%us, 4.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.1%si, 0.0%st
Cpu2 : 93.4%us, 4.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.8%si, 0.0%st
Cpu3 : 92.7%us, 5.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32687204k total, 32521340k used, 165864k free, 104284k buffers
Swap: 33554428k total, 32992k used, 33521436k free, 21020212k cached
The strange thing was , the issue could not be reproduced in 3.8.13-44 el6uek kernel.
When booted with 3.8.13-44 el6uek kernel, Oracle Linux 6.6 have seen and utilized all the cpu cores without any problems, perfectly in balance.
So, the problem basically was "Oracle Linux 6.6 with 3.8.13-98.1.1.el6uek.x86_64 kernel.
The problem was looking like the same that was discussed in the discussion that I have created in Oracle Community. Avi Miller from Oracle replied to that similar problem and stated that this is a known issue 3.8.13-98.2.1 (tracked by internal bug 21662) So , the workaround was downgrading to the previous UEK3 release or using the redhat compatible kernel for the time being.
The configuration was like below;
socket 1 = cpu0,cpu1,cpu2,cpu3 socket2= cpu4,cpu5,cpu6,cpu7
The problem was in the utilization.
That is, when using 3.8.13-98.1.1.el6uek.x86_64, Oracle Linux 6.6 was using only 4 cpu cores. We have analyzed cpu utilization properly and it didnt just not allocate the last 4 cpu cores.
Oracle Linux 6.6 was seeing all the 8 cpus on the other hand..
We have used taskset executable to force a process to run on a specific cpu core which Oracle Linux normally did not not utilize and seen that the process have started running on that cpu core without any problems and we could also see that cpu utilization of that cpu core have become %100, as expected.
pid 2313's current affinity list: 0-7
pid 2313's new affinity list: 6
[root@somehost~]# top
top - 19:10:14 up 4 days, 6:02, 6 users, load average: 1.06, 0.63, 0.32
Tasks: 432 total, 2 running, 430 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.7%us, 0.7%sy, 0.0%ni, 98.0%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.7%us, 0.3%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
..
..
Cpu6 : 99.7%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
So, when forced, Oracle Linux 6.6 with 3.8.13-98.1.1.el6uek.x86_64 kernel was using all the cores, but normally the scheduler automatically did not utilize the 4 cores coming from the second cpu socket, even under a very loaded situation as seen below. (cpu4,5,6,7 is not used.. not utilized..)
op - 12:51:32 up 3 days, 23:43, 3 users, load average: 16.74, 9.82, 5.30
Tasks: 454 total, 18 running, 436 sleeping, 0 stopped, 0 zombie
Cpu0 : 92.2%us, 6.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Cpu1 : 94.2%us, 4.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.1%si, 0.0%st
Cpu2 : 93.4%us, 4.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.8%si, 0.0%st
Cpu3 : 92.7%us, 5.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32687204k total, 32521340k used, 165864k free, 104284k buffers
Swap: 33554428k total, 32992k used, 33521436k free, 21020212k cached
The strange thing was , the issue could not be reproduced in 3.8.13-44 el6uek kernel.
When booted with 3.8.13-44 el6uek kernel, Oracle Linux 6.6 have seen and utilized all the cpu cores without any problems, perfectly in balance.
So, the problem basically was "Oracle Linux 6.6 with 3.8.13-98.1.1.el6uek.x86_64 kernel.
The problem was looking like the same that was discussed in the discussion that I have created in Oracle Community. Avi Miller from Oracle replied to that similar problem and stated that this is a known issue 3.8.13-98.2.1 (tracked by internal bug 21662) So , the workaround was downgrading to the previous UEK3 release or using the redhat compatible kernel for the time being.
Actually, a similar problem was there in 3.8.13-98.1.1, as well.
So, we are for now continuing with the older kernel 3.8.13-44 el6uek and probably upgrade after the internal bug 21662 will be resolved.