Thursday, December 26, 2013

Oracle Linux -- Memory Optimization -> NUMA, HugeTLB/HugePages, kswapd, TLB and UEK

I decided to write this post as the memory amounts that are used in Production systems are increasing day by day. Today, we can see an 2 TB sized Oracle Applications/EBS database , uses 80 GB sga.  Mixed workload environments require both large amount of avaliable system resources and  fast response times.

Most of the time, tuning the code meets this kind of requirements..
On the other hand, there are times that system resources need to be tuned in order to meet the performance requirements and prevent the hang situations..
So, I will write this post regarding the Linux Memory optimization and performance ..
Actually, I have got inspired from the Greg Marsden's , Kernel and Sustatining Engineer in Oracle Linux Team presentations and decided to write these facts down..

In Large systems(128GB and above) with multiple cpu's, the memory layout is different. Numa(Non Uniform Memory Allocation) is used to allocate the memory efficiently. NUMA is automatically configured by Oracle Linux, and used by the Oracle Databases running on the Linux server.. Oracle ships two kernels with its distribution.. One of the is UEK(Unbreakable Linux Kernel) and the other one is base kernel.. Both of these kernels supports NUMA.

Lets take a look at NUMA concept and architecture;

Non-Uniform Memory Access (NUMA):
  • Often made by physically linking two or more SMPs
  • One SMP can directly access memory of another SMP
  • Not all processors have equal access time to all memories
  • Memory access across link is slower
  • If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

NUMA is a memory design used in multiprocessing systems.. With NUMA, the memory access time depends on the memory allocation relative to a processor. In NUMA the memory is splitted in to multiple regions.. These regions are assigned to the processors.. That is, every process has its own local memory. Ofcourse, a processor can access the non-local memory regions, which are the local memory regions of other processor in the system.. Naturally, a processor can access its own local memory faster than a non-local memory, and this is the main concept in NUMA.
The NUMA concept have brought a solution to the problem, that only one processor can access the computer's memory at a time. With NUMA, the memory is seperated in to multiple regions and these regions are assigned to the processors in the system..
Here is the graphical presentation of the architecture;



In Linux with NUMA, the memory is divided and one or more processor are attached to this divided memory zones.. The attached CPU and memory region can be considered a cell. A matter of course, the entire memory is visible and accessible from all the CPUs in the system. The coherency between these memory regions are also supplied. It is handled in hardware by the processor caches and/or the system interconnect.




In detail, Linux divides the hardware into multiple nodes.. Actually, these nodes are the software representations of the hardware portions.. As hardware supplies the physical cells, Linux maps these nodes
onto the physical cells..
As a result, an access to a memory location that is in a closer node that maps to a closer cells will be faster than a remote cell..

Okay, we described NUMA in general.. Let get back to our subject..

So, Oracle Linux uses NUMA with big memory, but there are things that you need to consider when using NUMA with big memory..
NUMA systems can swap even if there is free ram.. These systems make their kernel allocations only from one NUMA Zone that is closer to the particular core... This is an issue in Linux today. Because, even if you are working with one NUMA zone and even there are free memory in other zones at that time, you can get memory allocation fail messages ( dmesg -> order N allocation Failed), and as a result the system can start swapping for these pages..
Min_free_kbytes parameter can handle this issue. (if the allocation failed messages are less than 5 or so.)
By default, Min_free_kbytes is 15 MB.. If you have several NUMA zones, than this 15 MB will be divided by the # of NUMA zones.. For example, for 5 numa zones -> min_free_kbytes is 15/5 = 3MB..
The solution is increasing the min_free_kbytes and it is addressed in MOS document  1546861.1..
Linux needs to be tuned to preserve more memory for the kernel in order to avoid this memory depletion event.
Increase vm.min_free_kbytes to 524288 (from default of 51200) by editing /etc/sysctl.conf and editing the vm.min_free_kbytes line to read:
vm.min_free_kbytes = 524288
Note, on NUMA systems vm.min_free_kbytes should be set to 524288 * <# of NUMA nodes>. Use the following command to see the number of NUMA nodes:
# numactl --hardware
WARNING: Changing vm.min_free_kbytes when the system is already under memory pressure can cause this system to panic or hang. Before trying to change this setting dynamically be sure the system has more free memory that what vm.min_free_kbytes will be set to. This can be checked using the free -k command, which will display the amount of memory used and free in the same format as vm.min_free_kbytes.


min_free_kbytes defitinition: This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size. Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads

You can see the numa layout with numactl tool.
"numactl --hardware" will give you the layout of NUMA in the system.

Example output(144GB Ram two Cpu sockets):

available: 2 nodes (0-1)
node 0 size: 72697 MB
node 0 free: 318 MB
node 1 size: 72720 MB
node 1 free: 29 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10



While working with big memory system, crash analysis are important, too..
If the memory is so big, then dumping the contents, I mean taking crash dump, becomes harder.. That's why, we need to take the critical contents that can be enough and used to make OS analysis..Otherwise, taking a crash dump can take several minutes... So in order to achieve this, the system and kdump should be configured in that way.
Also, keep in mind that, Kexec can be used for making fast reboot. It bypasses the BIOS instructions and reduces the downtime.. (15 minutes - > 2 minutes) On the other hand; some PCI devices does not like to be rebooted in this way.. 

Swap is another important thing for memory optimization.. Oracle 's memory recommendation for servers running Oracle Databases is as follows;

RAMSwap Space
Between 1 GB and 2 GB1.5 times the size of RAM
Between 2 GB and 16 GBEqual to the size of RAM
More than 16 GB16 GB

As you see above, after 16GB , regardless of memory size, the recommended swap space is 16GB. But, we do not want to use this space actually... Because, if the database code or data pages are swapped back and forth, we will face a significant decrease in performance .. We just set it, dont wanna use it..
If we dont wanna use it, we need to monitor it... We need to monitor swap usage to be sure it s not used.. So, we can use free tool to monitor it.. 
Remember in Linux free pages are used actually, they are used as page caches, but they are overwritable, so technically they are free.

That is why when you calculate your free memory , you will add cached to the free pages.. 
Cached column in free command output + Free column in free command output returns actual Total Free Memory.. 
Example Free command output:
                   total            used       free     shared    buffers     cached
Mem:        144967     144625        341          0        413      71252
-/+ buffers/cache:      72959      72007
Swap:        32765        474      32290

The calculation is actually as follows Free + Used - (Buffer+cached)
You can see the swap usage in here, too..  Also, you can use the following command to see the percantage of used swap space : free -m | grep -i Swap | awk '{print ($3 / $2)*100}'
It is okay to see some of the Swap space used.. The important thing is that, a system should not do swap operations so often. You can monitor  the swap in and out operations using Sar tool.
sar -W will report swap statistics in two columns
pswpin/s :Total number of swap pages the system brought in per second.
pswpout/s:Total number of swap pages the system brought out per second.

Example output:

                    pswpin/s pswpout/s
12:30:01 PM 8.02 11.87
12:40:01 PM 7.13 52.70
12:50:01 PM 6.61 15.07
01:00:01 PM 13.05 65.30
01:10:01 PM 29.10 4.72
01:20:01 PM 82.65 20.68
01:30:01 PM 44.91 17.46
01:40:01 PM 30.79 11.03
01:50:01 PM 10.10 19.78
02:00:01 PM 8.07 4.17
02:10:01 PM 18.40 9.05
02:20:01 PM 7.10 4.46
02:30:01 PM 11.90 27.23
02:40:01 PM 3.95 17.48
Average: 33.11 30.06

As I have mentioned above, seeing some swap space used by OS is not a bad thing.. Because Linux by design, swaps out the unnecessary/unused pages of an application in to swap space.. So a system using swap space can have a lot of free memory at the same time.. So pages swapped in and out (especially "in", in my opinion ) are important... These activities can decrease performance.. If you see a high count in this metrics, you need to tune your system... This behaviour can be controlled using /proc/sys/vm/swappiness...But, care must be taken on this parameter.. The default value is 60, the value 0 ->“never use swap if free ram avaiable“ and value 100->Swap out pages aggressively.


There are things to consider for utilizing the memory.. For example, batch process like updatedb or slocater can force the pages that belong to the applications in to swap.. So after this kind of process have worked, we can experience a performance decrease in our Applications, as their pages will need to be swapped in..
The general solution to that is to disable these kind of batch process.. An alternative solution can be using cgroups.. Using cgroups, you can isolate the memory usages of the applications.. You can configure a group of applications like updatedb to be in one group and you can set the maximum memory available for them..

So in general, swap space should only be used for Operating System services and that 's why, SWAP=RAM is not recommended.. On the other hand; swap space is important for small systems, as it supplies room to grow ,but when it is used so often, it decreases the general performance... Shortly, if a system has a lot of swap activity, then that system is either misconfigured or missized..

You can also use vmstat to monitor swap in and out operations..
Example output of vmstat:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  9 10758504 22830608  80848 45653220   13   12   759   137    1    3  7  1 84  8  0

Look at si and so columns .. They show blocks per seconds..
One other thing that I want to mention, is the kswapd .. It s the daemon that makes the swap in and out operations. It is a beast that consumes CPU in some cases.. It can hang a Linux system, while trying to swap out and in the pages.. I personally saw it make a Linux 5 64 bit system hang for 30 minutes, altough there were 40000 free pages in the System.. I didnt check renicing kswapd process priority, it may help on this.(renice 0 kswapd_pid)

Following picture describes the kswapd 's default behaviour;


As you see above, kswapd wakes up when Page count is low.. It start to reclaim memory. So what are the Pages High and Pages Low values then? When do we declare them? Where do we see them? I will try to answer this question in my next post.
Again, min_free_kbytes is important on this.
Also , following should be considered if you see aggresive swap I/O in Oracle Linux.


In Linux, /proc/buddyinfo can also be used in such situation to extend the analysis and diagnose memory fragmentation..
buddyinfo displays
Node 0, zone      DMA      3      5      5      4      2      2      2      0      1      0      2
Node 0, zone    DMA32    135     12      8      0      1      1      1      0      1      1     67
Node 0, zone   Normal      1    176     80      3      0      0      0      2      1      1      6
Node 1, zone   Normal      1      1      8      6      1      7      4      1      1      1      6 

The columns in the output can be considered as follows;

First column -> 2^0*PAGE_SIZE
Second colum -> 2^1*PAGE_SIZE
Third colum -> 2^2*PAGE_SIZE
and so on..
For example, by looking the output above, we can say that there are 176 of (2^1*PAGESIZE) sized pages in Zone 0.. In English , there are 1 single page could be allocated immediately, 176 contigous pairs exist, 80 groups of 4 pages can be found in the normal zone of Node 0..  If some process try to allocate some high order pages above these limits, allocations will start to fail.. This can trigger kswapd to work aggresively.

Okay , lets mention some important things about Linux pages, too.
There are two kind of pages in Linux. 4K Pages, and 2MB Pages.. Note that there are significant performance differences between them..
Linux uses paging to translate virtual addresses to physical addresses.. 
Processes work with virtual addresses that are mapped to physical addresses. This mapping is defined by page tables...



So consider you have 250 Gb memory installed in your Oracle Database Server and you are using 4K pages. Suppose you have 2000 connections in your database environment.. So every oracle process on Linux will its own Page tables to map the Sga ,and you will end up a giant total page table size .. (maybe 100gig or more..)

Note that, when calculating Sga, we need to consider the size of page tables and number of processes, because we need to leave some space for OS, too..

The solution for problem above is using Hugepages. Hugepages provide much more smaller page tables in terms of size, because there will be less pages to handle, as Hugepages are 2MB(or more , it depends on the system) sized. In addition the hugepages are never swapped out, they are locked in memory.. Kernel does less work  for bookkeeping of virtual memory, because of the larger page sizes..  Note that: Hugepages is not compatible with automatic memory management that Oracle does if configured to do..
Also sharing of pagetables seems also supported in Linux 5 and afterwards, as declared by Redhat below..


Shared Page Tables
Shared page tables are now supported for hugetlb memory. This enables page table entries to be shared among multiple processes.
Sharing page table entries among multiple processes consumes less cache space. This improves application cache hit ratio, resulting in better application performance.

So for Oracle, it is recommended to use Hugepages with Shared page tables..


TLB hit ratio will also increase if Hugepages are in use. Lets see what TLB's are...
Linux uses TLB(Transaction Lookaside Buffers) in the CPU. TLB store the mappings of virtual memory to actual physical memory addresses for performance reasons.. It is used like a cache, by using them Linux can access the mappings directly without going to the page tables.. So as with hugepages, the page sizes will be higher, page count will be lower.. That's why , TLB consumption will decrease, also decreasing the processing overhead..

Following is a graphical representation decribing the TLB in use.. Ref: University of New Mexico.


As you see above, CPU reads the virtual address from the TLB, firstly. If it cannot find the page there, then it goes to page table to find it..

In addition; I want to mention about the allocation of these Hugepages briefly.
I guess (didnt trace it); Oracle Database allocates the shared memory from Hugepages using something like the following;

shmget(2, MB_8, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);

You can see the size of and the count of available and allocated Hugepages by looking to the /proc/meminfo..

Example output:

cat /proc/meminfo
MemTotal:     148446596 kB
MemFree:        363292 kB
Buffers:        857808 kB
Cached:       77299976 kB
SwapCached:          0 kB
Active:       15177984 kB
Inactive:     70000152 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     148446596 kB
LowFree:        363292 kB
SwapTotal:    33551744 kB
SwapFree:     33551508 kB
Dirty:             256 kB
Writeback:           0 kB
AnonPages:     7122052 kB
Mapped:         111868 kB
Slab:          1135976 kB
PageTables:     228652 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  77055040 kB
Committed_AS: 21068916 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    280780 kB
VmallocChunk: 34359457067 kB
HugePages_Total: 30000
HugePages_Free:   4399
HugePages_Rsvd:   2560
Hugepagesize:     2048 kB

So 30000-4399+2560= is or will be used..
So in the example above we allocated nearly 28161 hugepages, this makes approx. 55 GB..
Note that the system that the above output is gathered from, has 50 gb sga.. 55 Gb sga target.. That s why we see these reserverd hugepages count (2560)..
So it s a good configuration, almost all of the sga is and will be in hugepages, then we can say that hupages configuration for this system is done properly..

If you want to use Hugepages with Oracle Database, please see the folowing Oracle Support documents.
  • MOS 361323.1 – Hugepages on Linux
  • MOS 361468.1 HugePages on 64-bit Linux
  • MOS 401749.1 script to calculate number of huge pages
  • MOS 361468.1 Troubleshooting Huge Pages

Moreover; there is also another type of Hugepages called THP(Transparent Hugepages).

These type of Hugepages are different than normal Hugepages, as Oracle Database cannot use them.
Note that , Java does not use Hugepages at all , unless +UseLargePages parameter is specified.. Ex: java -XX :+UseLargePages..
THP 's are swappable. That's why not good for RDBMS, as RDBMS work can go to disk.. Also this brings a paging/memory overhead, too..
It s nice to know that, if you compile the linux kernel with THP support and even if you dont use THP, there can be a %10 perfomance gain, especially in Applications that do a lot of I/O......
Note that : THP support is disabled by default in the latest UEK kernel.

The last paragraph will be about UEK kernel ..
UEK kernel is based on Linux 2.6 and later.. With Oracle Linux, Oracle have done enhancements for supporting OLTP, infiniand, SSD disk access, RDS , asnyc I/O, OCFS2 and networking. UEK is compatible with RHEL, so you can install and the applications running on RHEL servers into Oracle Linux running with UEK. It is a kernel that is fast, modern and reliable..
UEK is modern because it supports and supplies PV Hugepages, Data Integrity(to prevent data corruption), Dtrace,OCFS2, btrfs, OFED , ksplice and etc..

UEK kernel has a World Record in TPCC Benchmark made on March 12 2012. You can see the full disclosure report with this link ->  http://c970058.r58.cf2.rackcdn.com/fdr/tpcc/Oracle_X4800-M2_TPCC_OL-UEK-FDR_Rev2_071012.pdf


In my opinion, the machine that is usd in Benchmark is a high end machine.. Utilizing this machine and getting such a high transaction processing rate is a big success for Oracle Linux.

That 's all for now. Feel free to write comments.. Please help me make corrections, if necessary..

No comments :

Post a Comment