Tuesday, November 15, 2016

Linux -- Huge Pages in real life, memory consumption, Huge pages on swap operations, using "overcommit" / nr_overcommit_hugepages

This blog post will be about Huge pages on Linux.
I actually wrote a comprehensive article about Linux Memory Optimization (including Hugepages) earlier, but this blog post will be a little different.
Today, I want to make a demo to show you the Hugepages in real life and the memory locking mechanism that we need to get used to, when we enable Hugepages.
The thing that made me write this article was a question that one my collegues asked last week.
My collegue realized that after rebooting his database server, the memory directly becomes "used". Even before starting the database, he could see the memory is in use when he executed the "free" command.
This question asked me on the phone and I directly answered that "it is because huge pages".
However, I wanted to make a demo and see this statement in real life.

Well let's revisit my earlier blog post and recall the general information about the Huge pages:
(I strongly recommend you to read this blog post,as well -> http://ermanarslan.blogspot.com.tr/2013/12/oracle-linux-memory-optimization.html)


When we use hugepages, we have smaller page tables in terms of size, because there will be less pages to handle, as Hugepages are 2MB(or more , it depends on the system) sized. In addition the hugepages are never swapped out, they are locked in memory.. Kernel does less work for bookkeeping of virtual memory, because of the larger page sizes.. Note that: Hugepages is not compatible with automatic memory management that Oracle does if configured to do..


Let's start our demo. (Note that, my demo env is an Oracle Linux 6.5 x86_64 and the kernel is an UEK 3.8.13-16.2.1.el6uek.x86_64)

HUGEPAGES OCCUPY MEMORY ONCE THEY ARE CONFIGURED (altough they are not used by any applications)

Firstly, I will show you the affect of hugepages. You will see the hugepages are never swapped out and when they are configured; they occupy memory, eventhough they are not used at all.

Initially, our hugepages are not configured as seen below;

[root@jiratemp ~]# cat /proc/meminfo |grep Huge
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

Next, we sync and drop filesystem caches to have a clean environment in terms of memory. (we do this as we will use free command to see the affect of our actions)

[root@jiratemp ~]# sync;echo 3 > /proc/sys/vm/drop_caches ; free -m

[root@jiratemp ~]# free -m
             total       used       free     shared    buffers     cached
Mem:          7985        609       7375          0          4         47
-/+ buffers/cache:        557       7427
Swap:         8015          0       8015

Afterwards; we configure 2048 hugepages directly using proc fs  and directly check the memory usage using free command;
[root@jiratemp ~]# echo 2048 > /proc/sys/vm/nr_hugepages    (Hugepages are 2MB)
[root@jiratemp ~]#  free -m
                    total       used       free     shared    buffers     cached
Mem:          7985       4709       3275          0          4         49
-/+ buffers/cache:       4655       3330
Swap:         8015          0       8015

A quick explanation for the free command output:

Mem: total= Total physical memory
Mem: used = MemTotal - MemFree
Mem: free = Free memory
Mem: shared = meaningless nowadays, can be ignored
Mem: buffers Buffers
Mem: cached Cached memory
-/+ buffers/cache: used MemTotal - (MemFree + Buffers + Cached)
-/+ buffers/cache: free MemFree + Buffers + Cached
Swap: total Total swap
Swap: used SwapTotal - SwapFree
Swap: free Free Swap



You see 4709M are used. One page is 4K, one hugepage is 2M, so 2048 Hugepage makes 4096M
free command reports mb values when used with "-m" argument. See the used value is 4709 (609M was already used before we configure hugepage) . 4709-4096= 613 (almost equal to 609). So these used megabytes are caused by hugepages.
I remind you, we didn't use those hugepages, but once configured, they occupy memory as you see.

Well, it is certain that huge pages are reserved inside the kernel .

HUGEPAGES ARE NOT SWAPPED OUT EVEN UNDER PRESSURE (even when they are not used by any applications)

Hugepages can not be swapped out.. It is real. In order to test it, I wrote a python program. This program takes only one input , the memory size that we want it to allocate.

So , we use this program to create a memory pressure and to see if we can allocate hugepages when there is a memory pressure.

First, we configure 2048 hugepages;

[root@jiratemp ~]# echo 2048 > /proc/sys/vm/nr_hugepages    (Hugepages are 2MB)

[root@jiratemp ~]#  free -m
                    total       used       free     shared    buffers     cached
Mem:          7985       4709       3275          0          4         49
-/+ buffers/cache:       4655       3330
Swap:         8015          0       8015

As you see above, there is only 3275 mb free, so almost all used memory is occupied by Hugepages.
Now, we execute our  python program and try to allocate 4500MB of memory.

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                           
 3740 root      20   0 4615m 3.4g  424 S  0.0 44.2   1:09.38 python ./erm 4500 

While our program is running, we take the free command output in every second to see the system wide memory usage ;

[root@jiratemp ~]# free -m -s 1
                     total       used       free     shared    buffers     cached
Mem:          7985       4701       3284          0          0         43
-/+ buffers/cache:       4656       3328
Swap:         8015          0       8015

                     total       used       free     shared    buffers     cached
Mem:          7985       4701       3284          0          0         43
-/+ buffers/cache:       4656       3328
Swap:         8015          0       8015

                    total       used       free     shared    buffers     cached
Mem:          7985       4703       3281          0          0         45
-/+ buffers/cache:       4657       3327
Swap:         8015          0       8015

                    total       used       free     shared    buffers     cached
Mem:          7985       5750       2234          0          0         46
-/+ buffers/cache:       5704       2281
Swap:         8015          0       8015

                    total       used       free     shared    buffers     cached
Mem:          7985       7928         56          0          0         46
-/+ buffers/cache:       7881        103
Swap:         8015          0       8015

....
.............
........................

You see , as our program allocates more memory in every second, free memory is getting closer to 0 (zero).

Moreover; because of this pressure, our server starts to hang and when we check the situation using top command (using our limited cpu cycles),  we see that kswapd is aggresively running..

*COMMAND HANGS..
*SWAP DAEMON IS RUNNING AND SWAP USED IS INCREASED IN EVERY SECOND!!

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
 60 root      20   0     0    0    0 R 11.2  0.0   0:16.84 kswapd0  

[root@jiratemp 3740]# cat status|grep Swap
VmSwap:  1018512 kB

Morever, when we check our process; we see its Resident memory is 3.4G, as seen below;

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                           
 3740 root      20   0 4615m 3.4g  424 S  0.0 44.2   1:09.38 python ./erm 4500 

However,  the virtual memory of our process is 4615m, as you see above.

(VIRTUAL MEMORY= 4615 MB but RES=3.4G)

So , this is a little interesting right? Because we requested 4500 m memory, but our resident memory is 3.4G..

The situation is the same when we  run a C program and try to allocate 5500 megabytes..

Program just slows down when it reaches the 3 GB of memory and swap activities are triggered.

Our application stuck at this point, but if we wait the swap daemon to swap out the memory , we can see that our program can actually allocate 5500 MB.. Look program says, I m allocating 5550 th MB ;

Currently allocating 5535 MB
Currently allocating 5536 MB
Currently allocating 5537 MB
Currently allocating 5538 MB
Currently allocating 5539 MB
Currently allocating 5540 MB
Currently allocating 5541 MB
Currently allocating 5542 MB
Currently allocating 5543 MB
Currently allocating 5544 MB
Currently allocating 5545 MB
Currently allocating 5546 MB
Currently allocating 5547 MB
Currently allocating 5548 MB
Currently allocating 5549 MB

But; when we look at the top output, we see the RES is only 3.6 G, however again VIRT is increased. So swap is there.

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND   
 4191 root      20   0 5418m 3.6g  152 R 64.0 45.7   1:10.96 ./a.out  

You see; when we look at the top output above, we see the RES is only 3.6 G, however VIRT is 5500. So swap is there...
So our page are swapped out!  (Remember VIRT = The  total  amount  of  virtual  memory  used by the task.  It includes all code, data and shared libraries plus pages that have been swapped out)

MAN TOP ->
VIRT  --  Virtual Image (kb)
          The  total  amount  of  virtual  memory  used by the task.  It includes all code, data and shared libraries plus pages that have been swapped out. (Note: you can
          define the STATSIZE=1 environment variable and the VIRT will be calculated from the /proc/#/state VmSize field.)

Well, if we disable the hugepages; we can allocate "resident" memory using the same program... Here is an example output of top for the same program; (hugepages disabled),.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 4333 root      20   0 6623m 6.5g  376 S 67.2 82.9   0:03.33 a.out        

You see RES=6.5g.. So you see it allocates from the RESIDENT memory. (a.out is a C program which continously and endlessly allocates and uses memory)

So, this proves that hugepages are not swapped out . They are not swapped out, even in the case of a memory shortage. Also, when we have a memory shortage and if almost all the memory is allocated by the Hugepages, then we can see the pages that we recently allocated by our program, are swapped out to make room for our program to allocate more memory :).

Another interesting thing is; if there is not enough free memory, Hugepages can not be configured properly.
I mean, we can allocate regular pages from a self written program and we can test this.
When we do such a test, we see the hugepages will not be allocated altough we issue the commands;

Well we allocate all the memory by using a self written application and then configure 2048 hugepages..
Interesting thing is that, our command doesn't encounter any errors but hugepages are not allocated at all..

[root@jiratemp ~]# echo 2048 > /proc/sys/vm/nr_hugepages
[root@jiratemp ~]# echo $?
0
[root@jiratemp ~]# hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
   2097152        1        1        1        *
[root@jiratemp ~]# grep Huge /proc/meminfo
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

As you see, setting hugepages is a task that should be done carefully. As they are not swapped out, system may hang in case of a memory shortage and the risk of memory shortage is actually increased when we you hugepages or let's say when we configure(not even used) hugepages.

Well, there is alternative way for configuring hugepages actually. User the overcommit  configuration, we can at least decrease the memory allocation of our hugepages when they are not used by any process.

OVERCOMMIT SETTING FOR HUGEPAGES:

Let's introduce the overcommint setting for the hugepages first;

/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
requested by applications. Writing any non-zero value into this file
indicates that the hugetlb subsystem is allowed to try to obtain that
number of "surplus" huge pages from the kernel's normal page pool, when the
persistent huge page pool is exhausted. As these surplus huge pages become
unused, they are freed back to the kernel's normal page pool.

So, if we set the hugepages to a lower value and set the overcommit hugepages to a large value (large enough to meet our peak hugepage requests); then we can have a dynamic hugepage allocation in our environments.

Let's make a demo and see how it is done and how it behaves;

We set 100 hugepages and we set 1000 overcommit hugepages

[root@jiratemp ~]# echo 100 > /proc/sys/vm/nr_hugepages
[root@jiratemp ~]# echo 1000 > /proc/sys/vm/nr_overcommit_hugepages

We check the /proc/meminfo and hugepage pool list and see only 100 hugepages are allocated (as no processes use any hugepages at the moment)

[root@jiratemp ~]# grep Huge /proc/meminfo 
HugePages_Total:     100
HugePages_Free:      100
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

[root@jiratemp ~]# hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
   2097152      100      100     1100        *

We sync and clear the caches to have a fresh start in terms of memory and allocate shared memory from the huge pages. (just like an Oracle Database does :)

[root@jiratemp ~]# sync;echo 3 > /proc/sys/vm/drop_caches ; free -m
             total       used       free     shared    buffers     cached
Mem:          7985        736       7249          0          0         21
-/+ buffers/cache:        713       7271
Swap:         8015        125       7890

Note: for allocating shared memory from the hugepages, I use the following C program:

#include<stdio.h>
#include<sys/shm.h>
#include<sys/stat.h>
#include<unistd.h>
int main(){
    int segment_id_1;
    char *shared_memory_1;
    struct shmid_ds shmbuffer;
    int segment_size;
    const int shared_segment_size=0x40000000;
    /*Allocate a shared memory segment*/
    segment_id_1=shmget (IPC_PRIVATE,shared_segment_size,SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
    shared_memory_1=(char*)shmat(segment_id_1,0,0);
    sprintf(shared_memory_1,"ERMAN");
    sleep(100);
    return 0;
}

0x40000000 means 1GBytes, which means 512 Hugepages in Linux.
So , we tell our program to allocated 1GB ( 512 Hugepages) shared memory from the hugepages.

Remember, our hugepage count was 100, so there were 100 hugepages in our hugepage pool as shown earlier. On the other; we set 1000 overcommit hugepages. 

Well, when we execute this program, we see 512 pages are allocated. So our pool has enlarged :)

[root@jiratemp ~]# grep Huge /proc/meminfo 
HugePages_Total:     512
HugePages_Free:      511
HugePages_Rsvd:      511
HugePages_Surp:      412
Hugepagesize:       2048 kB

[root@jiratemp ~]#  hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
   2097152      100      512     1100        *

[root@jiratemp ~]# free -m
             total       used       free     shared    buffers     cached
Mem:          7985       1567       6417          0          3         25
-/+ buffers/cache:       1538       6446
Swap:         8015        125       7890

Now, our free memory decreased by 1024 Mbytes.
So overcommit works perfectly. We had 100 hugepages at first. So our hugepages were occupying only 200 Mbytes initially. However, when we need more, we could allocate it (thanks to overcommit)
We got ourselves an environment which can do a dynamic hugepage allocation..

WHAT ABOUT ORACLE DATABASE? CAN IT USE OVERCOMMIT HUGEPAGES?

Let's try with the Oracle database;

First of all our limits.conf should be configured properly to use hugepages.. In other words; oracle Os user must be able to lock memory when it is instructed to use hugepages (especially Hugepages only!)

This can be done in 2 ways.

1) By setting cap for oracle binary with root account
cd $ORACLE_HOME/bin
setcap cap_ipc_lock=+ep oracle

2) By adding the following (change the values according to your needs) to the limits.conf
oracle soft    memlock        unlimited
oracle hard    memlock        unlimited

If we don't do one of these configurations; we end up with the following ORA-27137 ;

[oracle@jiratemp ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 12.1.0.1.0 Production on Thu Nov 10 11:08:01 2016
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> startup nomount;
ORA-27137: unable to allocate large pages to create a shared memory segment
Linux-x86_64 Error: 1: Operation not permitted
Additional information: 14680064
Additional information: 1

Well, suppose we configured our memory lock parameters or set cap for oracle binary, 
and configured our memory related database parameters as follows;
--these parameters are used to configure the initial memory allocation of Oracle Database ,when it is started. (the parameter sga_target is for this actually)

sga_max_size = 1000M
sga_target = 500M
use_large_pages_only=ONLY --> this instruct oracle to use only the hugepages.

We set hugepage overcommit to 1000 and hugepage count to 100 ;

[root@jiratemp ~]# sync;echo 3 > /proc/sys/vm/drop_caches ; free -m

total used free shared buffers cached
Mem: 7985 205 7779 0 0 16
-/+ buffers/cache: 189 7796
Swap: 8015 30 7985

[root@jiratemp ~]# grep Huge /proc/meminfo

HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

[root@jiratemp ~]# hugeadm --pool-list
Size Minimum Current Maximum Default
2097152 0 0 0 *

[root@jiratemp ~]# echo 100 > /proc/sys/vm/nr_hugepages
[root@jiratemp ~]# echo 1000 > /proc/sys/vm/nr_overcommit_hugepages

[root@jiratemp ~]# hugeadm --pool-list
Size Minimum Current Maximum Default
2097152 100 100 1100 *

So, we startup our Oracle database (note that , starting the database in nomount mode is enough for this test) as follows;

oracle@jiratemp ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 12.1.0.1.0 Production on Thu Nov 10 11:23:41 2016
Copyright (c) 1982, 2013, Oracle.  All rights reserved
Connected to an idle instance.
SQL> startup nomount;
ORACLE instance started.
Total System Global Area 1043886080 bytes
Fixed Size                  2296280 bytes
Variable Size             876611112 bytes
Database Buffers          159383552 bytes
Redo Buffers                5595136 bytes

We check our parameters are set.

SQL> show parameter sga_target   

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
sga_target                           big integer     500M

SQL> show parameter sga_max_size  
sga_max_size                     big integer   1000M


Once, our database is started, we directly check the huge page usage;

Remember, our overcommit hugepage number was set to 1000 (2048 Mb) and our hugepage number was set to 512 (1024 Mbytes)

[root@jiratemp ~]#  grep Huge /proc/meminfo 
HugePages_Total:     501
HugePages_Free:      252
HugePages_Rsvd:      252
HugePages_Surp:      401
Hugepagesize:       2048 kB

You see? The hugepages the total hugepages count is now 501. (it was 100 earlier), and the Hugepages surplus is 401.  So Linux let oracle to allocate more hugepages than configured by respecting the overcommit configuration. In other words;  Linux let Oracle to allocate(reserver+allocate) 512 Hugepages by enlarging the Hugepage pool automatically and dynamically.

As we set sga_target parameter to 500M, Oracle allocated almost 250 Hugepages.(see Hugepages_Free is 252), also Oracle reserved 252 more hugepages as sga_max_size was set to 1024MB.

Now, we set sga_target parameter to 1000Mbytes and see if Linux will let Oracle to use all the 512 Hugepages.

SQL> alter system set sga_target=1000M scope=memory;

System altered.
[oracle@jiratemp ~]$ grep Huge /proc/meminfo 
HugePages_Total:     501
HugePages_Free:        2
HugePages_Rsvd:        2
HugePages_Surp:      401
Hugepagesize:       2048 kB

Yes. Oracle allocated almost 500 Hugepages to build its 1024M sized SGA on top of these hugepages.

When we shutdown our Oracle database, we see the hugepage pool dynamically deallocated and the space occupied by Oracle 's Hugepages are freed as expected.

SQL> shu immediate;
ORA-01507: database not mounted
ORACLE instance shut down.

[root@jiratemp ~]#  grep Huge /proc/meminfo 
HugePages_Total:     100
HugePages_Free:      100
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

HOW ORACLE DOES IT?

Well, we see that, Oracle can go with the the hugepage overcommit setting(nr_overcommit_hugepages).
What actually Oracle does is, that it honours the sga_max_size in the instance startup and make commitment to allocate the necessary hugepages for satisfying the sga_max_size. However, actually it only allocates the necessary hugepages for satisfying the sga_target.
Although the configured number of hugepages is set to a lower value, Oracle can allocate the necessary amount of hugepages to satisfy its sga_max_size, because an hugepage overcommit configuration is in place.Furthermore, when Oracle database is shut down, the surplus of hugepages (overcommit count - number of hugepage count) is given back to OS.

CONCLUSION:

So, following are the conclusions;

  • Hugepages are not swapped out in any circumstances.
  • Hugepages are not given to any process that want to allocate regular pages.
  • Hugepages occupy memory once they are configured , eventhough they are not used.
  • nr_overcommit_hugepages is good thing in the relevant cases. 
  • nr_overcommit lets a process to allocate more hugepages than configured.
  • When a process shuts down gracefully or release its memory, the hugepages used by that process are given to back to system and if that process uses overcommit hugepages, the nr_overcommit_hugapages value will be set to its default.
  • Oracle will go with the nr_overcommit_hugepages parameter.
  • Oracle will allocate(not use) shared memory using Huge pages based on the sga_max_size.(We can think like Oracle allocates shared memory using shmget in the instance startup, in other words; the segment size argument given in the shmget equals the value defined in sga_max_size) . So, when Oracle allocates shared memory in startup, it actually reserves the hugepages and the Reserved hugepage count increases. On the other hand, Oracle will use the number of hugepages based on the sga_target and that's why Hugepagess free(seen in /proc/meminfo) decreases accordingly.
  • If we use hugepages (only), we should set sga_max_size equal to the sga_target. (in order to not to waste our memory)

Well, after knowing these; I want to give an example, where overcommit memory can be used to address a memory wastage problem.

QUESTION:

Suppose; 
  • We have 3 databases in a single server environment. Let's say these are TEST , DEV and UAT.
  • We sometimes work only in TEST, sometimes in DEV and sometimes, in 2 of these databases and  sometimes in all of these 3.
  • We also have processes/session which do lots of PGA work.
  • We have 100 Gigabytes of Ram and we want to reserver 10GB of it to OS.
  • We want to have 20GB sized SGA for each of these databases.
  • We want to use sga_max_size/sga_target (ASMM -- not AMM) and use Hugepages as well.

Now,  consider which one of the following setting is a better idea?

1)

 echo 10240 > /proc/sys/vm/nr_hugepages
 echo  30720> /proc/sys/vm/nr_overcommit_hugepages

2)
 echo 30720> /proc/sys/vm/nr_hugepages
 echo  30720> /proc/sys/vm/nr_overcommit_hugepages

MY ANSWER:

*My answer is 1) . 
The reason is explained earlier. :)
What is yours? please feel free to comment...

1 comment :

  1. Excellent Erman! I was looking for overcommit behaviour and your write-up cleared it all. Thanks.

    One query: you mentioned that when a process shuts down gracefully then "the nr_overcommit_hugapages value will be set to its default." Will that be set to default or the memory occupied in the form of hugepages will be returned to the default pool. I believe the value set for nr_overcommit_hugapages (in /proc/sys/vm/nr_overcommit_hugepages) will stay the same as we set it, isnt it?

    Thanks again.

    ReplyDelete

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.