Erman Arslan's Oracle Blog: Linux -- Huge Pages in real life, memory consumption, Huge pages on swap operations, using "overcommit" / nr_overcommit

Tuesday, November 15, 2016

Linux -- Huge Pages in real life, memory consumption, Huge pages on swap operations, using "overcommit" / nr_overcommit_hugepages

This blog post will be about Huge pages on Linux.
I actually wrote a comprehensive article about Linux Memory Optimization (including Hugepages) earlier, but this blog post will be a little different.
Today, I want to make a demo to show you the Hugepages in real life and the memory locking mechanism that we need to get used to, when we enable Hugepages.
The thing that made me write this article was a question that one my collegues asked last week.
My collegue realized that after rebooting his database server, the memory directly becomes "used". Even before starting the database, he could see the memory is in use when he executed the "free" command.
This question asked me on the phone and I directly answered that "it is because huge pages".
However, I wanted to make a demo and see this statement in real life.

Well let's revisit my earlier blog post and recall the general information about the Huge pages:
(I strongly recommend you to read this blog post,as well -> http://ermanarslan.blogspot.com.tr/2013/12/oracle-linux-memory-optimization.html)

When we use hugepages, we have smaller page tables in terms of size, because there will be less pages to handle, as Hugepages are 2MB(or more , it depends on the system) sized. In addition the hugepages are never swapped out, they are locked in memory.. Kernel does less work for bookkeeping of virtual memory, because of the larger page sizes.. Note that: Hugepages is not compatible with automatic memory management that Oracle does if configured to do..

Let's start our demo. (Note that, my demo env is an Oracle Linux 6.5 x86_64 and the kernel is an UEK 3.8.13-16.2.1.el6uek.x86_64)

HUGEPAGES OCCUPY MEMORY ONCE THEY ARE CONFIGURED (altough they are not used by any applications)

Firstly, I will show you the affect of hugepages. You will see the hugepages are never swapped out and when they are configured; they occupy memory, eventhough they are not used at all.

Initially, our hugepages are not configured as seen below;

[root@jiratemp ~]# cat /proc/meminfo |grep Huge
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

Next, we sync and drop filesystem caches to have a clean environment in terms of memory. (we do this as we will use free command to see the affect of our actions)

[root@jiratemp ~]# sync;echo 3 > /proc/sys/vm/drop_caches ; free -m

[root@jiratemp ~]# free -m

total used free shared buffers cached

Mem: 7985 609 7375 0 4 47

-/+ buffers/cache: 557 7427

Swap: 8015 0 8015

Afterwards; we configure 2048 hugepages directly using proc fs and directly check the memory usage using free command;

[root@jiratemp ~]# echo 2048 > /proc/sys/vm/nr_hugepages (Hugepages are 2MB)

[root@jiratemp ~]# free -m

total used free shared buffers cached

Mem: 7985 4709 3275 0 4 49

-/+ buffers/cache: 4655 3330

Swap: 8015 0 8015

A quick explanation for the free command output:

Mem: total= Total physical memory
Mem: used = MemTotal - MemFree
Mem: free = Free memory
Mem: shared = meaningless nowadays, can be ignored
Mem: buffers Buffers
Mem: cached Cached memory
-/+ buffers/cache: used MemTotal - (MemFree + Buffers + Cached)
-/+ buffers/cache: free MemFree + Buffers + Cached
Swap: total Total swap
Swap: used SwapTotal - SwapFree
Swap: free Free Swap

You see 4709M are used. One page is 4K, one hugepage is 2M, so 2048 Hugepage makes 4096M

free command reports mb values when used with "-m" argument. See the used value is 4709 (609M was already used before we configure hugepage) . 4709-4096= 613 (almost equal to 609). So these used megabytes are caused by hugepages.

I remind you, we didn't use those hugepages, but once configured, they occupy memory as you see.

Well, it is certain that huge pages are reserved inside the kernel .

HUGEPAGES ARE NOT SWAPPED OUT EVEN UNDER PRESSURE (even when they are not used by any applications)

Hugepages can not be swapped out.. It is real. In order to test it, I wrote a python program. This program takes only one input , the memory size that we want it to allocate.

So , we use this program to create a memory pressure and to see if we can allocate hugepages when there is a memory pressure.

First, we configure 2048 hugepages;

[root@jiratemp ~]# echo 2048 > /proc/sys/vm/nr_hugepages (Hugepages are 2MB)

[root@jiratemp ~]# free -m

total used free shared buffers cached

Mem: 7985 4709 3275 0 4 49

-/+ buffers/cache: 4655 3330

Swap: 8015 0 8015

As you see above, there is only 3275 mb free, so almost all used memory is occupied by Hugepages.

Now, we execute our python program and try to allocate 4500MB of memory.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

3740 root 20 0 4615m 3.4g 424 S 0.0 44.2 1:09.38 python ./erm 4500

While our program is running, we take the free command output in every second to see the system wide memory usage ;

[root@jiratemp ~]# free -m -s 1

total used free shared buffers cached

Mem: 7985 4701 3284 0 0 43

-/+ buffers/cache: 4656 3328

Swap: 8015 0 8015

total used free shared buffers cached

Mem: 7985 4701 3284 0 0 43

-/+ buffers/cache: 4656 3328

Swap: 8015 0 8015

total used free shared buffers cached

Mem: 7985 4703 3281 0 0 45

-/+ buffers/cache: 4657 3327

Swap: 8015 0 8015

total used free shared buffers cached

Mem: 7985 5750 2234 0 0 46

-/+ buffers/cache: 5704 2281

Swap: 8015 0 8015

total used free shared buffers cached

Mem: 7985 7928 56 0 0 46

-/+ buffers/cache: 7881 103

Swap: 8015 0 8015

....

.............

........................

You see , as our program allocates more memory in every second, free memory is getting closer to 0 (zero).

Moreover; because of this pressure, our server starts to hang and when we check the situation using top command (using our limited cpu cycles), we see that kswapd is aggresively running..

*COMMAND HANGS..

*SWAP DAEMON IS RUNNING AND SWAP USED IS INCREASED IN EVERY SECOND!!

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

60 root 20 0 0 0 0 R 11.2 0.0 0:16.84 kswapd0

[root@jiratemp 3740]# cat status|grep Swap

VmSwap: 1018512 kB

Morever, when we check our process; we see its Resident memory is 3.4G, as seen below;

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

3740 root 20 0 4615m 3.4g 424 S 0.0 44.2 1:09.38 python ./erm 4500

However, the virtual memory of our process is 4615m, as you see above.

(VIRTUAL MEMORY= 4615 MB but RES=3.4G)

So , this is a little interesting right? Because we requested 4500 m memory, but our resident memory is 3.4G..

The situation is the same when we run a C program and try to allocate 5500 megabytes..

Program just slows down when it reaches the 3 GB of memory and swap activities are triggered.

Our application stuck at this point, but if we wait the swap daemon to swap out the memory , we can see that our program can actually allocate 5500 MB.. Look program says, I m allocating 5550 th MB ;

Currently allocating 5535 MB

Currently allocating 5536 MB

Currently allocating 5537 MB

Currently allocating 5538 MB

Currently allocating 5539 MB

Currently allocating 5540 MB

Currently allocating 5541 MB

Currently allocating 5542 MB

Currently allocating 5543 MB

Currently allocating 5544 MB

Currently allocating 5545 MB

Currently allocating 5546 MB

Currently allocating 5547 MB

Currently allocating 5548 MB

Currently allocating 5549 MB

But; when we look at the top output, we see the RES is only 3.6 G, however again VIRT is increased. So swap is there.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4191 root 20 0 5418m 3.6g 152 R 64.0 45.7 1:10.96 ./a.out

You see; when we look at the top output above, we see the RES is only 3.6 G, however VIRT is 5500. So swap is there...

So our page are swapped out! (Remember VIRT = The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out)

MAN TOP ->

VIRT -- Virtual Image (kb)

The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out. (Note: you can

define the STATSIZE=1 environment variable and the VIRT will be calculated from the /proc/#/state VmSize field.)

Well, if we disable the hugepages; we can allocate "resident" memory using the same program... Here is an example output of top for the same program; (hugepages disabled),.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4333 root 20 0 6623m 6.5g 376 S 67.2 82.9 0:03.33 a.out

You see RES=6.5g.. So you see it allocates from the RESIDENT memory. (a.out is a C program which continously and endlessly allocates and uses memory)

So, this proves that hugepages are not swapped out . They are not swapped out, even in the case of a memory shortage. Also, when we have a memory shortage and if almost all the memory is allocated by the Hugepages, then we can see the pages that we recently allocated by our program, are swapped out to make room for our program to allocate more memory :).

Another interesting thing is; if there is not enough free memory, Hugepages can not be configured properly.

I mean, we can allocate regular pages from a self written program and we can test this.
When we do such a test, we see the hugepages will not be allocated altough we issue the commands;

Well we allocate all the memory by using a self written application and then configure 2048 hugepages..

Interesting thing is that, our command doesn't encounter any errors but hugepages are not allocated at all..

[root@jiratemp ~]# echo 2048 > /proc/sys/vm/nr_hugepages

[root@jiratemp ~]# echo $?

[root@jiratemp ~]# hugeadm --pool-list

Size Minimum Current Maximum Default

2097152 1 1 1 *

[root@jiratemp ~]# grep Huge /proc/meminfo

HugePages_Total: 1

HugePages_Free: 1

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 2048 kB

As you see, setting hugepages is a task that should be done carefully. As they are not swapped out, system may hang in case of a memory shortage and the risk of memory shortage is actually increased when we you hugepages or let's say when we configure(not even used) hugepages.

Well, there is alternative way for configuring hugepages actually. User the overcommit configuration, we can at least decrease the memory allocation of our hugepages when they are not used by any process.

OVERCOMMIT SETTING FOR HUGEPAGES:

Let's introduce the overcommint setting for the hugepages first;

/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
requested by applications. Writing any non-zero value into this file
indicates that the hugetlb subsystem is allowed to try to obtain that
number of "surplus" huge pages from the kernel's normal page pool, when the
persistent huge page pool is exhausted. As these surplus huge pages become
unused, they are freed back to the kernel's normal page pool.

So, if we set the hugepages to a lower value and set the overcommit hugepages to a large value (large enough to meet our peak hugepage requests); then we can have a dynamic hugepage allocation in our environments.

Let's make a demo and see how it is done and how it behaves;

We set 100 hugepages and we set 1000 overcommit hugepages

[root@jiratemp ~]# echo 100 > /proc/sys/vm/nr_hugepages

[root@jiratemp ~]# echo 1000 > /proc/sys/vm/nr_overcommit_hugepages

We check the /proc/meminfo and hugepage pool list and see only 100 hugepages are allocated (as no processes use any hugepages at the moment)

[root@jiratemp ~]# grep Huge /proc/meminfo

HugePages_Total: 100

HugePages_Free: 100

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 2048 kB

[root@jiratemp ~]# hugeadm --pool-list

Size Minimum Current Maximum Default

2097152 100 100 1100 *

We sync and clear the caches to have a fresh start in terms of memory and allocate shared memory from the huge pages. (just like an Oracle Database does :)

[root@jiratemp ~]# sync;echo 3 > /proc/sys/vm/drop_caches ; free -m

total used free shared buffers cached

Mem: 7985 736 7249 0 0 21

-/+ buffers/cache: 713 7271

Swap: 8015 125 7890

Note: for allocating shared memory from the hugepages, I use the following C program:

#include<stdio.h>

#include<sys/shm.h>

#include<sys/stat.h>

#include<unistd.h>

int main(){

int segment_id_1;

char *shared_memory_1;

struct shmid_ds shmbuffer;

int segment_size;

const int shared_segment_size=0x40000000;

/*Allocate a shared memory segment*/

segment_id_1=shmget (IPC_PRIVATE,shared_segment_size,SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);

shared_memory_1=(char*)shmat(segment_id_1,0,0);

sprintf(shared_memory_1,"ERMAN");

sleep(100);

return 0;

}

0x40000000 means 1GBytes, which means 512 Hugepages in Linux.
So , we tell our program to allocated 1GB ( 512 Hugepages) shared memory from the hugepages.

Remember, our hugepage count was 100, so there were 100 hugepages in our hugepage pool as shown earlier. On the other; we set 1000 overcommit hugepages.

Well, when we execute this program, we see 512 pages are allocated. So our pool has enlarged :)

[root@jiratemp ~]# grep Huge /proc/meminfo

HugePages_Total: 512

HugePages_Free: 511

HugePages_Rsvd: 511

HugePages_Surp: 412

Hugepagesize: 2048 kB

[root@jiratemp ~]# hugeadm --pool-list

Size Minimum Current Maximum Default

2097152 100 512 1100 *

[root@jiratemp ~]# free -m

total used free shared buffers cached

Mem: 7985 1567 6417 0 3 25

-/+ buffers/cache: 1538 6446

Swap: 8015 125 7890

Now, our free memory decreased by 1024 Mbytes.

So overcommit works perfectly. We had 100 hugepages at first. So our hugepages were occupying only 200 Mbytes initially. However, when we need more, we could allocate it (thanks to overcommit)

We got ourselves an environment which can do a dynamic hugepage allocation..

WHAT ABOUT ORACLE DATABASE? CAN IT USE OVERCOMMIT HUGEPAGES?

Let's try with the Oracle database;

First of all our limits.conf should be configured properly to use hugepages.. In other words; oracle Os user must be able to lock memory when it is instructed to use hugepages (especially Hugepages only!)

This can be done in 2 ways.

1) By setting cap for oracle binary with root account

cd $ORACLE_HOME/bin

setcap cap_ipc_lock=+ep oracle

2) By adding the following (change the values according to your needs) to the limits.conf

oracle soft memlock unlimited

oracle hard memlock unlimited

If we don't do one of these configurations; we end up with the following ORA-27137 ;

[oracle@jiratemp ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 12.1.0.1.0 Production on Thu Nov 10 11:08:01 2016

Connected to an idle instance.

SQL> startup nomount;

ORA-27137: unable to allocate large pages to create a shared memory segment

Linux-x86_64 Error: 1: Operation not permitted

Additional information: 14680064

Additional information: 1

Well, suppose we configured our memory lock parameters or set cap for oracle binary,

and configured our memory related database parameters as follows;

--these parameters are used to configure the initial memory allocation of Oracle Database ,when it is started. (the parameter sga_target is for this actually)

sga_max_size = 1000M

sga_target = 500M

use_large_pages_only=ONLY --> this instruct oracle to use only the hugepages.

We set hugepage overcommit to 1000 and hugepage count to 100 ;

[root@jiratemp ~]# sync;echo 3 > /proc/sys/vm/drop_caches ; free -m

total used free shared buffers cached
Mem: 7985 205 7779 0 0 16
-/+ buffers/cache: 189 7796
Swap: 8015 30 7985

[root@jiratemp ~]# grep Huge /proc/meminfo

HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

[root@jiratemp ~]# hugeadm --pool-list
Size Minimum Current Maximum Default
2097152 0 0 0 *

[root@jiratemp ~]# echo 100 > /proc/sys/vm/nr_hugepages
[root@jiratemp ~]# echo 1000 > /proc/sys/vm/nr_overcommit_hugepages

[root@jiratemp ~]# hugeadm --pool-list
Size Minimum Current Maximum Default
2097152 100 100 1100 *

So, we startup our Oracle database (note that , starting the database in nomount mode is enough for this test) as follows;

oracle@jiratemp ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 12.1.0.1.0 Production on Thu Nov 10 11:23:41 2016

Connected to an idle instance.

SQL> startup nomount;

ORACLE instance started.

Total System Global Area 1043886080 bytes

Fixed Size 2296280 bytes

Variable Size 876611112 bytes

Database Buffers 159383552 bytes

Redo Buffers 5595136 bytes

We check our parameters are set.

SQL> show parameter sga_target

NAME TYPE VALUE

------------------------------------ ----------- ------------------------------

sga_target big integer 500M

SQL> show parameter sga_max_size

sga_max_size big integer 1000M

Once, our database is started, we directly check the huge page usage;

Remember, our overcommit hugepage number was set to 1000 (2048 Mb) and our hugepage number was set to 512 (1024 Mbytes)

[root@jiratemp ~]# grep Huge /proc/meminfo

HugePages_Total: 501

HugePages_Free: 252

HugePages_Rsvd: 252

HugePages_Surp: 401

Hugepagesize: 2048 kB

You see? The hugepages the total hugepages count is now 501. (it was 100 earlier), and the Hugepages surplus is 401. So Linux let oracle to allocate more hugepages than configured by respecting the overcommit configuration. In other words; Linux let Oracle to allocate(reserver+allocate) 512 Hugepages by enlarging the Hugepage pool automatically and dynamically.

As we set sga_target parameter to 500M, Oracle allocated almost 250 Hugepages.(see Hugepages_Free is 252), also Oracle reserved 252 more hugepages as sga_max_size was set to 1024MB.

Now, we set sga_target parameter to 1000Mbytes and see if Linux will let Oracle to use all the 512 Hugepages.

SQL> alter system set sga_target=1000M scope=memory;

System altered.

[oracle@jiratemp ~]$ grep Huge /proc/meminfo

HugePages_Total: 501

HugePages_Free: 2

HugePages_Rsvd: 2

HugePages_Surp: 401

Hugepagesize: 2048 kB

Yes. Oracle allocated almost 500 Hugepages to build its 1024M sized SGA on top of these hugepages.

When we shutdown our Oracle database, we see the hugepage pool dynamically deallocated and the space occupied by Oracle 's Hugepages are freed as expected.

SQL> shu immediate;

ORA-01507: database not mounted

ORACLE instance shut down.

[root@jiratemp ~]# grep Huge /proc/meminfo

HugePages_Total: 100

HugePages_Free: 100

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 2048 kB

HOW ORACLE DOES IT?

Well, we see that, Oracle can go with the the hugepage overcommit setting(nr_overcommit_hugepages).
What actually Oracle does is, that it honours the sga_max_size in the instance startup and make commitment to allocate the necessary hugepages for satisfying the sga_max_size. However, actually it only allocates the necessary hugepages for satisfying the sga_target.
Although the configured number of hugepages is set to a lower value, Oracle can allocate the necessary amount of hugepages to satisfy its sga_max_size, because an hugepage overcommit configuration is in place.Furthermore, when Oracle database is shut down, the surplus of hugepages (overcommit count - number of hugepage count) is given back to OS.

CONCLUSION:

So, following are the conclusions;

Hugepages are not swapped out in any circumstances.
Hugepages are not given to any process that want to allocate regular pages.
Hugepages occupy memory once they are configured , eventhough they are not used.
nr_overcommit_hugepages is good thing in the relevant cases.
nr_overcommit lets a process to allocate more hugepages than configured.
When a process shuts down gracefully or release its memory, the hugepages used by that process are given to back to system and if that process uses overcommit hugepages, the nr_overcommit_hugapages value will be set to its default.
Oracle will go with the nr_overcommit_hugepages parameter.
Oracle will allocate(not use) shared memory using Huge pages based on the sga_max_size.(We can think like Oracle allocates shared memory using shmget in the instance startup, in other words; the segment size argument given in the shmget equals the value defined in sga_max_size) . So, when Oracle allocates shared memory in startup, it actually reserves the hugepages and the Reserved hugepage count increases. On the other hand, Oracle will use the number of hugepages based on the sga_target and that's why Hugepagess free(seen in /proc/meminfo) decreases accordingly.
If we use hugepages (only), we should set sga_max_size equal to the sga_target. (in order to not to waste our memory)

Well, after knowing these; I want to give an example, where overcommit memory can be used to address a memory wastage problem.

QUESTION:

Suppose;

We have 3 databases in a single server environment. Let's say these are TEST , DEV and UAT.
We sometimes work only in TEST, sometimes in DEV and sometimes, in 2 of these databases and sometimes in all of these 3.
We also have processes/session which do lots of PGA work.
We have 100 Gigabytes of Ram and we want to reserver 10GB of it to OS.
We want to have 20GB sized SGA for each of these databases.
We want to use sga_max_size/sga_target (ASMM -- not AMM) and use Hugepages as well.

Now, consider which one of the following setting is a better idea?

echo 10240 > /proc/sys/vm/nr_hugepages
echo 30720> /proc/sys/vm/nr_overcommit_hugepages

2)
echo 30720> /proc/sys/vm/nr_hugepages
echo 30720> /proc/sys/vm/nr_overcommit_hugepages

MY ANSWER:

*My answer is 1) .

The reason is explained earlier. :)

What is yours? please feel free to comment...

1 comment :

pcoraNovember 12, 2018 at 4:58 PM
Excellent Erman! I was looking for overcommit behaviour and your write-up cleared it all. Thanks.

One query: you mentioned that when a process shuts down gracefully then "the nr_overcommit_hugapages value will be set to its default." Will that be set to default or the memory occupied in the form of hugepages will be returned to the default pool. I believe the value set for nr_overcommit_hugapages (in /proc/sys/vm/nr_overcommit_hugepages) will stay the same as we set it, isnt it?

Thanks again.
ReplyDelete
Replies

Add comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.

Subscribe to: Post Comments ( Atom )