Monday, April 6, 2015

Linux -- A problem diagnostic - rsync, out_of_memory, oom_kill_process, vmmemctl, overcommit_memory, /proc/sysrq-trigger -- All in one Post

Recently delivered an application filesystem sync utility.. It was using rsync to do the syncronizations of EBS filesystems.. In such syncronizations, we usually proceed with Dataguard in DB tier, and rsync in the APPS Tier..

In this article, I will focus on the syncronizations in the Apps Tier..
As you may know, a production EBS instance has several log and output files produced day by day.. These logs and outfiles are the most important part of the daily sync activities.. I mean you can sync Application Filesystem(APPL_TOP,INST_TOP,COMMON_TOP and Apps Tier Oracle Homes) once in a week, but these log and outfiles require to be syncronized at least day by day because they are changing/getting produced continously..

The rsync approach is actually documented in the Oracle 's Business Continuity documentations as follows;

If you wish to synchronize your concurrent manager log and out files from the primary to the standby, create directories that correspond to the APPLCSF environment variables on the standby application tier server(s).
For example:
$ mkdir -p <APPLCSF from PRODUCTION>/log
$ mkdir -p <APPLCSF from PRODUCTION>/out
Repeat this on the primary server, creating directories matching the standby context name, to be ready for a switchover operation.
For UNIX systems, on the primary application tier(s), set up an rsync job in cron, to run every few minutes. The following example synchronizes the log directory:
$ rsync av <APPLCSF>/log <standby_host>: <APPLCSF from PRODUCTION >/log --rsync-path=/usr/local/bin/rsync

Even Business Continuity for 11i suggests that..

Business Continuity for Oracle E-Business Release 11i Using Oracle 11g Release 2 and later Physical Standby Database - Single Instance and Oracle RAC (Doc ID 1068913.1)

Okay.. But there is one thing .. EBS 11i application servers are 32 bit..Actually, they must be 32 bits, the code does not work in 64 bit..  This brings a challange for the memory, as an  32 bit system can handle memory stably more than 16 gigs. So we normally deal with a limited amount of memory when working with EBS 11i apps servers.. So we must carefull..

Lets take a look at the following real life story and be aware of the possible consequences..
The system in question was an Oracle Linux 32 bit, running with 2.6.32-200.13.1.el5uek kernel..
It was a standby Application server , which was the target of rsync..
One day it got locked and could not answer any connection attempts.. It required an hard reboot..
Note that : The Standby apps server had 8 gb memory.

After rebooting, firstly noticed the following in the /var/log/messages..

There was an out-of_memory in 6:19 PM. This was the last trace..
Unfortunealty, we did not have OSwatcher running..

Just a little reminding;

Os watchler collects the following system information;

meminfo (Linux Only)
slabinfo (Linux Only)

But we had SAR , so consulted the SAR reports.. However, they were all empty for all the metrics after 6:10 PM.. 
Note that: probably, OS Watcher would not be able to collect its information in this incident either, as it seemed that the system as a whole stopped working suddenly because of this incident..

So we investigated the SAR reports, and tried to make comment accordingly..

Context Switch per second.
05:30:01 AM    147.21
05:40:01 AM    153.14
05:50:01 AM    150.48
06:00:01 AM    150.97
06:10:01 AM    215.15

                pgpgin/s pgpgout/s   fault/s  majflt/s
05:00:01 AM      0.00      0.19      4.21      0.00
05:10:01 AM      0.01      0.18      5.89      0.00
05:20:01 AM      0.00      0.21      4.20      0.00
05:30:01 AM      0.00      0.16      4.20      0.00
05:40:01 AM      0.00      0.32      5.25      0.00
05:50:01 AM      0.00      0.13      4.20      0.00
06:00:01 AM      0.72      0.39     19.77      0.01
06:10:01 AM   3737.28   3677.23     25.48      0.02
                  tps      rtps      wtps   bread/s   bwrtn/s
05:00:01 AM      0.11      0.00      0.11      0.00      1.12
05:10:01 AM      0.12      0.01      0.11      0.04      1.08
05:20:01 AM      0.12      0.00      0.12      0.00      1.28
05:30:01 AM      0.09      0.00      0.09      0.00      0.96
05:40:01 AM      0.17      0.00      0.17      0.00      1.92
05:50:01 AM      0.07      0.00      0.07      0.00      0.76
06:00:01 AM      0.36      0.15      0.21      4.31      2.34
06:10:01 AM   1095.44    113.58    981.85  22423.70  22063.37

                       IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
05:10:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
05:10:01 AM      eth0      4.59      0.24    636.79     38.50      0.00      0.00      1.75
05:20:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
05:20:01 AM      eth0      4.70      0.48    635.94     50.97      0.00      0.00      1.71
05:30:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
05:30:01 AM      eth0      4.39      0.22    610.48     35.23      0.00      0.00      1.69
05:40:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
05:40:01 AM      eth0      4.47      0.25    620.27     38.46      0.00      0.00      1.74
05:50:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
05:50:01 AM      eth0      4.37      0.24    605.46     38.13      0.00      0.00      1.72
06:00:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:00:01 AM      eth0      4.52      0.31    625.29     48.23      0.00      0.00      1.72
06:10:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:10:01 AM      eth0     13.95      7.89  12658.31   1526.62      0.00      0.00      1.81

From the SAR reports, it could be seen that, exactly at 6:10 there is an increase in the system activities.. I mean Network, paging , context switches and IO..

This was a standby application server, so it could not have a big activity on it..  With this in mind, consulted the crontab; and saw the following;

0 8,10,12,14,18,20,22,0,2,4,6 * * * rsync -av --delete /oracle/PROD/R12/inst/apps/PROD_ebslvapp/logs/appl/conc/log/ --rsync-path=/usr/bin/rsync 

Yes.. There was a cron job in the PROD application tier for syncronizing the log and out files of the Application Tiers.. Rsync was used with "a", "v" and "--delete" arguments..

The cron job was saying that "run rysnc everyday in every 2 hours.."
The arguments was saying the following;

-v : increase verbosity
-a : archive mode; equals -rlptgoD (no -H,-A,-X)

So, -a does;

-r, --recursive recurse into directories
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-g, --group preserve group
-o, --owner preserve owner (super-user only)
-D same as --devices --specials
--devices preserve device files (super-user only)
--specials preserve special files
-a does not/excludes the following;

-a does not;

-H, --hard-links preserve hard links
-A, --acls preserve ACLs (implies -p)
-X, --xattrs preserve extended attributes

--delete: delete extraneous files from dest dirs.

So, the cron job was logical..
Checked the crond log and saw that the cron job was triggered by the crond of the PROD application server at 06:00 pm.
Note that : In this configuration rsync was contacting to the remote system(standby application server): using a remote-shell(ssh) program as the transport.. It is the internal mechanism of rsync.. It s not rsync daemon, so uses remote shell to reach the target node.

Mar 29 06:00:01 crond[10266]: () CMD (rsync -av --delete /oracle/PROD/R12/inst/apps/PROD_ebslvapp/logs/appl/conc/log/ --rsync-path=/usr/bin/rsync 

Okay, cron triggered the rysnc job at 06:00 pm, and system went down in somewhere between 6:10 and 06:20 pm.
Could this hang be related with rysnc??

Lets explore how rsync works;

By default rsync determines which files differ between the sending and receiving systems by checking the modification time and size of each file. As this only requires reading file directory information, it is quick, but it will miss unusual modifications which change neither.

To determine which parts of a file have changed, the recipient splits its copy of the file into chunks and computes two checksums for each chunk: the MD5 hash, and a weaker but easier to compute rolling checksum It sends these checksums to the sender. The sender quickly computes the rolling checksum for each chunk in its version of the file; if they differ, it must be sent. If they're the same, the sender uses the more computationally expensive MD5 hash to verify the chunks are the same. The sender then sends the recipient those parts of its file that did not match, along with information on where to merge these blocks into the recipient's version. This makes the copies identical. 

When Rsync communicates with a remote non-daemon server via a remote shell the startup method is to fork the remote shell which will start an Rsync server on the remote system. Both the Rsync client and server are communicating via pipes through the remote shell. As far as the rsync processes are concerned there is no network. In this mode the rsync options for the server process are passed on the command-line that is used to start the remote shell.

The first thing that happens once the startup has completed is that the sender will create the file list. "

**"So it started at 6:00 pm  in the Application Prod server , and probably took sometime"

While it is being built, each entry is transmitted to the receiving side in a network-optimised way.

When this is done, each side sorts the file list lexicographically by path relative to the base directory of the transfer. (The exact sorting algorithm varies depending on what protocol version is in effect for the transfer.) Once that has happened all references to files will be done by their index in the file list.

If necessary the sender follows the file list with id→name tables for users and groups which the receiver will use to do a id→name→id translation for every file in the file list.

After the file list has been received by the receiver, it will fork to become the generator and receiver pair completing the pipeline.
**"Well, after the file list, Application Standby server becomes the generator and receiver , and makes the serios computing.."

The generator process compares the file list with its local directory tree. Prior to beginning its primary function, if --delete has been specified, it will first identify local files not on the sender and delete them on the receiver.
The generator will then start walking the file list. Each file will be checked to see if it can be skipped. In the most common mode of operation files are not skipped if the modification time or size differs

The receiver will read from the sender data for each file identified by the file index number. It will open the local file (called the basis) and will create a temporary file.
The receiver will expect to read non-matched data and/or to match records all in sequence for the final file contents. When non-matched data is read it will be written to the temp-file. When a block match record is received the receiver will seek to the block offset in the basis file and copy the block to the temp-file. In this way the temp-file is built from beginning to end.
The file's checksum is generated as the temp-file is built. At the end of the file, this checksum is compared with the file checksum from the sender. If the file checksums do not match the temp-file is deleted. If the file fails once it will be reprocessed in a second phase, and if it fails twice an error is reported.
After the temp-file has been completed, its ownership and permissions and modification time are set. It is then renamed to replace the basis file.
Copying data from the basis file to the temp-file make the receiver the most disk intensive of all the rsync processes. Small files may still be in disk cache mitigating this but for large files the cache may thrash as the generator has moved on to other files and there is further latency caused by the sender. As data is read possibly at random from one file and written to another, if the working set is larger than the disk cache, then what is called a seek storm can occur, further hurting performance.

**"So according the info above(written in bold)  the standby application server do a lot of stuff in this rsync process."

Lets come back to our topic..
We had seen out_of_memory errors right? Just at the same time interval that our rsync job was working.

Okay , in this example; we were syncing log and out files of a production EBS instance, so there were several log and out files to be syncronized, besides there were a bunch of big log files (2gb) in our syncronization directories..

Altough we have a lot of activities in a heavy rsync operation; it would a weak comment, if we directy charge this problem to rsync..

Lets explore the other way around... I mean the system side..
By looking the error, syslogd recorded we can say that ;

vmmemctl, which is a kernel driver for Vm environments, collaborates with the server to reclaim pages that are considered least valuable by the guest operating system invoked the oom_killer which is a  a feature enabled by default, is a self protection mechanism employed the Linux kernel when under severe memory pressure..
So vmmemctl may be using /proc/sys/kernel/sysrq to trigger oom_killer or /proc/sysrq-trigger.
For example: echo "f" > /proc/sysrq-trigger   -- to call oom_kill.

Using the echo command to write to this file, a remote root user can execute most System Request Key commands remotely as if at the local terminal. To echo values to this file, the /proc/sys/kernel/sysrq must be set to a value other than 0. 

‘k’ – Kills all the process running on the current virtual console.
‘s’ – This will attempt to sync all the mounted file system.
‘b’ – Immediately reboot the system, without unmounting partitions or syncing.
‘e’ – Sends SIGTERM to all process except init.
‘m’ – Output current memory information to the console.
‘i’ – Send the SIGKILL signal to all processes except init
‘r’ – Switch the keyboard from raw mode (the mode used by programs such as X11), to XLATE mode.
‘s’ – sync all mounted file system.
‘t’ – Output a list of current tasks and their information to the console.
‘u’ – Remount all mounted filesystems in readonly mode.
‘o’ – Shutdown the system immediately.
‘p’ – Print the current registers and flags to the console.
’0-9′ – Sets the console log level, controlling which kernel messages will be printed to your console.
‘f’ – Will call oom_kill to kill process which takes more memory.
‘h’ – Used to display the help. But any other keys than the above listed will print help

For example: echo "f" > /proc/sysrq-trigger   -- to call oom_kill.

When running a kernel with SysRq compiled in, /proc/sys/kernel/sysrq controls the functions allowed to be invoked via the SysRq key. Here is the list of possible values in /proc/sys/kernel/sysrq:
  • 0 - disable sysrq completely
  • 1 - enable all functions of sysrq
  • >1 - bitmask of allowed sysrq functions (see below for detailed function description):
  • 2 - enable control of console logging level
  • 4 - enable control of keyboard (SAK, unraw)
  • 8 - enable debugging dumps of processes etc.
  • 16 - enable sync command
  • 32 - enable remount read-only
  • 64 - enable signalling of processes (term, kill, oom-kill)
  • 128 - allow reboot/poweroff
  • 256 - allow nicing of all RT tasks
Why vmmemctl invokes the oom_killer?

vmmemctl works in conditions when some other vm guest require more pages.

It allocates pages to build the required balloon..
It seems, in this situation, vmmemctl tried to allocate pages in our application standby server, and at its last try it got a NULL for its malloc call.. That 's why ; it invoked the oom-killer, and oom-killer killed some important processes to reclaim the memory, which resulted a system crash ..

Also there is known bug against it:
The balloon driver, vmmemctl, is unaware of pinned pages (1003586)

Could be the problem with rsync memory usage then? Considerin again...

We monitored the consequent runs of rsync and I can say that it did not allocate anymore that 20 mb resident memory. It allocates the file cache though.. I mean when we looked the memory consumption we see almost 2gb memory file cache used by rsync, but this was not the issue.
The only effect of it may be a negative performance, as a large files may occupy the cache space and this may force other data blocks out of the cache. This can effect system performance in overall because the other running processes on the system will need to reread their data from disk.

Just in case; we created a cron job for freeing the file based cache and executed in regularly during the rsync executions.
5,10,35 8,10,12,14,18,20,22,0,2,4,6 * * * sync; echo 3 > /proc/sys/vm/drop_caches

Alternatively, a patched version rsync which has --no-cache / drop-cache option may be used in this situation..
Website of the patch : 
Patch file :
--drop-cache do not cache rsync files (POSIX_FADV_DONTNEED)\n");

What are the solutions for all of these possibilities?

I say possibility because we didnt have the diagnostic data.. It could not be generated..

1) Dropping the caches during rsync run (just in case)
5,10,35  8,10,12,14,18,20,22,0,2,4,6 * * * sync; echo 3 > /proc/sys/vm/drop_caches

2) set overcommit kernel parameters; (just in case)
vm.overcommit_memory = 2
vm.overcommit_ratio = 80

3) Set the vm parameter sched.mem.maxmemctl to an optimized value..

Ref: Vmware
The administrator of a system using pinned pages should specify the maximum balloon size (in megabytes) using the sched.mem.maxmemctlparameter in the virtual machine's .vmx configuration file. This overrides the system-wide default.

For example, consider a virtual machine with:
4GB guest memory
2GB pinned for a high-performance user application

In this example, you can specify sched.mem.maxmemctl = "1330" to limit the memory that the balloon driver will attempt to swap to the guest swap space. In this case, a value of 1330MB is used because it is about 65% of the 2GB that can safely be swapped to disk (this leaves some headroom for pinned pages belonging to the guest operating system).

This recommendation assumes you have followed the instructions in the Resource Management Guidefor ESX Server. Specifically, " sure your guest operating systems have sufficient swap space." This swap space must be greater than or equal to:

(the virtual machine's configured memory size - its Reservation) + (the space required to accommodate the workload of the virtual machine)

Furthermore, the maximum balloon size set with sched.mem.maxmemctl must be less than or equal to the configured swap space in the guest operating system. See the Swap Space and Guest Operating System section in the Resource Management Guidefor ESX Server for additional information.

4) Increasing the lowmem protection (just in case)
vm.lowmem_reserve_ratio = 256 256 250
Check out this link :

Optional / Disabling the ballooning driver

Ref: Vmware 
Warning: Disabling the balloon driver in a virtual machine results in performance issues with the ESXi/ESX host. For more information, see thevSphere Resource Management Guide.
Disabling ballooning via the vSphere Client

To set the maximum balloon size to zero:

Using the vSphere Client, connect to the vCenter Server or the ESXi/ESX host where the virtual machine resides.
Log into the ESXi/ESX host as a user with administrative rights.
Shut down the virtual machine.
Right-click the virtual machine listed on the Inventory panel and click Edit Settings.
Click the Options tab, then under Advanced, click General.
Click Configuration Parameters.
Click Add row and add the parameter sched.mem.maxmemctl in the text box.
Click on the row next to it and add 0 in the text box.
Click OK to save changes.

In conclusion;

This have become a tricky blog post, as we first go on with rsync, and then completed with vmmemctl :)

As I already mentioned, this was a real life story , so I have written the analysis part as it was in a real life :) That is ; we first got suspected rsync..  Then thought that it was because applications which allocate big amounts of memory altough free memory was not available
On the other hand; the problem was there in the syslogd messages .. So it all started by the vmmemctl kernel driver..
Anyways; I can say that It was a good experience though... Coming down from rsync to vmmemctl made me explore the different applications, which indirectly brought me new perspective for dealing this kind of tainted problems..

No comments :

Post a Comment