Tuesday, November 25, 2014

EBS,Linux -- fork: retry: Resource temporarily unavailable, limits.conf has no effect in Redhat

The resource temporarily unavailable error is actually self explanatory. It means, OS has resource that the process needs but these resources can not be given to the process at the moment.
The interest of ours as Linux Admin should be the cause of this error and ofcourse the solution to that..
This error may have critical effects, such as ;

Cant starting the EBS services;
/u01/fs2/inst/apps/ERMAN_ERBANT/admin/scripts/adstrtal.sh: fork: retry: Resource temporarily unavailable

Cant using basic commands;
 ps aux
-bash: fork: retry: Resource temporarily unavailable

ps -ef |grep pmon
bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable

As written in the error lines, the process can not fork. As known; fork() creates a new process by duplicating the calling process.
The cause of the error actually comes from the OS limits and here is the checklist for finding the cause..

First; we check the /etc/sysctl.conf and /etc/limits.conf... In these two files we try to find a clue. We check for undersized parameter values such as Max Process count and Max File Descriptor count..
Note that: fork makes me directly think of the Process counts, but File Descriptor check is a nice-to-have thing in these kind of OS limit errors.

Note that : Tthe activities such as checking the memory using free command or using df command to check the disk space are unrelevant ..

Okay lets explain the problem determination and the solution by walking through an example scenario;

Problem :
/u01/fs2/inst/apps/ERMAN_ERBANT/admin/scripts/adstrtal.sh: fork: retry: Resource temporarily unavailable

Here is an example sysctl.conf: 

net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Disable netfilter on bridges.
# net.bridge.bridge-nf-call-ip6tables = 0
# net.bridge.bridge-nf-call-iptables = 0
# net.bridge.bridge-nf-call-arptables = 0

# Controls the default maxmimum size of a mesage queue
# kernel.msgmnb = 65536
# Controls the maximum size of a message, in bytes
# kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
# kernel.shmall = 4294967296
kernel.sem = 256 32000 100 142
kernel.shmall = 2097152
kernel.shmmni = 4096
kernel.msgmax = 8192
kernel.msgmnb = 65535
kernel.msgmni = 2878
fs.file-max = 6815744
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range=9000 65500
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576

We are interested for the fs.file-max above.. It is 6815744, actually pretty good.

Here is an example of limits.conf:

* hard nofile 65536
* soft nofile 4096
* hard nproc 16384
* soft nproc 1024
* hard stack 16384
* soft stack 10240

Okay.. In limits conf , we have soft nproc 1024 and hard nproc 16384.

These values seem good.. They actually seem okay when you take Oracle Database installation document as reference, but there is an important info: These values are actually suggested minimum values, so they should be increased according to the situation or lets say according to the load & concurrency.

Okay.. After the introduction ; here is the fd counts, process counts and the ulimits of the problematic OS user at the problematic moment;

cat /proc/sys/fs/file-nr
26016 0 6815744

26016 ( total allocated file descriptors since boot)
0 ( total free allocated file descriptors)
6815744 ( maximum open file descriptors)

lsof | wc -l (note that: lsof display allocated file descriptors a little higher than normal -- so it is consistent wtith the file-nr output)

However, this control  is not right, becuase these 33572 is total number of open files.. Remember our nofile limits are per process.
In order to monitor the open files per process , we may use ;
for p in /proc/[0-9]* ; do echo $(ls $p/fd | wc -l) $(cat $p/cmdline) ; done | sort -n | tail
In a crowded EBS application server, I see max 650,660 files opened by a process.

ps -eLF | grep oracle| wc -l ( !!!! the user oracle has 1029 processes)

Note that: nproc limit is per user (not per process)

Lets look at the limits of Oracle user

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 95145
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 8192
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

As you may guess , this oracle OS user have reached its process limits at the problematic moment, and it seems this creates ourr fork problem..

The solution here is increasing the proc values and restarting our shell processes. Reboot also works for this purpose but it is not necessary. What we need to do is edit the limits.conf using root, and then relogin to our oracle account and start our applications processes. In addition we need to be sure that in our login file -> session required pam_limits.so should be present..

Note that:
Watch out the PAM extra configuration:
Because in some releases PAM provides a file which overrides the settings in limits.conf for the number of processes.. If this is the case , then the modification should be done in that file.

This behaviour is explained as a bug record in Redhat:

One last thing:
An alternative can be using the following script in the .bash_profile of the problematic os user:

if [ $USER = "oracle" ]; then

        if [ $SHELL = "/bin/ksh" ]; then

              ulimit -p 16384

              ulimit -n 65536


              ulimit -u 16384 -n 65536



No comments :

Post a Comment