Tuesday, January 21, 2014

Linux-- cron.daily hangs, mcelog hangs, Xen kernel, XenStoreD, /sys/hypervisor/uuid

In one of our client's production environments , we have faced an incident.
After startup (approximately 1 day after startup), the Linux server, which was a Redhat Linux 5.4 64 bit (with 2.6.18-164.el5xen  kernel) started to experience a high load.

Okay, I will keep it short;
The reason behing the problem was that the machine had 6 gig memory and it was all being used.  Actually this was due to a high process count on the Linux.. Because, when I checked processes with ps ; I saw that there were a lot of crond processes, the cron was trying to execute what was scheduled to run hourly (cron.hourly).. I didnt go in to details and analyze it further to prove that was the cause , because it was obviously the cause :)
Going down from the parent to child I saw that cron was trying to execute mcelog script, and the script was looping / hanging..
Going down from the parent to child processes again ; I saw that the script was trying to do the following;
cat /sys/hypervisor/uuid` != "00000000-0000-0000-0000-000000000000"
The cat command could not be finished. The relevant process was in D state, in other words waiting for I/O.

Then, I checked the mcelog dailiy script , it was like below;

if [ -e /proc/xen ] && [ `cat /sys/hypervisor/uuid` != "00000000-0000-0000-0000-000000000000" ]; 
then
# this is a PV Xen guest. Do not run mcelog.
 exit 1;
else
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
fi

So, The crond was running mcelog.cron which is supplied by the mcelog package..
The kernel was a xen kernel, and xen related software packages was running on it..
Note that, there was no need for a Xen kernel to be used in this server, but it s another story :)

Brief explanatin about the Xen /hypervisor ,  as Xen is a native (bare-metal) hypervisor providing services that allow multiple computer operating systems to execute on the samecomputer hardware concurrently.

Hyperviseur.png

Anyways;
I found a similar bug in Suse, not Redhat Linux but it seems we were hitting something like the following,

patches.xen/499-sysfs-uuid-hang.patch: Avoid kernel hang reading
+ /sys/hypervisor/uuid if xenstore is not available.
Also look at the following;

Also this is from Redhat 5:

Bug 225203 Reading /sys/hypervisor/uuid in Dom0 hangs if XenStoreD isn't running
Description of problem: Attempting to read from /sys/hypervisor/uuid in Dom0 will hang indefinitely if XenStored hasn't been launched. Version-Release number of selected component (if applicable): uname -r 2.6.18-1.2747.el5xen
kernel-xen-2.6.18-1.2747.el

In conclusion;

we can say that, the mcelog was trying to gather some vm related information(vm domain based information) from XenStore, using Xenstore daemon, and Xenstore daemon was not running , we were hitting a bug in Redhat 5 caused  by the bug described above, as the /sys/hypervisor/uuid path relies on Xenstore to query the domain's UUID (handle) .

Possible solutions are:
  • Editing the mcelog.cron script, commenting all of the lines and putting and exit 0 at the end of the file..
  • Checking  with kill -0 `cat /var/run/xenstore.pid`, before reading the /sys/hypervisor/uuid
  • The code in the script can change something like the following; ( I didnt test it)

           if [ -e /proc/xen/capabilities ] 
           # xen 
           grep control_d /proc/xen/capabilities >& /dev/null if [ $? -ne 0 ] 
           # domU -- do not run on xen PV guest 
           return 1; 
           fi 
           fi

No comments :

Post a Comment