In one of our client's production environments , we have faced an incident.
After startup (approximately 1 day after startup), the Linux server, which was a Redhat Linux 5.4 64 bit (with 2.6.18-164.el5xen kernel) started to experience a high load.
Okay, I will keep it short;
The reason behing the problem was that the machine had 6 gig memory and it was all being used. Actually this was due to a high process count on the Linux.. Because, when I checked processes with ps ; I saw that there were a lot of crond processes, the cron was trying to execute what was scheduled to run hourly (cron.hourly).. I didnt go in to details and analyze it further to prove that was the cause , because it was obviously the cause :)
Going down from the parent to child I saw that cron was trying to execute mcelog script, and the script was looping / hanging..
Going down from the parent to child processes again ; I saw that the script was trying to do the following;
cat /sys/hypervisor/uuid` != "00000000-0000-0000-0000-000000000000"
kernel-xen-2.6.18-1.2747.el
After startup (approximately 1 day after startup), the Linux server, which was a Redhat Linux 5.4 64 bit (with 2.6.18-164.el5xen kernel) started to experience a high load.
Okay, I will keep it short;
The reason behing the problem was that the machine had 6 gig memory and it was all being used. Actually this was due to a high process count on the Linux.. Because, when I checked processes with ps ; I saw that there were a lot of crond processes, the cron was trying to execute what was scheduled to run hourly (cron.hourly).. I didnt go in to details and analyze it further to prove that was the cause , because it was obviously the cause :)
Going down from the parent to child I saw that cron was trying to execute mcelog script, and the script was looping / hanging..
Going down from the parent to child processes again ; I saw that the script was trying to do the following;
cat /sys/hypervisor/uuid` != "00000000-0000-0000-0000-000000000000"
The cat command could not be finished. The relevant process was in D state, in other words waiting for I/O.
Then, I checked the mcelog dailiy script , it was like below;
if [ -e /proc/xen ] && [ `cat /sys/hypervisor/uuid` !=
"00000000-0000-0000-0000-000000000000" ];
then
# this is a PV Xen guest. Do not run mcelog.
exit 1;
else
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
fi
So, The crond was running mcelog.cron which is supplied by the mcelog package..
The kernel was a xen kernel, and xen related software packages was running on it..
Note that, there was no need for a Xen kernel to be used in this server, but it s another story :)
Brief explanatin about the Xen /hypervisor , as Xen is a native (bare-metal) hypervisor providing services that allow multiple computer operating systems to execute on the samecomputer hardware concurrently.
Anyways;
I found a similar bug in Suse, not Redhat Linux but it seems we were hitting something like the following,
patches.xen/499-sysfs-uuid-hang.patch: Avoid kernel hang reading
+ /sys/hypervisor/uuid if xenstore is not available.
# this is a PV Xen guest. Do not run mcelog.
exit 1;
else
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
fi
So, The crond was running mcelog.cron which is supplied by the mcelog package..
The kernel was a xen kernel, and xen related software packages was running on it..
Note that, there was no need for a Xen kernel to be used in this server, but it s another story :)
Brief explanatin about the Xen /hypervisor , as Xen is a native (bare-metal) hypervisor providing services that allow multiple computer operating systems to execute on the samecomputer hardware concurrently.
Anyways;
I found a similar bug in Suse, not Redhat Linux but it seems we were hitting something like the following,
+ /sys/hypervisor/uuid if xenstore is not available.
Also look at the following;
Also this is from Redhat 5:
Bug 225203 Reading /sys/hypervisor/uuid in Dom0 hangs if XenStoreD isn't running
Description of problem:
Attempting to read from /sys/hypervisor/uuid in Dom0 will hang indefinitely if
XenStored hasn't been launched.
Version-Release number of selected component (if applicable):
uname -r 2.6.18-1.2747.el5xen kernel-xen-2.6.18-1.2747.el
In conclusion;
we can say that, the mcelog was trying to gather some vm related information(vm domain based information) from XenStore, using Xenstore daemon, and Xenstore daemon was not running , we were hitting a bug in Redhat 5 caused by the bug described above, as the /sys/hypervisor/uuid path relies on Xenstore to query the domain's UUID (handle) .
Possible solutions are:
- Editing the mcelog.cron script, commenting all of the lines and putting and exit 0 at the end of the file..
- Checking with kill -0 `cat /var/run/xenstore.pid`, before reading the /sys/hypervisor/uuid
- The code in the script can change something like the following; ( I didnt test it)
if [ -e /proc/xen/capabilities ]
# xen
grep control_d /proc/xen/capabilities >& /dev/null if [ $? -ne 0 ]
# domU -- do not run on xen PV guest
return 1;
fi
fi
No comments :
Post a Comment
If you will ask a question, please don't comment here..
For your questions, please create an issue into my forum.
Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html
Register and create an issue in the related category.
I will support you from there.