Why is my EC2 Linux instance becoming unresponsive due to over-utilization of resources?

6 minute read
0

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance becomes unresponsive due to over-utilization of resources. How can I prevent this?

Short description

There are several common causes for why an instance becomes unresponsive:

  • Memory: EC2 instances don't have allocated swap space by default. Running out of memory invokes the Linux Out Of Memory (OOM) manager. The OOM manager terminates processes, such as a database, web server, or the SSH service.
  • Networking: Without networking, your system can't answer ARP requests from status checks. When this occurs, your instance fails to communicate with other hosts.
  • I/O operations: With no disk I/O, read or write instructions become stuck. For example, creation of temporary files, reads from system libraries, or databases.
  • CPU: All the preceding tasks require CPU time to work. 100% CPU usage for a prolonged time prevents the kernel from performing normal operating system operations.

These issues might also accumulate into a snowball effect. For example, you run out of memory and the OOM manager terminates an important process. Now, a second process that relies on the first process that was stopped starts a much higher number of CPU cycles. If this task is disk related, then this cycle might also exhaust the Amazon Elastic Block Store (Amazon EBS) volume. Also, the issue might be transferred to a different instance that is expecting communication from the unresponsive instance.

Resolution

If your system has high CPU utilization, or often becomes unresponsive due to over-utilization of resources, then do the following:

Gather information

Monitor CPU usage using Amazon CloudWatch

Use a monitoring tool such as Amazon CloudWatch to observe trends and patterns of high resource utilization.

Use system monitoring tools

If you have multiple services and aren't sure which one is over-utilizing resources, then install a utility such as atop. You can also use tools such as htop, top, and sar. All these tools help identify processes that are consuming the most CPU usage. For more information, see the following:

Get more information on the process that's using high CPU

Use the pidstat or ps command to get more detailed information about the process. The information provided in the command output helps you determine if the process is a system or user process. For more information on how to configure and use the tools needed to run these commands, see the following:

Check system logs

Check errors or warnings that are related to high CPU usage. For example, use the dmesg command to view kernel messages, and view the /var/log/syslog or /var/log/messages files for system messages. The command output and log file contents help identify system or application issues that are causing problems.

Review command history

Review the history of commands to see if there was human error. The command history is usually located in the ~/.bash_history file.

Check for scheduled jobs

Check if there are any scheduled jobs or cron jobs running on the EC2 instance that might cause high CPU usage. First, confirming the timestamp of the high CPU usage. Then run the following command to list the cron jobs:

sudo crontab -l
sudo cat /etc/crontab
sudo cat crontab -l -u <username>

The preceding command lists the crontab configuration for the root user. Include the -u option in the command to check cron for a specific user. Then, look for the time that you noted the issue. Check your logs including the following:

/var/log/messages
/var/log/syslog 
/var/log/dmesg 
/var/log/cron.log

Use the grep command to filter relevant entries for the specific cron jobs that you want to investigate. Confirm if errors occurred that are related to one of the identified cron.

Check for memory usage

High memory usage might lead to high CPU usage due to swap space usage. Use the free command to check memory usage. For more information on how to configure and use the necessary tools, see Dissecting the free command on the redhat.com website.

Check network traffic

High network traffic might cause high CPU utilization, especially if the instance is handling a lot of network requests. Use the iftop command to monitor network traffic, and consider optimizing your network configuration or upgrading your instance type if necessary. For more information on how to configure and use the necessary tools, see Linux interface analytics on-demand with iftop on the redhat.com website.

Check disk I/O

High disk I/O might cause high CPU usage. Use the iostat command to monitor the disk I/O and identify any processes that might cause high I/O. For more information, see I/O reporting from the Linux command line on the redhat.com website.

Act based on the acquired data

Optimize code

If your application is causing high CPU usage, then optimize your code. To do this, identify and eliminate performance bottlenecks. Profiling tools such as perf or strace help identify problematic code. For more information on how to configure and use the necessary tools, see the following:

Upgrade your instance

If your processes are utilizing lots of resources for valid reasons, such as high intake of users, then considering upgrading your instance.

AWS Compute Optimizer can help you decide the appropriate instance type and size to use. You can also consider scaling horizontally using Amazon EC2 Auto Scaling.

Configure Linux audit rules

If you want more visibility over user commands and configuration changes, then you can configure the Linux Audit system to track changes.

Prevent future over-utilization

  1. Before deploying a new application in production, create a test environment and benchmark to determine the necessary compute, memory, EBS, and network. Deploy according to your benchmarks, while building for fault tolerance. For more information, see the following: 
    Design interactions in a distributed system to prevent failures
    Tutorial: Set up a scaled and load-balanced application
  2. Make sure that applications running on the instance are optimized for performance. Optimization involves tweaking configuration files, optimizing database queries, or optimizing code.
  3. If your application is database-heavy, then consider implementing caching to reduce the number of queries to the database.
  4. Make sure that your software is current with the latest security patches and bug fixes. Outdated software might cause performance issues and vulnerabilities leading to high CPU utilization.
  5. If your application receives high traffic volume, then consider using a load balancer to distribute the traffic across multiple EC2 instances. A load balancer reduces the CPU utilization on any one instance.
  6. Continue monitoring your instances, and create alarms for certain resource usage thresholds.
AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago