Why is my EC2 Linux instance becoming unresponsive due to over-utilization of resources?

Last updated: 2021-10-25

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance becomes unresponsive due to over-utilization of resources. How can I prevent this?

Short description

There are several common causes for why an instance becomes unresponsive:

Memory: EC2 instances don't have allocated swap space by default. Running out of memory can invoke the Linux Out Of Memory (OOM) manager. The OOM manager terminates processes, such as a database, web server, or the SSH service.

Networking: Without networking, your system can't answer ARP requests from status checks. When this occurs, your instance fails to communicate with other hosts.

Amazon Elastic Block Store (Amazon EBS): With no disk I/O, read or write instructions become stuck. For example, creation of temporary files, reads from system libraries, or databases.

CPU: All the preceding tasks require CPU time to work. 100% CPU usage for a prolonged time prevents the kernel from performing normal operating system operations.

These issues might also accumulate into a snowball effect. For example, you run out of memory and the OOM manager terminates an important process. Now, a second process that relies on the first process that was stopped starts a much higher number of CPU cycles. If this task is disk related, then this cycle can also exhaust the EBS volume. Also, the issue might be transferred to a different instance that is expecting communication from the unresponsive instance.

Resolution

If your system often becomes unresponsive due to over-utilization of resources, do the following:

Gather information

  1. Use a monitoring tool such as Amazon CloudWatch to observe trends and patterns of high resource utilization.
  2. If you have multiple services and aren't sure which one is over-utilizing resources, then install a utility such as atop.
  3. Review your application and operating system logs. These logs are usually located in /var/log/.
  4. Review the history of commands to see if there was human error. The command history is usually located in the ~/.bash_history file.
  5. Review cronjobs by running the crontab -l command.

Act based on the acquired data

Prevent future over-utilization

  1. Before deploying a new application in production, create a test environment and benchmark to determine the necessary compute, memory, EBS, and network.
  2. Deploy according to your benchmarks, while building for fault tolerance. For more information, see the following:
    Design interactions in a distributed system to prevent failures
    Tutorial: Set up a scaled and load-balanced application
  3. Continue monitoring your instances, and create alarms for certain resource usage thresholds.

Did this article help?


Do you need billing or technical support?