Why is my EC2 Linux instance becoming unresponsive due to over-utilization of resources?
Last updated: 2021-10-25
My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance becomes unresponsive due to over-utilization of resources. How can I prevent this?
Short description
There are several common causes for why an instance becomes unresponsive:
Memory: EC2 instances don't have allocated swap space by default. Running out of memory can invoke the Linux Out Of Memory (OOM) manager. The OOM manager terminates processes, such as a database, web server, or the SSH service.
Networking: Without networking, your system can't answer ARP requests from status checks. When this occurs, your instance fails to communicate with other hosts.
Amazon Elastic Block Store (Amazon EBS): With no disk I/O, read or write instructions become stuck. For example, creation of temporary files, reads from system libraries, or databases.
CPU: All the preceding tasks require CPU time to work. 100% CPU usage for a prolonged time prevents the kernel from performing normal operating system operations.
These issues might also accumulate into a snowball effect. For example, you run out of memory and the OOM manager terminates an important process. Now, a second process that relies on the first process that was stopped starts a much higher number of CPU cycles. If this task is disk related, then this cycle can also exhaust the EBS volume. Also, the issue might be transferred to a different instance that is expecting communication from the unresponsive instance.
Resolution
If your system often becomes unresponsive due to over-utilization of resources, do the following:
Gather information
- Use a monitoring tool such as Amazon CloudWatch to observe trends and patterns of high resource utilization.
- If you have multiple services and aren't sure which one is over-utilizing resources, then install a utility such as atop.
- Review your application and operating system logs. These logs are usually located in /var/log/.
- Review the history of commands to see if there was human error. The command history is usually located in the ~/.bash_history file.
- Review cronjobs by running the crontab -l command.
Act based on the acquired data
- You might find that your application requires configuration or code changes to optimize resource utilization.
- If your processes are utilizing lots of resources for valid reasons, such as high intake of users, considering upgrading your instance.
AWS Compute Optimizer is a useful source for generating recommended instance sizes. You can also consider scaling horizontally using Amazon EC2 Auto Scaling. - If you want more visibility over user commands and configuration changes, you can install `audit` to track changes.
Prevent future over-utilization
- Before deploying a new application in production, create a test environment and benchmark to determine the necessary compute, memory, EBS, and network.
- Deploy according to your benchmarks, while building for fault tolerance. For more information, see the following:
Design interactions in a distributed system to prevent failures
Tutorial: Set up a scaled and load-balanced application - Continue monitoring your instances, and create alarms for certain resource usage thresholds.
Did this article help?
Do you need billing or technical support?