My EC2 Linux instance failed the instance status check due to over-utilization of its resources. How do I troubleshoot this?
Last updated: 2021-09-22
My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed its instance status check due to over-utilization of its resources. How do I troubleshoot this?
There are several reasons your instance might fail health checks due to over-utilization. The following are three of the most common reasons your health check might fail due to over-utilization of resources:
- The CPU utilization of your instance reached close to 100% and the instance didn’t have enough compute capacity left for the kernel to run.
- The root device is 100% full and is preventing other processes from completing or beginning.
- The processes running on the instance used all its memory, preventing the kernel from running.
Check the Amazon CloudWatch CPU utilization metrics
View the instance's CloudWatch metrics. If the CPU utilization is at or near 100%, use the following options to troubleshoot:
- Reboot your instance to return it to a healthy status.
Note: If your instance CPU requirements are higher than what your current instance type can offer, then the problem will occur again after a reboot.
- Consider changing your instance to an instance type that meets your CPU requirements.
- If your instance is a burstable performance instance (T2, T3, or T3a), then check its CPUCreditBalance by listing the instance's metrics. You can list metrics using the CloudWatch console or using the AWS Command Line Interface (AWS CLI).
If the credit balance is close to zero, then the instance CPU might be throttled. If the credit specification is set to standard on the instance, you can change the specification to unlimited. For information on changing the credit specification, see Modify the credit specification of a burstable performance instance.
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.
Check the instance's system log for errors
Check the system log for No space left on device or Out of memory errors.
No space left on device error
If an error similar to "OSError: [Error 28] No space left on device '/var/lib/'" appears in the instance's system log, then the file system containing the listed folder is full. (In this example, /var/lib is the folder.)
You can free space on the file system using one of the following methods:
Method 1: Use the EC2 Serial Console
If you activated EC2 Serial Console for Linux, then you can use it to troubleshoot supported Nitro-based instance types. The serial console helps you troubleshoot boot issues, network configuration, and SSH configuration issues. The serial console connects to your instance without the need for a working network connection. You can access the serial console using the Amazon EC2 console or the AWS CLI.
Before using the serial console, grant access to it at the account level. Then, create AWS Identity and Access Management (IAM) policies granting access to your IAM users. Also, every instance using the serial console must include at least one password-based user. If your instance is unreachable and you haven’t configured access to the serial console, then follow the instructions in Method 2: Use a rescue instance. For information on configuring the EC2 Serial Console for Linux, see Configure access to the EC2 Serial Console.
Method 2: Use a rescue instance
Warning: Before stopping and starting your instance, be sure you understand the following:
- If your instance is instance store-backed or has instance store volumes containing data, then the data is lost when you stop the instance. For more information, see Determine the root device type of your instance.
- If your instance is part of an Amazon EC2 Auto Scaling group, then stopping the instance may terminate the instance. Instances launched with Amazon EMR, AWS CloudFormation, or AWS Elastic Beanstalk might be part of an AWS Auto Scaling group. Instance termination in this scenario depends on the instance scale-in protection settings for your Auto Scaling group. If your instance is part of an Auto Scaling group, then temporarily remove the instance from the Auto Scaling group before starting the resolution steps.
- Stopping and starting the instance changes the public IP address of your instance. It's a best practice to use an Elastic IP address instead of a public IP address when routing external traffic to your instance. If you're using Amazon Route 53, you might have to update the Route 53 DNS records when the public IP changes.
1. Launch a new EC2 instance in your virtual private cloud (VPC) using the same Amazon Machine Image (AMI) and in the same Availability Zone as the impaired instance. The new instance becomes your rescue instance.
Or, you can use an existing instance that you can access if the instance uses the same AMI and is in the same Availability Zone as your impaired instance
3. Detach the Amazon Elastic Block Store (Amazon EBS) root volume (/dev/xvda or /dev/sda1) from your impaired instance. Note the device name (/dev/xvda or /dev/sda1) of your root volume.
4. Attach the EBS volume as a secondary device ( /dev/sdf) to the rescue instance.
6. Create a mount point directory (/rescue) for the new volume attached to the rescue instance.
$ sudo mkdir /rescue
7. Mount the volume at the directory that you created in step 6.
$ sudo mount /dev/xvdf1 /rescue
Note: The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. Use the lsblk command to view your available disk devices, along with their mount points, to determine the correct device names.
8. Run the du -h command to check which files are taking up the most space.
du -h /rescue/var/lib
9. Delete files as needed to free additional space.
10. Run the unmount command to unmount the secondary device from your rescue instance.
$ sudo umount /rescue
If the unmount operation isn't successful, you might have to stop or reboot the rescue instance to get a clean unmount.
12. Start the instance and then verify if the instance is responsive.
Out of memory error
If the error "Out of memory: kill process" appears in the instance's system log, then the instance's memory is exhausted. When the memory is exhausted, the kernel doesn’t have enough memory to run and other processes are terminated to free memory.
For more information on how to resolve out of memory (OOM) issues, see Out of memory:kill process.