How do I troubleshoot an EC2 Linux instance that fails a status check due to over-utilization of resources?

8 minute read

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed its instance status check due to over-utilization of its resources.

Short description

The following are three of the most common reasons that your health check might fail due to over-utilization of resources:

The CPU utilization of your instance reached close to 100% and the instance didn't have enough compute capacity left for the kernel to run.
The root device is 100% full and is preventing other processes from completing or beginning.
The processes running on the instance used all of its memory, preventing the kernel from running.

Resolution

Stop and start the instance

Warning:

Data stored in instance store volumes is lost when the instance is stopped. Make sure that you save a backup of the data before stopping the instance. Unlike EBS-backed volumes, instance store volumes are ephemeral and don't support data persistence. For more information, see What happens when you stop an instance.
The stop and start operations on the instance forces the kernel to stop any running processes. This is a temporary solution to return resources to the operating system. If you don't address the root cause of the issue, then over-utilization persists.
The static public IPv4 address that Amazon EC2 automatically assigned to the instance on launch or start changes after the stop and start. To retain a public IPv4 address that doesn't change if the instance is stopped, use an Elastic IP address.

For more information, see Stop and start your instances.

Check the Amazon CloudWatch CPU utilization metrics

View the instance's CloudWatch metrics. If the CPU utilization is at or near 100%, use the following options to troubleshoot:

Reboot your instance to return it to a healthy status.
Note: If your instance CPU requirements are higher than what your current instance type offers, then the problem occurs again after reboot.
Consider changing your instance to an instance type that meets your CPU requirements.
If your instance is a burstable performance instance (T2, T3, or T3a), then check its CPUCreditBalance by listing the instance's metrics. You can list metrics using the CloudWatch console or using the AWS Command Line Interface (AWS CLI).
If the credit balance is close to zero, then the instance CPU might be throttled. If the credit specification is set to standard on the instance, then you can change the specification to unlimited. For more information, see Modify the credit specification of a burstable performance instance.

Note: If you receive errors when running AWS CLI commands, make sure that you're using the most recent version of the AWS CLI.

Check the instance's system log for errors

Check the system log for No space left on device or Out of memory errors.

No space left on device error

If an error similar to "OSError: [Error 28] No space left on device '/var/lib/'" appears in the instance's system log, then the file system containing the listed folder is full. In this example, /var/lib is full.

You can free space on the file system using one of the following methods:

Method 1: Use the EC2 serial console for Linux instances

If you turned on the EC2 serial console for Linux instances, then you can use it to troubleshoot supported Nitro-based instance types and bare metal instances. The serial console helps you troubleshoot boot issues, network configuration, and SSH configuration issues. The serial console connects to your instance without the need for a working network connection. You can access the serial console using the Amazon EC2 console or the AWS CLI.

If you haven't previously used the EC2 serial console, make sure that you review prerequisites and configure access before trying to connect. If your instance is unreachable and you haven't configured access to the serial console, then follow the instructions in Method 2: Use a rescue instance. For information on configuring the EC2 serial console, see Configure access to the EC2 serial console.

Method 2: Use a rescue instance

Warning: The following procedure requires stopping the instance. Data stored in instance store volumes is lost when the instance is stopped. Make sure that you save a backup of the data before stopping the instance. Unlike EBS-backed volumes, instance store volumes are ephemeral and don't support data persistence. For more information, see What happens when you stop an instance.

1. Launch a new EC2 instance in your virtual private cloud (VPC). Use the same Amazon Machine Image (AMI) and the same Availability Zone as the impaired instance. The new instance becomes your rescue instance.

Or, use an existing instance that you can access. The existing instance must use the same AMI and be in the same Availability Zone as your impaired instance

2. Stop the impaired instance.

3. Detach the Amazon Elastic Block Store (Amazon EBS) root volume (/dev/xvda or /dev/sda1) from your impaired instance. Note the device name (/dev/xvda or /dev/sda1) of your root volume.

4. Attach the EBS volume as a secondary device (/dev/sdf) to the rescue instance.

5. Connect to your rescue instance using SSH.

6. Create a mount point directory (/rescue) for the new volume attached to the rescue instance

$ sudo mkdir /rescue

7. Mount the volume at the directory that you created in step 6.

$ sudo mount /dev/xvdf1 /rescue

The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. Use the lsblk command to view your available disk devices, along with their mount points, to determine the correct device names.

Note: You might receive an error message similar to the following:

"...wrong fs type, bad option, bad superblock on /dev/nvme2n1, missing codepage or helper program, or other error."

The preceding error is caused by a UUID conflict with the XFS file system. If you receive this error, see Why can't I mount my Amazon EBS volume?

8. Run the du -h command to check which files are taking up the most space.

du -shcm /rescue/var/lib/* |sort -n

9. Delete files as needed to free additional space.

10. Run the unmount command to unmount the secondary device from your rescue instance.

$ sudo umount /rescue

If the unmount operation isn't successful, then you might have to stop or reboot the rescue instance to get a clean unmount.

11. Detach the secondary volume (/dev/sdf) from the rescue instance. Then, attach it to the original instance as /dev/xvda (root volume).

12. Start the instance and then verify if the instance is responsive.

You can resize the root EBS volume using the following steps:

1. Request modification of EBS volume size.

2. Extend a Linux file system after resizing a volume using a rescue instance

Out of memory error

If the error "Out of memory: kill process" appears in the instance's system log, then the instance's memory is exhausted. When the memory is exhausted, the kernel doesn't have enough memory to run and other processes are terminated to free memory.

For more information on how to resolve out of memory (OOM) issues, see Out of memory:kill process.

To check the memory error logs (Out of Memory), follow the steps in Method 2. Use a rescue instance through Step.7 Mount the volume.

Then, use the following commands to search the logs for out-of-memory alerts, depending on your Linux distribution:

Amazon Linux 1 and Amazon Linux 2

sudo grep -i -r 'out of memory' /var/log/

Amazon Linux 2023

$ sudo journalctl -p err | grep -i "out of memory"

-or-

$ sudo dmesg | grep -i "out of memory"

For more information on how to resolve out of memory (OOM) issues, see Out of memory:kill process.

Page allocation failure

Page allocation failure occurs when the kernel memory allocator fails the allocation request. When this happens, a page allocation failure message is added to the system log.

To troubleshoot and resolve this out of memory issue, consider upgrading the instance to a larger instance type. Or, for instances that use ephemeral instance store volumes, add swap storage to the instance to alleviate the memory pressure.

For more information on setting up swap space, see the following:

Note: The instance uses swap space when the amount of RAM is full. You can use swap space for instances that have a small amount of RAM, but isn't a replacement for more RAM. Because swap space is located on the instance's hard drive, performance is slower when compared to actual RAM. For more or faster memory, consider increasing your instance size.

For more information on page allocation failures, see What are page allocation failures on the Red Hat website.

Related information

Why is my EC2 Linux instance unreachable and failing its status checks?

What steps do I need to take before changing the instance type of my EC2 Linux instance?

How do I diagnose high CPU utilization on an EC2 Windows instance?

Topics

Compute

Relevant content

AWS EC2 servers crashed due to unknown reason. Status check failed and doesnt recover after reboot
Akhil
asked a year ago
EC2 Status check failed a few times
rePost-User-5838921
asked a year ago
I want to forcefully simulate an status check failure on one of my EC2 instance to test status check metric alarm which I recently created.
Accepted Answer
Shubham Bhawsar
asked 8 months ago
EC2 Linux Instance Reachability Check Failed
Robert
asked a month ago
EC2 instance failing status check due to full disk
rePost-User-9215087
asked a year ago
How do I troubleshoot an instance status check failure on my Amazon EC2 Windows instance?
AWS OFFICIALUpdated 8 months ago
How do I troubleshoot common issues that cause my Lightsail instance to be unresponsive?
AWS OFFICIALUpdated 3 months ago
How do I troubleshoot an EC2 Linux instance that failed the instance status check due to operating system issues?
AWS OFFICIALUpdated 9 months ago
Why did my EC2 Linux instance fail a system status check?
AWS OFFICIALUpdated 9 months ago
EMR Cluster failure with "Failed to start the job flow due to an internal error"
SUPPORT ENGINEER
Yokesh NK
published 8 days ago