My EC2 Linux instance failed the instance status check due to over-utilization of its resources. How do I troubleshoot this?

Last updated: 2020-05-29

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed its instance status check due to over-utilization of its resources. How do I troubleshoot this?

Short Description

There are several reasons your instance might fail health checks due to over-utilization. The following are three of the most common reasons your health check might fail due to over-utilization of resources:

  • The CPU utilization of your instance reached close to 100% and the instance didn’t have enough compute capacity left for the kernel to run.
  • The root device is 100% full and the instance became stuck while booting.
  • The processes running on the instance used all its memory, preventing the kernel from running.

Resolution

Check the Amazon CloudWatch CPU utilization metrics

View the instance's CloudWatch metrics. If the CPU utilization is at or near 100%, use the following options to troubleshoot:

  • Reboot your instance to return it to a healthy status.
    Note: If your instance CPU requirements are higher than what your current instance type can offer, the problem will occur again after a reboot.
  • Consider changing your instance to an instance type that meets your CPU requirements.
  • If your instance is a burstable performance instance (T2, T3, or T3a), check its CPUCreditBalance by listing the instance's metrics. You can list metrics using the CloudWatch console or using the AWS Command Line Interface (AWS CLI).
    If the credit balance is close to zero, the instance CPU might be throttled. If the credit specification is set to standard on the instance, you can change the specification to unlimited. For information on changing the credit specification, see Modifying the credit specification of a burstable performance instance.

Check the instance's system log for errors

Check the system log for No space left on device or Out of memory errors.

No space left on device error

If an error similar to, "OSError: [Errno 28] No space left on device '/var/lib/'" appears in the instance's system log, the file system containing the listed folder (/var/lib, in this example) is full.

You can free space on the file system using a rescue instance.

Warning: Before stopping and starting your instance, be sure you understand the following:

  • Instance store data is lost when you stop and start an instance. If your instance is instance store-backed or has instance store volumes containing data, the data is lost when you stop the instance. For more information, see Determining the root device type of your instance.
  • If your instance is part of an Amazon EC2 Auto Scaling group, stopping the instance may terminate the instance. If you launched the instance with Amazon EMR, AWS CloudFormation, or AWS Elastic Beanstalk, your instance might be part of an AWS Auto Scaling group. Instance termination in this scenario depends on the instance scale-in protection settings for your Auto Scaling group. If your instance is part of an Auto Scaling group, then temporarily remove the instance from the Auto Scaling group before starting the resolution steps.
  • Stopping and starting the instance changes the public IP address of your instance. It's a best practice to use an Elastic IP address instead of a public IP address when routing external traffic to your instance. If you are using Route 53, you might have to update the Route 53 DNS records when the public IP changes.
  • If the shutdown behavior of the instance is set to Terminate, the instance is terminated when stopped. You can change the instance shutdown behavior to avoid this.

1.    Launch a new EC2 instance in your virtual private cloud (VPC) using the same Amazon Machine Image (AMI) and in the same Availability Zone as the impaired instance. The new instance becomes your rescue instance.

Or, you can use an existing instance that you can access, if it uses the same AMI and is in the same Availability Zone as your impaired instance

2.    Stop the impaired instance.

3.    Detach the Amazon Elastic Block Store (Amazon EBS) root volume (/dev/xvda or /dev/sda1) from your impaired instance. Note the device name (/dev/xvda or /dev/sda1) of your root volume.

4.    Attach the EBS volume as a secondary device ( /dev/sdf) to the rescue instance.

5.    Connect to your rescue instance using SSH.

6.    Create a mount point directory (/rescue) for the new volume attached to the rescue instance.

$ sudo mkdir /rescue

7.    Mount the volume at the directory that you created in step 6.

$ sudo mount /dev/xvdf1 /rescue

Note: The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. Use the lsblk command to view your available disk devices, along with their mount points, to determine the correct device names.

8.    Run the du -h command to check which files are taking up the most space.

du -h /recovery/var/lib

9.    Delete files as needed to free additional space.

10.    Run the unmount command to unmount the secondary device from your rescue instance.

$ sudo umount /rescue

If the unmount operation isn't successful, you might have to stop or reboot the rescue instance to enable a clean unmount.

11.    Detach the secondary volume (/dev/sdf) from the rescue instance. Then, attach it to the original instance as /dev/xvda (root volume).

12.    Start the instance and then verify if the instance is responsive.

Out of memory error

If the error "Out of memory: kill process" appears in the instance's system log, the instance's memory is exhausted. When the memory is exhausted, the kernel doesn’t have enough memory to run and other processes are killed to free memory.

For more information on how to resolve out of memory (OOM) issues, see Out of memory:kill process.