Why is my EC2 Linux instance unreachable and failing one or both of its status checks?
Last updated: 2022-09-07
My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance is unreachable and is failing one or both of its status checks. How do I troubleshoot status check failure?
Amazon EC2 monitors the health of each EC2 instance with two status checks:
System status check
The system status check detects issues with the underlying host that your instance runs on. If the underlying host is unresponsive or unreachable due to network, hardware or software issues, then this status check fails.
Instance status check
The instance status check failure indicates an issue with the reachability of the instance. This issue occurs due to operating system-level errors such as the following:
- Failure to boot the operating system
- Failure to mount the volumes correctly
- Exhausted CPU and memory
- Kernel panic
- Network failed to come up
Warning: Some of these resolutions require an instance stop and start. Before stopping and starting your instance, note these conditions:
- Instance store data is lost when you stop an instance. If your instance is instance store-backed or has instance store volumes containing data, then the data is lost when you stop the instance. For more information, see Determine the root device type of your instance.
- If your instance is part of an Amazon EC2 Auto Scaling group, then stopping the instance might terminate the instance. If you launched the instance with Amazon EMR, AWS CloudFormation, or AWS Elastic Beanstalk, then your instance might be part of an AWS Auto Scaling group. Instance termination in this scenario depends on the instance scale-in protection settings for your Auto Scaling group. If your instance is part of an Auto Scaling group, then temporarily remove the instance from the Auto Scaling group before starting the resolution steps.
- Stopping and starting an instance releases the public IP address back into the AWS dynamic IP pool. It's a best practice to use an Elastic IP address instead of a public IP address when routing external traffic to your instance. If you're using Amazon Route 53, you might be required to update the Route 53 DNS records when the public IP changes.
For more information, see Stop and start your instance.
1. Determine if the instance status check or system status check failed by viewing the instance status check metrics.
2. If the system status check failed, then see My EC2 Linux instance failed its system status check. How do I troubleshoot this?
If the instance status check failed, then check the instance's system logs to determine the cause of the failure. Depending on the data found in the system logs, use one of these resolutions:
Failure to boot the operating system
If the system logs contain boot errors, then see My EC2 Linux instance failed the instance status check due to operating system issues. How do I troubleshoot this?
Failure to mount the volumes correctly
An instance status check might fail due to a mount point that's unable to mount correctly, as shown in this example:
[FAILED] Failed to mount / See 'systemctl status mnt-nvme0n1p1.mount' for details. [DEPEND] Dependency failed for Local File Systems.
Exhausted CPU and Memory
High CPU Utilization
If the CPUUtilization metric is at or near 100%, the instance might not have enough compute capacity for the kernel to run.
For T2 or T3 instances, check the CPU credit metrics in the Amazon CloudWatch metrics table to determine if the CPU credits are at or near zero. If the CPU credits are at zero, then the CPUUtilization metric shows a saturation plateau at the baseline performance for the instance. The baseline performance might be 20%, 40%, and so on, depending on the instance type.
CloudWatch metrics indicating CPU utilization at or near 100%, or at a saturation plateau for T2 or T3 instances, indicate that the status check failed due to over-utilization of the instance's resources. For instructions on how to troubleshoot this issue, see My EC2 Linux instance failed the instance status check due to over-utilization of its resources. How do I troubleshoot this?
Block device errors, software bugs, or kernel panic might cause an unusual CPU usage spike. If the CPUUtilization metric is at 100%, and the system logs contain errors related to block devices, memory issues, or other unusual system errors, then reboot or stop and start the instance.
Out of memory
High memory pressure might result in the instance status check failing. In this example, the operating system is out of memory and stopping the process consuming the most memory.
[115879.769795] Out of memory: kill process 20273 (httpd) score 1285879 or a child [115879.769795] Killed process 1917 (php-cgi) vsz:467184kB, anon-rss:101196kB, file-rss:204kB
EC2 memory and disk metrics aren't sent to CloudWatch by default. However, you can send additional metrics to CloudWatch for monitoring using the CloudWatch agent.
To troubleshoot and resolve the out of memory issue, consider upgrading the instance to a larger instance type. Or, add swap storage to the instance to alleviate the memory pressure. For more information, see these topics:
- How do I allocate memory to work as swap space in an Amazon EC2 instance by using a swap file?
- How do I allocate memory to work as swap space on an Amazon EC2 instance using a partition on my hard drive?
Disk full errors
If the system logs contain disk full errors, then the instance might have entered emergency mode due to the root device is full.
$: service apache2 restart Error: No space left on device $: /etc/init.d/mysql restart [....] Restarting mysql (via systemctl): mysql.serviceError: No space left on device root@example:~# df -h / Filesystem Size Used Avail Use% Mounted on /dev/root 7.7G 7.7G 0 100% /
For detailed instructions on how to troubleshoot and resolve disk full errors, see the following:
A kernel panic error occurs when the kernel detects an internal, fatal error during operation. If the error occurs during the operating system boot, then the kernel might not be able to load properly. This might cause an operating system boot failure. This is an example of a kernel panic error message:
Linux version 2.6.16-xenU (email@example.com) (gcc version 4.0.1 20050727 (Red Hat4.0.1-5)) #1 SMP Mon May 28 03:41:49 SAST 2007 Kernel command line: root=/dev/sda1 ro 4 Registering block device major 8 Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1)
For information on troubleshooting and resolving a kernel panic error, see these topics:
Network failed to come up
The network can fail to come up due to these reasons:
Missing "cloud-init" package
If the instance is missing the cloud-init package, it can cause the network to fail to come up. The cloud-init package is useful for updating the network configuration upon launch.
To fix this error, you can install the cloud-init package to your instance.
$ sudo yum install cloud-init
MAC address is hardcoded in a configuration file
If your MAC address is hardcoded in a configuration file, you can encounter issues that cause your network to fail to come up.
Hardcoded MAC addresses can be found in the Linux configuration files and "udev" configuration files, such as in these locations:
To resolve network issues when your MAC address is hardcoded in your instance, you need to remove the entries or configuration files. For example:
IP address is hardcoded in a configuration file
If your IP address is hardcoded in a configuration file, you can encounter issues that cause your network to fail to come up. This occurs when an Amazon Machine Image (AMI) is taken from an instance with a statically configured IP address.
To fix this error, set your network interface to use DHCP.
Note: You can't update existing AMIs. Setting the network interface to use DHCP has to be done on the instance before creating a new AMI.
Missing ENA or Intel Enhanced network drivers
If the instance is missing the Elastic Network Adapter (ENA) or Intel Enhanced network drivers, it can cause the network to fail to come up.
For more information, see Enhanced networking on Linux.
Network interface is renamed upon startup
If the network interface is renamed upon the startup of the instance, it can cause network issues. To fix this error, deactivate predictable network interface names by adding the net.ifnames=0 to the kernel command line. To do this, you must activate enhanced networking with the ENA.
For more information on network issues, see Best practices for configuring network interfaces.