How do I troubleshoot an EC2 Linux instance that failed the instance status check due to operating system issues?

9 minute read
2

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed the instance status check due to operating system issues. Now it doesn't boot successfully.

Short description

Your EC2 Linux instance might fail the instance status check for the following reasons:

  • You updated the kernel and the new kernel didn't boot.
  • The file system entries in /etc/fstab are incorrect or the file system is corrupted.
  • There are incorrect network configurations on the instance.

Resolution

There are three methods for troubleshooting OS issues.

Important: Some of the following procedures require stopping the instance. Data that's stored in instance store volumes is lost when the instance is stopped. Make sure that you save a backup of the data before stopping the instance. Unlike Amazon Elastic Block Store (Amazon EBS)-backed volumes, instance store volumes are ephemeral and don't support data persistence.

The static public IPv4 address that Amazon EC2 automatically assigned to the instance on launch or start changes after the stop and start. To retain a public IPv4 address that doesn't change when the instance is stopped, use an Elastic IP address.

For more information, see What happens when you stop an instance.

Use the EC2 serial console for Linux instances

If you turned on the EC2 serial console for Linux instances, then you can use it to troubleshoot supported Nitro-based instance types and bare metal instances. The serial console helps you troubleshoot boot issues and network and SSH configuration issues. The serial console connects to your instance without needing a working network connection. To access the serial console, use the Amazon EC2 console or the AWS Command Line Interface (AWS CLI).

If you're using the EC2 serial console for the first time, then make sure that you review the prerequisites, and configure access before trying to connect.

If your instance is unreachable and you haven't configured access to the serial console, then follow the instructions in Run the EC2Rescue for Linux tool or Use a rescue instance. For information on configuring the EC2 serial console for Linux instances, see Configure access to the EC2 serial console.

Note: If you receive errors when running AWS CLI commands, make sure that you're using the most recent version of the AWS CLI.

Run the EC2Rescue for Linux tool

EC2Rescue for Linux automatically diagnoses and troubleshoots operating systems on unreachable instances. For more information, see How do I use EC2Rescue for Linux to troubleshoot operating system-level issues?

Manually correct errors using a rescue instance

1.    Launch a new EC2 instance in your virtual private cloud (VPC). Use the same Amazon Machine Image (AMI) and the same Availability Zone as the impaired instance. The new instance becomes your rescue instance.

Or, use an existing instance. The existing instance must use the same AMI and be in the same Availability Zone as your impaired instance.

2.    Stop the impaired instance.

3.    Detach the Amazon Elastic Block Store (Amazon EBS) root volume (/dev/xvda or /dev/sda1) from your impaired instance. Note the device name (/dev/xvda or /dev/sda1) of your root volume.

4.    Attach the volume as a secondary device (/dev/sdf) to the rescue instance.

5.    Connect to your rescue instance using SSH.

6.    Create a mount point directory (/rescue) for the new volume attached to the rescue instance:

$ sudo mkdir /rescue

7.    Mount the volume at the new directory:

$ sudo mount /dev/xvdf1 /rescue

If you receive an error, such as Wrong Fs type or UUID duplicate, Superblock is missing or badblock found, see Why can't I mount my Amazon EBS volume?

Note: The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. To determine the correct device name, run the lsblk command to view your available disk devices along with their mount points.

8.    If you haven't already done so, retrieve the system log of the instance to verify the error. The next steps depend on the error message listed in the system log. The following is a list of common errors that cause instance status check failure. For additional errors, see Troubleshooting system log errors for Linux-based instances.

Kernel panic

If a Kernel Panic error message is in the system log, then the kernel might not have the vmlinuz or initramfs files. The vmlinuz and initramfs files are necessary to boot successfully.

1.    Run the following commands:

cd /rescue/boot
ls -l

2.    Check the output to verify that there are vmlinuz and initramfs files corresponding to the kernel version that you want to boot.

The following output example is for an Amazon Linux 2 instance with kernel version, 4.14.165-131.185.amzn2.x86_64. The /boot directory has the files initramfs-4.14.165-131.185.amzn2.x86_64.img and vmlinuz-4.14.165-131.185.amzn2.x86_64, so it will boot successfully.

uname -r
4.14.165-131.185.amzn2.x86_64

cd /boot; ls -l
total 39960
-rw-r--r-- 1 root root      119960 Jan 15 14:34 config-4.14.165-131.185.amzn2.x86_64
drwxr-xr-x 3 root root     17 Feb 12 04:06 efi
drwx------ 5 root root       79 Feb 12 04:08 grub2
-rw------- 1 root root 31336757 Feb 12 04:08 initramfs-4.14.165-131.185.amzn2.x86_64.img
-rw-r--r-- 1 root root    669087 Feb 12 04:08 initrd-plymouth.img
-rw-r--r-- 1 root root    235041 Jan 15 14:34 symvers-4.14.165-131.185.amzn2.x86_64.gz
-rw------- 1 root root   2823838 Jan 15 14:34 System.map-4.14.165-131.185.amzn2.x86_64
-rwxr-xr-x 1 root root   5718992 Jan 15 14:34 vmlinuz-4.14.165-131.185.amzn2.x86_64

3.    If the initramfs and or the vmlinuz files aren't present, then try boot the instance using a previous kernel that has both of the files. For instructions on how to boot your instance using a previous kernel, see How do I revert to a known stable kernel after an update prevents my Amazon EC2 instance from rebooting successfully?

4.    Run the umount command to unmount the secondary device from your rescue instance:

$ sudo umount /rescue

If the unmount operation doesn't succeed, then you might have to stop or reboot the rescue instance for a clean unmount.

5.    Detach the secondary volume (/dev/sdf) from the rescue instance, and then attach it to the original instance as /dev/xvda (root volume).

6.    Start the instance, and then verify if the instance is responsive.

For additional information on resolving kernel panic errors, see Why do I see a "Kernel panic" error after I upgrade the kernel or reboot my EC2 Linux instance?

Failed to mount or Dependency failed

Errors such as "Failed to mount" or "Dependency failed" in your system log indicated that the /etc/fstab file has incorrect mount point entries.

1.    Verify that the mount point entries in the /etc/fstab are correct. For information on correcting the /etc/fstab file entries, see the Auto mount failures because of incorrect entries in the /etc/fstab file section of Why is my EC2 Linux instance going into emergency mode when I try to boot it?

2.    It's a best practice to run the fsck or xfs_repair tool to correct any file system errors. If there are inconsistencies in the file system, the fsck or xfs_repair tool corrects them.

Note: Create a backup of your file system before running the fsck or xfs_repair tool.

Run the umount command to unmount your mount point before running the fsck or xfs_repair tool:

$ sudo umount /rescue

Run the fsck or xfs_repair tool, depending on your file system.

For ext4 file systems:

$ sudo fsck /dev/sdf
fsck from util-linux 2.30.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/sdf: clean, 11/6553600 files,
459544/26214400 blocks

For XFS file systems:

$ sudo xfs_repair /dev/sdf
xfs_repair /dev/xvdf
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

3.    Detach the secondary volume (/dev/sdf) from the rescue instance, and then attach it to the original instance as /dev/xvda (root volume).

4.    Start the instance, and then verify if the instance is responsive.

Bringing up interface eth0: failed

Verify that the ifcfg-eth0 file has the correct network entries. The network configuration file corresponding to the primary interface, eth0, is located at /etc/sysconfig/network-scripts/ifcfg-eth0. If the device name of your primary interface isn't eth0, then there is a file that begins with ifcfg and is followed by the name of your device. The file is in the /etc/sysconfig/network-scripts directory on the instance.

1.    Run the cat command to view the network configuration file for the primary interface, eth0.

The following are the correct entries for the network configuration file located in /etc/sysconfig/network-scripts/ifcfg-eth0.

Note: Replace eth0 in the following command with the name of your primary interface, if different.

$ sudo cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
TYPE=Ethernet
USERCTL=yes
PEERDNS=yes
DHCPV6C=yes
DHCPV6C_OPTIONS=-nw
PERSISTENT_DHCLIENT=yes
RES_OPTIONS="timeout:2 attempts:5"
DHCP_ARP_CHECK=no

2.    Verify that ONBOOT is set to yes, as shown in the previous example. If ONBOOT isn't set to yes, then eth0 (or your primary network interface) isn't configured to come up at boot.

To change the ONBOOT value:

Open the file in an editor. In this example, the vi editor is used.

$ sudo vi /etc/sysconfig/network-scripts/ifcfg-eth0

Press I to insert.

Scroll the cursor to the ONBOOT entry, and then change the value to yes.

Save and exit the file by pressing :wq!

3.    Run the umount command to unmount the secondary device from your rescue instance:

$ sudo umount /rescue

If the unmount operation isn't successful, then you might have to stop or reboot the rescue instance to enable a clean unmount.

4.    Detach the secondary volume (/dev/sdf) from the rescue instance, and then attach it to the original instance as /dev/xvda (root volume).

5.    Start the instance and then verify if the instance is responsive

Related information

Why is my EC2 Linux instance unreachable and failing one or both of its status checks?

Troubleshoot instances with failed status checks

Why is my Linux instance not booting after I changed its type to a Nitro-based instance type?

AWS OFFICIAL
AWS OFFICIALUpdated 8 months ago
3 Comments

Thanks for the very detailed and well structured article.

profile picture
replied a year ago

In the "Method 3: Manually correct errors using a rescue instance", when you try to mount the disk on step no 7 for problematic boot volumes, you would get an error stating that "Wrong Fs type or UUID duplicate, Superblock is missing or badblock found" this is because of boot volumes UUID are conflicting with the rescue server boot UUID and validate the disks boot UUID using "blkid" command and mount the volumes if its xfs using this command "mount -t xfs -o nouuid /dev/vg/lv /mnt" and refer the https://access.redhat.com/solutions/5494781 for reference

AWS
replied 9 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 9 months ago