How do I use EC2Rescue for Linux to troubleshoot operating system-level issues?
Last updated: 2021-04-30
I can't connect to my Amazon Elastic Compute Cloud (Amazon EC2) Linux instance or I'm experiencing boot issues. To correct these problems, I need to fix common issues such as OpenSSH file permissions or gather system (OS) logs for analysis and troubleshooting. How can I use EC2Rescue for Linux to do this?
EC2Rescue for Linux is a tool that helps diagnose and troubleshoot problems on Amazon EC2 Linux instances. EC2Rescue for Linux is run on your Amazon EC2 Linux instance to correct operating system-level issues. EC2Rescue for Linux also collects advanced logs, system utilization reports, and configuration files for further analysis.
Common scenarios addressed by EC2Rescue for Linux:
- Collect system utilization reports such as vmstat, iostat, mpstat, and so on.
- Collect logs and details such as syslog, dmesg, application error logs, and SSM logs.
- Detect system problems such as asymmetric routing or duplicate root device labels.
- Automatically remediate system problems such as correcting OpenSSH file permissions or disabling known problematic kernel parameters.
EC2Rescue for Linux requires an Amazon EC2 Linux instance that meets the following prerequisites:
Supported operating systems
- Amazon Linux 2
- Amazon Linux 2016.09+
- SLES 12+
- RHEL 7+
- Ubuntu 16.04+
- Python 2.7.9+ or 3.2+
Note: If you’ve enabled EC2 Serial Console for Linux, then you can use it to troubleshoot supported Nitro-based instance types. The serial console helps you troubleshoot boot issues, network configuration, and SSH configuration issues. The serial console connects to your instance without the need for a working network connection. You can access the serial console using the Amazon EC2 console or the AWS Command Line Interface (AWS CLI).
Before using the serial console, grant access to the console at the account level. Then create AWS Identity and Access Management (IAM) policies granting access to your IAM users. Also, every instance using the serial console must include at least one password-based user. If your instance is unreachable and you haven’t configured access to the serial console, follow the instructions in the Resolution section. For information on configuring the EC2 Serial Console for Linux, see Configure access to the EC2 Serial Console.
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.
To troubleshoot an unreachable Amazon EC2 Linux instance using EC2Rescue for Linux, do the following:
1. Launch a new Amazon EC2 instance in your virtual private cloud (VPC) using the same Amazon Machine Image (AMI) and in the same Availability Zone as the impaired instance. The new instance becomes your "rescue" instance.
Or, you can use an existing instance that you can access, if it uses the same AMI and is in the same Availability Zone as your impaired instance.
2. Detach the Amazon Elastic Block Store (Amazon EBS) root volume (/dev/xvda or /dev/sda1) from your impaired instance.
3. Attach the EBS volume as a secondary device ( /dev/sdf) to the rescue instance.
5. Create a mount point directory (/rescue) for the new volume attached to the rescue instance in step 3.
$ sudo mkdir /rescue
6. Mount the volume at the directory you created in step 5.
$ sudo mount /dev/xvdf1 /rescue
Note: The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. Use the lsblk command to view your available disk devices along with their mount points to determine the correct device names.
Note: If the volume mount fails, check dmesg | tail. If the logs suggest conflicting UUID, use the option -o nouuid.
7. Change the root directory (chroot) to the newly mounted volume:
$ sudo -i # for i in proc sys dev run; do mount --bind /$i /rescue/$i ; done # chroot /rescue
8. Download and install the EC2Rescue Tool for Linux on an offline Linux root volume:
$ curl -O https://s3.amazonaws.com/ec2rescuelinux/ec2rl.tgz $ tar -xvf ec2rl.tgz
9. Verify the installation by listing the help file:
$ cd ec2rl-<version_number> $ ./ec2rl help
10. Run EC2Rescue for Linux with no options to run all modules as sudo:
$ sudo ./ec2rl run
11. View the results in /var/temp/ec2rl:
12. Enable remediation for the supported modules based on the results:
$ ./ec2rl run --remediate
13. After remediation is complete, exit from chroot and unmount the secondary device:
$ exit $ sudo umount /rescue
Note: If the unmount operation isn't successful, you might have to stop or reboot the rescue instance to enable a clean unmount.
14. Detach the secondary volume (/dev/sdf) from the rescue EC2 instance, and then attach it to the original instance as /dev/xvda (root volume).
15. Start the EC2 instance, and then verify that the instance is responsive.
Note: You can also use an AWS Systems Manager Automation document to troubleshoot connection issues. For more information, see Walkthrough: Run the EC2Rescue tool on unreachable instances. The AWSSupport-ExecuteEC2Rescue document is designed to automate steps normally required to use EC2Rescue for Linux. These steps are a combination of Systems Manager actions, AWS CloudFormation actions, and AWS Lambda functions.
- For general instructions on recovering a Linux instance, see Instance recovery when a host computer fails. For Windows instances, see Troubleshoot an unreachable instance.
- If your instance's root device is an Amazon EBS-backed volume, try stopping and then starting the instance. For more information, see Stop and start your instance.
- For instance-store backed instances, if you created a custom AMI of the instance, you might be able to restore your instance using the AMI as a backup. For instructions on creating a new instance from an AMI you own, see Launching your instance from an AMI.
- In some cases, your EBS volume might have I/O access disabled, which can render your instance inaccessible. For instructions on how to identify and troubleshoot this, see Working with the Auto-Enabled IO volume attribute.
- If you've lost the SSH key pair, you can reset it using Systems Manager Automation and AWSSupport-ResetAccess document.