Recover your impaired instances using EC2Rescue and Amazon EC2 Systems Manager Automation

Have you ever had an issue connecting to your Amazon EC2 Windows instance? This can be caused by any number of different reasons, but is almost always related to how the instance is configured. Unfortunately, if you can’t connect to it, you can’t fix it!

Earlier this year, AWS announced EC2Rescue for Windows, a convenient, straightforward, GUI-based troubleshooting tool that can be run on your Windows instances to troubleshoot operating system-level issues and collect advanced logs and configuration files for further analysis.

AWS listened to your feedback, and now EC2Rescue is available as a one-click, self-service, scalable automated solution for you to use via Systems Manager Automation. Starting today, there’s a new public Systems Manager Automation document, called AWSSupport-ExecuteEC2Rescue. Documentation for EC2Rescue has more details about this Automation document.

In Automation, you can define a sequence of actions to perform: stopping or starting EC2 instances, creating backup AMIs, invoking AWS Lambda functions, creating AWS CloudFormation templates, and more. In this blog we will show you how you can use AWSSupport-ExecuteEC2Rescue Automation document to orchestrate EC2Rescue workflow.

Introducing EC2Rescue powered by Systems Manager Automation

While the symptom is always the same (you can’t remotely access your Windows instance), there could be multiple causes for it. AWS Support regularly publishes the most frequent questions and requests received from AWS customers in Knowledge Center.
The most common reasons AWS Support has dealt with over the years are:

Network adapter misconfiguration: There is an incorrect static IP address assigned to the network interface, or the DHCP client can’t renew the DHCP lease.
RDP service issues: The service is disabled or you are using a non-default configuration, that is, a TCP port other than 3389.
Firewall: The Windows firewall is blocking RDP traffic.

What are your troubleshooting options?

To start, you can take a console screenshot. It may show, for example, that Windows is installing updates. In that case, you just need to wait.

For issues like RDP and firewall misconfiguration, you can use Systems Manager Run Command, to start your investigation or even fix the problem.

After you have exhausted these options, or you don’t know exactly what the issue might be, the next available step is to investigate the Amazon EBS root volume of your instance, by attempting an offline Windows registry analysis. This requires deep understanding of the Windows operating system, and a wrong action could worsen the problem.

EC2Rescue for Windows is able to detect and attempt to resolve all the issues listed above directly from an offline EBS volume, reducing troubleshooting and remediation of common Windows issues to a matter of clicks. Download the tool on a helper EC2 instance that has access to the EBS root volume to inspect, and EC2Rescue guides you through the analysis and remediation, with no advanced Windows knowledge required.

There are multiple preparation steps to execute when using EC2Rescue in offline mode. You need to create a new “helper” EC2 instance in your VPC (or use an existing instance that you can access), detach the EBS root volume from your impaired instance, and attach it to the helper instance. Finally, you need to attach the volume back to its original instance after EC2Rescue completes its work.

While the remediation is guided and happens in a matter of minutes, the preparation is prone to human error, and is usually done under the pressure of fixing the problem as soon as possible.

All the steps are now automated, from the helper instance setup to the EC2Rescue remediation, thanks to Systems Manager Automation. You can now use EC2Rescue on your Windows instance consistently. Here’s how to use the new public document, AWSSupport-ExecuteEC2Rescue.

How to use AWSSupport-ExecuteEC2Rescue

A Windows instance is not passing the instance health check:

You can use EC2Rescue with your Windows instances from the AWS Management Console or the AWS CLI. The documentation has a walkthrough of the console experience, so I’m going through the CLI experience here.

You can now use EC2Rescue on this instance with one CLI command. In the following code example, I am passing the instance ID to use with EC2Rescue, and an IAM role with the required permissions to run this Automation document:

aws ssm start-automation-execution --document-name "AWSSupport-ExecuteEC2Rescue" --parameters "ImpairedInstanceId=YOURINSTANCEID ,AssumeRole=arn:aws:iam::YOURACCOUNTID:role/YOURSSMAUTOMATIONROLE"

{
    "AutomationExecutionId": "ae6b3617-843e-11e7-8f65-57a040263d53"
}

You can start the automation from the EC2 console as well, in which case an IAM role is not necessary as Automation can impersonate the current user (make sure to have the required permissions though!).

You can monitor the execution with the returned ID. The execution is still in progress:

aws ssm get-automation-execution --automation-execution-id "ae6b3617-843e-11e7-8f65-57a040263d53”

{
    "AutomationExecution": {
        "AutomationExecutionStatus": "InProgress",
        "Parameters": {
            (..)
        },
        "Outputs": {
            (..)
        },
        "DocumentName": "AWSSupport-ExecuteEC2Rescue",
        "AutomationExecutionId": "ae6b3617-843e-11e7-8f65-57a040263d53",
        "DocumentVersion": "1",
        "ExecutionStartTime": 1503079041.084,
        "StepExecutions": [
            {
                (..)
            }
        ]
    }
}

After about 25 minutes, the Automation document completed successfully. The instance is passing both health checks now. Check to see what the problem was!

You can run this CLI command to review the analysis and changes that EC2Rescue made, or review the execution output from the EC2 console:

aws ssm get-automation-execution --automation-execution-id "ae6b3617-843e-11e7-8f65-57a040263d53" --query 'AutomationExecution.Outputs."runEC2Rescue.Output"' --output text

===== System Information =====

Operating System: Windows Server 2008 R2 Datacenter

Service Pack: Service Pack 1

Version: 6.1.7601

Computer Name: WIN-0KEEGO57HHS

Time Zone: UTC

.NET Framework:

v4.7 (4.7.02053)

EC2Config Version: 4.9.1981

===== Analysis =====

System Time

OK – RealTimeIsUniversal (Enabled): This registry value should be enabled when timezone is not UTC.

Windows Firewall

Warning – Domain networks (Enabled): Windows Firewall will be disabled.

Warning – Private networks (Enabled): Windows Firewall will be disabled.

Warning – Guest or public networks (Enabled): Windows Firewall will be disabled.

Remote Desktop

OK – Service Start (Manual): Sets Remote Desktop service start to automatic.

OK – Remote Desktop Connections (Enabled): The RDP listening port will be changed to TCP/3389.

OK – TCP Port (3389): The RDP listening port will be changed to TCP/3389.

EC2Config

OK – Installation (Installed): EC2Config 4.9.1981 is installed.

OK – Service Start (Automatic): The service will be set to start automatically.

Information – Ec2SetPassword (Disabled): Re-generates Administrator’s password on next boot.

Information – Ec2HandleUserData (Disabled): Executes User Data script on next boot.

Network Interface

OK – DHCP Service Startup (Automatic): The service will be set to start automatically.

Information – Local Area Connection detail (N/A): AWS PV Network Device (7.4.6.0)

Warning – DHCP on Local Area Connection (Disabled (Static: 169.254.0.1)): DHCP will be enabled.

===== Changes =====

Windows Firewall

OK – Domain networks (Disabled)

OK – Private networks (Disabled)

OK – Guest or public networks (Disabled)

Network Interface

OK – DHCP on Local Area Connection (Enabled)

EC2Rescue found that the Windows Firewall and a static IP address configured on the network adapter may have caused the connectivity issue, and made some changes in an attempt to resolve them. Update your custom AMI to make sure that you can launch new EC2 instances consistently in your VPC.

Under the hood

AWSSupport-ExecuteEC2Rescue automates the use of EC2Rescue for Windows in offline mode. This document leverages an AWS CloudFormation template and AWS Lambda functions, orchestrated by Systems Manager Automation, to automate the steps normally required to use EC2Rescue, including:

Creating an instance to assist with recovery in the appropriate Availability Zone
Attaching and detaching EBS volumes
Running the EC2Rescue tool

This provides a one-click solution to remediate common Windows issues that prevent remote access to the instance.

AWSSupport-ExecuteEC2Rescue creates a VPC where EC2Rescue can run, completely isolated from your environment, and creates a backup AMI of the instance on which to run EC2Rescue before any further action is taken.

After you have detected that your Windows instance is unreachable (1), you can pass its instance ID to AWSSupport-ExecuteEC2Rescue, which stages a VPC and a number of Lambda functions (2-3) to rescue it. AWSSupport-ExecuteEC2Rescue stops your original instance (5), and creates a backup before taking any action on it (6).

The Automation document identifies which subnet to use in the EC2Rescue VPC that was created (it uses one in the same Availability Zone as your instance), and gets the latest Windows Server 2016 AMI to launch an EC2Rescue instance (4). The Automation document uses RunCommand and EC2Rescue CLI on this instance, and attempts to fix the issues identified on your instance (7) before the Automation document starts it back up (9). The EC2Rescue instance is terminated as part of the flow (8).

How can this document fix my instance automatically?

AWSSupport-ExecuteEC2Rescue creates the EC2Rescue instance in the same Availability Zone as your instance (but in an isolated VPC).

AWSSupport-ExecuteEC2Rescue then attaches the root volume of your instance to the EC2Rescue instance.

At this stage, AWSSupport-ExecuteEC2Rescue runs a new Run Command document, called AWSSupport-RunEC2RescueForWindowsTool, against the EC2Rescue instance. The document:

Downloads EC2Rescue.
Runs the EC2Rescue for Windows tool CLI to diagnose and attempts to fix all the issues that it can identify in the offline root volume that was just attached.

The root volume is then automatically reattached to your instance. The Automation document terminates the EC2Rescue instance, and deletes the EC2Rescue VPC.

Summary

AWSSupport-ExecuteEC2Rescue is a new Automation document that automates all the steps required to fix common Windows issues on your unreachable Windows instance using the EC2Rescue for Windows tool.

The Automation document uses the new EC2Rescue for Windows CLI for a fully automated end-to-end fully experience.

With the recent integration between CloudWatch Events and Systems Manager Automation, you can run AWSSupport-ExecuteEC2Rescue automatically in response to an event in your infrastructure.

You can start using this document today. We are planning to add support for EC2Rescue for Linux soon.

If you have any questions or suggestions, please leave a comment for us. Happy EC2Rescue!

About the Author

Alessandro Martini is a Senior Cloud Support Engineer in the AWS Support organization. He likes working with customers, understanding and solving problems and loves to write blogs outlining his solutions on multiple AWS products. He also loves pizza, especially when there is no pineapple on it.

AWS Cloud Operations Blog