Smart RDP and SSH remediation with AWS Systems Manager Automation API actions

Here in AWS Support, I often help customers regain RDP or SSH access to their instances. It’s a common problem, but the identification of a correct solution could take some time, even hours or days if the right information isn’t available.

Even with the most up-to-date playbook, it is easy to miss simple checks that might help resolve the problem faster. That is why automation is key in ensuring a fast resolution for recurrent problems.

AWS Support and AWS Systems Manager have partnered to package the power of EC2Rescue into a 1-click Automation document. (See the announcement from last year if you missed it.) You can now make offline remediation fast and easy by using AWSSupport-ExecuteEC2Rescue (which supports both Windows and Linux instances).

In this blog post, I’ll discuss two new Automation documents for online RDP and SSH troubleshooting. I’ll show how these documents can intelligently branch to the offline equivalent based on your available information.

Overview of the online remediation

The AWSSupport-TroubleshootRDP and AWSSupport-TroubleshootSSH documents target a managed instance. They use AWS Systems Manager Run Command to execute local scripts to check the current status of RDP and SSH configuration. By default, these documents run in read-only mode, meaning that no change is made to your instance.

Based on the information that is returned, you can decide to re-run the same documents to apply the necessary changes.

For example, for this blog post I executed AWSSupport-TroubleshootRDP against my managed instance. To do this, I used a deep link to the us-east-1 Region to open the AWS Systems Manager console and land directly to the Automation execution page, with the automation document I want to run already selected.

I then provided the instance ID from the convenient interactive instance picker and executed the automation with the default values:

The Executed steps page shows the output (note that we are using the aws:executeAutomation action to call few other AWS Support documents to retrieve all of the RDP settings and the RDP service status):

As you can see, my RDP service is stopped, and no remote connections are allowed.

I decide to run the automation again, this time with the Action parameter set to FixAll (here is another deep link for us-east-1 with the parameter pre-selected for you):

I can now successfully use RDP!

Similarly, AWSSupport-TroubleshootSSH will return an analysis of the SSH configuration health and provide an option to fix all of the detected problems at once. It’s easy to use.

Branching to the offline remediation

What if your instance is not passing the health checks, and can’t receive commands from Run Command? Or what if you are not targeting a managed instance?

Both AWSSupport-TroubleshootRDP and AWSSupport-TroubleshootSSH use the new action aws:assertAwsResourceProperty to check if the instance provided as an input, {{ InstanceId }}, is a managed instance.

I find it very convenient to use the example response from the AWS API reference (or a real response from the AWS CLI) and test the property selector with an available online JSON path evaluator like jsonpath.com.

{
      "name": "assertInstanceIsManaged",
      "action": "aws:assertAwsResourceProperty",
      "onFailure": "step:assertAllowOffline",
      "inputs": {
        "Service": "ssm",
        "Api": "DescribeInstanceInformation",
        "InstanceInformationFilterList": [
          {
            "key": "InstanceIds",
            "valueSet": [
              "{{ InstanceId }}"
            ]
          }
        ],
        "PropertySelector": "InstanceInformationList..PingStatus",
        "DesiredValues": [
          "Online"
        ]
      },
      "nextStep": "assertActionIsUseSettingSpecificAction"
}

Thanks to the new branching feature, I can jump to a specific step in case the assertion in the previous code fails:

      (..)      
      "onFailure": "step:assertAllowOffline",
      (..)

When the instance isn’t a managed instance, I can’t use any online remediation option. Instead, I’ll use AWSSupport-ExecuteEC2Rescue to attempt to fix the instance offline.

Before we start the offline remediation, make sure that:

AllowOffline input flag is set to True
Action input is set to FixAll

Only proceed if these conditions are met. By default, offline remediation will not happen unless you explicitly allow it. You have control.

{
  "name": "assertAllowOffline",
  "action": "aws:assertAwsResourceProperty",
  "onFailure": "Abort",
  "inputs": {
     "Service": "ssm",
     "Api": "GetAutomationExecution",
     "AutomationExecutionId": "{{ automation:EXECUTION_ID }}",
     "PropertySelector": "AutomationExecution.Parameters.AllowOffline[0]",
     "DesiredValues": [
	   "True"
     ]
  },
  "nextStep": "assertActionIsFixAllForOfflineBranch"
}

Here is a flowchart to summarize what I just described:

The offline remediation uses AWSSupport-ExecuteEC2Rescue, which attempts to fix common issues with SSH and RDP. To execute this child workflow, AWSSupport-TroubleshootRDP and AWSSupport-TroubleshootSSH need some additional information from your instance, like for example its subnet ID. The new action aws:executeAwsApi is used to get this information:

{
      "name": "describeSourceInstance",
      "action": "aws:executeAwsApi",
      "onFailure": "Abort",
      "inputs": {
          "Service": "ec2",
          "Api": "DescribeInstances",
          "InstanceIds": [
              "{{ InstanceId }}"
          ]
      },
      "outputs": [
          {
              "Name": "SubnetId",
              "Selector": "Reservations..Instances..NetworkInterfaces[0].SubnetId",
              "Type": "String"
          }
      ],
      "nextStep": "troubleshootRDPOfflineWithSubnetId"
}

Conclusion

In this post, I showed you how to troubleshoot RDP and SSH connectivity issues in a new automated way. With the AWSSupport-TroubleshootRDP and AWSSupport-TroubleshootSSH automation documents, you can easily investigate and remediate common issues with your instances. Thanks to the new Automation actions and the new branching feature, you can use the same documents to automatically pick the right solution for your specific instance. Then you can remediate the issues with or without being able to contact the SSM Agent on the target instance.

About the Author
Alessandro Martini is a Senior Cloud Support Engineer in the AWS Support organization. He likes working with customers, understanding and solving problems, and writing blog posts that outline solutions on multiple AWS products. He also loves pizza, especially when there is no pineapple on it.

AWS Cloud Operations & Migrations Blog

Smart RDP and SSH remediation with AWS Systems Manager Automation API actions

Overview of the online remediation

Branching to the offline remediation

Conclusion

Resources

Follow