How do I delay Auto Scaling termination of unhealthy Amazon EC2 instances so I can troubleshoot them?

Last updated: 2020-11-18

My Amazon Elastic Compute Cloud (Amazon EC2) instance was marked as unhealthy and moved to the "Auto Scaling Terminating" state. Then, my Amazon EC2 instance terminated before I could determine the cause of the problem. How can I troubleshoot this?

Short description

Add a lifecycle hook to your AWS Auto Scaling group to move instances in the Terminating state to the Terminating:Wait state. In this state, you can access instances before they're terminated, and then troubleshoot why they were marked as unhealthy.

By default, an instance remains in the Terminating:Wait state for 3600 seconds (1 hour). To increase this time, use the heartbeat-timeout parameter in the put-lifecycle-hook API call. The maximum time that you can keep an instance in the Terminating:Wait state is 48 hours or 100 times the heartbeat timeout, whichever is smaller.

Resolution

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent version of the AWS CLI.

Use the following steps to configure a lifecycle hook using the AWS CLI. Then, create the necessary Amazon Simple Notification Service (Amazon SNS) topic and AWS Identity and Access Management (IAM) permissions.

Or, you can configure a lifecycle hook using the AWS Management Console. Then, refer to the following to manage Amazon SNS topics and IAM permissions in the console:

Create an Amazon SNS topic

1.    Create a topic where AWS Auto Scaling can send lifecycle notifications. The following example calls the create-topic command to create the ASNotifications topic:

$ aws sns create-topic --name ASNotifications

An Amazon Resource Name (ARN) similar to the following is returned:

"TopicArn": "arn:aws:sns:us-west-2:123456789012:ASNotifications"

2.    Create a subscription to the topic. You must have a subscription to receive the LifecycleActionToken that's required to extend the heartbeat timeout of the pending state or complete the lifecycle action. The following example uses the subscribe command to create a subscription that uses the email protocol (SMTP) with the endpoint email address user@amazon.com.

$ aws sns subscribe --topic-arn arn:aws:sns:us-west-2:123456789012:ASNotifications --protocol email --notification-endpoint user@amazon.com

Configure IAM permissions

IAM permissions are configured by creating an IAM role that grants the AWS Auto Scaling service permissions to send to the SNS topic. To complete this task, create a text file that contains the appropriate policy. Then, reference the file in the create-role command.

1.    Use a text editor (such as vi) to create the text file:

$ sudo vi assume-role.txt

2.    Paste the following in the text file, and then save the file.

{
  "Version": "2012-10-17",
  "Statement": [{
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "autoscaling.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

3.    Use the aws iam create-role command to create the IAM role AS-Lifecycle-Hook-Role from the policy saved to assume-role.txt:

$ aws iam create-role --role-name AS-Lifecycle-Hook-Role --assume-role-policy-document file://assume-role.txt

The output contains the ARN for the role. Be sure to save both the ARN of the IAM role and the SNS topic.

4.    Add permissions to the role to allow AWS Auto Scaling to send SNS notifications when a lifecycle hook event occurs. The following example uses the attach-role-policy command to attach the managed policy AutoScalingNotificationAccessRole to the IAM role AS-Lifecycle-Hook-Role:

$ aws iam attach-role-policy --role-name AS-Lifecycle-Hook-Role --policy-arn arn:aws:iam::aws:policy/service-role/AutoScalingNotificationAccessRole

This managed policy grants the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [{
      "Effect": "Allow",
      "Resource": "*",
      "Action": [
        "sqs:SendMessage",
        "sqs:GetQueueUrl",
        "sns:Publish"
      ]
    }
  ]
}

Important: The AWS managed policy AutoScalingNotificationAccessRole allows the AWS Auto Scaling service to make calls to all SNS topics and queues. To restrict AWS Auto Scaling's access to only specific SNS topics or queues, use the following sample policy.

{
  "Version": "2012-10-17",
  "Statement": [{
      "Effect": "Allow",
      "Resource": "arn:aws:sns:us-west-2:123456789012:ASNotifications",
       "Action": [
         "sqs:SendMessage",
         "sqs:GetQueueUrl",
         "sns:Publish"
       ]
     }
   ]
}

Configure the lifecycle hook

Next, use the put-lifecycle-hook command to configure the lifecycle hook:

aws autoscaling put-lifecycle-hook --lifecycle-hook-name AStroublshoot --auto-scaling-group-name MyASGroup
        --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING
        --notification-target-arn arn:aws:sns:us-west-2:123456789012:ASNotifications
        --role-arn arn:aws:iam::123456789012:role/AS-Lifecycle-Hook-Role 

Be sure to substitute your own AWS Auto Scaling group name, SNS target ARN, and IAM role ARN before running this command.

This command:

  • Names the lifecycle hook (AStroubleshoot)
  • Identifies the AWS Auto Scaling group that is associated with the lifecycle hook (MyASGroup)
  • Configures the hook for the instance termination lifecycle stage (EC2_INSTANCE_TERMINATING)
  • Specifies the SNS topic's ARN (arn:aws:sns:us-west-2:123456789012:ASNotifications)
  • Specifies the IAM role's ARN (arn:aws:iam::123456789012:role/AS-Lifecycle-Hook-Role)

Test the lifecycle hook

To test the lifecycle hook, choose an instance and then use terminate-instance-in-auto-scaling group to terminate the instance. This forces AWS Auto Scaling to terminate the instance, similar to when the instance becomes unhealthy. After the instance moves to the Terminating:Wait state, you can keep your instance in this state using record-lifecycle-action-heartbeat. Or, allow the termination to complete using complete-lifecycle-action.

aws autoscaling complete-lifecycle-action --lifecycle-hook-name my-lifecycle-hook
        --auto-scaling-group-name MyASGroup --lifecycle-action-result CONTINUE
        --instance-id i-0e7380909ffaab747

Did this article help?


Do you need billing or technical support?