AWS Compute Blog

How to Automate Container Instance Draining in Amazon ECS

Update 24 Aug 2023: The approach described in this post relies on a recursive AWS Lambda function. Lambda announced a recursion control to detect and stop Lambda functions in July 2023. Accounts having recursive Lambda functions were automatically opted-out during the launch, we recommend contacting AWS Support to disable the Lambda feature to use this solution.


My colleague Madhuri Peri sent a nice guest post that describes how to use container instance draining to remove tasks from an instance before scaling down a cluster with Auto Scaling Groups.
—–

There are times when you might need to remove an instance from an Amazon ECS cluster; for example, to perform system updates, update the Docker daemon, or scale down the cluster size. Container instance draining enables you to remove a container instance from a cluster without impacting tasks in your cluster. It works by preventing new tasks from being scheduled for placement on the container instance while it is in the DRAINING state, replacing service tasks on other container instances in the cluster if the resources are available, and enabling you to wait until tasks have successfully moved before terminating the instance.

You can change a container instance’s state to DRAINING manually, but in this post, I demonstrate how to use container instance draining with Auto Scaling groups and AWS Lambda to automate the process.

Amazon ECS overview

Amazon ECS is a container management service that makes it easy to run, stop, and manage Docker containers on a cluster, or logical grouping of EC2 instances. When you run tasks using ECS, you place them on a cluster. Amazon ECS downloads your container images from a registry that you specify, and runs those images on the container instances within your cluster.

Using the container instance draining state

Auto Scaling groups support lifecycle hooks that can be invoked to allow custom processes to finish before instances launch or terminate. For this example, the lifecycle hook invokes a Lambda function that performs two tasks:

  1. Sets the ECS container instance state to DRAINING.
  2. Checks if there are any tasks left on the container instance. If there are running tasks still in process of draining, it posts a message to SNS so that the Lambda function is called again.

Lambda repeats step 2 until there are no tasks running on the container instance OR the heartbeat timeout on the lifecycle hook is reached (set to TTL 15 minutes in the sample CloudFormation template), whichever occurs first. Afterward, control is returned to the Auto Scaling lifecycle hook, and the instance terminates. This process is shown in the following diagram:

Try it out!

Use the CloudFormation template to set up the resources described in this post. This template creates the following resources:

  • The VPC and associated network elements (subnets, security groups, route table, etc.)
  • An ECS cluster, ECS service, and sample ECS task definition
  • An Auto Scaling group with two EC2 instances and a termination lifecycle hook
  • A Lambda function
  • An SNS topic
  • IAM roles for Lambda to execute

Create the CloudFormation stack and then see how this works by triggering an instance termination event.

In the Amazon EC2 console, choose Auto Scaling Groups and select the name of the Auto Scaling group created by CloudFormation (from the resources section of the CloudFormation template).

Select Actions, Edit and update the service to reduce the desired number of instances by “1”. This initiates one of the instances’ termination process.

Select the Auto Scaling group Instances tab; one instance state value should show the lifecycle state “Terminating:Wait”.

This is when the lifecycle hook gets activated and posts a message to SNS. The Lambda function is then executed in response to the SNS message trigger.

The Lambda function changes the ECS container instance state to DRAINING. The ECS service scheduler then stop the tasks on the instance and starts tasks on an available instance.

You can go to the ECS console to confirm that the container instance state is DRAINING.

After the tasks have drained, the Auto Scaling group activity history confirms that the EC2 instance is terminated.

How it works

Take a moment to see the inner workings of the Lambda function. The function first checks to see if the event received has a LifecycleTransition value matching autoscaling:EC2_INSTANCE_TERMINATING.

# If the event received is instance terminating...
if 'LifecycleTransition' in message.keys():
print("message autoscaling {}".format(message['LifecycleTransition']))
if message['LifecycleTransition'].find('autoscaling:EC2_INSTANCE_TERMINATING') > -1:

If there is a match, it proceeds to call the function “checkContainerInstanceTaskStatus”. This function gets the container instance ID of the EC2 instance ID received, and sets the container instance state to ‘DRAINING’.

# Get lifecycle hook name
lifecycleHookName = message['LifecycleHookName']
print("Setting lifecycle hook name {} ".format(lifecycleHookName))

# Check if there are any tasks running on the instance
tasksRunning = checkContainerInstanceTaskStatus(Ec2InstanceId)

It then checks to see if there are tasks running on the instance. If there are tasks, it publishes a message to the SNS topic to trigger the Lambda function again and then exits.

# Use Task ARNs to get describe tasks
descTaskResp = ecsClient.describe_tasks(cluster=clusterName, tasks=listTaskResp['taskArns'])
for key in descTaskResp['tasks']:
print("Task status {}".format(key['lastStatus']))
print("Container instance ARN {}".format(key['containerInstanceArn']))
print("Task ARN {}".format(key['taskArn']))

# Check if any tasks are running
if len(descTaskResp['tasks']) > 0:
print("Tasks are still running..")
return 1
else:
print("NO tasks are on this instance {}..".format(Ec2InstanceId))
return 0

When the Lambda function sees that no more tasks are running on the container instance, it proceeds to complete the lifecycle hook and terminate the EC2 instance.

#Complete lifecycle hook.
try:
response = asgClient.complete_lifecycle_action(
LifecycleHookName=lifecycleHookName,
AutoScalingGroupName=asgGroupName,
LifecycleActionResult='CONTINUE',
InstanceId=Ec2InstanceId)
print("Response = {}".format(response))
print("Completedlifecycle hook action")
except Exception, e:
print(str(e))

Conclusion

Container instance draining simplifies cluster scale-down and operational activities such as new AMI rollouts. For example, with the integration described in this post, you could use CloudFormation and CodePipeline to create a rolling deployment that launches new instances and terminates instances in batches.

To learn more about container instance draining, see the Amazon ECS Developer Guide.

If you have questions or suggestions, please comment below.