Automate Patching by Replacing Amazon ECS Container Instances
Containerization becoming popular for application lifecycle
In the retail industry, more and more developers are using containerization as the primary software lifecycle for applications and services. There are countless benefits to this approach, and it shifts the focus of configuration and runtime management to the container itself. This means that there’s much less configuration and setup at the host level. However, you still need to consider patching and security at the host.
This blog will describe how Rue Gilt Groupe (RGG), a premier off-price ecommerce company comprised of Rue La La, Gilt, Gilt City, and Shop Premium Outlets, implemented an automated solution to patch and replace hosts that run the company’s containerized applications with zero downtime and minimal impact to the containers running within clusters.
Infrastructure and setup
Before we dive into how we implemented the automated patching, let’s discuss the underlying architecture that uses Amazon Web Services (AWS). With Amazon Elastic Container Service (Amazon ECS), a fully managed container orchestration service, customers can determine the optimal compute level for their containers to run in a cluster. In an Amazon ECS cluster, you can use different types of instances from Amazon Elastic Compute Cloud (Amazon EC2), which has secure and resizable compute capacity for virtually any workload, and/or AWS Fargate, a serverless, pay-as-you-go compute engine, to serve the container use cases running in the cluster. For users that choose Amazon EC2 as the underlying compute for their clusters, we recommend managing the instances within a group in Amazon EC2 Auto Scaling, where you can add or remove compute capacity to meet changes in demand, because managing cluster autoscaling on your own is difficult; you need to carefully monitor your compute capacity, so that you scale up and down at precise times to meet demand fluctuations. Instead, using Amazon ECS capacity providers and cluster autoscaling, Amazon ECS will scale the underlying Amazon EC2 instances to meet the capacity of the desired task counts. With capacity providers, you can scale to a reservation percentage target and queue containers for launch when additional capacity becomes available. We will use both technologies to automate the instance replacement process. Below is a sample architecture using the best practices and guidelines that we have described.
- An ECS Cluster
- AWS CLI installed with appropriate permissions if you want to execute each step from command line
- Docker environment
In this example, we have three Amazon EC2 Auto Scaling groups that provide hosts to the cluster. Amazon ECS can use these clusters to run containers. Each one uses a different instance type and operating system, which can handle different types of workloads.
You can configure Amazon ECS services to use specific capacity providers based on the different capacity provider strategy options. We highly recommend configuring container health checks by attaching to a target group that has its own health checks or by configuring the health checks in the task definition itself.
Managing hosts through Amazon EC2 Auto Scaling groups offers many benefits. To automate instance replacements as a strategy for patching, you can update an Amazon Machine Image (AMI) ID in your launch configuration (or launch template), and the Amazon EC2 Auto Scaling group will launch the new instances with the new AMI ID. Because AWS regularly updates AMIs with security patches and software updates, you can update the ASG’s launch template with a new version of the same AMI and then replace all of the instances running in that Amazon EC2 Auto Scaling group.
Automating Amazon EC2 Instance replacement
Now that we’ve explained the general infrastructure and configurations that make up an Amazon ECS cluster, we can use the infrastructure to implement an automated host replacement process. We explain the step-by-step process below. It relies heavily on AWS APIs built into both the Amazon ECS and Amazon EC2 Auto Scaling group services.
- For a given Amazon ECS cluster, implement the “aws ecs list-container-instances” command and use the returned list of container instance ARNs as input to the “aws ecs describe-container-instances” command. You should store the instance ID, instance ARN, and running task count for each instance.
- Then, implement the “aws ecs describe-cluster” command and use the returned list of capacity providers as input to the “aws ecs describe-capacity-providers” command. Store the Amazon EC2 Auto Scaling group name associated with each capacity provider.
- Then, for each stored Amazon EC2 Auto Scaling group, implement the “aws autoscaling describe-auto-scaling-groups” command. You’ll need information about the current launch template attached to the Amazon EC2 Auto Scaling group. Once you have that, you can implement the “aws ec2 describe-launch-template-versions” to get the current AMI in use. Finally, you can use the “aws ec2 describe-images” command to retrieve the platform and architecture of the AMI.
- Using a preconfigured map of AMI name patterns (or SSM parameters with AMI IDs), you can look up the latest AMI matching the current AMI and create a new version of the existing launch template with this new AMI version using the “aws ec2 create-launch-template-version” command. We use the tags on the instaces that were set by the Auto Scaling group to match an AMI from the map. From there, you can set the new version as the launch template default version, which the Amazon EC2 Auto Scaling group will then use.
- Once that is complete, you can replace the hosts one by one using the commands we gathered in step 1.
a. First, you’ll detach the instance from its Amazon EC2 Auto Scaling group using the “aws autoscaling detach-instances.” This allows the Amazon EC2 Auto Scaling group to replace the instance, but it does not yet remove it from the Amazon ECS cluster. Then, you can do the rest of the work while the Amazon EC2 Auto Scaling group replaces this host.
b. Then, you implement the “aws ecs update-container-instances-state” and set the state to draining. This begins the process of replacing any containers that are currently running on the draining host.
c. You will then wait for the host to completely drain by querying to find out how many tasks are still running. When an Amazon ECS and Amazon EC2 host gets put into a draining state:
- The scheduler will no longer schedule tasks on that host.
- The scheduler will gracefully stop the tasks on the host.
- For services, the scheduler will meet the desired state of the service and count of tasks required by that service. This means that it will reschedule tasks from the draining host to another host based on the deployment configuration parameters.
- For standalone ad-hoc type tasks, the scheduler will wait until the tasks are complete and they exit on their own.
d. This will take a while for a few reasons:
- There is a deregistration delay, which is configurable on containers, attached to target groups.
- It takes time for new containers to get up and running in a healthy state. This is a safety mechanism so that you don’t kill containers until others take their place.
- If nonservice containers are running, the host will simply wait until those containers stop and complete before moving on.
e. Finally, once the instance has drained and is no longer running containers, you can safely close out the instance using the “aws ec2 terminate-instance” command.
You should repeat steps 3–5 for each Amazon EC2 Auto Scaling group configured as a capacity provider in the given Amazon ECS cluster. You will replace every host inside of each Amazon EC2 Auto Scaling group. But be sure to wait patiently until all containers are replaced safely. Overall, this can take a long time depending on your number of hosts and containers and how long it takes to deregister old containers and run new, healthy ones.
Here is the reference code for the process to automatically replace Amazon EC2 instances. This implementation uses Python and Boto3 and runs inside a Docker container. This can run anywhere, but you should not run it on a cluster instance that will be replaced as part of this automation because the process might never move past the host on which it’s running. However, this kind of container is a great use case for AWS Fargate because it can run within the same cluster on which the script is running.
Put this automated process to work for your business
The robust features built into Amazon ECS, Amazon EC2, and Amazon EC2 Auto Scaling groups help you to orchestrate patching to automate the process of replacing hosts. If you need help implementing this process or you want to discuss your specific automation needs, contact your AWS account team today.