Rolling EC2 AMI updates with capacity providers in Amazon ECS

When deploying containers to Amazon Elastic Container Service (Amazon ECS), customers have choices as to what level of management they want or need to have over the cluster compute. First there is AWS Fargate, which is a serverless compute engine that removes the need for customers to provision and manage servers. This approach simplifies the user experience as AWS manages the compute that your tasks run on, enabling teams to focus on their applications, and not the underlying infrastructure needed to run them. If you’re looking for a low management approach, Fargate is the recommended option. For customers looking for more control over the compute that ECS tasks get scheduled on, they can use EC2 instances for the compute inside their clusters. Usually these customers have specific needs or reasons when choosing EC2, such as tasks requiring GPU support, Windows or ARM based workloads, or simply the desire to have control over their nodes in the cluster. As a general rule of thumb, I always advise customers to start with the path that will enable them to spend more time on their core business applications, and less time on managing resources.

But of course every use case is unique with its own set of requirements, so let’s look at how we can offload some of the complexities that come with having to manage EC2 instance updates for your ECS clusters. If you aren’t familiar with capacity providers, check out this blog where we dive deeper into the feature. Capacity providers provide a more customizable interface to schedule tasks in the compute layer of a cluster. Whether it’s Fargate, Fargate Spot, or EC2, capacity provider strategies provide more advanced functionality. One of the most notable features of capacity providers is the built-in cluster autoscaling for EC2. Cluster autoscaling will scale your EC2 infrastructure in and out based on the capacity required by your services and tasks, as long as they are using a capacity provider strategy. This solves the challenge for the cluster operators that previously had to manage the scaling the EC2 to support the desired task counts. For more information on cluster autoscaling with capacity providers, check out the documentation.

Autoscaling based on demand is most certainly a big reason why customers use capacity providers with EC2, but another lesser known benefit is to use capacity providers as the mechanism for rotating newly patched instances (AMIs) into your clusters, and rolling out the old ones. This is because you can take advantage of spreading across capacity provider strategies when deploying your services and tasks. You can slowly deploy your tasks to a new capacity provider with a newer instance type in a canary style fashion by using a mixed strategy, or deploy your tasks all at once by flipping over to the latest capacity provider running the latest EC2 AMI. In this blog, I am going to demo how to achieve this. We’re going to define our environment and services as code using the AWS Cloud Development Kit (CDK), and we will show how to migrate our EC2 backed services from an x86 AMI to an Arm based Graviton AMI.

Build and push our application image

Let’s start by creating our Dockerfile and application. For this demo, I wrote a simple python based API that will respond with the Linux system architecture via the uname command. Below is a look at the application and Dockerfile.

Application code:

#!/usr/bin/env python3

from flask import Flask
import os

app = Flask(__name__)

@app.route('/')
def index():
    return f"{{ OS Architecture: {os.uname().machine} }}"

if __name__ == '__main__':
    app().run(host='0.0.0.0')

Dockerfile:

FROM public.ecr.aws/bitnami/python:3.7

EXPOSE 5000

WORKDIR /

COPY ./python_app.py /app.py

RUN pip install flask

CMD ["flask", "run", "--host", "0.0.0.0"]

To verify everything works, I’m going to build the Docker image and run it locally to see the output.

# Build image
docker build -t osarch:latest .

# Run the docker image as a container!
docker run --rm -d -p 8080:5000 --name osarch osarch:latest

Now I’ll run a curl against localhost:8080 and we will see a JSON object returned with the architecture info. I’m running this on a Cloud9 instance hosted on an x86 architecture, so I expect to see that in the response.

The application is running locally in Docker and looks good, so now I’m going to push this to an Amazon Elastic Container Registry (Amazon ECR) public repository to share with the world! So let’s create our repo, log in, and push the image.

In the AWS Manage Console, under “Elastic Container Registry”, I will select Public and Create repository.

ecs-cp-2

In the create screen, I’m going to leave the defaults and name my repository, then select Create repository.

The last step after we create the repository is to push our image. There is a notification pop up that will guide me on how to get my image built and pushed to ECR public. Let’s run those commands now.

# Docker login to my registry in Amazon ECR
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/f0j5z9b5

# Since we built the image earlier, let's tag it following the name and tag of our public repository
docker tag osarch:latest public.ecr.aws/f0j5z9b5/osarch:latest

# Push our image
docker push public.ecr.aws/f0j5z9b5/osarch:latest

Now we’re ready to build an environment and deploy our container as an ECS service.

Build the environment

As mentioned earlier, we’re going to use the AWS CDK to define our environment as well as our service definition for our container to run on ECS using Amazon EC2 as the compute for our tasks to be scheduled on. We’ll get started by initializing our CDK application.

cdk init --language python

Now that we have our initialized CDK app, let’s define our environment using Python. We’ll need a VPC, EC2 autoscaling group, capacity provider, as well as our ECS cluster and service. I’m going to take advantage of the higher level constructs provided by the CDK, which will create a lot of the boilerplate components using recommended, well architected practices. If you aren’t familiar with the AWS CDK and how the leveling of constructs work, check out the documentation for more information.

class CPDemo(cdk.Stack):

    def __init__(self, scope: cdk.Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # Creating a VPC and ECS Cluster
        vpc = ec2.Vpc(self, "VPC")

        cluster = ecs.Cluster(
            self, "ECSCluster",
            vpc=vpc
        )

        # Autoscaling group with x86_64 architecture and associated Capacity Provider
        autoscaling_group = autoscaling.AutoScalingGroup(
            self, "ASG",
            vpc=vpc,
            instance_type=ec2.InstanceType('t3.medium'),
            machine_image=ecs.EcsOptimizedImage.amazon_linux2(
                hardware_type=ecs.AmiHardwareType.STANDARD
            ),
            min_capacity=0,
            max_capacity=100
        )
        
        capacity_provider = ecs.AsgCapacityProvider(
            self, "CapacityProvider",
            auto_scaling_group=autoscaling_group,
        )
        
        cluster.add_asg_capacity_provider(capacity_provider)
               
        # Building out our ECS task definition and service
        task_definition = ecs.Ec2TaskDefinition(self, "TaskDefinition")
        
        task_definition.add_container(
            "DemoApp",
            image=ecs.ContainerImage.from_registry('public.ecr.aws/f0j5z9b5/osarch:latest'),
            cpu=256,
            memory_limit_mib=512
        )
        
        ecs_service = ecs.Ec2Service(
            self, "DemoEC2Service",
            cluster=cluster,
            task_definition=task_definition,
            desired_count=10,
            capacity_provider_strategies=[
                ecs.CapacityProviderStrategy(
                    capacity_provider=capacity_provider.capacity_provider_name,
                    weight=1,
                    base=0
                )
            ]
        )

Aside from the resources to deploy our environment, I want to point out how we’re going from a container image on my laptop to a long running service in Amazon ECS. First, we’re defining our task definition, which is the instructions for ECS to define what my container needs to run. I am sticking with the defaults for this example, but the task definition is where I can further customize and tune how my containers run. Next, I am adding a container to the task definition, which points to the Docker image that we just deployed to ECR, as well as some resource specifications such as memory and CPU. This is our desired state of how we expect to launch our container, but it’s merely the definition and I have yet to define how I want my container to be scheduled onto the cluster. This is where the ECS service comes in. In the service definition, I am defining the cluster and task definition, as well as the desired count of how many tasks I want to run. I don’t have to set this value, and it’s a better practice to apply autoscaling to your service so your application can respond to the demand automatically opposed to having a human do it. Lastly, we are setting our capacity provider strategy which defines how we want the scheduler to schedule our tasks onto the compute layer. For more information on strategies, see the documentation.

There is more application code that is being used to ensure that we can exec into our ECS tasks, check out the GitHub repo to see it. Now, let’s get our environment and application deployed.

# Create our python virtual environment and install packages
virtual env .venv
source .venv/bin/activate
pip install -r requirements.txt

# Deploy our environment and service!
cdk deploy --require-approval never

Once the deployment is complete, we should see some CloudFormation outputs that we will use to confirm that our application is working as we expect. The commands will grab our active tasks and then we will loop over those tasks and run a curl against each of the containers. Here’s the output after the deployment:

Here is the output after I run the commands:

As you can see, we’ve deployed our application to Amazon ECS and we were able to exec into each container and confirm the architecture that it’s running on! Like any fast moving environment, my team has discovered that that we could take advantage of running on Arm architectures using Graviton instances, and see considerable price and performance improvements. Rather than having to update my current autoscaling group I’d much prefer to take advantage of the flexibility with capacity providers. Let’s see how we can migrate our service from x86 to Arm.

First and foremost, we need to rebuild our image to support Arm architectures. To do this, I’m going to use Docker buildx which will enable me to build Arm images on my x86 machine. I’m going to skip over the how to get buildx set up and jump right into the multi-arch image build.

docker buildx build --platform linux/amd64,linux/arm64 -t public.ecr.aws/f0j5z9b5/osarch:latest --push .

The output:

Before I move on and deploy, I have a Graviton EC2 instance running where I will test and confirm that the same code works when using an Arm architecture.

As we can see, our container is running on my Graviton test instance, so let’s create a new capacity provider that uses Graviton. We’re going to add the following code to create a new autoscaling group that uses the Graviton instance, create a capacity provider that points to the new autoscaling group, and finally will associate the capacity provider with the cluster.

# Replacement ASG with new instance type. 
autoscaling_group_arm = autoscaling.AutoScalingGroup(
    self, "ASGArm",
    vpc=vpc,
    instance_type=ec2.InstanceType('t4g.medium'),
    machine_image=ecs.EcsOptimizedImage.amazon_linux2(
        hardware_type=ecs.AmiHardwareType.ARM
    ),
    min_capacity=0,
    max_capacity=100
)

capacity_provider_arm = ecs.AsgCapacityProvider(
    self, "CapacityProviderArm",
    auto_scaling_group=autoscaling_group_arm,
)

cluster.add_asg_capacity_provider(capacity_provider_arm)

Notice that in our autoscaling group configuration we are setting the AMI to use Arm and we are using the t4g.medium instance type. Next we will update the capacity provider strategy in our service definition. This is where you can get really creative depending on the use case and how sensitive the application may be to change. I’m going to take a more conservative approach and split the deployment across the old autoscaling group and the new. I am also going to set a base of five for the old group to ensure that if something doesn’t work as I expect with the new tasks, I will have that base set of tasks running on the previous hardware.

ecs_service = ecs.Ec2Service(
    self, "DemoEC2Service",
    cluster=cluster,
    task_definition=task_definition,
    desired_count=10,
    capacity_provider_strategies=[
        ecs.CapacityProviderStrategy(
            capacity_provider=capacity_provider.capacity_provider_name,
            weight=0,
            base=5
        ),
        ecs.CapacityProviderStrategy(
            capacity_provider=capacity_provider_arm.capacity_provider_name,
            weight=1
        )
    ]
)

This service configuration will ensure the following on my next deployment:

Prior to deploying tasks to the Arm instances, the scheduler will schedule the first five tasks to the x86 based instances from the original capacity provider.
Once the base of five is met, the scheduler will schedule the remaining tasks onto the hosts behind the Arm capacity providers.

Now we can redeploy the changes, and run the code to exec into our newly deployed tasks to see the result.

cdk deploy --require-approval never

On completion of the deployment, let’s run the commands to capture the tasks and check the architecture of each task. The result will look like this:

As we can see from the above image, of the ten tasks that we deployed, five are running on x86 hosts and five running on Arm hosts. At this point I feel confident that I can migrate the remaining tasks over to the new capacity provider. The first step we’ll take is to remove the capacity provider running x86 nodes from the service definition. We’ll do that and then rerun the deployment by running cdk deploy.

ecs_service = ecs.Ec2Service(
    self, "DemoEC2Service",
    cluster=cluster,
    task_definition=task_definition,
    desired_count=10,
    capacity_provider_strategies=[
        ecs.CapacityProviderStrategy(
            capacity_provider=capacity_provider_arm.capacity_provider_name,
            weight=1
        )
    ]
)

What we are doing here is instructing our ECS service to use the Arm based capacity provider, which will trigger a redeployment of the tasks. At this point, we should only see Arm nodes running our ECS tasks for this service. The other thing that I haven’t pointed out here is that the cluster is managing the autoscaling of the EC2 instances behind the scenes. As more tasks get scheduled onto the new capacity provider, ECS is automatically scaling up the instance count to meet the demand of the service. On this inverse, for the capacity provider running the x86 nodes that no longer has any tasks running, ECS will autoscale those EC2 instances to zero. Note that scaling EC2 instances in is slower than scale out actions, and generally, this is a good practice to ensure we don’t prematurely terminate EC2 instances.

Ok, so for the last time we will grab our task ID’s and exec into them to see what architecture our container is running on.

That’s it! For more information as well as the code used to deploy this environment, checkout this repo.

Wrapping up

In this blog, we highlighted some of the benefits of using ECS capacity providers and walked through how to use them to migrate ECS services and tasks to newly updated EC2 instances. In the demo, we took a more cautious approach by splitting the tasks across the new and old capacity provider; however, there is a lot of flexibility based on what makes the most sense for your use case. For example, we could deploy to the new capacity provider all at once by scheduling all of the tasks to the new capacity provider and rely on ECS circuit breakers the roll back in the case of failure.

As always we want to hear feedback from our customers, so reach out via our roadmap or message me on twitter @realadamjkeller.

Containers

Rolling EC2 AMI updates with capacity providers in Amazon ECS

Build and push our application image

Build the environment

Wrapping up

Resources

Follow