Containers

Managing compute for Amazon ECS clusters with capacity providers.

Customers running containers are often challenged with having to manage and understand how to run and scale the compute for their clusters. For customers taking advantage of Amazon Elastic Container Service (Amazon ECS) on AWS Fargate, the burden is lifted as the underlying compute layer is fully managed by AWS, enabling the customer to focus on their applications. The challenge becomes clearer for customers that choose to run their tasks on self-managed EC2 instances as the compute layer. The complexity of determining how to scale the compute layer in tandem with the tasks can be a burden for teams to manage and maintain. In addition, the focus is operational, which takes time away from solving the problems for your business and your customers. This is where capacity providers help to shift the focus from managing the underlying capacity, to the application.

It’s been a little over one year since we released capacity providers along with cluster autoscaling for Amazon ECS, and we’ve made a lot of improvements along the way. First, if you haven’t heard of capacity providers, let’s do a quick primer on what they are and how they may benefit you. In short, capacity providers are used to manage the infrastructure the tasks in your clusters use. The main idea here is to shift the operational burden of compute away from the developers/operators, enabling them to focus on the application. We call this approach “application first”.

With the application first approach, Amazon ECS takes on the heavy lifting of managing capacity. If you’re familiar with ECS terminology, you’re probably aware that when we talk about launching a task, we associate that task with a launch type. The launch type is a generic way to determine where your tasks get deployed, whether that’s Fargate or EC2. With launch types, the choice is binary; with capacity providers, you can launch your tasks across multiple strategies, enabling greater flexibility in how you run your containers.

When we talk about capacity providers, we will often reference strategies. The capacity provider strategy determines how the tasks are spread across the cluster’s capacity providers. This means that you aren’t tied to one way of launching your tasks. The shift occurs when launching tasks, for example when choosing a launch type, we need to use task placement constraints if we want to ensure our tasks run on specific nodes based on requirements like spot or GPU enabled. Capacity providers are chosen based on strategies; for example, we could schedule a group of tasks to be spread across Fargate and Fargate Spot, or across EC2 and EC2 spot. We will dive into this momentarily, but prior to doing so it’s helpful to understand the compute options for capacity providers and what they mean. To get more details around capacity provider concepts, check out the documentation.

If Fargate is your choice for compute, you don’t have to think about scaling or patching the underlying EC2 instances, as each task is placed on it’s own compute and managed by AWS. The focus shifts to your application and what it takes to scale that application to meet the demand of the user base. Around the time when we released capacity providers, we also announced the availability of Fargate Spot, which offers a savings of up to 70% off of the price of on demand Fargate tasks. As of today, there are two default strategies that are available without any configuration required, and those are FARGATE and FARGATE_SPOT. Check out this tech talk from the AWS Cloud Containers Conference where we dive deeper into these concepts.

If EC2 is the preferred method of compute for your clusters, you may be familiar with the complexity in trying to determine the best ways to autoscale your EC2 instances in and out to meet the demand of the tasks as they scale. This is where cluster autoscaling via capacity providers has helped our customers shift the focus from infrastructure first to application first. There are several benefits to this approach, let’s list a few:

  • Scale to zero for compute that is unused
  • Split your tasks across multiple capacity provider strategies with different EC2 instance profiles and cost strategies
  • Deploy to a new capacity provider with an updated AMI gradually, or all at once.

While there are still components that have to be considered (such as AMI management), you get to have greater control of your compute layer, without having to think about the mechanisms required to auto scale the infrastructure.

Now that we’ve gone through the basics of capacity providers, let’s talk about the improvements we’ve made over the past year and talk about some ways our customers are taking advantage of the feature.

AWS CloudFormation support

One of the most requested features from customers was for capacity providers to have full CloudFormation support. With our latest enhancements, we now have full support to manage and deploy to capacity providers via CloudFormation. Let’s walk through deploying a cluster and capacity providers in CloudFormation, and discuss a couple of examples of ways to use capacity providers. Below is the full template, which we will break down and walk through.

AWSTemplateFormatVersion: 2010-09-09
Parameters:
  LatestAmiId:
    Type: 'AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>'
    Default: '/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id'
  ECSInstanceRole:
    Type: String
    Default: "ecsInstanceRole"
Resources:
  LaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        ImageId: !Ref LatestAmiId
        InstanceType: "t3.medium"
        IamInstanceProfile: 
          Name: !Ref ECSInstanceRole
        UserData:
          Fn::Base64: !Sub |
            #!/bin/bash -xe
            echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
  AutoScalingGroup1:
    Type: "AWS::AutoScaling::AutoScalingGroup"
    Properties:
      AvailabilityZones:
        Fn::GetAZs: !Ref "AWS::Region"
      HealthCheckGracePeriod: 60
      LaunchTemplate:
        LaunchTemplateId: !Ref LaunchTemplate
        Version: !GetAtt LaunchTemplate.LatestVersionNumber
      NewInstancesProtectedFromScaleIn: true
      MaxSize: "10"
      MinSize: "0"
      DesiredCapacity: "0"
  AutoScalingGroup2:
    Type: "AWS::AutoScaling::AutoScalingGroup"
    Properties:
      AvailabilityZones:
        Fn::GetAZs: !Ref "AWS::Region"
      HealthCheckGracePeriod: 60
      LaunchTemplate:
        LaunchTemplateId: !Ref LaunchTemplate
        Version: !GetAtt LaunchTemplate.LatestVersionNumber
      NewInstancesProtectedFromScaleIn: true
      MaxSize: "10"
      MinSize: "0"
      DesiredCapacity: "0"
  CapacityProvider1:
    Type: "AWS::ECS::CapacityProvider"
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref AutoScalingGroup1
        ManagedScaling:
          Status: ENABLED
        ManagedTerminationProtection: ENABLED
  CapacityProvider2:
    Type: "AWS::ECS::CapacityProvider"
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref AutoScalingGroup2
        ManagedScaling:
          Status: ENABLED
        ManagedTerminationProtection: ENABLED
  ECSCluster:
    Type: 'AWS::ECS::Cluster'
  ClusterCPAssociation:
    Type: "AWS::ECS::ClusterCapacityProviderAssociations"
    Properties:
      Cluster: !Ref ECSCluster
      CapacityProviders:
        - FARGATE
        - FARGATE_SPOT
        - !Ref CapacityProvider1
        - !Ref CapacityProvider2
      DefaultCapacityProviderStrategy:
          - CapacityProvider: FARGATE
            Base: 1
            Weight: 0
          - CapacityProvider: FARGATE_SPOT
            Weight: 1
  ECSTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      RequiresCompatibilities:
        - "EC2"
      Cpu: '512'
      Memory: '1024'
      ContainerDefinitions:
        - Name: "CapacityProvidersDemo"
          Image: public.ecr.aws/nginx/nginx:latest
          PortMappings:
            - ContainerPort: 80
  ECSDemoService: 
    Type: AWS::ECS::Service
    Properties: 
      Cluster: !Ref ECSCluster
      DesiredCount: 5
      TaskDefinition: !Ref ECSTaskDefinition
      CapacityProviderStrategy:
        - CapacityProvider: !Ref CapacityProvider1
          Base: 1
          Weight: 1
        - CapacityProvider: !Ref CapacityProvider2
          Weight: 1

Let’s break this down to understand what we are deploying. In the snippet below we are creating two capacity providers that take advantage of the managed cluster autoscaling (as seen by ManagedScaling being ENABLED). to We are referencing the autoscaling groups that we created in the template just above.

  CapacityProvider1:
    Type: "AWS::ECS::CapacityProvider"
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref AutoScalingGroup1
        ManagedScaling:
          Status: ENABLED
        ManagedTerminationProtection: ENABLED
  CapacityProvider2:
    Type: "AWS::ECS::CapacityProvider"
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref AutoScalingGroup2
        ManagedScaling:
          Status: ENABLED
        ManagedTerminationProtection: ENABLED

Next, we will create an ECS cluster, and then associate the capacity providers with that cluster. The two capacity providers that we are setting as the default for the cluster are available “out of the box” and are FARGATE and FARGATE_SPOT. Next, we set the default capacity provider strategy for our cluster, which will determine how tasks get placed that aren’t launched with a launch type or capacity provider strategy specified. This goes back to the “application first” mindset, where those who are deploying their tasks/services don’t have to think about compute capacity, and can simply define how they want their containers to run in the task definition and service configuration. Finally, you may notice the base and weight for each of the capacity providers. In the example template, we ensure that we always have one on demand Fargate task running for baseline stability, with every task launched after the base using Fargate Spot. Finally, we are attaching the two autoscaling group-backed capacity providers to the cluster that we discussed up above.

  ECSCluster:
    Type: 'AWS::ECS::Cluster'
  ClusterCPAssociation:
    Type: "AWS::ECS::ClusterCapacityProviderAssociations"
    Properties:
      Cluster: !Ref ECSCluster
      CapacityProviders:
        - FARGATE
        - FARGATE_SPOT
        - !Ref CapacityProvider1
        - !Ref CapacityProvider2
      DefaultCapacityProviderStrategy:
          - CapacityProvider: FARGATE
            Base: 1
            Weight: 0
          - CapacityProvider: FARGATE_SPOT
            Weight: 1

The outcome of this deployment will provide an ECS cluster with all of the capacity providers associated as expected. In the autoscaling groups that we created in the template, we set the desired count to a minimum and a base of zero, ensuring that EC2 instances will only get launched when needed and not sit idle. This just highlights one of the many powerful features that come with capacity providers.

Now that we have our capacity providers created and associated with our cluster, let’s look at an example service deployment where we want to set the capacity providers instead of relying on the cluster default.

  ECSTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      RequiresCompatibilities:
        - "EC2"
      Cpu: '512'
      Memory: '1024'
      ContainerDefinitions:
        - Name: "CapacityProvidersDemo"
          Image: public.ecr.aws/nginx/nginx:latest
          PortMappings:
            - ContainerPort: 80
  ECSDemoService: 
    Type: AWS::ECS::Service
    Properties: 
      Cluster: !Ref ECSCluster
      DesiredCount: 10
      TaskDefinition: !Ref ECSTaskDefinition
      CapacityProviderStrategy:
        - CapacityProvider: !Ref CapacityProvider1
          Base: 1
          Weight: 1
        - CapacityProvider: !Ref CapacityProvider2
          Weight: 1

In the example above, we explicitly set our capacity provider strategy. As mentioned earlier, you don’t have to do this if you have a default strategy set for your cluster. It’s important to understand that when you rely on the default, you have to ensure that your task definiton and service definition are compatible with the compute being used. Assuming we didn’t set a launch type or capacity provider strategy in our service definition, in this scenario the cluster will choose the default capacity provider strategy for compute. As we saw earlier, the default capacity provider strategies for the cluster are to split across running Fargate and Fargate Spot, so after the base strategy of 1 is met for Fargate, the rest of the tasks will be launched using Fargate Spot.

Capacity providers with blue/green deployments

In addition to improving our CloudFormation coverage with capacity providers, we recently announced capacity provider support when using the AWS CodeDeploy deployment controller with ECS. This means that regardless of the compute options you choose as your capacity provider strategy, you can use the CodeDeploy deployment controller. For customers using EC2 for the underlying compute with capacity providers (with cluster autoscaling enabled), you don’t have to worry about the scaling of the underlying infrastructure during a deploy. In other words, if there is not enough EC2 capacity to support the deployment, capacity providers will scale the instances to meet the need and then scale them back in.

Cluster auto scaling improvements

In addition to CloudFormation support and blue/green deployments via AWS CodeDeploy with capacity providers, we also added some enhancements to cluster autoscaling. For a deep dive into how cluster autoscaling works with Amazon ECS, check out the Deep Dive on Amazon ECS Cluster Auto Scaling.

One of the more notable enhancements we announced was to improve the scaling speed and providing more responsive scaling with cluster auto scaling. This enhancement addressed the more advanced use cases for auto scaling groups that span multiple Availability Zones and/or have multiple instance types defined. Previous to this announcement, the scaling logic would fall back to step scaling for these scenarios.

Lastly, we added support to specify a custom instance warm-up time to enable more responsive scaling. The instance warm-up period is the period of time, in seconds, after a newly launched EC2 instance can contribute to CloudWatch metrics for the Auto Scaling group. The default value is 300 seconds, which means that until those latest metrics come in, the scaling remains on hold until ECS can make the next scale determination. With this parameter, you can have more control over how fast and responsive the autoscaling is.

Wrapping up

In the beginning of this blog, we talked about some of the challenges customer face when running containers at scale, specifically around managing the compute for their clusters. With capacity providers, customers are able to offload the heavy lifting of self-managing the auto scaling of the cluster as well as leverage multiple strategies when deploying their tasks. Whether you are utilizing Fargate or EC2 for compute, capacity providers are the standard in defining compute for your tasks and services running on ECS. We are continuing to make improvements on behalf of our customers, and as always if there are features or issues that you’d like to share, please check out our public roadmap to find existing feature requests or to submit a new one.