Optimize cost for container workloads with ECS capacity providers and EC2 Spot Instances

Amazon EC2 Spot Instances use spare Amazon Elastic Compute Cloud (Amazon EC2) capacity at up to a 90% discount compared to On-Demand prices. Amazon EC2 can interrupt Spot Instances with a two-minute notification when EC2 needs the capacity back. Spot Instances are an ideal option for applications that are stateless, fault-tolerant, scalable, and flexible, such as big data, containerized workloads, continuous integration and delivery (CI/CD), web services, high performance computing (HPC), and development and test workloads.

Spot Instances are a great fit for containers because both are designed to be interruptible and replaceable. Containerized applications are often modern cloud-native applications that are fault-tolerant and can run on Spot Instances with minimal operational efforts, making the application more resilient and flexible at the time of Spot Instance interruptions.

In this blog post, we will demonstrate how to use Amazon ECS capacity providers with EC2 Spot Instances to optimize both cost and scale with minimum engineering efforts. A capacity provider is associated with a cluster and is used in a capacity provider strategy to determine the infrastructure that a task runs on. The capacity provider strategy can be configured to use one or more capacity providers. You can set a default capacity provider strategy for the ECS cluster or specify a custom one for each service. For EC2 based container instances, a capacity provider consists of a name, an Auto Scaling group (ASG), and settings for managed scaling and managed termination protection.

When you enable managed scaling, a capacity provider scales the underlying infrastructure using the Amazon ECS Cluster Auto Scaling (CAS) feature. Once managed scaling is enabled, Amazon ECS creates an AWS Auto Scaling plan with a target tracking scaling policy and associates it with the Auto Scaling group. See this blog post if you want to learn more about how ECS CAS works.

One best practice of using Spot Instances is to be flexible with instance types. Consider using different instance families, generations, and availability zones to take advantage of multiple spare capacity pools. We recently announced a new feature called attribute-based instance type selection (ABS). Using this new feature, you can select instance types based on a set of attributes instead of manually picking them up, thereby allowing the Auto Scaling group to use newer generation instance types as they’re released. To protect you from extreme price differences across instance types, price protection is enabled by default, and you can customize it to your preferred thresholds for Spot and On-Demand Instances.

As a second best practice, if your workload requires a faster scaling out and you have enabled managed scaling, you can over-provision capacity by setting target capacity in a capacity provider to be less than 100%. For example, if you set it to 80%, you will have 20% extra idle capacity available, saving you the time of allocating and setting up a new container instance. With Spot Instances, using the extra idle capacity would allow a faster replacement for the tasks running on an instance that was notified for interruption.

Spot Instance Interruption Handling

Amazon ECS supports Automated Spot Instance Draining, a capability that reduces service disruptions due to Spot Instance interruptions. If the ECS agent detects an interruption notification at the Instance Metadata Service, it sets the instance state to DRAINING to prevent new tasks from being scheduled for placement on this container instance. Service tasks on the container instance that are in the RUNNING state are stopped and replaced according to the service’s deployment configuration parameters, minimum healthy percent and maximum percent. You can turn on this capability by setting the ECS_ENABLE_SPOT_INSTANCE_DRAINING variable to true in the ECS agent configurations in the user data section of the launch template for container instances. To learn more about graceful shutdowns with ECS, please see this blog post.

To test application resiliency and validate the interruption-handling mechanisms, we recently announced the availability of Spot Instance interruptions simulation in the AWS Fault Injection Simulator. The Spot Instance interruptions that are injected by your AWS Fault Injection Simulator experiments behave in the same way as they do when reclaimed by Amazon EC2; see this blog post to learn more about it.

Our customers commonly ask: “What happens if there is no spare capacity and Spot Instances can’t be launched? Do On-Demand Instances get deployed instead?” The short answer is, no, it doesn’t happen automatically. However, it is less likely to happen if you applied the Spot Instance best practices we described earlier, such as the instance type flexibility. You can also mix Spot and On-Demand capacity within the same ECS service, which we will demonstrate next.

Walkthrough: Running an application using ECS capacity providers with On-Demand and Spot Instances

In this walkthrough, we will run an ECS service for a stateless web application on both On-Demand and Spot Instances using two capacity providers connected to two Auto Scaling groups. The first Auto Scaling group runs with 100% On-Demand Instances. and the second one runs with 100% Spot Instances. By using a custom capacity provider strategy, the service will be configured with two tasks placed on On-Demand Instances using the base parameter and the remaining tasks placed equally between Spot and On-Demand capacity providers.

Figure 1. Capacity provider strategy controls tasks placement

Prerequisites

You can use AWS CloudFormation to create capacity providers and associate them with an ECS cluster in your infrastructure code, but for this demo, we will be using the AWS Command Line Interface (AWS CLI). Use your favorite terminal to run the commands, or use AWS CloudShell for easier access and to avoid installation of tools on your local machine. If you will not be using AWS CloudShell, install the jq tool following this guide.

For this walkthrough, you should have the following prerequisites:

An AWS account
An Amazon Virtual Private Cloud (Amazon VPC) and two private subnets—Follow this tutorial to create one.
An IAM instance role to allow the EC2 instances to communicate with the ECS cluster—Follow this guide to create it. Make sure to use the same name as in the linked guide (ecsInstanceRole).
The AWS CLI version 2.3.3 or later—Follow this guide to install or upgrade it.

Let’s get started

To prepare for the walkthrough, create an empty directory, name it web-app, and change your current directory to it:

mkdir web-app
cd web-app

1. Create an EC2 Launch Template for the ECS container instances

Start with a simple launch template file that includes the Amazon Machine Image (AMI) ID and the UserData that would be used to initiate each container instance. Use the AWS Systems Manager Parameter Store to get the latest Amazon ECS-optimized AMI ID for these container instances:

# get latest ECS image id
img_id=$(aws ssm get-parameters \
            --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended \
            --query "Parameters[].Value" \
            --output text | jq '.image_id')
echo "img_id=$img_id"
            
# Launch Template file        
cat <<EoF > lt-ecsInstance.json
{
    "LaunchTemplateName": "lt-ecsInstance",
    "VersionDescription": "Launch Template for ECS container instances",
    "LaunchTemplateData": {
        "ImageId": $img_id,
        "UserData": "IyEvYmluL2Jhc2gKZWNobyAiRUNTX0NMVVNURVI9ZWNzLXdlYmFwcCIgPj4gL2V0Yy9lY3MvZWNzLmNvbmZpZwplY2hvICJFQ1NfQkFDS0VORF9IT1NUPSIgPj4gL2V0Yy9lY3MvZWNzLmNvbmZpZwplY2hvICJFQ1NfRU5BQkxFX1NQT1RfSU5TVEFOQ0VfRFJBSU5JTkc9dHJ1ZSIgPj4gL2V0Yy9lY3MvZWNzLmNvbmZpZwplY2hvICJFQ1NfQ09OVEFJTkVSX1NUT1BfVElNRU9VVD05MHMiID4+IC9ldGMvZWNzL2Vjcy5jb25maWcK",
        "IamInstanceProfile": {
            "Name": "ecsInstanceRole"
        },
        "TagSpecifications": [
            {
                "ResourceType": "instance",
                "Tags": [
                    {
                        "Key": "Name",
                        "Value": "ECS Instance"
                    }
                ]
            }
        ]
    }
}
EoF

Then run this command to create the launch template with the updated image ID:

aws ec2 create-launch-template \
    --cli-input-json file://lt-ecsInstance.json

Tip: To give the application more time to shut down gracefully in case of a Spot Instance being interrupted, we set the Stop Timeout to 90 seconds (default is 30 seconds) in the ECS agent UserData.

Use the following command to verify the UserData attribute in the launch template you just created:

aws ec2 describe-launch-template-versions \
    --launch-template-name lt-ecsInstance \
    --output json | jq -r '.LaunchTemplateVersions[].LaunchTemplateData.UserData' \
    | base64 --decode

It’s a Base64-encoded string that should include the following configurations:

#!/bin/bash echo "ECS_CLUSTER=ecs-webapp" >> /etc/ecs/ecs.config echo "ECS_BACKEND_HOST=" >> /etc/ecs/ecs.config echo "ECS_ENABLE_SPOT_INSTANCE_DRAINING=true" >> /etc/ecs/ecs.config echo "ECS_CONTAINER_STOP_TIMEOUT=90s" >> /etc/ecs/ecs.config

2. Create Auto Scaling groups and capacity providers for EC2 Spot and On-Demand Instances

Before creating the capacity providers, you first need to create the EC2 Auto Scaling groups. Use attribute-based instance type selection to select all the instance types that match the requirements (2–4 vCPU, 4–8 GiB of memory), and look in the current instance type generations. Scaling with Amazon ECS Cluster Auto Scaling (CAS) is fastest when using the same instance sizes; however, as mentioned earlier, instance type flexibility is a key best practice for using Spot Instances. Therefore, to allow faster scaling, you could create multiple capacity providers with multiple Auto Scaling groups, each with a different instance size. For this demo, however, select large and xlarge instance sizes with 1:2 vCPU to memory ratio.

To preview matching instance types, use this command and update excluded instance types accordingly:

aws ec2 get-instance-types-from-instance-requirements \
        --architecture-types x86_64 \
        --virtualization-types hvm \
        --instance-requirements "VCpuCount={Min=2,Max=4},\
        MemoryMiB={Min=4096,Max=8192},\
        CpuManufacturers=intel,amd,\
        BurstablePerformance=included,\
        InstanceGenerations=current,\
        ExcludedInstanceTypes=t2*,r*,d*,g*,i*,z*,x*"

Once you’re happy with the matched instance types, make sure to update the instance requirements in the following command.

Now, create a configuration file for the Spot Auto Scaling group:

# Auto Scaling group configurations file
export OD_PERCENTAGE=0 
export ASG_NAME=asg-spot

# <- copy from here when repeating this command
cat <<EoF > $ASG_NAME.json
{
    "MixedInstancesPolicy": {
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "lt-ecsInstance",
                "Version": "1"
            },
            "Overrides": [
                {
                    "InstanceRequirements": {
                        "VCpuCount": {
                            "Min": 2,
                            "Max": 4
                        },
                        "MemoryMiB": {
                            "Min": 4096,
                            "Max": 8192
                        },
                        "BurstablePerformance": "included",
                        "InstanceGenerations": [
                            "current"
                        ],
                        "CpuManufacturers": [
                            "intel",
                            "amd"
                        ],
                        "ExcludedInstanceTypes": [
                            "t2*","r*","d*","g*","i*","z*","x*"
                        ]
                    }
                }
            ]
        },
        "InstancesDistribution": {
            "OnDemandBaseCapacity": 0,
            "OnDemandPercentageAboveBaseCapacity": $OD_PERCENTAGE,
            "SpotAllocationStrategy": "capacity-optimized"
        }
    },
    "DesiredCapacity": 0
}
EoF

And another configuration file for the On-Demand Auto Scaling group:

export OD_PERCENTAGE=100
export ASG_NAME=asg-od

# Repeat the last CAT command to create a configuration file for On-Demand ASG

Replace the subnet_id_1 and subnet_id_2 parameters in the create-auto-scaling-group commands with ones from your AWS account.

Tip: Use this command to list subnets in your AWS account in the current region:

aws ec2 describe-subnets \
    --query 'Subnets[*].[VpcId,AvailabilityZone,SubnetId]'

aws autoscaling create-auto-scaling-group \
    --cli-input-json file://asg-spot.json \
    --auto-scaling-group-name asg-spot --min-size 0 --max-size 5 \
    --new-instances-protected-from-scale-in \
    --vpc-zone-identifier "subnet_id_1,subnet_id_2"

arn_spot=$(aws autoscaling  describe-auto-scaling-groups \
    --auto-scaling-group-name asg-spot --output json \
    --query 'AutoScalingGroups[0].AutoScalingGroupARN')

echo "$arn_spot"

aws autoscaling create-auto-scaling-group \
    --cli-input-json file://asg-od.json \
    --auto-scaling-group-name asg-od --min-size 0 --max-size 5 \
    --new-instances-protected-from-scale-in \
    --vpc-zone-identifier "subnet_id_1,subnet_id_2"

arn_od=$(aws autoscaling  describe-auto-scaling-groups \
    --auto-scaling-group-name asg-od --output json \
    --query 'AutoScalingGroups[0].AutoScalingGroupARN')

echo "$arn_od"

Now that the Auto Scaling groups have been created, create two capacity providers and associate them with the two Auto Scaling groups:

aws ecs create-capacity-provider \
    --name "cp-spot" \
    --auto-scaling-group-provider "autoScalingGroupArn=$arn_spot,\
    managedScaling={status=ENABLED,targetCapacity=100,minimumScalingStepSize=1,\
    maximumScalingStepSize=100},managedTerminationProtection=ENABLED"
    
aws ecs create-capacity-provider \
    --name "cp-od" \
    --auto-scaling-group-provider "autoScalingGroupArn=$arn_od,\
    managedScaling={status=ENABLED,targetCapacity=100,minimumScalingStepSize=1,\
    maximumScalingStepSize=100},managedTerminationProtection=ENABLED"

Before moving to the next step, verify the two capacity providers have been created successfully:

aws ecs describe-capacity-providers --capacity-providers cp-spot cp-od

3. Create an ECS cluster and attach both capacity providers

Run the following command to create a new ECS cluster, associate the capacity providers with it, and add a default capacity provider strategy to it:

aws ecs create-cluster \
    --cluster-name ecs-webapp \
    --capacity-providers cp-spot cp-od \
    --settings name=containerInsights,value=enabled \
    --default-capacity-provider-strategy capacityProvider=cp-od,weight=1

If you create new ECS services and tasks that don’t specify a custom strategy, they will be running on On-Demand capacity by default.

4. Register a task definition and deploy an ECS service

Create a task definition to configure the container image, resources, and network (see Amazon ECS Task Definitions for more details):

# ECS Task definition file
cat <<EoF > web-task.json
{
  "containerDefinitions": [
    {
      "name": "nginx-task",
      "image": "nginx:latest",
      "memory": 512,
      "cpu": 256,
      "essential": true,
      "portMappings": [
        {
          "containerPort": 80,
          "protocol": "tcp"
        }
      ]
    }
  ],
  "volumes": [],
  "networkMode": "awsvpc",
  "family": "nginx"
}
EoF

aws ecs register-task-definition \
    --cli-input-json file://web-task.json

Now deploy an ECS service using the previous task definition in the same subnets you used with the Auto Scaling groups:

subnets=$(aws autoscaling describe-auto-scaling-groups \
    --auto-scaling-group-name "asg-spot" \
    --query 'AutoScalingGroups[].VPCZoneIdentifier|[0]'| tr -d '"')
    
echo "$subnets"

aws ecs create-service \
    --capacity-provider-strategy capacityProvider=cp-od,base=2,weight=1 \
    capacityProvider=cp-spot,weight=1 \
    --cluster ecs-webapp \
    --service-name srv-web-app \
    --task-definition nginx \
    --desired-count 10 \
    --network-configuration "awsvpcConfiguration={subnets=[$subnets]}"

With the custom capacity provider strategy, the tasks should be distributed as illustrated in Figure 1. The desired number of tasks for this service is 10, the strategy base has been set to 2, and the weight has been set to 1:1 for Spot and On-Demand Instances. So 6 tasks should be running on On-Demand Instances and 4 tasks should be running on Spot Instances.

To confirm the expected outcome, use this script to list each task along with the capacity provider it’s using:

for task in $(aws ecs list-tasks --cluster ecs-webapp \
    --service-name srv-web-app|jq -r .'taskArns[]');
do aws ecs describe-tasks --cluster ecs-webapp --tasks $task \
|jq -r '.tasks[]|.taskArn,.capacityProviderName' ; done

Cleanup

This concludes the walkthrough; make sure to clean up the resources created in this exercise to avoid any unnecessary charges.

First, delete the ECS service and both capacity providers:

aws ecs delete-service \
    --cluster ecs-webapp \
    --service srv-web-app --force

aws ecs delete-capacity-provider --capacity-provider cp-spot

aws ecs delete-capacity-provider --capacity-provider cp-od

Then delete both Auto Scaling groups and the launch template:

aws autoscaling delete-auto-scaling-group \
    --auto-scaling-group-name asg-spot \
    --force-delete

aws autoscaling delete-auto-scaling-group \
    --auto-scaling-group-name asg-od \
    --force-delete

aws ec2 delete-launch-template --launch-template-name lt-ecsInstance

And lastly, delete the task definition and the ECS cluster:

aws ecs deregister-task-definition --task-definition nginx:1

aws ecs delete-cluster --cluster ecs-webapp

If you created the IAM instance role (ecsInstanceRole) and want to delete it, please follow this guide.

Conclusion

In this blog post, we discussed how to use ECS capacity providers along with EC2 Spot Instances to run your containerized workload at a significant cost savings and with minimal operational overhead. We described Spot Instances best practices, explained how to reduce workload disruptions and increase service availability, and demonstrated running an application with ECS capacity providers using a mix of On-Demand and Spot Instances to ensure application resiliency and optimize compute costs.

If you want to learn more about what we’re working on for containers or have any requests, please visit the AWS Containers roadmap on GitHub.

Finally, if you want to learn more about EC2 Spot Instances and how to use them in different types of workloads, check out the Amazon EC2 Spot Instances Workshop site.

Containers