Announcing Amazon ECS deployment circuit breaker

Today, we announced the Amazon ECS deployment circuit breaker for EC2 and Fargate compute types. With this feature, Amazon ECS customers can now automatically roll back unhealthy service deployments without the need for manual intervention. This empowers customers to quickly discover failed deployments, while not having to worry about resources being consumed for failing tasks, or indefinite deployment delays.

Previously, when using the rolling update deployment type in Amazon ECS, if the service was unable to reach a healthy state, the scheduler would retry deployments in perpetuity using the service throttling logic. An extra step was required when monitoring deployments to ensure that a deployment failure could be caught in a timely manner. This also was a pain point for customers that deploy Amazon ECS services using AWS CloudFormation.

There are several reasons why a deployment can fail, such as introducing a breaking change to the code or service configuration, lack of resources available to reach the desired count, or container/load balancer health checks failing. While deployment failures aren’t limited to these scenarios, these are just some examples to better understand where the deployment circuit breaker can help. Throughout the remainder of the blog, we are going to demonstrate the circuit breaker from the example scenario of introducing a failure to the container health check.

What are we deploying?

The demo application we are deploying is running a Python Flask web server, that will show the current version of the task definition deployed via the ECS service. To create, deploy, and update this service, we will use the AWS CLI. Let’s start with looking at the code, Dockerfile, and task definition to gain a better understanding of what is being deployed.

In our flask service, we’re gathering the task version from the task metadata endpoint. The website will show us what task definition version the ECS service is running. The goal of this app is to show us the capabilities of the circuit breaker and highlight the rollback functionality as we should not see the version change in the frontend. The flask code below will be in the flask_app.py file:

#!/usr/bin/env python3

from flask import Flask
from os import getenv
from requests import get

import json

app = Flask(__name__)

def get_service_version():
    #https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4.html
    metadata_endpoint = getenv('ECS_CONTAINER_METADATA_URI_V4', None)
    if metadata_endpoint is None:
        return "Metadata endpoint not available"
    else:
        response = get("{}/task".format(metadata_endpoint)).text
        json_response = json.loads(response)
        return "{}:{}".format(json_response['Family'], json_response['Revision'])

@app.route('/')
def hello():
    return """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Circuit Breaker Demo!</title>
</head>
<body>
   <h1> My Amazon ECS Application Demo </h1>
   <p> Current version of the service {} </p>
</body>
</html>
    """.format(get_service_version())

@app.route('/health')
def health():
  return "OK"

if __name__ == '__main__':
    app().run(host='0.0.0.0')

Using the Python base image, we are installing Flask, copying the code into the image, and defining how to run the application on startup.

FROM python:3

EXPOSE 5000

COPY ./flask_app.py /flask_app/app.py

RUN pip install requests flask

WORKDIR /flask_app

CMD ["flask", "run", "--host", "0.0.0.0"]

Prerequisites:

AWS CLI
AWS CDK
Docker for building/pushing container images

Demo:

To start, we will create an ECS cluster with required vpc/networking, an ECR repository, as well as the task execution IAM role to allow our Fargate service to pull our ECR image. We use the CDK to define and deploy our environment using Python. This code will reside in a file named app.py. For more information on AWS CDK, please take a look at the documentation.

#!/usr/bin/env python3

from aws_cdk import (
    Stack,
    CfnOutput,
    RemovalPolicy,
    aws_ecs as ecs,
    aws_ecr as ecr,
    aws_iam as iam,
    aws_ec2 as ec2,
    App
)
from constructs import Construct


class CircuitBreakerDemo(Stack):

    def __init__(self, scope: Construct, id: str, **kwargs) -> None:
        super().__init__(scope, id, **kwargs)

        # The code that defines your stack goes here
        ecs_cluster = ecs.Cluster(
            self, "DemoCluster",
            cluster_name="CB-Demo"
        )

        # ECR Image Repo
        ecr_repo = ecr.Repository(self, "ECRRepo",
            repository_name="flask-cb-demo",
            empty_on_delete=True,
            removal_policy=RemovalPolicy.DESTROY
        )

        # IAM Task Role with ECS managed policy
        task_execution_role = iam.Role(
            self, "TaskExecutionRole",
            role_name="CircuitBreakerDemoRole",
            assumed_by=iam.ServicePrincipal(service="ecs-tasks.amazonaws.com"),
            managed_policies=[
                iam.ManagedPolicy.from_aws_managed_policy_name("service-role/AmazonECSTaskExecutionRolePolicy")
            ]
        )

        security_group = ec2.SecurityGroup(
            self, "WebSecGrp",
            vpc=ecs_cluster.vpc
        )

        security_group.connections.allow_from_any_ipv4(
            port_range=ec2.Port(
                protocol=ec2.Protocol.TCP,
                string_representation="Web Inbound",
                from_port=5000,
                to_port=5000
            ),
            description="Web ingress"
        )

        CfnOutput(
            self, "IAMRoleArn",
            value=task_execution_role.role_arn,
            export_name="IAMRoleArn"
        )

        CfnOutput(
            self, "PublicSubnets",
            value=",".join([x.subnet_id for x in ecs_cluster.vpc.public_subnets]),
            export_name="PublicSubnets"
        )

        CfnOutput(
            self, "SecurityGroupId",
            value=security_group.security_group_id,
            export_name="SecurityGroupId"
        )

        CfnOutput(
            self, "EcrRepoUri",
            value=ecr_repo.repository_uri,
            export_name="EcrRepoUri"
        )


app = App()
CircuitBreakerDemo(app, "circuit-breaker-demo")
app.synth()

To deploy the environment, we will run the following commands:

# Install dependencies
python3 -m venv .env
source .env/bin/activate
pip install aws-cdk-lib constructs
# Deploy environment
cdk deploy --require-approval never --app "python3 app.py"

Next, we will build our Docker image and push it to ECR, create a task definition, and then deploy our ECS service.

export region=$(aws configure get region)
export account_id=$(aws sts get-caller-identity --output text --query Account)
export ECR_REPO=$(aws cloudformation describe-stacks --stack-name circuit-breaker-demo --query 'Stacks[].Outputs[?ExportName == `EcrRepoUri`].OutputValue' --output text)
export ECR_IMAGE="${ECR_REPO}:working"
export EXECUTIONROLEARN=$(aws cloudformation describe-stacks --stack-name circuit-breaker-demo --query 'Stacks[].Outputs[?ExportName == `IAMRoleArn`].OutputValue' --output text)
export SUBNETS=$(aws cloudformation describe-stacks --stack-name circuit-breaker-demo --query 'Stacks[].Outputs[?ExportName == `PublicSubnets`].OutputValue' --output text)
export SECGRP=$(aws cloudformation describe-stacks --stack-name circuit-breaker-demo --query 'Stacks[].Outputs[?ExportName == `SecurityGroupId`].OutputValue' --output text)

# Login to ECR and build/push docker image
aws ecr get-login-password \
  --region $region \
  | docker login \
    --username AWS \
    --password-stdin $account_id.dkr.ecr.$region.amazonaws.com

docker build -t ${ECR_IMAGE} . && docker push ${ECR_IMAGE}

The final step to deploy our container image is to create a task definition and then create the service.

# Create task definition
echo '{
  "containerDefinitions": [
    {
      "name": "cb-demo",
      "image": "$ECR_IMAGE",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 5000,
          "hostPort": 5000,
          "protocol": "tcp"
        }
      ],
      "healthCheck": {
        "retries": 3,
        "command": [
          "CMD-SHELL",
          "curl -f localhost:5000/health || exit 2"
        ],
        "timeout": 5,
        "interval": 5
      }
    }
  ],
  "executionRoleArn": "$EXECUTIONROLEARN",
  "family": "circuit-breaker",
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "256",
  "memory": "1024"
}' | envsubst > task_definition.json 

# Register task definition
aws ecs register-task-definition --cli-input-json file://task_definition.json

# Create the service
aws ecs create-service \
  --service-name circuit-breaker-demo \
  --cluster CB-Demo \
  --task-definition circuit-breaker \
  --desired-count 5 \
  --deployment-controller type=ECS \
  --deployment-configuration "maximumPercent=200,minimumHealthyPercent=100,deploymentCircuitBreaker={enable=true,rollback=true}" \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SECGRP],assignPublicIp=ENABLED}" \
  --launch-type FARGATE \
  --platform-version 1.4.0

Let’s take a moment to examine our create-service command. Looking at the values that we pass into the --deployment-configuration parameter, this is where we are enabling the circuit breaker functionality.

deploymentCircuitBreaker {enable=true,rollback=true}

This setting is instructing ECS to enable the circuit breaker, and upon failure, automatically roll back to the previous healthy version of the service. The deployment circuit breaker is disabled by default, because there may be scenarios where a user wants to handle these failures on their own (via automation or manual intervention). The same goes with automated rollback, as customers may have their own ways of handling failed deployments. When enabling the circuit breaker, both parameters (enable and rollback) are required to be present in the configuration.

We are also setting our rolling deployment configuration as follows:

maximumPercent=200,minimumHealthyPercent=100

This directs the scheduler to ensure that the service maintains the desired healthy task count while introducing the new deployment of tasks. To put it simply, we are doubling the task count when deploying to ensure that the desired service count is always met. The percentages may vary depending on the requirements for your application.

Once the deployment is complete, let’s grab the public IP address for one of our Fargate tasks.

SERVICE_IP=$(aws ecs list-tasks --cluster CB-Demo --query taskArns[0] --output text | xargs -I {} aws ecs describe-tasks --cluster CB-Demo --tasks {} --query 'tasks[].attachments[].details[?name == `networkInterfaceId`].value[]' --output text | xargs -I {} aws ec2 describe-network-interfaces --network-interface-ids {} --query 'NetworkInterfaces[].Association.PublicIp' --output text)
echo "http://$SERVICE_IP:5000"

After opening up the IP address in the browser, we can see the demo application is up and running as expected. The UI is showing us that the service is running the first revision of the circuit-breaker task definition: circuit-breaker:1. Now let’s have some fun and break deployments!

The use case we are covering is related to deploying a breaking change to an ECS service. This could be caused by a misconfiguration in the service definition, or a change made to the application code. We will introduce a breaking change to our application code. In the flask_app.py, let’s modify the healthcheck endpoint to produce a 500 response, which will cause the tasks to exit due to the failed healthcheck response.

@app.route('/health')
def health():
  return "UNHEALTHY", 500
  #return "OK", 200

Now, we will rebuild the docker image, push it to ECR, update the task definition to point to the broken image, and finally deploy the latest changes to our service running in ECS.

export ECR_IMAGE="${ECR_REPO}:broken"
docker build -t ${ECR_IMAGE} . && docker push ${ECR_IMAGE}

# Create task definition
echo '{
  "containerDefinitions": [
    {
      "name": "cb-demo",
      "image": "$ECR_IMAGE",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 5000,
          "hostPort": 5000,
          "protocol": "tcp"
        }
      ],
      "healthCheck": {
        "retries": 3,
        "command": [
          "CMD-SHELL",
          "curl -f localhost:5000/health || exit 2"
        ],
        "timeout": 5,
        "interval": 5
      }
    }
  ],
  "executionRoleArn": "$EXECUTIONROLEARN",
  "family": "circuit-breaker",
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "256",
  "memory": "512"
}' | envsubst > task_definition.json  

# Register task definition
aws ecs register-task-definition --cli-input-json file://task_definition.json

# Update the service and trigger a deployment
aws ecs update-service \
  --service circuit-breaker-demo \
  --cluster CB-Demo \
  --task-definition circuit-breaker \
  --deployment-configuration "maximumPercent=200,minimumHealthyPercent=100,deploymentCircuitBreaker={enable=true,rollback=true}" \
  --desired-count 5

With our latest deployment, our service is pointing to the second revision of the circuit-breaker task definition. We are now going to walk through the output available to us as we watch the circuit breaker take action. First, let’s see what the deployment looks like by running the following command:

aws ecs describe-services --services circuit-breaker-demo --cluster CB-Demo --query services[]

As a part of this launch, we have introduced new service events marking the state change in deployments, as well as a new parameter: rolloutState. This parameter has three service deployment states: IN_PROGRESS, COMPLETED, and FAILED. In the output above, looking at the second deployment under the deployments array, we see the rolloutState of the previous deployment was COMPLETED with the rolloutStateReason noting the deployment completed successfully. Above that, we see that there is a pending deployment as the rolloutState parameter shows IN_PROGRESS. This is the deployment that we just triggered. Also note the difference between the task definitions of each deployment. This is important to track as we progress through the deployment lifecycle.

The tasks will fail as the scheduler attempts to launch them, and we will start to see the failedTasks parameter count grow (as seen below). This is the expected state because we deployed a broken container image.

As more tasks fail, the circuit breaker logic will kick in and mark the deployment as FAILED. Let’s dive into what’s happening under the hood to better understand the functionality and how the circuit breaker will reach that FAILED rolloutState.

When a service deployment is triggered (via the Update-service API or Create-service API), the scheduler begins to track and maintain a running count of task launch failures. The circuit breaker is comprised of two stages, each with a success and failure criteria. Let’s first break down how we define those criteria:

Success: the deployment shows the potential to transition to a successful, COMPLETED rolloutState
Failure: the deployment is showing signs of issues, and there is a possibility to that a FAILED rolloutState could be reached.

Now that we understand what success and failure look like, let’s look at the stages:

Stage 1: this stage monitors the underlying tasks in the deployment while they transition to a RUNNING state.
- Success: the scheduler will check for any tasks (greater than zero) that have transitioned into a RUNNING state. If any of the tasks for the current deployment are in a RUNNING state, the failure criteria will be skipped, and the circuit breaker will progress to the next stage.
- Failure: checks the count of consecutive failed task launches. This includes any tasks that fail to transition to a RUNNING state. Once the threshold is met, the deployment is marked as FAILED. More to come on how we determine the threshold.
Stage 2: this stage will be triggered only when the stage 1 checks show that one or more tasks in the current deployment are in a RUNNING state. The circuit breaker will check the corresponding health checks for the tasks in the current deployment being evaluated. The health checks included in the validation are: Amazon Elastic Load Balancer health checks, AWS Cloud Map service health checks, and container health checks.
- Success: if there are any tasks in a RUNNING state, that show all dependent health checks passing.
- Failure: checks the count of replacement tasks that are replaced due to failed health checks. This count will be checked against the threshold that the circuit breaker has defined.

We can see that there is a clear path for the circuit breaker to determine success or failure, and the last thing to discuss is how the circuit breaker determines the threshold for failure of a service deployment. The formula is straightforward: min <= Desired Count * 0.5 => max, with min being 3, and max being 200. To put it simply, if the formula calculates a number lower than the minimum, the failure threshold will be 3; inversely, if the formula calculates a number above the maximum, the failure threshold will be set to 200. Please note that the min and max thresholds are static and can not be changed at this time. For information on the circuit breaker process, see the ECS documentation. Below is a table with some examples of what the failure threshold would be based on the desired count:

Service Desired Count	Formula	Failure Threshold
1	3 <= 1 * 0.5 => 200	3 (lower than the minimum)
25	3 <= 25 * 0.5 => 200	13
70	3 <= 70 * 0.5 => 200	35
100	3 <= 100 * 0.5 => 200	50
400	3 <= 400 * 0.5 => 200	200
800	3 <= 800 * 0.5 => 200	200 (higher than maximum)

Now that we have a better understanding of what’s happening under the hood, let’s get back to our demo service deployment.

When we created our service, we enabled automatic rollbacks. In the output shown below, we see that the scheduler caught the failed deployment, and triggered a rollback deployment of the previous, successfully deployed version of the service. The rollback deployment rolloutState is in IN_PROGRESS, while the previous deployment shows as being in a FAILED rolloutState.

Once the rollback is complete, we see the rolloutState transitioned to COMPLETED with 5 running tasks and zero failed tasks.

That’s it! With this demo we deployed a healthy service, introduced a failure, deployed it, and sat back as the scheduler automatically rolled back to the previous version. Just to confirm we are still up and running, let’s grab the IP of a task and see what version we have deployed. We will see that our service rolled back to the circuit-breaker:1 task definition.

This feature is available today, and can be enabled via the AWS CLI, AWS SDK, or AWS CloudFormation. As always, we value our customers feedback, so please let us know how the feature is working for you. Feel free to submit any issues or questions to the containers public roadmap on GitHub.

Happy Deploying!

Containers

Announcing Amazon ECS deployment circuit breaker

What are we deploying?

Prerequisites:

Demo:

Resources

Learn

Resources

Developers

Help