A deep dive into Amazon ECS task health and task replacement

Introduction

Amazon Elastic Container Service (Amazon ECS) is a container orchestration service that manages the lifecycle of billions of application containers on AWS every week. One of the core goals of Amazon ECS is to remove overhead burden from human operators. Amazon ECS watches over your application containers 24/7, and can respond to unexpected changes faster and better than any human can. Amazon ECS reacts to undesired changes, such as application crashes and hardware failures by continuously attempting to self-heal your application container deployments back to your desired state. There are also external factors such as traffic spikes that can cause an application brown out. This can be more challenging to handle. This post dives deep into recent changes to how Amazon ECS handles task health issues and task replacement, and how these changes increase the availability of your Amazon ECS orchestrated applications.

Task health evaluation

Amazon ECS evaluates the health of a task based on a few criteria:

First, for a task to be healthy all containers that are marked as essential must be running. Every Amazon ECS task must have at least one essential container. Best practice containers run a single application process, and if that process ends because of a critical runtime exception, then the container stops. If that stopped container was marked as essential, then the entire task is considered to be unhealthy and the task must be replaced.
You can use the Amazon ECS Task Definition to configure an optional internal health check command that the Amazon ECS agent runs inside the container periodically. This command is expected to return a zero exit code that indicates success. If it returns a non-zero exit code, then that indicates failure. The container is considered unhealthy and an unhealthy essential container causes the task to be considered unhealthy, which causes Amazon ECS to replace the task.
You can use the Amazon ECS service to configure attachments between your application container and other AWS services. For example, you can connect your container deployment to an Amazon Elastic Load Balancer (ELB) or AWS Cloud Map. These services perform their own external health checks. For example, ELB periodically attempts to open a connection to your container and send a test request. If it isn’t possible to open that connection, your container returns an unexpected response, or your container takes too long to respond, then the ELB considers the target container to be unhealthy. Amazon ECS also considers this external health status when deciding whether an Amazon ECS task is healthy or unhealthy. An unhealthy ELB health check causes the task to be replaced.

For a task to be healthy, all sources of health status must evaluate as healthy. If any of the sources return an unhealthy status, then the Amazon ECS task is considered unhealthy and it will be replaced.

Task replacement behavior

Replacing an Amazon ECS task is something that happens in two main circumstances:

During a fresh deployment triggered by the UpdateService API call. Any existing tasks that are part of the previous deployment must be replaced by new tasks that are part of the new deployment.
When an existing task inside an active deployment becomes unhealthy. Unhealthy tasks must be replaced in order to maintain the desired count of healthy tasks.

From early on in the history of Amazon ECS, the behavior of task replacement during rolling deployments has been configurable using two properties of the Amazon ECS service:

maximumPercent – This controls how many additional tasks Amazon ECS can launch above the service’s desired count. For example, if the maximumPercent is 200% and the desired count for the service is eight tasks, then Amazon ECS can launch additional tasks up to a total of 16 tasks.
minimumHealthyPercent – This controls the percentage that an Amazon ECS service is allowed to go below the desired count during a deployment. For example, if minimumHealthyPercent is 75% and the desired count for the service is eight tasks, then Amazon ECS can stop two tasks, reducing the service deployment down to six running tasks.

The maximumPercent and minimumHealthyPercent have functioned for many years as efficient controls for fine tuning the behavior of rolling deployments when running Amazon ECS tasks on Amazon Elastic Compute Cloud (Amazon EC2) capacity. However, these deployment controls don’t make as much sense in a world where more and more Amazon ECS users are choosing serverless AWS Fargate capacity. In most cases, modern applications don’t require Amazon ECS to go below the desired count of running tasks during a rolling deployment or reduce the number of additional tasks being launched during a rolling deployment, because AWS Fargate utilization isn’t constrained by how many underlying Amazon EC2 instances you have registered into your cluster.

Additionally, the maximumPercent and minimumHealthyPercent controls were originally ignored when it came to replacing unhealthy tasks. If tasks became unhealthy, then your service’s desired count could dip well below the threshold defined by minimumHealthyPercent. For example, if you were running eight tasks and four of them became unhealthy, then Amazon ECS would terminate the four unhealthy tasks and launch four replacement tasks. The number of running tasks would temporarily dip to 50% of the desired count.

Updates to how Amazon ECS replaces unhealthy tasks

As of October 20, 2023, Amazon ECS now uses your maximumPercent whenever possible when replacing unhealthy tasks. Let’s look at a few scenarios to understand how this works:

Crashing tasks

You’re running a service with a desired count of eight tasks and maximum percent of 200%. Four of your eight tasks encounter critical runtime exceptions. Their processes crash and exit, which causes an essential container to exit. Amazon ECS observes that four of the eight tasks have gone unhealthy because their essential container exited. Unfortunately, Amazon ECS can’t avoid the healthy percentage dipping below 100% because the unhealthy container crashed. The running task count dips to 50% of the desired count briefly, but Amazon ECS launches four replacement tasks as quickly as possible to bring the number of running tasks back up to the desired count of eight tasks.

Frozen tasks

You’re running a service with a desired count of eight tasks and maximum percent of 200%. Because of an endless loop in your code four of your eight tasks freeze up, but the processes stay running. The attached load balancer that is sending health check requests to the service observes that the target container is no longer responsive to health check requests, so it marks the target as unhealthy. Amazon ECS considers those four frozen tasks to be unhealthy. The maximum percent for the service allows it to go up to 16 tasks. Amazon ECS launches four additional replacement tasks for the four unhealthy tasks, making a total of 12 running tasks. Once the four additional tasks have become healthy, Amazon ECS stops the four unhealthy tasks, which brings the running task count back down to the desired count of eight tasks.

Overburdened tasks

You’re running a service with a desired count of eight tasks and maximum percent of 150%. The service has autoscaling rules attached to it. It also has a load balancer attached to it, and a large spike of traffic arrives via the load balancer. The spike of traffic is so large that response time from the task rises dramatically. As a result of high response time, the load balancer health check fails and the ELB marks all eight targets as unhealthy. The ELB fails open and continues distributing traffic to all the targets as there are no healthy targets in the load balancer.

Amazon ECS observes that all eight tasks are unhealthy. As a result, Amazon ECS wants to replace these unhealthy tasks. The maximum percent of 150% allows the service to go up to 12 running tasks. Therefore, Amazon ECS avoids stopping the unhealthy running tasks immediately. Instead, it launches four replacement tasks in parallel with the existing eight unhealthy tasks. Fortunately these four additional tasks give the ELB more targets to distribute traffic across, and all 12 of the running tasks stabilize in health as they are now able to handle the incoming traffic without timing out. Amazon ECS observes that there are now 12 healthy running tasks.

Simultaneously with this, an Application Auto Scaling rule has kicked in based on seeing high CPU utilization by the original eight running tasks. The rule has updated the desired count for the Amazon ECS service from eight running tasks to 10 running tasks. Therefore, Amazon ECS only stops two of the 12 healthy running tasks, which reduces the task count back down to its current desired count of 10 running tasks.

Limited maximum percent

You’re running a service with a desired count of eight tasks and because of downstream limits or infrastructure constraints you have set a maximum percent of 100%. This doesn’t allow Amazon ECS to launch any additional tasks in parallel with your eight running tasks. If a task from this deployment freezes, or becomes overburdened and starts failing health checks, then Amazon ECS needs to replace it. Amazon ECS stops the unhealthy task first, then launches a replacement task after the unhealthy task has been stopped. This means the running task count still temporarily dips below the desired count.

Task fails health checks during a rolling deployment

You’re running a service with a desired count of eight tasks and a maximum percent of 150%. You’re doing a rolling deployment to update your running tasks to be based off of a new task definition. Because the maximum percent is 150%, this allows Amazon ECS to launch additional tasks in parallel with your currently running tasks. The rolling deployment has already triggered four additional task launches. The service currently has 12 running tasks: eight old tasks and four new tasks.

During this rolling deployment, some of the old tasks begin failing a health check due to an unexpected bug. Because there’s an active rolling deployment occurring, Amazon ECS resorts to terminating unhealthy tasks immediately and replacing them with instances of the new task as quickly as possible. During a rolling deployment, Amazon ECS always try to replace failing tasks with tasks from the new active deployment.

Ongoing task failures because of external factors

You’re running a service with a desired count of eight tasks and maximum percent of 150%. One of the downstream services that your code depends on starts returning an unexpected response, and this causes your code to start failing health checks. Amazon ECS sees that the eight tasks are unhealthy and need to be replaced, so it launches four additional replacement tasks in parallel with the eight initial tasks. At this point there are twelve tasks running: eight original tasks, and four replacement tasks. Unfortunately all twelve tasks are unhealthy because the replacement tasks are still relying on the same unreliable downstream service as the original tasks.

Because the replacement tasks did not stabilize, and ECS sees that the number of unhealthy tasks is greater than the desired count for the service, ECS will stop four of the unhealthy tasks at random, in order to bring the number of unhealthy tasks back down to the desired count. ECS does not maintain a stateful knowledge of which unhealthy tasks were “original” and which were “replacements”. Once enough of the excess unhealthy tasks have been stopped, and there is room for additional tasks, then ECS will attempt to launch replacement tasks again. This will continue endlessly until the downstream service becomes reliable again, or you make an UpdateService API call to roll out a code update that handles the failure condition more gracefully.

Health checks and responsive absorption of workload spikes

Previously, Amazon ECS always stopped unhealthy tasks first, then launched a replacement task. This behavior made sense in a world where tasks were binpacked densely onto a statically sized cluster of Amazon EC2 instances that had no room to launch a replacement task without stopping an existing task. But more modern container workloads are now running using serverless AWS Fargate capacity. There’s no need to stop an unhealthy running task to make room for its replacement, as AWS Fargate can supply as much on-demand container capacity as needed. Additionally, many customers of Amazon ECS on Amazon EC2 are now using Amazon ECS capacity providers to launch additional Amazon EC2 instances on demand, rather than deploying to statically sized clusters of Amazon EC2 instances. Therefore, Amazon ECS now prioritizes using the maximumPercent for a service, and whenever possible it keeps unhealthy tasks running until after their replacements have become healthy.

Additionally, the new Amazon ECS task replacement behavior helps prevent runaway task termination. In some cases, a large workload spike could cause a few tasks from the deployment to become unhealthy, which triggered their replacement. However, when Amazon ECS stopped unhealthy tasks in order to launch a replacement, the load balancer would shift more workload onto the remaining healthy tasks, which caused them to go unhealthy. In quick succession, all healthy tasks would be overwhelmed with workload that caused a cascade of runaway health check failures until every task had gone unhealthy.

Eventually, Application Auto Scaling rules would kick in and scale up the deployment to a large enough size to handle the workload. But in most cases, a traffic spike causes the load balancer health checks to fail before it triggers aggregate resource consumption-based autoscaling. Auto scaling rules need to observe at least one minute of high average resource utilization before they react by scaling out the container deployment. However, an overburdened task may begin failing load balancer health checks immediately.

In the scenario where your tasks are unhealthy because they are dealing with a large spike of incoming workload, the new task replacement behavior of Amazon ECS dramatically improves availability and reliability of your service. Amazon ECS catches health check failures and proactively launches a parallel replacement task that can help absorb the incoming workload spike before autoscaling rules even trigger. Once autoscaling rules trigger, the replacement task and the original task are both retained, if they are both healthy and if they fulfill the current desired task count of the service.

Conclusion

In this post, we explained new Amazon ECS behavior when handling unhealthy tasks. As more customers adopt Amazon ECS for their mission critical applications, we are always happy to tackle challenging new orchestration problems at scale. This updated task replacement behavior is designed to help serve the needs of customers both small and large. It helps keep your container deployments online and available—even in adverse circumstances such as application failure or traffic spikes.

Please visit the Amazon ECS public roadmap for more info on additional upcoming features for Amazon ECS or to create your own issue to request a change or new feature.

For more info on Amazon ECS scheduler behavior, see the official documentation, under Service Scheduler Concepts.

Containers