A deep dive into resilience and availability on Amazon Elastic Container Service

Introduction

In this post, we’ll deep dive into the architecture principles we use in Amazon Elastic Container Service (Amazon ECS). We’ll outline some of the features that Amazon ECS delivers to make it easy for your application to achieve high availability and resilience. We explore how Amazon ECS is designed to use AWS availability and resilience patterns and how these are surfaced in the Amazon ECS API for easy consumption. With this deeper insight, we believe you will be well positioned to choose the best Amazon ECS configuration and features for your solutions needs.

Our goal at AWS is to remove undifferentiated heavy lifting from our customers in order to enable you to focus on innovating for your business. Many modern applications need to be resilient to various failure modes and highly available. Building resilient and highly available solutions can be challenging and time consuming and often that investment – although critical to business success – isn’t in itself something that differentiates your solution for your customers.

Amazon ECS is AWS’s fully managed container orchestration service that helps you to efficiently deploy, manage, and scale containerized applications on either Amazon Elastic Compute Cloud (Amazon EC2), AWS Fargate or on your on-premises hardware through Amazon ECS Anywhere. We built Amazon ECS so that you can use the constructs we use ourselves to achieve high availability easily, so that you can focus on those aspects of your application that truly differentiate you and your business.

Solution overview

Availability and resilience

When you consider failure mitigation there are typically two areas to think through:

How available does my solution need to be?
How resilient does my solution need to be?

Availability is the measure of the probability of a solution to be available to do useful work. For example, if your solution is a web site selling cupcakes, then your solutions availability could be measured by the ability of a customer being able to successfully buy a cupcake at any time. If at any point, a customer is unable to complete a cupcake purchase, that would negatively impact your availability metric.

Resilience factors into availability, but is subtly different. Resilience is a service’s ability to continue to operate under adverse conditions and the speed with which it is able to return to normal operation. In our cupcake example, your resilience would be measured by how quickly you get back to customers being able to buy cupcakes from the point of first failure. Put another way, availability can be measured as a ratio of the time a solution is inoperable over some period and resilience can be measured as the average time it takes to recover from failure.

When designing the Amazon ECS control plane, we deeply considered these two concepts and have designed the service for both. We’ll first explore how Amazon ECS is designed and can be used for availability and then explore how Amazon ECS is architected and managed to improve its resilience.

Availability

It is safe to assume that over the lifetime of an application there will almost certainly be some failure. Something we factor into our designs at AWS is to expect the unexpected. Something is bound to go wrong at some point for reasons we didn’t expect. Even at very low failure rates, with sufficiently large scale, there’s always some amount of background failure. For example, if we have one million servers at a 99.99% availability, then we can expect approximately one hundred of these servers to be failing at any point in time. To accommodate this, we design our services to assume failure to be part of normal operations. With Amazon ECS, we incorporated this thinking into the service design for our management layers and built features to provide you with mechanisms to do the same for your applications on Amazon ECS.

Static stability using Availability Zones

AWS provides our customers opportunities to use our highly redundant infrastructure through a hierarchy of AWS Regions and Availability Zones. Each AWS Region uses a shared nothing pattern, where each Region is isolated and independent of every other Region to avoid correlated failure across regions. Every AWS Region is divided into multiple Availability Zones. Availability Zones are geographically close enough to avoid network latency issues but geographically spread in a Region with redundant and independent operations in order to avoid correlated failure. An Availability Zone comprises of one or more discrete data centers each with their own redundant power, network and inter-Availability Zone connectivity. These two constructs provide the foundation for building highly available services.

In their article Static Stability using Availability Zones, Becky Weiss and Mike Furr describe what we mean by static stability. They outline how AWS services leverage Active-Active stateless services pre-scaled across Availability Zones. The Amazon ECS control plane is a micro-service architecture with many discrete services. These micro-services are designed to use AWS infrastructure to maximize availability building on the Active-Active pattern outlined in Becky and Mike’s article. Amazon ECS is deployed in every Region as a full, discrete, and autonomous copy of the entire service. Each micro-service is deployed into every Availability Zone in a Region and over provisioned to more than 150% of peak load spread cross at least three Availability Zones. This ensures that in the event of an outage in a single Availability Zone, Amazon ECS would have ample capacity in the remaining Availability Zones across all of its micro-services to continue serving customer requests.

Diagram of Amazon ECS Control Plane showing 150% pre-scaled distributed evenly over at least three Availability Zones

Amazon ECS services and placement strategies

Workloads are provisioned into Amazon ECS clusters either directly via the console, AWS Command Line Interface (AWS CLI), or the Amazon ECS API. An instance of a workload running in a cluster (regardless of how it was launched) is referred to as a task. The control plane leverages Availability Zones to mitigate impact resulting from underlying infrastructure failure, by ensuring that copies of each of its micro-service are running in at least three Availability Zones. You are able to achieve the same availability for your workload by ensuring the same. When using Amazon ECS on Amazon EC2, configure your service to use the spread placement strategy for the tasks in your service, across as many Availability Zones as are available in the AWS region and are necessary.

Diagram shows an Amazon ECS Service with Spread placement across at least three Availability Zones

In order to ensure that your workload is spread across Availability Zones when using Amazon ECS on Amazon EC2, you’ll need to ensure you have Amazon EC2 capacity registered with your cluster in those Availability Zones. Typically you would do this through registering a capacity provider with an associated Amazon EC2 Auto Scaling Group (ASG) into your cluster and ensuring that the Amazon EC2 Auto Scaling group is configured to launch Amazon EC2 instances in multiple AZs. With Amazon EC2 capacity spread across multiple Availability Zones, you’ll then want to choose the spread placement strategy to spread task placement as evenly as possible across AZs.

When using Amazon ECS with AWS Fargate it is even easier to ensure that your tasks are spread across Availability Zones. You’ll need to ensure that you specify multiple Amazon VPC subnets in the network configuration for your task when you create the service or in your RunTask request. Amazon VPC subnets reside in a single AWS Availability Zone. Amazon ECS uses the subnets provided during task provisioning to determine in which Availability Zones to launch the task launches and ensures that all workloads are spread across these subnets. AWS Fargate always spreads your tasks across the Availability Zones you configure through subnets. The spread is based on the service name, when deploying a service, and on the task definition family name in the case of single task launch through RunTask.

Diagram shows Amazon ECS Service using AWS Fargate spread across at least three Availability Zones

Avoiding fallback in distributed systems

Although we do everything we can to mitigate the occurrence of a failure, we assume that at some point an unforeseen event results in infrastructure or software failing. In his article Avoiding Fallback in Distributed Systems, Jacob Gabrielson outlines how Amazon has found that fallback in the face of failure can actually exacerbate that failure. To mitigate this, we favor active-active solutions where we weight work away from failure and work to avoid bimodal behaviors in our service architecture (i.e., bimodal behavior is when a service exhibits different behavior under normal and failure modes). As Jacob outlines, our experience has been that this introduced complexity and risk.

To improve the availability of our service through weigh away, we ensure that we are pre-scaled sufficiently that a service can survive the loss of a single Availability Zone (i.e., typically that would equate to 33% of available capacity). For example, if one of our services requires six workers to support peak traffic load with some headroom and that service is hosted out of three Availability Zones, we’ll provision three workers per availability zone (i.e., 50% over provisioned per Availability Zone) for a total of nine workers. This ensures that if a single Availability Zone fails, the remaining Availability Zones are sufficiently pre-scaled that the service has enough capacity to continue to serve requests even at peak.

In order to achieve the same availability for your Amazon ECS service, you can configure the desired Task Count for your service to be 50% greater than your peak traffic load, with spread placement across three Availability Zones. During a deployment, or as a result of scaling activity, Amazon ECS ensures that your service honors the placement strategy you have configured. For Spread Placement, Amazon ECS ensures that your tasks are distributed across the Availability Zones you have configured.

To determine the number of tasks you need, you can use the following approach:

1) First, determine the maximum number of tasks your service needs to meet its operational commitment. We’ll call this your Base Desired Count.

2) Determine the number of Availability Zones you plan to use. We’ll call this AZ Spread Count.

3) Calculate the additional task count needed to continue to meet your peak traffic commitment in case of a single Availability Zone being unavailable. We’ll call this the Target Desired Count.

Calculate this as follows:

Target Desired Count = Base Desired Count X (AZ spread count / (AZ spread count – 1))

For example, if we need 6 tasks and we have 3 AZs then:

Target Desired Count = 6 X (3 / (3 – 1)) = 9

Diagram shows the benefit in case of failure of an ECS Fargate service pre-scaled to 150% needed capacity across three availability zones

For larger services in Regions where there are more than three Availability Zones, it is possible to reduce the pre-scaling needed in a single Availability Zone by spreading the service footprint over more availability zones. This can help to reduce the cost of ownership of your service while still meeting your availability requirements. For example, us-east-1 (at time of writing) has six availability zones. We can see the benefit to this amortization across a larger number of Availability Zones for a service needing 600 Amazon ECS Tasks to handle peak load in the following example:

600 tasks across 3 AZs: Target Desired Count = 600 X (3 / (3 – 1)) = 900
600 tasks across 6 AZs: Target Desired Count = 600 X (6 / (6 – 1)) = 720

In this case, we see a saving on our compute cost as a result of using more Availability Zones while still meeting our availability goals, rather than needing 50% additional compute we only need 20% in the above scenario for the same outcome. It is important to be aware that depending on the nature of your workload, this could increase cross AZ data transfer charges which may increase the total cost of ownership.

There are a number of benefits to this approach. You don’t need to take action in the event of a single Availability Zone failure, your service continues to operate provided you have correctly configured your load balancer(s) and load balancer health checks. Secondly there is no fall back to mitigate failure ensuring that we mitigate the risks outlined by Jacob in his article. Thirdly, this approach is readily testable in non-production environments, by provisioning a test stack with the same configuration and then force terminating all of the worker nodes in a single Availability Zone.

For Amazon ECS, we use this ability to weigh traffic out of an Availability Zone as part of our standard operating procedure and something we regularly exercise in pre-production and production environments. This is a critical tool for meeting our obligation to our customers for high availability.

Workload isolation using sharding

Amazon ECS uses workload isolation through sharding, which is a concept introduced in Colm MacCárthaigh’s article Workload Isolation using Suffle Sharding. This enables scaling of a service and provides an isolation boundary for sets of customers workloads. The core set of services that comprise the Amazon ECS control plane in a Region are partitioned into what we call cells. Each cell is a full, discrete, and autonomous instance of the control plane, responsible for managing a subset of customer workloads for that Region. Each cell is provisioned across at least three Availability Zones at more than 150% of peak load capacity as outlined earlier. We use this cell-based partitioning to provide additional fault isolation for our customers. Customer workloads are associated with a subset of cells within a Region and the processes managing those workloads are discrete from processes in other cells.

Diagram showing Amazon ECS Control Plane sharded into cells with each isolated cell a complete copy of Amazon ECS pre-scaled to 150% of peak and spread over at least three AZs.)

We have found this architectural pattern provides real benefit for both scale and availability. Cells act as fault isolation boundaries for the control plane and Availability Zones act as fault isolation boundaries for the infrastructure. If a failure occurs due to a hardware or a software failure, then that failure is typically isolated to a single fault zone and a subset of customer workloads. This pattern is very useful for scaling considerations. Since cells are a relatively small unit of the whole of a production environment, while also being a complete solution, this allows us to scale test in non-production environments, but at production scale for a single cell. We only need to scale test a single cell to understand the scaling constraints of our production service.

As a customer of Amazon ECS, you are able to use these fault isolation boundaries through a combination of clusters and services. When you create a cluster, it is associated with one or more Amazon ECS cells. Any workloads provisioned in your cluster will be managed by the control plane partition(s) that reside within the cell(s) where your cluster is placed.

Diagram showing how Amazon ECS Clusters are hosted in Amazon ECS Control Plane cells and can use the Cell fault isolation boundary.

One consideration when designing your solutions is whether to provision one or more clusters. Amazon ECS clusters are logical namespaces used to group workloads. They can be a useful mechanism for aligning workloads with fault zones within Amazon ECS. Doing so provides protection from correlated failure workloads in different Amazon ECS clusters. You may have observed that the service quotas for Services and Amazon EC2 instances in an Amazon ECS cluster are relatively large, which allows for thousands of clusters per Region and thousands of services per cluster. There is no charge for clusters and any number of clusters can reside in the same Amazon Virtual Private Cloud (Amazon VPC) network. The workloads communicate freely across clusters in the same network, which allows you to use clusters as needed to meet your availability needs without incurring cost.

Resilience

As we described earlier, resilience is a service’s ability to continue to operate under adverse conditions and the speed with which it is able to return to normal operation. For workloads that scale horizontally, we want to ensure that when something fails, the service remains available. We do this through pre-scaling and AZ static stability, as outlined in the previous sections. From a resilience stand point, we want to ensure that the service returns to steady state as fast as possible, while continuing to remain available to do useful work.

Automating safe hands-off deployments

At AWS, we favor automation and continuous deployment is a critical part of this automation. Clare Ligouri has a really great article that describes the patterns that we use to ensure automated safe continuous deployment for our solutions.

Amazon ECS is deployed multiple times a day using continuous deployment. Service change is applied to a single Availability Zone at a time, in any one AWS Region, to mitigate blast radius in case of failure. We also ensure that deployments are managed at a cadence, where a new revision of that service is introduced to only a small subset of the set of servers in that Availability Zone. In her article, Clare refers to this as Rolling Deployments – the same deployment strategy that is supported by default by Amazon ECS for your services. To monitor for success, we use automated metrics that track the aggregate health of the service and are able to quickly identify undesirable behavior. If any undesirable outcomes are identified in these metrics, then the deployment is automatically rolled back. At the same time, failure detected in a single Availability Zone is mitigated through automation processes, that immediately act to weigh traffic out of the impacted Availability Zone. We can do this with no impact to our customers, because of the pre-scaling already in place in order to accommodate single AZ failure.

Automated safe hands-off deployments meets workload Isolation using sharding

As previously mentioned, the core set of services that comprise the Amazon ECS control plane are partitioned into what Amazon ECS refers to as cells. In order to further improve service resilience, our deployments are done within a single cell and within a single Availability Zone. This approach allows us to use pre-scaling in Availability Zones to weigh away from an AZ in case of failure and workload isolation, to further limit impact in unforeseen issues arising from the deployment. As a result, changes are rolled out in small increments to a subset of service workers.

Changes are released in change sets. We take a very conservative approach when introducing any new change set. Once thoroughly tested, the new change set is introduce in small increments, to a production environment using a rolling deployment strategy. A new change set is introduced into a single Region, in a single Availability Zone and a single cell, and only to a subset of the workers. The deployment is monitored with automated rollback alarming. If at any point negative impact is detected, then the deployment is immediately rolled back. We leverage Bake Time (described in Clare’s article) and small initial increments to build confidence in a change set. As we build confidence, we’ll increase the velocity of the deployment. We don’t compromise on our commitment to ensure that a single change set is only every deployed to a single Availability Zone in any one Region at the same time.

Using AWS Code Pipelines and Amazon ECS, you are able to model the same pattern for deployment and achieve the same resilience posture for your service.

Conclusion

In this post, we showed you some of the resilience and availability tenets we use for Amazon ECS. Using these patterns allow us to deliver highly available services even in the face of a single AZ failure, while also allowing engineering teams to safely and continuously deliver software. We speak about these patterns in more detail in a number of our Amazon Builders’ Library publications referenced throughout this article and we hope these can be a useful guide to you, when looking to design your service to meet your availability needs.

We have found various benefits using the patterns outlined above to ensure Amazon ECS is highly available and resilient in the face of unforeseen failure. We normalize the mechanisms to mitigate failure by incorporating them into our continuous deployment pipeline. We favor pre-scaling our solutions across at least three Availability Zones, using workload isolation techniques, through partitioning our control plane into complete copies delivered as cells. We combine these along with a rolling deployment strategy with automated rollback alarming, to enable developer agility while preserving service availability. Not all services need to achieve this level of availability or agility, so these patterns may not be right for you. For large and growing services that need to meet high availability and resilience goals, we have found them to work well.

You can also learn more about this topic in the 2023 re:Invent session Deep dive into Amazon ECS resilience and availability (CON401)

Containers