Scale to 15,000+ tasks in a single Amazon Elastic Container Service (ECS) cluster

Introduction

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies your deployment, management, and scaling of containerized applications. Amazon ECS has deep AWS integrations and best practices built-in, which enable you to run and scale your applications in the cloud or on-premises, without the complexity of managing a control plane. AWS Fargate is a serverless compute engine built into Amazon ECS, which enables you to focus on building and scaling your applications without needing to worry about managing servers, scaling capacity, and security and compliance of your infrastructure.

As your business grows and customer demand increases, the scalability of your applications becomes crucial to success. In this post, we’ll demonstrate the powerful simplicity of Amazon ECS with AWS Fargate to seamlessly scale an application to 15,000+ tasks on a brand-new AWS account. We’ll describe the limits encountered, resolutions, and recommendations for operating at this massive scale. For a quick summary, you can jump to the Key takeaways.

Walkthrough

In this exercise, we deployed a web application consisting of four microservices to showcase various scaling dimensions. Below are the configuration details of each service:

ecsdemo-frontend – This is the front-end component of our application. It is configured with Amazon ECS service discovery and exposed to users through a public Application Load Balancer (ALB). Each task within the service is allocated 1 vCPU and 2 GB of memory. To ensure the health and availability of both the ALB and the containers, both ALB health checks and container health checks are enabled for this service.
ecsdemo-auth-no-sd – This service provides a publicly exposed authentication APIs. It has an ALB with health checks enabled for routing all user communications. Each task is allocated 0.5 vCPU and 1 GB of memory.
ecsdemo-backend – This is a Java application that’s responsible for the backend of our web application. Each task is allocated 1 vCPU and 2 GB of memory..
ecsdemo-payment-api – This service is responsible for processing payments for our web application. Each task is allocated 0.25 vCPU and 500 MB of memory.

CloudWatch Running Task Count Graph with over 15,000+ tasks

After outlining the characteristics of the services for our application, we’d like to begin our discussion by sharing the results and scale we achieved seamlessly with Amazon ECS on AWS Fargate using a brand new AWS account. This sets the stage for the subsequent technical details of our scaling experience. The following chart shows the how our Amazon ECS cluster scaled out to the desired number of tasks (16,000) during this scaling exercise.Amazon ECS successfully scheduled and ran over 15,000 tasks within the same cluster. Moreover, it just took approximately 15 minutes for an Amazon ECS service to achieve a scale of 1,000 running tasks, including the ALB and service discovery registration, and approximately 40 minutes to achieve a scale of 5,000 running tasks. It is worth noting that all four services were scaled up simultaneously – the front-end service had 1,000 running tasks, while the remaining three services reached the maximum limit of 5,000 running tasks per service.

Note: The actual performance in practice can vary based on several factors such as container image size, image caching, load balancer health checks, and the actual performance of your application. For performance recommendations, see the Amazon ECS best practices guide.

Now that we’ve demonstrated the results, let’s get into details on how you can achieve this scale with Amazon ECS.

Initial preparations

To deploy our services faster with minimal operational overhead, we used Amazon ECS Blueprints, which inherently codify best practices and well-designed architecture patterns.

It’s important to note that Amazon ECS tasks on AWS Fargate run in the awsvpc network mode. In this mode, each task is allocated its own elastic network interface (ENI) and a primary private IPv4 address. To accommodate this requirement, we configured six /20 subnets, with each subnet providing 4,091 private IPv4 addresses, as follows:

VPC Subnets with Available IPv4 Private IPs

It is important to highlight that not having enough private IPv4 addresses within your subnets prevents Amazon ECS from scaling out and running more tasks. Ensuring that your networking configuration supports the desired scale is crucial.

To demonstrate the powerful simplicity of Amazon ECS, we used a new AWS account with default limits. This provided a realistic perspective and showcases the capabilities of Amazon ECS within the constraints of default account limits.

1,000 tasks with ecs-demo-frontend

Our first task was to scale our front-end service, which has service discovery enabled and an ALB associated, from one running task to 1,000 running tasks by setting the desired count to 1,000. Initially, we observed following as the service is scaling out:

Task launch rate: Amazon ECS attempts to launch 80 tasks about every 10 seconds, which is in line with the rate of tasks launched service quota of 500 tasks per minute:

ECS Event Message with Task Launch Rate

As the number of running tasks reached about 280 running tasks and progressed to approximately 28%, the front-end service started to report the following error about AWS Fargate vCPU-based quotas:

ECS Events for Fargate vCPU Limit

This service event message indicates that we have reached the AWS Fargate On-Demand vCPU resource count. This count represents the number of AWS Fargate vCPUs running concurrently Fargate On-Demand in our account within the current Region. To address this service event message, we requested a limit increase using AWS service quotas console.

After submitting the limit increase request for Fargate On-Demand vCPU resource count, you can verify your new limit through the AWS console or AWS Command Line Interface (AWS CLI) after it is approved. In our experiment, it took a day for the limit increase to take effect. Therefore, we recommend raising this limit ahead of time.

aws service-quotas list-service-quotas —service-code fargate —output table

AWS CLI Output of Fargate Service Quotos for vCPU

In regards to the Fargate On-Demand vCPU resource count, it is important to note that new AWS accounts may initially have lower quotas that can increase over time. AWS Fargate continuously monitors the account usage within each Region and automatically adjusts the quotas based on your usage. For details on various service quotas and limits, refer to Amazon ECS service quotas.

After increasing the AWS Fargate On-Demand vCPU resource count to 10,000, our front-end service continued to schedule further tasks as expected and the service reached 1,000 running tasks, as seen in the following screenshots:

Deployment Progress while reaching 1,000 Tasks

Service Status after reaching 1000 Tasks

The desired count has been updated to 1,000 tasks at 2023-06-01T17:40 UTC, and the front-end service reached steady state with 1,000 running tasks at 2023-06-01T17:54 UTC. It took approximately 15 minutes for the service to achieve a scale of 1,000 running tasks, including the ALB and service discovery registration as follows:

Running Task Count Metric in CloudWatch for the service

Scale up to 5,000 tasks

After achieving our first milestone of 1,000 running tasks and given that an Amazon ECS service can have up to 5,000 running tasks per single service, we set the desired count of the service to 5,000 and started to observe the service behavior.

The front-end service started to report the following error at approximately the maximum number of targets per target group per Region, which is 1,000 by default:

service ecsdemo-frontend failed to register targets in target-group ecsdemo-frontend-tg  with (error The maximum number of targets on target group 'arn:aws:elasticloadbalancing:us-west-2:862096221061:targetgroup/ecsdemo-frontend-tg/fcaf5369a444a513' has been reached)

After checking the service quotas for Elastic Load Balancing (ELB) through the following dashboard, we confirmed that quota for Targets per Target Group per Region was set to 1,000.

Service Quotas → AWS services → Elastic Load Balancing (ELB) → Targets per Target Group per Region

Limit for Targets per Target Group per Region

To address this service message and continue our scaling journey, we requested our second limit increase through service quotas. But this time, the limit increase request was set to 5,000, as seen in the following:

Updated Targets per Target Group per Region Limit

While expecting the front-end service to continue to schedule new tasks to meet the desired count of 5,000 after increasing the targets per target group, we started to see the following service message, which indicates the tasks for the service repeatedly fail to enter the RUNNING state (progressing directly from a PENDING to a STOPPED status). The Amazon ECS Service throttle logic came into the play:

ECS Scheduler Error Message

After checking scheduled tasks through the AWS console, indeed, the tasks were not able to enter into RUNNING state, and they were transitioning into STOPPED state immediately as follows:

Snapshot of Task Status

To analyze stopped tasks, on the AWS console, we selected STOPPED from the dropdown on the Tasks tab. We realized that the tasks could not enter into RUNNING state because they weren’t able to register to AWS Cloud Map as a part of Amazon ECS service discovery, and they reported the following error message:

stoppedReason=The Service Discovery instance could not be registered.

Error Detail at Task Level

The tasks for the front-end service were not able to register new tasks because the maximum Instances per service limit in AWS Cloud Map (on which service discovery is based) had been reached. However, unlike previously increased limits such as targets per target group per Region and Fargate On-Demand vCPU resource count, the instances per service limit isn’t adjustable, and you can’t request a quota increase.

Instances per service Limit Details

Here, this behavior is expected for Amazon ECS services with AWS Cloud Map–based Amazon ECS service discovery enabled and clearly referenced within Amazon ECS service quotas public documentation as follows:

Document Reference for Instances per service limit

The configuration of the front-end service showing the service discovery is shown in the following details:

Service Discovery Configuration of the front-end service

Once our service reaches a steady state with 1,000 running tasks, we’ve achieved our first scaling goal of this scaling journey.

Before moving on to the next section of our scaling journey, it is important to highlight that Amazon ECS Service Connect, which provides management of service-to-service communication using both service discovery and a service mesh in Amazon ECS have a limit of 1,000 tasks per service as documented in the following details.

Document Reference for Service Connect

5,000 tasks with ecs-auth-alb-no-sd

After the lessons learned from our front-end service and reaching a scale of 1,000 running tasks, we tried to achieve a scale of 5,000 running tasks, which is the maximum number of tasks per Amazon ECS service.

ECS Task per Service Limit

For this, we used the auth-no-sd service, which has an ALB but with AWS Cloud Map–based Amazon ECS service discovery disabled, as seen in the following configurations:

Configuration of auth-no-sd service with ALB

To start our scaling test with this service, we set our desired count to 5,000 tasks and started to observe the scaling behavior of the service. Initially, the service started to schedule 80 new tasks-batch about every 10 seconds, as expected and seen in the following screenshot:

Deployment Progress of auth-no-sd service

However, when the service reached the scale of 1,000 running tasks, it started to emit the following service message, and some of the tasks transitioned to a STOPPED state.

service ecsdemo-auth-no-sd failed to register targets in target-group ecsdemo-auth-no-sd-tg  with (error The maximum number of targets per load balancer '1,000' has been reached)

ECS Service Error Event for ALB Target Limit

Looking into the details of the stopped tasks individually, we realized that Amazon ECS scheduler wasn’t able to register the scheduled tasks with the ALB. The tasks failed with the error message Scheduler failed to register target with ELB.

Error Details at Task Level

After checking the service quotas for ELB through the dashboard shown in the following screenshot, we found that the auth-no-sd service correctly emitted the service event message about registration failure as the Targets per Application Load Balancer limit was set to 1,000 by default:

Service Quotas → AWS services → Elastic Load Balancing (ELB) → Targets per Application Load Balancer

Targets per Application Load Balancer View in Service Quotas

To continue our scaling journey, in addition to the Targets per Target Group per Region limit that was raised previously to 5,000, we requested two new limit increases through service quotas. In this regard, both Targets per Application Load Balancer and Number of times a target can be registered per Application Load Balancer were set to 5,000. Now all three limits were set to 5,000.

After all three limits related to ALB and the ALB targets were updated to 5,000, our auth-no-sd service continued to schedule further tasks as expected. The service reached 5,000 running tasks successfully, as shown in the following screenshot:

Deployment Status with 5000 Tasks

Considering the desired count has been set to 5,000 tasks at 2023-06-09T13:18 UTC and the auth-no-sd service reached steady state with 5,000 running tasks at 2023-06-09T13:55 UTC, it took approximately 37 minutes for Amazon ECS to achieve a scale of 5,000 running tasks. This scaling up included the time to register into an ALB, as shown in the following screenshot.

CloudWatch Running Task Count for auth-no-sd service

It is important to highlight that with this scaling journey of the auth-no-sd service, we achieved our second scaling milestone of 5,000 running tasks for a single Amazon ECS service.

We now have 6,000 tasks running in total in our Amazon ECS cluster across two services: auth-no-sd and front-end.

Reach 15,000+ in a single cluster

Our final milestone was to scale our services to reach a total of 16,000 tasks in the Amazon ECS cluster. For this, we used our backend service and payment-api service, which don’t have an Amazon ECS service discovery enabled or an associated ALB.

After setting the desired count of the backend service to 5,000 tasks at about 2023-06-01T13:16 UTC, it took approximately 39 minutes for the service to reach 5,000 tasks at about 2023-06-01T13:55 UTC, as shown in the following screenshot.

CloudWatch Running Task Count for backend service

We’ve reached 11,000 tasks running in total in our Amazon ECS clusters across three Amazon ECS services: auth-no-sd (5,000 tasks), front-end (1,000 tasks), and backend (5,000 tasks).

Now, we go to our fourth and last Amazon ECS service, payment-api, and update its desired count to 5,000 tasks. Then we observe its scheduling and scaling.

Similar to previous experience with the backend service, after setting the desired count of the payment-api service to 5,000 tasks at about 2023-06-01T14:10 UTC, it took approximately 42 minutes for the service to reach 5,000 tasks at about 2023-06-01T14:52 UTC, as shown in the following screenshot.

CloudWatch Running Task Count for payment-api service

Now, we reached our final goal of 16,000 tasks running on a single Amazon ECS cluster across four services: auth-no-sd, front-end, backend, and payment-api, as seen in the following screenshot.

Status of All Four Service within ECS Console

Key takeaways

While using AWS Fargate, it is important to have enough private IPv4 addresses within your subnets, since awsvpc network mode requires and assigns one private IPv4 address for each task.
Amazon ECS provides powerful scheduling performance and task launch rate out of the box. For more details, see: Under the hood: Amazon Elastic Container Service and AWS Fargate increase task launch rates
AWS Fargate uses vCPU-based quotas, and it is important to ensure that you have the right value and limit set based on your scaling requirements.
Amazon ECS services configured to use AWS Cloud Map–based Amazon ECS service discovery have a limit of 1,000 tasks per service. This is due to the AWS Cloud Map service quota for the number of instances per service. This is a hard limit.
Amazon ECS enables you to run 5,000 services in a single cluster, and each Amazon ECS service can run 5,000 tasks. This scale allows you to run thousands of tasks in a single cluster.
It is also important to consider the quotas for your ALBs. During our scaling journey, Targets per Target Group per Region, Targets per Application Load Balancer, and Number of times a target can be registered per Application Load Balancer are set to 5,000.
If your workload requirements exceed a hard limit, then you should consider sharding your workloads using a cell-based architecture. For more details, see Guidance for Cell-based Architecture on AWS.

Amazon ECS with Amazon EC2 launch type

During the scaling experience, we primarily used AWS Fargate as our launch type, and many of the lessons learned can also be applied to the Amazon Elastic Compute Cloud (Amazon EC2) launch type. However, the following considerations are specific to the Amazon ECS with Amazon EC2 launch type:

While scaling your tasks, it is important to ensure that you have sufficient Amazon EC2 instance capacity required to place your tasks. You should use Amazon ECS capacity providers to automatically scale underlying instances based on application demand.
It is important to note that each Amazon ECS task that uses the awsvpc network mode receives its own ENI, which is attached to the Amazon EC2 instance. There’s a default quota for the number of network interfaces that can be attached to an Amazon EC2 Linux instance.
Whenever awsvpc network mode is used either with Amazon EC2 or AWS Fargate launch type while scaling your Amazon ECS workloads, consider these two important quotas:
- The Network interfaces per Region quota is the maximum number of network interfaces in an AWS Region for your account.
- The Elastic IP addresses per Region quota is the maximum number of elastic IP addresses in an AWS Region.

Conclusion

In this post, we showed you the powerful simplicity of Amazon ECS with AWS Fargate to seamlessly deploy and scale your applications. We demonstrated how you can quickly scale your applications to thousands of tasks on a brand new account with Amazon ECS on AWS Fargate, without having to plan and manage underlying infrastructure. You can refer to the following resources for additional guidance on operating at scale with Amazon ECS:

Containers