Rearchitecting AWS Batch managed services to leverage AWS Fargate
AWS service teams continuously improve the underlying infrastructure and operations of managed services, and AWS Batch is no exception. The AWS Batch team recently moved most of their job scheduler fleet to a serverless infrastructure model leveraging AWS Fargate. I had a chance to sit with Devendra Chavan, Senior Software Development Engineer on the AWS Batch team, to discuss the move to AWS Fargate and its impact on the Batch managed scheduler service component.
First off, what is AWS Batch and what benefits does it provide our customers?
AWS Batch enables customers to run batch computing jobs on AWS. It removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, much like traditional batch computing software. The Batch service can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly. It plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as AWS Fargate, Amazon EC2 On-demand and Spot Instances.
What is a Batch scheduler and what does it do?
AWS Batch provides a cloud-native scheduler that is responsible for evaluating jobs that you have submitted to a queue, and managing the lifecycle of those jobs. The scheduler is a managed service that handles job dependencies, timeouts, and retries. It also helps dynamically provision the optimal quantity and type of compute resources based on the aggregate resource requirements of the submitted jobs. The scheduler performs these operations using a mix of poll-based and event-driven mechanisms. It also periodically checks in with the Batch control plane to determine if any configurations have changed.
Recently, the engineering team decided to move some services to a serverless model based on AWS Fargate. What was there before and what motivated that move?
Each AWS Batch customer gets their own scheduler process running as an ECS task. In the previous architecture, Batch schedulers ran on EC2 instances in auto scaling groups managed by AWS CloudFormation. In AWS regions where Batch has many customers, we were reaching some scaling performance limits for our CloudFormation-managed capacity. Specifically, we have a goal for updates to the underlying EC2 instances to complete within a 1-hour window. In large regions, these updates would at times fail to meet this service level objective. This limited how well we could scale out the scheduler fleet as Batch grew more popular. We needed to find another solution, which was to build on AWS Fargate.
AWS Fargate allows you to use Amazon ECS to run containers without managing clusters of EC2 instances. Instead, you package your ECS workload up as a Fargate task, specifying the operating system, security and networking configuration, and resource requirements, then launch the application. Scaling Fargate tasks becomes a matter of setting a desired task count, with the Fargate service transparently taking care of monitoring and scaling on your behalf.
Moving to Fargate tasks had a couple of advantages. First, Fargate tasks eliminated the overhead of periodically patching host EC2 instances that powered the Batch scheduler fleet, taking valuable engineer time away from development. One reason it took so much effort was that during the update process, sometimes there were transient failures while connecting with dependent services. This could cause EC2 instance replacements to fail, which prevented the autoscaling group from stabilizing. This in turn caused a CloudFormation rollback, which could time out (or at least take a long time). Moving to Fargate completely eliminated this issue since the Fargate service team handles this work for us.
Second, Fargate tasks gave us granular control of our fleet capacity. AWS Batch would periodically adjust the desired capacity in our Auto Scaling Groups based on how many schedulers needed to run. These groups were configured to use large EC2 instances to efficiently utilize available capacity, and to launch in multiple Availability Zones to provide high availability. This meant that AWS Batch would scale up its fleet with multiple large instances at a time across the Availability Zones. For small Regions in particular, this resulted in significant idle capacity with up to 80% of the fleet sitting unused by customer schedulers. Moving to Fargate has allowed us to scale up one task a time as new customer schedulers are provisioned, rather than in large chunks. We maintain high availability as the ECS service is now responsible for balancing tasks across Availability Zones.
Moving to such a different architecture does not seem straight forward. What changes to your team’s overall development methods and operational tooling were implemented as part of this move?
The Batch scheduler was already a containerized application. Rather than run those containers on hosts we manage, our new approach uses Fargate ECS services. We use pipelines built using the AWS Cloud Development Kit (CDK) that deploy the schedulers over a series of cells in all supported AWS regions. To reduce our operational overhead, we invested early in automating the management of this cellular infrastructure. To achieve this, we built CDK-based stacks that include infrastructure to manage the scheduler containers, provide monitoring (dashboards and alarms), and support compliance.
To ensure a safe migration, we deployed the new scheduler fleet to all supported regions using a new set of AWS accounts, while the existing schedulers were running production customer workloads. We then validated the new fleet’s behavior with synthetic tests before gradually migrating production customers over to it over a period of few weeks. The migration was driven by a dynamic configuration that determined the percentage of schedulers that need to run on the new scheduler fleet. Once the new Fargate-based schedulers ran for a few weeks with no operational issues, we scaled down the EC2 instances hosting the original fleet, returning that capacity to our customers.
Nice. So, what was the result? Was this a good move overall?
Definitely. AWS Batch has migrated >98% of its scheduler fleet to Fargate. In Regions with few Batch customers, we were able to reclaim up to 80% of our original fleet capacity due to the granular controls provided by Fargate. Meanwhile, our team’s operations challenges stemming from transient EC2 bootstrapping failures have been largely eliminated. Also, adopting a cellular architecture for our scheduler fleet has helped reduce the blast radius from a bad deployment, while simplifying the work of scaling AWS Batch in a Region as the service grows. If you are interested in cell-based architectures, this re:Invent session from Peter Voshall is a good introduction to the topic.
So what’s next?
We still have a few managed EC2 Instances that run schedulers for customers whose job scheduling needs were too large for Fargate-based scheduler tasks to handle when we implemented the new architecture. Now that Fargate supports larger task sizes so we can decommission the last remnants of our managed EC2 Instances. We are also looking at ways to use just-in-time customer usage data to better scale and automate Batch scheduler services.