How Amazon’s Search M5 team optimizes compute resources and cost with fair-share scheduling on AWS Batch
This post was contributed by: Kamalakannan Hari Krishna Moorthy (SDE), Ameeta Muralidharan (SDE), Vijay Rajakumar (Sr SDE), R J (Principal Engineer), James Park (Sr Solutions Architect)
The M5 program within Amazon Search owns the discovery learning strategy for Amazon and builds large-scale models across modalities: multilingual, multi-entity, and multitask. To build and train models with billions of parameters at scale, M5 uses accelerated compute such as Amazon Elastic Compute Cloud (Amazon EC2) instances with GPUs and AWS Trainium. One of our central tenets is to keep the infrastructure and operational costs under control.
In this post, we focus on how we evolved our systems to manage accelerated compute resources efficiently, and schedule the distributed deep learning workloads by leveraging AWS Batch fair-share scheduling. By continuously improving the approach to managing compute resources and scheduling, we have: 1) reduced idle resources by 14%; 2) increased GPU utilization of our fleet by 19%; and 3) eliminated downtime during reallocation of compute.
The evolution of our queue system over time
Initially, M5 started as a one-pizza team and each developer was assigned some compute resources to run their experiments (Figure 1). However, this approach led to idle resources whenever a developer wasn’t actively running an experiment. When the team size increased, the number of experiments we ran also increased and eventually the compute required by the experiments exceeded the available compute. Due to this constraint, we needed a job queue to manage experiment lifecycle with compute that was available, in a scalable and reproducible manner.
We chose AWS Batch as the solution to this requirement and were successful in efficiently sharing our compute resources across different AWS Regions, boosting our experiment velocity (Figure 2). Batch is a fully managed service that plans, schedules, and executes containerized ML workloads on different AWS compute offerings such as Amazon EC2 (both Spot and On-Demand Instances), Amazon ECS, Amazon EKS, AWS Fargate. Distributed training jobs in M5 are executed using a cluster of accelerated compute resources supported by multi-node-parallel jobs in Batch.
As time progressed and our team size (and experiments and workloads) increased, we encountered another scheduling challenge. The Batch queue employed a first-in-first-out (FIFO) strategy for job placement, which sometimes resulted in a series of long-running jobs occupying compute resources for extended periods and dominating the queue. As a consequence, other jobs had to wait longer than we wanted before they could access the required compute resources and certain internal teams faced difficulties in obtaining compute resources promptly for their experiments. The single FIFO queue strategy was impacting their ability to meet business-driven timelines (Figure 3).
To enable experiments from all teams to proceed at an equal pace, the total compute was divided and allocated to multiple new FIFO queues such that each team receives a part of the total compute dedicated to their experiments and decide prioritization.
Team-specific FIFO queues worked well for teams to schedule their jobs in a timely manner. However, we observed two main disadvantages over time:
- Reallocation of resources across queues involved downtime. Changes to resource allocation were carried out by scaling down compute from a source queue and scaling up in a target queue. In the background, this involved termination of EC2 instances from Batch compute environment linked to the source queue and re-provisioning them in the compute environment linked to the target queue. Due to limited availability of accelerated compute resources, the re-provisioning step was not predictably able to complete in a deterministic timeframe. This is because when you terminate an EC2 instance it becomes available to numerous other AWS accounts. In our experience, capacity reallocations ranged from few days to sometimes weeks, incurring cost and time during which no queue was able to use those instances.
- Inefficient use of resources became a problem again. We started observing unused/idle resources in the team specific FIFO queues. This was mainly coming from three factors: 1) random traffic of jobs across different teams led to under-saturated job queues (less jobs than available resources), 2) scheduling overhead — which is the time the scheduler takes to prepare N instances to start a N-node job when all N nodes are not readily available, 3) fragmentation — which refers to idle instances resulting across 20 queues. Though the fragmented instances could be from the same AWS Availability Zone (AZ), they cannot be utilized to schedule jobs if they belong to different Batch compute environments (CEs). CEs are the compute pools backing job queues. The total unused compute resources contributed to about 23.6% of our total fleet size, which is more than enough cause to explore solutions to reduce idle resources across our queues.
To overcome these disadvantages, we evaluated then newly-announced fair share scheduling policies (FSS) for Batch job queues. With FSS, Batch offers configurable parameters like share identifiers, weights for share identifiers, priority for jobs, fairness duration, etc., to ensure fair allocation of compute resources to several users while managing them in a single queue. The following is how we used these parameters:
shareIdentifier: represent internal teams that share the compute
weightFactor: share of compute allocated for each
shareDecaySeconds: period over which Batch calculates fairness of usage by a
computeReservation: share of compute to be set aside for inactive
shareIdentifiers. i.e., share identifiers which do not have active jobs
jobPriority: determines priority of jobs within a
shareIdentifier. i.e., a job with higher priority will be scheduled ahead of a job with lower priority irrespective of their creation times
The first four parameters are configured using a scheduling policy which is attached to the job queue. Once configured, the Batch scheduler will keep track of compute usage and schedule jobs such that resources are allocated fairly. The
jobPriority parameter is set either within the job definition, or at the time of submitting a job. Using this information, we created a plan to move from team-specific FIFO queues to a unified queue with FSS where teams share capacity while retaining the advantages of having team specific queues.
As a first step, we converted team specific FIFO queues into FSS queues to reduce idle resources caused by scheduling overhead (~3% of total fleet) within individual queues. This was enabled by allowing jobs to be submitted with different priorities where developers could submit jobs with higher priorities to take advantage of instances that are idle due to scheduling overhead. For example, if there is a job needs 24 instances to become available and get scheduled while the queue only has 16 available, a new incoming job with higher priority can get scheduled immediately if it requires 16 nodes or less. Since CEs can be attached to more than one queue, we successfully switched over from FIFO to FSS in ~20 job queues without any downtime.
Next, we consolidated our capacity pools that were partitioned logically as several On-Demand Capacity Reservations (ODCR) into fewer ODCRs grouped by AZ. We then carried out the consolidation of 20 FSS queues in 2 phases. In the first phase, we reduced 20 FSS queues to 3 based on project boundaries. The CEs from 20 queues were retained and attached as-is to one of the 3 FSS queues to avoid downtime. This phase gave us an opportunity to dry run the new scheduling approach while limiting our blast radius from an unforeseen outage. At the end of phase 1, we reduced the idle compute resources from 23.6% to 20% while the GPU utilization of the fleet was at 69%.
In the second phase, we created the unified FSS queue with fewer CEs to reduce fragmentation and migrated the compute resources from staging FSS queues. Teams using the unified queue received a share of compute from our fleet which was configured using
weightFactors on the scheduling policy. Values for
weightFactor were based on the level of compute a team needed in the near future. Changes to resource allocation are handled by simply updating the weights, which eliminated any downtime while moving capacity between teams. After completion of phase 2, we saw a reduction in idle resources from 20% to 9.5% while our fleet’s aggregate GPU utilization increased from 69% to 88.7% due to reduction in idle compute resources. (If you are interested in how we measure GPU utilization, check out the AWS Deep Learning AMI documentation).
Figure 5. Compute managed and share by teams with a single FSS queue improving utilization while retaining fairness in allocation
In this post, we described how Amazon’s Search M5 team continuously optimized their compute resources while owning and operating one of the largest accelerated compute clusters at Amazon. By moving to a unified fair-share scheduling queue on AWS Batch, M5 met its requirements with 14% less resources, and improved GPU utilization by 19% while also eliminating downtime during capacity allocation changes. We also published a self-paced workshop on how to leverage AWS Batch multi-node processing jobs to train deep learning models, and you can try it out in your own account here. If you are interested in more detailed information about fair share, read the previous deep dive blog post on fair share scheduling policy parameters.