AWS HPC Blog

Job queue snapshots: see what’s at the head of your queues in AWS Batch

Job queue snapshots: see what’s at the head of your queues in AWS BatchIn 2021, AWS Batch introduced fair share job queues which allowed customers to create scheduling policies for a job queue so you can control how to split resources and prioritize jobs that belong to different workloads (differentiated by their jobs having different share identifiers). Before this, all Batch job queues acted as independent first-in-first-out (FIFO) queues. That meant if you had multiple groups or workloads in the same AWS account, you needed separate job queues (JQs) and compute environments (CEs) for each business need.

This in turn meant you needed to manage the distribution of the underlying compute resource across all these CEs. By introducing Fair Share Scheduling (FSS), customers like Amazon Search could consolidate their environments, which reduced their operational overhead, drove up their fleet utilization – and greatly improved their throughput.

But moving away from FIFO queues introduced a different challenge — how to reliably tell what was going to run next across the set of possible jobs at the head of the queue across different share identifiers.

Today we’ll explain a recent addition to AWS Batch that we think will address this: Job queue snapshots. This is a new API we launched a few weeks ago to query the jobs that are at the head of the job queue. We’ll work through the details and give you an practical example of how to use this new information.

Inspecting the head of the queue

Using the AWS Batch management console, the AWS SDK, or AWS CLI, you can now list the first 100 RUNNABLE jobs for a single job queue by calling the GetJobQueueSnapshot API. For FIFO job queues, jobs are ordered based on their submission time. For FSS job queues, jobs are ordered based on their share’s usage and, within a share, the job share priority. You can read more about how job priority and share usage effect job scheduling in our fair share deep dive blog post.

Job queue snapshots are a great visibility tool for customers that need to make on-the-fly modifications to jobs in the queue. Let’s take a look at a practical example.

We’ve created a fair share job queue that uses AWS Fargate for the compute environment. For the purposes of this experiment, we’ve temporarily disabled the compute environment so that jobs stay in the job queue and we can see the results of the queue manipulations. Finally, the fair share policy gives equal weight to the two active shares, “pink” and “blue”, meaning that you should see an interleaving of set of jobs with each share (Figure 1).

Figure 1:  The AWS Batch management console showing the Job queue snapshot tab. The tab lists an interleaving of pink and blue jobs that are at the head of the job queue.

Figure 1: The AWS Batch management console showing the Job queue snapshot tab. The tab lists an interleaving of pink and blue jobs that are at the head of the job queue.

Bob from team pink comes in with an urgent request to run high-priority jobs as soon as possible to meet an important deadline. You submit these jobs with a priority=10, and this results in the high priority jobs moving ahead of all other pink jobs, but there are still some blue jobs ahead of one or more high priority pink jobs (Figure 2). That’s because job priority is only applicable within each share, and doesn’t affect the overall placement of jobs across shares. At this point you can determine whether the high-priority pink jobs can finish by the deadline without affecting team blue’s workload.

Figure 2: The AWS Batch management console showing the job queue snapshot tab. The tab shows that high-priority pink jobs have moved ahead of lower priority pink jobs, but are still interleaved with blue jobs.

Figure 2: The AWS Batch management console showing the job queue snapshot tab. The tab shows that high-priority pink jobs have moved ahead of lower priority pink jobs, but are still interleaved with blue jobs.

If you think that the high priority jobs will still not finish by the deadline, then you have two options:

  1. You can temporarily adjust the share policy to prefer pink jobs over blue
  2. You can cancel team blue’s jobs and then resubmit them once the high priority jobs are RUNNING. Note that, in this case, blue jobs will keep their place in the queue, even though Batch has marked them for cancellation. When they reach the head of the queue, their state will immediately switch to FAILED and won’t take up any compute resources.

Since option 1 is less destructive, you change the scheduling policy to preference placement of team pink jobs by adjusting the weight factor to a lower value (lower weight factor means a share gets more compute resources over time). This places most of the high priority pink jobs ahead of any blue ones (Figure 3).

Figure 3: The AWS Batch management console showing the job queue snapshot tab. The tab shows that high-priority pink jobs have moved ahead of lower priority pink jobs, but are still interleaved with blue jobs.

Figure 3: The AWS Batch management console showing the job queue snapshot tab. The tab shows that high-priority pink jobs have moved ahead of lower priority pink jobs, but are still interleaved with blue jobs.

The underlying reason blue still gets some allocation before pink is that the fair share algorithm will try to give blue jobs some allocation of resources (it’s called “fair share” for a reason). Figure 3 shows that most of the upcoming resources are for pink jobs, even the regular priority ones. At the low weight factor we set, the next blue job comes after the last pink job in our queue (not shown in the figure).

If at this point you’re still not sure that the pink jobs will complete by your deadline, you may need to take the more drastic option number 2. In either case, once the high-priority workloads are RUNNING be sure to re-instate the previous share policy allocations. Otherwise Batch will keep prioritizing team pink’s jobs over team blue’s – forever.

Before this new feature to see what’s at the head of the queue, you may have needed to be indiscriminate about “clearing the queue” – cancelling all scheduled jobs to make room for the high priority request.

Now with job queue snapshots you can be more prescriptive about how to adjust the job queue to allow high-priority workloads to run in the time you need.

Conclusion

In this post we described a new feature for AWS Batch: job queue snapshots. This feature improves the customer and user experience with job queues by giving insight into what is at the head of the queue for both FIFO and fair share job queues. We also described a scenario to use job queue snapshots to help make decisions about managing a queue to reorder workloads based on an urgent priority.

Job queue snapshots are available now in to the AWS Batch console, or through the CLI, or API – whichever you prefer. We think you’ll love this new feature and welcome feedback on how you use it. Also let us know if there’s more we can do to make the job of managing jobs easier.

Angel Pizarro

Angel Pizarro

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.

Naina Thangaraj

Naina Thangaraj

Naina Thangaraj is a Senior Product Manager for AWS Batch, and works in the Advanced Computing and Simulation org at AWS. Her background is in bioinformatics and prior to joining AWS, she worked in the healthcare and life sciences industry.