AWS HPC Blog

Diving Deeper into Fair-Share Scheduling in AWS Batch

In our previous blog post, we introduced fair share scheduling policies for AWS Batch job queues. We thought it would be worth digging deeper into what the scheduling policy parameters mean in detail and in practice, showing how policies influence the placement of jobs on compute resources with a few examples. Specifically, we will cover the effects of weightFactor, shareDecayInSeconds, and computeReservation parameters. We’ll finish off the post with some use cases where fair share queues may be applicable in your AWS Batch environments. Let’s get started!

Understanding the allocation of available resources for jobs in the queue

As we mentioned in that initial post, a scheduling policy’s purpose is to allow the Batch job scheduler to meter out equitable access to a set of shared compute resources for different workloads. The different workloads are identified by supplying a shareIdentifier within the job queue’s schedulingPolicy, and then submitting jobs with that specific share identifier.

To allocate resources across a set of active share identifiers, the scheduler in AWS Batch calculates the aggregate usage of all actively running and recently completed jobs by summing the total vCPU seconds across all eligible jobs. Shares with a higher value for aggregate usage are the less likely than shares with a lower value to be assigned available resources. Over time, this function results in your compute resources being fairly distributed across shares according to your job queue’s fair share policy.

A share’s aggregate usage is calculated within a limited time window, known as the share decay. Specifically, any job STARTING or RUNNING is considered active and counts toward the share’s aggregate usage metric. Jobs that completed (SUCCESSFUL or FAILED state) within the share decay time window are also considered in the aggregate usage metric.

The net effect is that more recent jobs will have a higher value for aggregate usage compared to jobs with share identifiers that ended at an earlier time. This is worth repeating:

  • A higher aggregate usage for a share means that a lower percentage of resources will be assigned to that share, as resources become available.
  • Since fewer resources are allocated, the aggregate usage of that share will decrease.
  • As the aggregate usage becomes lower than other shares, it will preferentially be assigned resources again.

Over time this job scheduling and placement behavior converges to your scheduling policy for the active shares identifiers.

Verifying Share Decay Behavior

To better understand how the scheduling behaves, let’s take the following example fair share policy that designates two share IDs that should get an equal amount of resources over time:

{
  "fairsharePolicy": {
    "shareDistribution": [
      {
        "shareIdentifier": "yellow"
      },
      {
        "shareIdentifier": "blue"
      }
]}}

We also defined the following Batch resources:

  1. A compute environment with a maximum of 64 vCPUs, consisting of m5.2xlarge instances.
  2. A fair share job queue with this policy we just defined.
  3. A job definition that sleeps randomly between 2 to 5 minutes.

With these pieces in place, we submitted an array job of 500 using the yellow share identifier, waited three minutes, then submitted another array job using the blue share identifier. Since we submitted yellow jobs first, we expected all available job slots would be occupied by yellow jobs. When resources become available, blue jobs should preferentially get assigned resources until the blue share’s aggregate usage becomes greater than yellow, and the scheduler starts to prefer yellow jobs. This behavior is shown in Figure 3.

Figure 1: A line chart showing the number of yellow and blue pending and running jobs over time. The data shows that AWS Batch will use historical data to preferentially place jobs from a share that does not have recent usage, balancing over time to equal amount of compute resources over time.

Figure 1: A line chart showing the number of yellow and blue pending and running jobs over time. The data shows that AWS Batch will use historical data to preferentially place jobs from a share that does not have recent usage, balancing over time to equal amount of compute resources over time.

The key to interpreting these graphs is to note that the dashed lines indicate pending jobs, while the solid lines (toward the bottom) show actual running jobs jostling for a share of the compute resources. As each chart proceeds to the right along the time axis, you can see how various policies impact the rate at which pending jobs from each shareIdentifier get depleted sooner, or later.

Extending the share decay time window

The default minimum time window for share decay is ten minutes (600 seconds), but you can define a larger time window in the fair share policy using the shareDecaySeconds parameter. This lets you to tweak how the scheduler calculates a share’s aggregate usage by considering jobs that finished longer than ten minutes in the past. This would be something you should explore if you have jobs that significantly differ in job counts or run times between share identifiers and are finding that the allocation of resources is not meeting your needs. Otherwise, stick with the default.

Adjusting a share’s aggregate usage of compute with weightFactor

So far, we’ve seen how AWS Batch uses the recent history of jobs to calculate their aggregate vCPU usage, and then use that value to allocate resources over time to different shares. In our example, we assumed that you wanted each share to be equal in how it gets assigned resources.

Sometimes, however, you want to give preference to a particular share ID over others. For example, say your share identifiers are aligned to organizational departments, in this case human resources (“HR”) and production machine learning services (“MLOps”). You may want to designate that jobs from “MLOps” get a higher percentage of resources than “HR”. The way to do this is to modify the calculated aggregate usage of a share to increase or decrease it using the weightFactor parameter.

weightFactor is a modifier on the computed aggregate usage metric of a given share. The default is 1.0 – and means don’t modify the calculated aggregate usage of the share. A value less than 1.0 for weightFactor will result in a lower value for aggregate usage, and hence that share will be allocated a higher percentage of available resources than it would normally get. Higher values for weight factor will result in the opposite — a higher value for aggregate usage for the share and lower percentage of resource allocated than Batch would normally assign.

To see how this works, let’s adapt our previous example fair share policy, adjusting the weight factor of the blue share identifier to 0.5. The policy also defines yellow share’s as the default weight factor is 1.0.

{
  "fairsharePolicy": {
    "shareDistribution": [
      {
        "shareIdentifier": "yellow",
        "weightFactor": 1.0
      },
      {
        "shareIdentifier": "blue",
        "weightFactor": 0.5
      }
]}}

What we expect from this policy is that jobs with the blue share identifier should get twice as many compute resources than jobs with the yellow share identifier. Figure 2 shows our expected result: Batch gives more resources to the blue jobs, and as a result these jobs complete faster despite having been submitted later in time compared to the yellow jobs.

Figure 2: A graph of pending and running jobs over time. The chart shows that the blue share, which is given a weightFactor of 0.5, is allocated more resources than the yellow share, resulting in the blue pending queue being completed before yellow, despite these jobs being submitted at a later time than yellow.

Figure 2: A graph of pending and running jobs over time. The chart shows that the blue share, which is given a weightFactor of 0.5, is allocated more resources than the yellow share, resulting in the blue pending queue being completed before yellow, despite these jobs being submitted at a later time than yellow.

An analogy for weight factor would be like a drag coefficient on a car — a car with a lower drag requires less energy to move forward with less effort than other cars with a higher drag. But drag is not the only factor impacting how far or fast a car can go: it also depends on the size of the road and how many other cars are traveling alongside it. So, while a lower weightFactor value can increase the likelihood a job will get resources, it’s not guaranteed to get them right away.

Saving some capacity for other shares – compute reservations

Sometimes you want to set aside a small amount of your total capacity of compute resource just in case some high-priority work arrives and needs to run right away. Fair share policies allow you to do this by defining a compute reservation.

A compute reservation allows AWS Batch to reserve a certain amount of the total capacity for share identifiers that are not yet active in the job queue. Let’s take a look at the formula used to calculate the compute reservation of a job queue, and then use some examples to understand its implications.

Understanding the compute reservation formula

Batch uses the computeReservation parameter to determine the percentage of total capacity to hold in reserve for inactive shares. It works like this:

Reserved Capacity = (computeReservation/100)^activeShareIds

The computeReservation value is just an integer expressing the percentage of total maximum compute capacity you want to reserve for inactive share identifiers. It’s modified by the total number of activeFairShareIds — the number of share identifiers that are currently active in the job queue. As the number of activeShareIds increases, the fraction of computed capacity reserved for inactive shares decreases.

To see how this works in practice, let’s consider a contrived example where we reserved 50% of our total capacity in a compute environment for share identifiers not in the job queue. To do that, we defined the following fair share policy:

{
  "fairsharePolicy": {
    "computeReservation": 50,
    "shareDistribution": [
      {
        "shareIdentifier": "yellow",
        "weightFactor": 1.0
      },
      {
        "shareIdentifier": "blue",
        "weightFactor": 1.0
      }
]}}

In the case where the job queue only has jobs with the yellow share identifier running, the job queue will reserve 50% of the capacity for jobs with other share identifiers because: (50/100)^1 = 0.50. We’ve shown this in Figure 3.

Figure 3: An illustration of the AWS Batch job queue and compute environment showing that when capacity reservation is set to 50 and there is only one share identifier active, 50% of the capacity is reserved for jobs with a different share identifier.

Figure 3: An illustration of the AWS Batch job queue and compute environment showing that when capacity reservation is set to 50 and there is only one share identifier active, 50% of the capacity is reserved for jobs with a different share identifier.

As more share identifiers become active in the job queue, the amount of compute reserved for new shares identifiers decreases. This ensures that “fairness” is maintained, and newer share identifiers are not disproportionately allocated compute resources.

Now let’s consider a situation again where two share identifiers — blue and yellow — are active in the job queue, and the compute reservation is set at 50. The capacity that will remain unused in this case will be (50/100)^2 = 0.25 or 25%. We’ve shown this in Figure 4.

Figure 4: An illustration of the AWS Batch job queue and compute environment showing that when capacity reservation is set to 50 and there are two share identifiers active, 25% of the capacity is reserved for jobs with a different share identifier.

Figure 4: An illustration of the AWS Batch job queue and compute environment showing that when capacity reservation is set to 50 and there are two share identifiers active, 25% of the capacity is reserved for jobs with a different share identifier.

We ran the experiment as before with this new policy and you can see the results in Figure 5. When there is only a single share active, only 50% of the compute resources are used, and this increases to 75% once two shares are active. You’ll also note that blue jobs received an immediate job placement since resources were kept reserved for them.

Figure 5: A line chart showing the number of yellow and blue pending and running jobs over time in a AWS Batch compute environment that has a compute reservation value of 50. The data shows that Batch reserves a portion of compute resources for inactive shares, corresponding to the CR 50% and CR 25% lines.

Figure 5: A line chart showing the number of yellow and blue pending and running jobs over time in a AWS Batch compute environment that has a compute reservation value of 50. The data shows that Batch reserves a portion of compute resources for inactive shares, corresponding to the CR 50% and CR 25% lines.

Something to note is that at the end of the run shown in Figure 5, utilization dropped below 50% even though there are two active shares! This is due to Batch’s smart scale-down behavior. As the number of jobs in the queue start to decrease, Batch will start to pack new jobs on a smaller number of instances, scaling down other compute resources sooner, saving you money.

The last example used equal weight factors for the shares, but what happens if the policy determined that blue jobs had a weight factor of 0.5 as before? The answer is that the available shares are allocated according to your weights, which we’ve shown in Figure 6.

Figure 6: A line chart showing the number of yellow and blue pending and running jobs over time in an AWS Batch compute environment that has a compute reservation value of 50. The data shows that Batch allocates resources based on a share’s weight factor, and reserves a portion of compute resources for inactive shares.

Figure 6: A line chart showing the number of yellow and blue pending and running jobs over time in an AWS Batch compute environment that has a compute reservation value of 50. The data shows that Batch allocates resources based on a share’s weight factor, and reserves a portion of compute resources for inactive shares.

You’ll notice at the end of the run that the compute capacity remains at 75% utilization, even though there is only one type of job running. That’s because blue jobs completed within the share decay time window, and hence the blue share is still considered “active” even though there are no currently running jobs.

Determining the size of the reserve

While our example used a large value for computeReservation to illustrate how the parameter works, in practice you should define a much smaller value for this. The intent of the computeReservation parameter is to hold a small amount of reserve capacity for urgent requests, or allow room for you to meet a minimum SLA across shares.

The exact value of computeReservation will depend on your job sizes and expected maximum number of active shares. It should be as small as possible while still allowing for jobs with an inactive share identifier to start quickly. For our example of two active shares, a value of 50 is way too high, but if you expect at least 5 active share IDs, a value of 50 would translate to ~3% of your total maximum capacity reserved for inactive shares.

One more thing – setting job priority within a share

One final resource allocation modifier is the sharePriority parameter. This parameter only effects the relative ordering of jobs belonging to a single share ID — it does not affect the ordering of jobs belonging to other shares.

Expanding our previous example of the organizational groups “HR” and “MLOps” we used weightFactor to give preferential treatment to an MLOps jobs. Within MLOps, though, you want to define that production get jobs higher priority than dev/test jobs. To do that, you set higher value for the sharePriority parameter (sharePriority=100) for production jobs than dev/test jobs (sharePriority=1). This would result in production jobs to be placed on available resources before dev/test jobs, even if the dev/test jobs were submitted earlier in time to the job queue.

Practical use cases for fair share policies

Fair share policies are useful in a lot of scenarios. Here’s just a few examples to seed your thinking:

  • The share decay time window of fair share queues helps smooth out resource usage over time by considering the history of the jobs assigned to a share. It ensures that variations in job count or run times between different share identifiers do not result in giving more compute time to a share than is specified within your policy. This is particularly beneficial when a share has a spike in job submissions. In a FIFO queue, spikes would dominate the compute resources to the detriment of other jobs in other shares.
  • Fair share queues allow new work to get running as quickly as possible. Jobs with a share that have not been in the job queue recently will have an extremely low aggregate usage (possibly even zero) and so Batch’s scheduler will preferentially assign resources to them when they become available. This allows for dynamic resource allocation, ensuring that jobs submitted later receive their fair share of compute resources as soon as possible.
  • Fair share queues allow you to run heavy and light jobs simultaneously. It allows important lighter jobs to be executed alongside heavy jobs, instead of waiting for the heavy jobs to complete first.
  • Fair share allows efficient use of resources within a single queue. With FSS, a single job queue can be used equitably and efficiently for different types of workloads, maximizing your overall resource efficiency. This might save you splitting your compute capacity across different job queues and compute environments.
  • Fair sharing enables fair resource allocation among multiple users or workloads, ensuring equitable distribution of your total compute capacity.
  • Priority scheduling lets important jobs in each share run sooner, because they’re given a higher priority.

Fair share policies are not magic

While fair share policies can help to overcome some of the drawbacks of a single FIFO job queue, they are not magic. When you are expecting jobs that have special requirements, such as GPU accelerators or very large memory requirements, you should take care to design your Batch environment for multiple job queues. In the case of GPUs, you don’t want CPU only jobs to run on these instances since it is both a waste of resources and they might block another job that actually needs the GPU from starting. Similarly  very large memory jobs may block the queue for a time as an instance waits for current jobs to complete to make room for the big job.

Conclusion

In this blog post, we delved into the parameters of AWS Batch fair share scheduling policies, including compute reservations, share decay time windows, weight factor, and share priority. We explored how these parameters affect resource allocation, fairness over time, and job prioritization within share identifiers. And we discussed practical use cases where fair share proves beneficial.

By leveraging fair share queues in AWS Batch, organizations can optimize job execution, enhance resource management, and meet critical deadlines effectively. Read more about scheduling policies in the documentation or try creating your own fair share policies using the AWS Batch management console.

Angel Pizarro

Angel Pizarro

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.

Ashish Tak

Ashish Tak

Ashish is AWS Batch, Amazon ECS SME and Cloud Support Engineer in Containers domain here at AWS. He works directly with Customers to troubleshoot the issues on AWS Batch and Amazon ECS services. Outside of work, He likes to spend his time with family and playing outdoor sports.

Ayush Rathore

Ayush Rathore

Ayush is AWS Batch SME and Cloud Support Engineer in Containers domain here at AWS. His day to day work involves troubleshooting issues related to container technologies like EKS, ECS, Fargate and Batch, ECR etc. and actively conducting learning sessions on containers to educate our customers.