AWS Compute Blog

Optimizing for cost, availability and throughput by selecting your AWS Batch allocation strategy

This post is contributed by Steve Kendrex, Senior Technical Product Manager, AWS Batch

 

Introduction

 

AWS offers a broad range of instances that are advantageous for batch workloads. The scale and provisioning speed of AWS’ compute instances allow you to get up and running at peak capacity in minutes without paying for downtime. Today, I’m pleased to introduce allocation strategies: a significant new capability in AWS Batch that  makes provisioning compute resources flexible and simple. In this blog post, I explain how the AWS Batch allocation strategies work, when you should use them for your workload, and provide an example CloudFormation script. This blog helps you get started on building your personalized Compute Environment (CE) most appropriate to your workloads.

Overview

AWS Batch is a fully managed, cloud-native batch scheduler. It manages the queuing and scheduling of your batch jobs, and the resources required to run your jobs. One of AWS Batch’s great strengths is the ability to manage instance provisioning as your workload requirements and budget needs change. AWS Batch takes advantage of AWS’s broad base of compute types. For example, you can launch compute based instances and memory instances that can handle different workload types, without having to worry about building a cluster to meet peak demand.

Previously, AWS Batch had a cost-controlling approach to manage compute instances for your workloads. The service chose an instance that was the best fit for your jobs based on vCPU, memory, and GPU requirements, at the lowest cost. Now, the newly added allocation strategies provide flexibility. They allow AWS Batch to consider capacity and throughput in addition to cost when provisioning your instances. This allows you to leverage different priorities when launching instances depending on your workloads’ needs, such as: controlling cost, maximizing throughput, or minimizing Amazon EC2 Spot instances interruption rates.

There are now three instance allocation strategies from which to choose when creating an AWS Batch Compute Environment (CE). They are:

1.        Spot Capacity Optimized

2.        Best Fit Progressive

3.        Best Fit

 

Spot Capacity Optimized

As the name implies, the Spot capacity optimized strategy is only available when launching Spot CEs in AWS Batch. In fact, I recommend the Spot capacity optimized strategy for most of your interruptible workloads running on Spot today. This strategy takes advantage of the recently released EC2 Auto Scaling and EC2 Fleet capacity optimized strategy. Next, I examine how this strategy behaves in AWS Batch.

Let’s say you’re running a simulation workload in AWS Batch. Your workload is Spot-appropriate (see this whitepaper to determine), so you want to take advantage of the savings you can glean from using Spot. However, you also want to minimize your Spot interruption rate, so you’ve followed the Spot best practices. Your instances can run across multiple instance types and multiple Availability Zones. When creating your Spot CE in AWS Batch, Input all the instance types with which your workload is compatible in the instance field. OR select ‘optimal’, which allows Batch to choose from among M, C, or R instance families. The image below shows how this appears in the console:

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

When evaluating your workload, AWS Batch selects from the allowed instance types. These allowed instance types are specified in the ‘compute resource parameter’, and are capable of running your jobs listed in your Spot CE.  From the capable instances, AWS calculates the correct assortment of instance types that have the most Spot capacity. AWS Batch then launches those instances on your behalf, and runs your jobs when those instances are available. This strategy gives you access to all AWS compute resources at a fraction of On-Demand cost. The Spot capacity optimized strategy works whether you’re trying to launch hundreds of thousands (or a million!) of vCPU’s in Spot, or simply trying to lower your chance of interruption. Additionally, AWS Batch manages your instance pool to meet the capacity needed to run your workload as time passes.

For example, as your workloads run, demand in an Availability Zone may shift. This might lead to several of your instances being reclaimed. In that event, AWS Batch automatically attempts to scale a different instance type based on the deepest capacity pools. Assuming you set a retry attempt count, your jobs then automatically retry. Then, AWS Batch scales new instances until either it meets the desired capacity, or it runs out of instance types to launch based on those provided.  That is why I recommend that you give AWS Batch as many instance types and families as possible to choose from when running Spot capacity optimized. Additional detail on behavior can be found in the capacity optimized documentation.

To launch a Spot capacity optimized CE, follow these steps:

1.       Navigate to the console

2.       Create a new Compute Environment.

3.       Select “Spot Capacity Optimized” in the Allocation Strategy field

4.       Alternatively, you can use the CreateComputeEnvironment API; in the Allocation Strategy field, pass in “Spot_Capacity_Optimized”. This command should look like the following:

…
"TestAllocationStrategyCE": { 
"Type": "AWS::Batch::ComputeEnvironment",
 "Properties": { 
"State": "ENABLED", 
"Type": "MANAGED", 
"ComputeResources": { 
"Subnets": [
 {"Ref": "TestSubnet"}
 ], 
"InstanceRole": {
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
" InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "SPOT_CAPACITY_OPTIMIZED", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 }
 },
…

Once you follow these steps your Spot capacity optimized CE should be up and running.

 

Best Fit Progressive

Imagine you have a time-sensitive machine learning workload that is very compute intensive. You want to run this workload on C5 instances because you know that those have a preferable vCPU/memory ratio for your jobs. In a pinch, however, you know that M5 instances can run your workload perfectly well. You’re happy to take advantage of Spot prices. However, you also have a base level of throughput you need so you have to run part of the workload on On-Demand instances.  In this case, I recommend the best fit progressive strategy. This strategy is available in both On-Demand and Spot CEs, and I recommend it for most On-Demand workloads. The best fit progressive strategy allows you to let AWS Batch choose the best fit instance for your workload (based on your jobs’ vCPU and memory requirements). In this context, “best fit” means AWS Batch provisions the least number of instances capable of running your jobs at the lowest cost.

Sometimes, AWS Batch cannot resource enough of the best fit instances to meet your capacity. When this is the case, AWS Batch progressively looks for the next best fit instance type from what you specified in the ‘compute resources’ parameter. Generally, AWS Batch attempts to spin up different instance sizes within the same family first. This is because AWS Batch has already determined that vCPU and memory ratio to fit your workload. If it still cannot find enough instances that can run your jobs to meet your capacity, AWS Batch launches instances from a different family. These attempts run until capacity is met, or until it runs out of available instances from which to select.

To create a best fit progressive CE, follow the steps detailed in the Spot capacity optimized strategy section. However, specify the strategy BEST_FIT_PROGRESSIVE when creating a CE, for example:


…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT_PROGRESSIVE", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important note: you can always restrict AWS Batch’s ability to launch instances by using the max vCPU setting in your CE. AWS Batch may go above Max vCPU to meet your capacity requirements for best fit progressive and Spot capacity optimized strategies. In this event, AWS Batch will never go above Max vCPU by more than a single instance (for example, no more than a single instance from among those specified in your CE compute resources parameter).

 

How to Combine Strategies

You can combine strategies using separate AWS Batch Compute Environments. Let’s take the case I mentioned earlier: you’re happy to take advantage of Spot prices, but you want a base level of throughput for your time-sensitive workloads.

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

 

In this case, you can create two AWS Batch CEs:

1.       Create an On-Demand CE that uses the best fit progressive strategy.

2.       Set the max vCPU at the level of throughput that is necessary for your workload.

3.       Create a Spot CE using the Spot capacity optimized strategy.

4.        Attach both CEs to your job queue, with the On-Demand CE higher in order. Once you start submitting jobs to your queue, AWS Batch spins up your On-Demand CE first and starts placing jobs.

If AWS Batch meets its max vCPU criteria, it will spin up instances in the next CE. In this case, the next CE is your Spot CE, and AWS Batch will place any additional jobs on this CE.  AWS Batch continues to place jobs on both CEs until the queue is empty.

Please see this repository for sample CloudFormation code to replicate this environment. Or, click here for more examples of leveraging Spot with AWS Batch.

 

Best Fit

Imagine you have a well-defined genomics sequencing workload. You know that this workload performs best on M5 instances, and you run this workload On-Demand because it is not interruptible. You’ve run this workload on AWS Batch before and you’re happy with its current behavior. You’re willing to trade off occasional capacity constraints in return for the knowledge you’re controlling cost strictly.  In this case, the best fit strategy may be a good option. This strategy used to be AWS Batch’s only behavior. It examines the queue and picks the best fit instance type and size for the workload. As described earlier, best fit to AWS Batch means the least number of instances capable of running the workload, at the lowest cost. In general, we recommend the best fit strategy only when you want the lowest cost for your instance, and you’re willing to trade cost for throughput and availability.

Note: AWS Batch will not launch instances above Max vCPU while using the best fit strategy. To launch a best fit CE, you can launch it similar to the following:

…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important Note for AWS Batch Allocation Strategies with Spot Instances:

You always have the option to set a percentage of On-Demand price when creating a Spot CE. When setting a percentage of an On-Demand price, AWS Batch will only launch instances that have Spot prices lower than the lowest per-unit-hour instance. In general, setting a percentage of an On-Demand price lowers your availability, and should only be used if you want cost controls.If you want to enjoy the cost savings with Spot with better availability, I recommend that you do not set a percentage of On-Demand price.

Conclusion

With these new allocation strategies, you now have much greater flexibility to control how AWS Batch provisions your instances. This allows you to make better throughput and cost trade-offs depending on the sensitivity of your workload. To learn more about how these strategies behave, please visit the AWS Batch documentation. Feel free to experiment with AWS Batch on your own to get an idea of how they help you run your specific workload.

 

Thanks to Chad Scmutzer for his support on the CloudFormation template