AWS Batch updates: higher compute utilization, AWS PrivateLink support, and updatable compute environments

The AWS Batch team has been busy this year, releasing features that have given Batch customers better performance, more advanced security and compliance controls, and eased their operational procedures. This blog post describes a few of them.

Faster and more efficient job placements

AWS Batch is a container-centric, fully-managed service for you to use to submit work for background processing. You can define a set of resources that tells the Batch managed services what kind and how many resources to provision for your jobs to run. Batch can handle workloads requests of any size, and scales automatically to your job queue. Our customers leverage Batch for a truly diverse set of workloads: from background image processing to massive scale genomics analysis.

Last year we talked about Batch’s faster scaling features that improved resource scaling by up to 5x and job placement by up to 2x. These improvements to the managed services are valuable when you need to quickly scale up and crunch through the job queue as quickly as possible.

Figure 1 – The AWS Batch request flow from job submission through to job execution. The diagram shows which parts of the process where improved, with job submission rate improved by up to 1.6 times, internal job scheduling and execution start improved by up to 2 times, and scaling of resources by up to 5 times faster than before the scaling improvements.

The other side of this challenge is to maintain costs as low as possible by utilizing resources efficiently. When there is a lot of work to be done, Batch launches compute resources as fast as it can to address the need, and then starts placing jobs as quickly as possible. As the work queue is drained, fastest-possible job placement can have a side-effect of keeping the launched capacity up longer than it needed to be, and decreases the utilization rates of these instances at the tail end of workload batches – adding to the overall cost of the batch analyses.

We recently switched our job placement logic to take into account the remaining number of jobs in the queue and intelligently switch to a more conservative approach that packs jobs on a smaller set of the launched instances at the expense of slightly longer job placement times. This dynamic job placement strategy has resulted in a 29% better utilization of available capacity for Batch in a wide spread of test scenarios we tried. This improved instance utilization results in faster scale-down of the fleet, which in turn lowers the overall cost of running jobs.

Support for AWS PrivateLink

While AWS Batch could launch and manage resources within private Virtual Private Cloud (Amazon VPC) subnets, customers still need to route requests to the Batch API via publicly accessible endpoints. Some customers, for security or compliance reasons, do not want to expose any internet accessible endpoint to their internal services running within on-premises or within private subnets.

AWS Batch has now enabled the use of AWS PrivateLink (PrivateLink) to access the Batch APIs. AWS PrivateLink provides private connectivity between VPCs, AWS services, and your on-premises networks.

To use Batch with PrivateLink, you will need to create an interface VPC endpoint for AWS Batch in your VPC using the VPC management console, SDK, or CLI. You can also access the VPC endpoint from on-premises environments or from other VPCs using AWS VPN, AWS Direct Connect, or VPC Peering.

New compute environment update capabilities

Batch compute environments (CEs) define the set of compute and storage resources your jobs will run on. You can define the minimum and maximum total vCPU capacity of the fleet, as well as storage, security groups, and a number of other parameters. You can also define whether the underlying compute resource provider is AWS Fargate or Amazon Elastic Compute Cloud (Amazon EC2).

Before today, once a compute environment was created, you were only able to update certain features, such as the minimum/maximum number of allocatable CPUs, or the service role used by your job requests. Any other update, for example needing to update an AMI for a security patch, would require you to create a new compute environment and replace the existing one that was linked to from your Batch job queues.

Today we are pleased to announce the release of new capabilities in the UpdateComputeEnvironment API that allow you to dynamically update many of the settings of compute environment that leverage service linked IAM role. The UpdateComputeEnvironment API updates are mainly targeted at compute environments that leverage EC2 compute resources. Compute environments that use Fargate resources only support updating security groups (securityGroupIds) and VPC subnets (subnets).

Changing some of the compute environment settings requires that the instances in the compute environment be replaced by AWS Batch. Batch has two update mechanisms for the compute environment. The first is a scaling update, where instances are added or removed from the compute environment. The second is an infrastructure update, where the all of the instances in the compute environment are replaced, which may take much longer than a scaling update. Scaling updates are triggered by changes to the desired, minimum and/or maximum number of vCPUs, (desiredvCpus, maxvCpus, minvCpus), changing the service role (serviceRole), and the state of the CE (state). Any other update will trigger an infrastructure update.

You can read more details about how compute environment updates are implemented and the impact on scheduled jobs in the Update compute environment documentation page, but the general idea is that new job submissions will land on newly launched resources with updated settings. Running jobs will be dispatched using update policy (updatePolicy) set in the UpdateComputeEnvironment API action and either have a job time-out enforced (30 minutes to complete by default) or be immediately terminated when you issue the request to update the compute environment.

Updating the AMI ID

We want to also call out that the documentation has a specific section on updating the AMI ID that is worth a careful read if you leverage EC2 for your compute environments. The underlying reason for the complexity is that you are able to define an AMI in either the compute environment’s settings, or using an EC2 launch template. The ability offered by launch templates to change the base AMI on launch —thus not needing to create and manage custom AMIs — more than makes up for the additional complexity in the CE management API, and better fits the model of how customers manage other EC2-based resources.

Considerations for compute environment updates

We think the new update functionality is a great operational win for customers. But there are a couple of limitations we want to point out.

First, we do not allow changing the type of CE between Fargate and EC2. These two environments are so different we decided that allowing this change would cause more confusion and errors for jobs that were created with a specific compute environment in mind. For example, job definitions that have resource needs larger than what a Fargate task size can support would fail if you changed the compute environment from EC2 to Fargate.

Second, the compute environment must be configured to leverage the Batch service linked role. The reason for this is that behind the scenes, the update process involves re-triggering some parts of the create compute environment workflow. It is possible to use your own IAM role for Batch, but this also enables you to change what permissions the Batch managed service components can take on your behalf. An update to the role might prevent Batch managed services from successfully updating your CE. Service linked roles were introduced to both simplify Batch deployment, and to protect you from accidentally causing your resources to go into an INVALID state due to changed permissions.

Last, we currently only support updating compute environments with BEST_FIT_PROGRESSIVE and SPOT_CAPACITY_OPTIMIZED allocation strategies. Compute environments that leverage BEST_FIT as an allocation strategy would need to be replaced with a new compute environment with the desired updates.

Summary

In this post, we covered some of the recent updates to AWS Batch, including improvements to job placement, addition of AWS PrivateLink support, and the new capabilities to update your AWS Batch compute environments. We also described a couple of limitations to CE updates, as well as background for the same.

To learn more about PrivateLink support, visit the documentation on how to Access AWS Batch using an interface endpoint. To learn more about updating compute environments, visit the documentation page on Updating Compute Environments in the AWS Batch User Guide.

AWS HPC Blog