AWS Batch Dos and Don’ts: Best Practices in a Nutshell

This post contributed by Pierre-Yves Aquilanti, Principal HPC Solutions Architect, Steve Kendrex, Principal Product Manager for AWS Batch, and Matt Koop, Principal HPC Application Engineer.

Update: Oct 5, 2021 – We updated some guidance on subnet CIDR blocks in section 2 on common errors.

AWS Batch is a service that enables scientists and engineers to run computational workloads at virtually any scale without requiring them to manage a complex architecture. Since launch in 2017, AWS Batch has been adopted by customers from diverse industries and institutions to run workloads in domains as diverse as epidemiology, gaming simulations, and large-scale machine learning.

In this blog post, we share a set of best practices and practical guidance devised from our experience working with customers in running and optimizing their computational workloads. The readers will learn how to optimize their costs with Amazon EC2 Spot on AWS Batch, how to troubleshoot their architecture should an issue arise and how to tune their architecture and containers layout to run at scale.

1. Using EC2 Spot smartly to optimize your costs

Amazon EC2 Spot is commonly used by customers running batch workloads due to the cost savings it provides and which can attain up to 90% compared to On-Demand Instances prices. The Amazon EC2 Spot best practices provides general guidance on how to take advantage of this purchasing model. You will find below additional information on how this can be applied in the context of AWS Batch:

Pick the right allocation strategy. With BEST_FIT, AWS Batch picks the most cost-efficient EC2 instance, with BEST_FIT_PROGRESSIVE AWS Batch will launch additional instances in the event the BEST_FIT instance cannot fulfill all needed capacity. In the case of SPOT_CAPACITY_OPTIMIZED, AWS Batch selects instances from the deepest EC2 Spot capacity pools. SPOT_CAPACITY_OPTIMIZED is recommended if interruptions are a concern. Furthermore, you’ll still benefit from the savings provided by EC2 Spot. If cost is what you prefer to optimize, then BEST_FIT_PROGRESSIVE is an option to consider.
Diversify across instance types. If you think c5.12xlarge is a good option, you will want to include all compatible alternatives in terms of size (c5.24xlarge…) and families (c5a, c5n, c5d, m5, m5d…) and let Batch pick the ones that you need based on price or availability.
Use multiple Availability Zones. Use subnets located in different Availability Zones when configuring your Batch compute environments. Use all Availability Zones in a Region enables Batch to tap into multiple instance pools and spread the risk of seeing interruptions.
Reduce your job runtime or checkpoint. Jobs lasting 1 or multiple hours are not ideal for EC2 Spot as the impact of interruptions will be significantly more important than if your jobs last a few minutes to a few dozen of minutes. If you can chunk your jobs into smaller parts or checkpoint during the run, then the impact (and cost) of interruptions on your overall is lessened.
Use automated retries. AWS Batch jobs can be automatically retried up to 10 times in case of a non-zero exit code, a service error or an instance reclamation if using Spot. Set your retry parameter to at least 1 to 3 retries, if interrupted your jobs will be placed at front of the Job Queue and will be scheduled to run in priority. You can set the retry strategy when creating the Job Definition or when submitting a job like in the following example using the AWS CLI:
```
aws batch submit-job --job-name MyJob \
                     --job-queue MyJQ \
                     --job-definition MyJD \
                     --retry-strategy attempts=2
```
Use custom retries. Jobs retry strategy can be tailored to specific application exit codes or on Amazon EC2 Spot reclamations. The example below allows up to 5 retries in case of a job failure caused by the host, if the failure is caused by another reason, then the job exit and is declared as failed:
```
"retryStrategy": {
    "attempts": 5,
    "evaluateOnExit":
    [{
        "onStatusReason" :"Host EC2*",
        "action": "RETRY"
    },{
      "onReason" : "*"
        "action": "EXIT"
    }]
}
```

If you’d like to track Spot interruptions, you can install the Spot Interruption Dashboard serverless application on your account. It provides statistics on which EC2 Spot Instances are reclaimed and in which Availability Zone they are located.

2. Common errors and Troubleshooting steps

When a customer encounters an error with AWS Batch it is often due to an instance configuration that does not fit the jobs requirements, or an error at the application level. Other common issues can be a job stuck in RUNNABLE status or a Batch compute environment stuck in INVALID state.

Before diving into at your AWS Batch configuration, run through this checklist:

Check your EC2 Spot vCPUs quotas as your current limits may be below the capacity you’d like to achieve (for example 256 vCPUs is your current quota while attempting to run your workload on 10,000 vCPUs). Request a limit increase via the Service Quotas Dashboard. This is commonly overlooked, especially for new accounts.
Job are failing before running the application due to a DockerTimeoutError and CannotPullContainerError can be solved by following this guide.
Not enough IP addresses in your VPC and subnets can limit the number of instances you can create. Use CIDRs that can provide more IP addresses than necessary to run your workloads and if needed build a dedicated VPC with a large address space. For example, create a VPC with multiple CIDRs in 10.x.0.0/16 and a subnet in every Availability Zone with a CIDR in 10.x.y.0/17 where x can be between 1-4 and y will be either 0 or 128. This will provide you with 36k IP addresses in every subnet.

Example of VPC & subnets decomposition providing large address spaces in the Northern Virginia region (us-east-1).

If after going through the checklist, you did not find a solution to your challenge, follow this series of steps:

Review the AWS Batch Dashboard to check if the jobs states are the ones that we expect and verify if the CE scales when jobs are submitted. You can also look at your jobs logs in Amazon CloudWatch.
Check if instances are created by going in the Amazon EC2 Console – Instances. That’s a sign that scaling occurs and instances are requested by Batch and Amazon EC2. If you don’t see any instance, you will need to review the Auto Scaling Group history to identify if the subnets in the CE configuration have to be changed. Another reason may be that your jobs requirements cannot be accommodated by an instance (for example a job requesting 1 TiB of memory when the CE is set to use C5 family only which offers 192 GiB maximum).
See your Auto Scaling group history to check if instances are being requested by AWS Batch. This will give you an idea on how EC2 tries to acquire instances. EC2 Spot sometimes throws errors saying it can’t acquire an instance in a particular Availability Zone. That’s fine, instance pools are different across Availability Zones and it can happen that one does not offer a specific instance family.
Verify if instances register with ECS as you may see instances on the EC2 panel but no ECS Container Instances registered with you in your ECS cluster that’s a sign that the ECS agent is not installed in the case of a custom AMI, a misconfiguration of the agent, the EC2 User Data in your AMI or the Launch Template. If this happens, create a separate EC2 instance to find the root cause or connect to an existing instance using SSH or AWS Systems Manager Session Manager.
If the problem occurs at runtime and cannot be easily reproduced install and configure the CloudWatch Agent to push your system logs and ECS logs to CloudWatch Logs so they can be viewed offline. To do so you will need to build a custom AMI or build a custom Launch Template.

If you can’t find the error after these steps, please raise a support ticket if you have a support plan. The AWS Forums for AWS Batch and HPC are also an option when seeking help. Don’t forget to provide sufficient information on the issue, workload specifics, your configuration as well as the tests you ran and their results, this will help to accelerate the resolution of your issue. To further assist you in troubleshooting, you can use this open-source monitoring solution. It provides a holistic view of your AWS Batch environment through a series of dashboards.

3. Optimize your Containers and AMIs

Container size and structure matters for the initial set of jobs you will be running, especially if your container is larger than 4 GB. Container images are built in layers that are retrieved in parallel by Docker, by default 3 threads are used to pull these layers. While you can add increase concurrency for layer retrieval using the parameter max-concurrent-downloads for Docker, optimizing your container structure and size is also a good option. Below is what we recommend to customers using AWS Batch:

Smaller containers are fetched faster and lead to faster start time. Offload libraries or files with infrequent updates to the AMI and use bind mounts to provide access to your containers. It’ll help to slim down the containers and decrease jobs start time.
Create layers even in size or break-up the large ones. Each layer is retrieved by one thread which significantly impacts the startup time of your job if this layer is large. 2 GB maximum per layer is a good tradeoff. To check your container image structure and layers sizes run the following command: docker history your_image_id
Use Amazon ECR as your container repository for your AWS Batch jobs. A self-managed repository can easily crumble even when running a few thousands of jobs in parallel, a public repository can throttle you (we’ve seen both happen). ECR will work at small and large scale (1M+ vCPUs).

A table of relative sizes of files utilized when working with containers, showing that the closer you get to the operating system, then less these data and files change, while at the level of runtime, changes are measured in minutes. Thus using layers for containers and minimizing the download needed at runtime is a key performance factor.

4. Checklist to Run at Scale

Before running a workload on more than 50,000 vCPUs using AWS Batch, go through the checklist below. Contact your AWS team If you plan to run on 1M vCPUs or if you need guidance for large scale.

Check your EC2 limits in the Service Quotas panel of your AWS Management Console and if necessary, request a limit increase for a peak number of EC2 instances you think you’ll use. Keep in mind that Amazon EC2 Spot and Amazon On-Demand Instances have separate limits.
Verify your EBS quota in the Region. A GP2 or GP3 EBS volume is used by each instance for the operating system. The quota per Region is set by default at 300 TiB and every instance will consume part of this quota. If you reach it, you won’t be able to create more instances.
Use S3 as storage as it provides high throughput and you don’t need to guess how much to provision depending on the number of jobs or instances running in each Availability Zones. Follow the recommendations on this page to optimize for performance.
Scale by steps to identify bottlenecks early. For 1M+ vCPUs we start by running on 50,000 vCPUs, then increase the scale gradually with 200,000 vCPUs, 500,000 vCPUs and 1M+ vCPUs.
Something may break in your architecture or application when running at scale. This happens even when scaling from 1,000 to 5,000 vCPUs and can manifest in parts that you may not suspect. Monitoring helps to identify potential issues quickly for example with Amazon CloudWatch Logs for log data or with the CloudWatch Embedded Metrics Format for metrics using a client library.

5. Where AWS Batch is a fit, when other options can be considered

AWS Batch is intended to run jobs at scale with low cost, while giving you powerful queueing and cost-optimized scaling capabilities. However, not every workload is great for Batch, particularly:

Very short jobs. If your jobs run for only a handful of seconds, the overhead to schedule your jobs may be more than the runtime of the jobs themselves, resulting in less-than-desired utilization. A workaround is to binpack your tasks together before submitting them in Batch then having your Batch jobs iterate over your tasks. A general guidance is to binpack your jobs is to: 1) stage the individual tasks arguments into an Amazon DynamoDB table or as a file in an Amazon S3 bucket, ideally group the tasks in order to get your AWS Batch jobs to last of 3-5 minutes each 2) loop through your tasks groups within your AWS Batch job.
Jobs that require immediate running. While Batch can process jobs rapidly, it is a batch scheduler and will tend to optimize for cost, priority, and throughput rather than immediately running jobs. This implies that some queue time is desirable to allow Batch to make the right choice for compute. If you need a response time ranging from milliseconds to seconds between job submissions, you may want to use a service-based approach using Amazon ECS or Amazon EKS instead of using a batch processing architecture.

6. Fargate or EC2, On-demand or Spot: which one should you pick?

Choosing between AWS Fargate or Amazon EC2

Use AWS Fargate if:

You need to start your jobs in less than 30 seconds.
Your jobs require no GPU, 4 vCPUs or less and 30 GiB of memory or less.

Fargate requires slightly less configuration than Amazon EC2 and is simpler to use for first-time users. You can find a more complete list of criteria in the AWS Batch documentation.

Use Amazon EC2 if:

You need a higher level of control on the instance selection
Your jobs require resources that AWS Fargate can’t provide such as higher memory, vCPUs count, GPUs, Amazon Elastic Fabric Adapter
You require a high level of throughput or concurrency for your workload needs.
You need to customize your AMI, Amazon EC2 Launch Template or access to special Linux parameters.

EC2 provides additional knobs to tune your workload and enable you to run at scale should you need it. Fargate is preferred by customers seeking the ease of use it provides and don’t need the capabilities offered by EC2.

Choosing between Amazon EC2 On-Demand or Amazon EC2 Spot

Most of our AWS Batch customers will use Amazon EC2 Spot given the savings over On-Demand. However, On-Demand can be a better option if your workload runs for multiple hours and cannot be interrupted. Below is a list of criteria to estimate the fit of your workload for EC2 Spot or On-Demand. Don’t hesitate to try Spot yourself and move to On-Demand if necessary.

Use Amazon EC2 Spot if:

Your jobs durations range from a few minutes to a few dozens of minutes.
The overall workload can tolerate potential interruptions and job rescheduling.
Long running jobs can be restarted from a checkpoint if interrupted.

Go for Amazon EC2 On-Demand:

If your job runtime lasts one or several hours and cannot tolerate interruptions.
You have a strict Service Level Objective (SLO) for your overall workload and cannot tolerate an increase of computational time.
The instances you need are more likely to see interruptions.

In some cases, customers can mix both purchasing models by submitting on Spot first and then use On-Demand as a fallback option. They start by submitting their jobs on a queue connected to compute environments running on EC2 Spot. If a job gets interrupted, they catch the event from Amazon EventBridge, correlate it to a Spot Instance Reclamation and resubmit the job to an On-Demand queue using an AWS Lambda function or an AWS Step Functions. Just keep in mind that you should use different instance kinds, sizes or Availability Zones for your On-Demand compute environment otherwise you will starve your EC2 Spot Instance pools and increase the interruption rate (thus rescheduling to on-demand instances).

Summary

In this blog post, you have learned about best practices regarding AWS Batch that were identified through our interactions with customers in domains ranging from Genomics, Financial Services to Autonomous Vehicle development. Through this list of best practices, you have seen how to leverage Amazon EC2 Spot, how to troubleshoot a workload running on AWS Batch and how to optimize your AWS Batch architectures.

If you have identified other best practices or tips, don’t hesitate to leave us a comment or contact us directly through the HPC webpage. We’d love to learn about new tricks and share them more broadly. In addition of service page, you can visit HPC workshops and the AWS Workshops websites to learn more about AWS Batch.

AWS HPC Blog