Why you should use Fargate with AWS Batch for your serverless batch architectures

The AWS Batch team recently launched support for Graviton and Windows containers running on AWS Fargate resources. Combine that announcement with other recent additions such as larger task sizes and configurable local storage and Fargate becomes an even better serverless solution for your batch and asynchronous workloads.

Let’s take a look at the features that make Fargate a great choice for your small to medium scale AWS Batch environments, starting with the most recent announcements.

Graviton resources

Previously, Batch on Fargate only supported running containers that leverage an x86 CPU architecture, even though Fargate itself was able to run tasks using Linux containers built for the Graviton2 Arm architecture. We closed that gap with this release and you can now leverage Graviton2 which delivers 2-3.5 times better CPU performance per watt than any other processors of the same generation for your Batch jobs, making them a sustainable choice. They also have very capable price/performance characteristics and so are idea for analyzing data, logs, or other batch processing. Fargate does not yet support Graviton3.

The best way to find out whether x86 or aarch64 resources give you better price/performance is to try them out on your own application. You can leverage AWS CodePipeline to build multi-architecture container images and store them in Amazon Elastic Container Registry (Amazon ECR). If you need some guidance for building multi-architecture container images, this AWS samples GitHub repository is a good place to start.

To leverage Fargate Graviton resources, you set the job definition’s runtimePlatform.cpuArchitecture parameter to ARM64. Then submit a job to a job queue that is configured to use Fargate resources. Graviton with Fargate is available in most AWS Regions, but there are some limitations for you should consider such as being limited to Linux containers. The full set of considerations are outlined in the documentation in Working with 64-bit ARM workloads on Amazon ECS.

Windows containers

Since its release, AWS Batch has only supported running containers on top of the Linux operating system. Customers could build against the .NET Core for Linux, but not native Windows containers. Those containers are required to run Windows-only applications. Windows Batch jobs can also automate tasks that would benefit from a Windows environment, for example when you want to integrate with Microsoft Active Directory. Now customers can leverage the runtimePlatform.operatingSystemFamily parameter to designate that the container should run on a variety of Windows Server releases supported by Fargate. With Fargate taking care of the licensing, running batch processing jobs on Windows is simpler than managing your own licensing for the ephemeral resources.

One thing to keep in mind when working with Windows containers is the larger size of the windows base layer compared to most Linux versions. Even with the recent Server Core image size reduction introduced with Windows Core 2022, at the time of writing it’s still 3.91GB. Additionally, you need to ensure compatibility between the Windows version of the job and the Windows version used on the Batch compute environment. For a full set of considerations, refer to the documentation on using Windows containers on AWS Fargate.

Lastly, we recommend that you map Windows and Linux compute environments to platform-specific job queues to avoid trying to run jobs using the wrong operating system.

Larger task sizes

Last October we announced the ability to launch larger Fargate type jobs that use up to 16 vCPUs and up to 120 GiB of memory. This is approximately a 4x increase from previous limits.

These larger task sizes enable you to run more compute-heavy and/or memory-intensive applications like machine learning inference, scientific modeling, and distributed analytics on Fargate. Larger vCPU and memory options may also make migration to serverless container compute simpler for jobs that need more compute resources and cannot be easily re-architected into smaller sized containers.

To take advantage of these larger task sizes, set the appropriate values for the resourceRequirements parameter of your containerProperties. You’ll want to pay attention to the API documentation about the valid combination of values for Fargate for VCPU and MEMORY. For example, if your value for VCPU is 1, then the valid values for MEMORY (in MiB) are 2048, 3072, 4096, 5120, 6144, 7168, or 8192.

Configurable local storage volumes

More recently, the AWS Batch team released a feature that allows you to configure ephemeral storage up to 200 GiB in size on Fargate job definitions. The previous default was 20 GiB, which was not nearly enough for data-heavy processes like genomics or video processing. If you need even more storage space, you still have the option to configure a shared Amazon Elastic File System (Amazon EFS) mountpoint as part of the job definition.

To define the size of an ephemeral volume for a job, use the ephemeralStorage parameter in the job definition’s containerProperties. You can set a value from 21 up to 200 GiB. Not specifying this parameter results in the default value of 20 GiB for local storage. This parameter is available only when you’re using Fargate.

Other great Fargate features for batch workloads

Even before adding support for larger tasks, larger local storage, and Graviton and Windows containers, Fargate had some nice advantages for running batch workloads.

Job level cost allocation

When running Batch jobs on EC2 resources, it can take some effort to accurately determine the compute cost of any given job that ran on the shared instance. Since Fargate jobs are metered individually, you have an easier time determining the cost of jobs as opposed to joining the metered usage against split cost allocation data for tasks.

Fargate with Amazon EC2 Spot capacity provider

Fargate resources can leverage EC2 Spot capacity to run jobs at a discounted rate compared to the on-demand price. The usual Spot criteria apply — meaning that your jobs could be interrupted but you’ll get two-minute warning. For Fargate, the warning is sent as a task state change event to Amazon EventBridge and as a SIGTERM signal to the running task.

You can take advantage of the SIGTERM signal in your application code to gracefully exit the process, possibly checkpointing data to shared storage so that the task can start at a later time when capacity becomes available. The SIGTERM signal must be received from within the container to perform any cleanup actions. Failure to process this signal results in the task receiving a SIGKILL signal and that, in turn, may result in data loss or corruption.

Please note that Fargate Spot is not supported for ARM64 and Windows-based containers on Fargate, but we still think it is a great option for your other Batch on Fargate workloads.

Shared storage across jobs using Amazon Elastic File System

Expanded local storage is great, but sometimes you need to send the output of one task as the input to another, and copying all that data around can take time (and hence money). Sometimes it is easier and faster to share data across tasks directly using a shared file system. A shared filesystem would also be advantageous for checkpointing processes for later restart, as in the case of Spot interruptions.

You can define mountpoints Amazon Elastic File System (EFS) at the job definition level. If you define two job definitions with the same mount point, they effectively have a common place to read and write data. You can read more about how to do that in our “how to use EFS with AWS Batch” blog post.

Finally, whenever you leverage shared storage across jobs and applications, you should always pay attention to how they may interact with each other. Specifically you should identify and address any security boundaries or data overwriting scenarios that may occur when separate job definitions mount the same volume.

“Is Fargate right for me?”

If you’re thinking about leveraging Fargate with AWS Batch, it’s worth taking a moment to consider your overall scale and throughput needs.

Batch maps a single job request to a single Fargate resource. This means that your maximum jobs-per-minute dispatch rate is limited at 500 task launches per minute. Alternatively, if you use Batch with EC2, multiple tasks are placed on running instances resulting in faster job placement.

Fargate also has a service limit defining the total number of concurrent Fargate vCPUs you can launch in a Region. Depending on the size of your jobs, it defines the number of concurrent jobs you can run. You can find your Fargate service limits using the AWS Service Quotas management console. You’ll also find a mechanism there for requesting an increase, should you need it.

Depending on your workload (e.g. the size of the tasks, duration of each job, and frequency of the jobs), it’s recommended to reach out to AWS Support in advance if you’re planning on running very large workloads using Batch and Fargate. For more information on how to choose the underlying compute model, see our best practices documentation on how to choose the right compute environment resource.

Conclusion

With the release of Graviton and Windows container support, AWS Batch with Fargate has become a very capable serverless batch computing solution. It’s a great fit if you’re using AWS Batch for background, asynchronous tasks or for data processing.

If you want to try out the features mentioned in this post, log into the AWS Batch management console and try out the features yourself!

AWS HPC Blog