Choosing between AWS Batch or AWS ParallelCluster for your HPC Workloads
It’s an understatement that AWS has a lot of services (more than 200 at the time of this post!). We’re usually the first to point out that there’s more than one way to solve a problem. HPC is no different in this regard, because we offer a choice: customers can run their HPC workloads using AWS ParallelCluster or AWS Batch.
Which brings us to today’s question: how should you choose between them?
We think your choice will come down to three big factors:
- Your environment and workspace preferences.
- What your application(s) assume about the runtime environment.
- What you use to define a complex workflow.
The first thing to consider is how to take advantage of your prior experience to implement a workflow on AWS.
If you’re currently using (or managing) shared HPC resources to submit jobs to a scheduler, you’ll feel right at home with AWS ParallelCluster. ParallelCluster is an AWS-supported, open-source tool that makes it easy for you to deploy and manage HPC clusters on AWS.
To use ParallelCluster, you define what the cluster should look like, including: what sort of compute instances to use for jobs, limits on how many to spin up, and which other capabilities you need such as shared storage, or remote visualization. ParallelCluster takes that configuration and handles the undifferentiated AWS orchestration for you. It’ll set up the networking and firewall rules, build and configuring a head node with packages, applications, and a scheduler. It’ll also stand-up shared storage if it doesn’t already exist, and make it available across the cluster. And it includes the automation you need to scale your compute nodes to the size of the work queue, expanding and shrinking the number of compute nodes based on the workloads in your queue.
Alternatively, if your background is more developer or DevOps oriented, you should consider implementing your analysis pipelines with AWS Batch. Batch is a container-centric, always-waiting, fully-managed task execution service. Batch provides job queues with sophisticated scheduling capabilities, and compute environments to define the size and shape of worker nodes. You define what the job will look like and when you submit some work, Batch will take care of orchestrating the underlying compute fleet and placement of jobs on that fleet.
Batch is a native AWS service, and has direct support for its resources in our SDKs, and AWS CloudFormation. If you already have processes in place for developing infrastructure-as-code on AWS, then creating Batch environments and integrating it into your workflows should follow the same process as integrating any other AWS service or feature.
In contrast, while ParallelCluster also uses CloudFormation behind the scenes, there aren’t any CloudFormation resources for “HeadNode”, “SlurmScheduler”, etc. This may or may not be an important distinction for you, but it is worth mentioning here.
Batch also integrates well with other AWS services, such as Amazon Identity and Access Management (IAM) for authentication/authorization, leveraging Amazon EC2 Spot Instances for cost savings, Amazon CloudWatch for triggering and monitoring of analysis events, and AWS Step Functions to enable new workflows. Having native integration with other AWS services can simplify development process for when you integrate Batch with other parts of your stack.
Most HPC applications were written before cloud computing existed (some were written before the internet was even a thing). As a result, most applications were written with some assumptions about their runtime environment. Specifically, they often assume that the underlying servers are homogenous, static in number, and that the application can read and write data to shared POSIX storage that is available across the cluster. Tightly-coupled codes that communicate using a Message Passing Interface (MPI) across several nodes also assume that they have priority access to a high-bandwidth, low-latency network. Finally, some applications require shared access to acceleration hardware, like GPUs, across running processes.
If your application falls into this category, ParallelCluster will allow you to port it to AWS with few (or sometimes no) changes to your existing workflow. It’s a great option for quickly getting started running your existing HPC workloads on AWS and reaping the scalability and flexibility benefits of the cloud while maintaining an environment that is very close to what these applications expect.
These assumptions don’t prevent you using AWS Batch for HPC applications, but it’s a different environment than either running an application on a local workstation, or submitting to a traditional HPC scheduler.
To use an HPC application with Batch you’ll first need a containerized version of it. That’s a straight-forward procedure when the application is pleasingly parallel, and doesn’t require cross-node communication using MPI. Many codes fall into this category, especially in the bioinformatics space. The BioContainers community have already packaged many bioinformatics applications into containers and made them available to the general community. If you don’t find what you need in their registry, they also provide great documentation on best practices for creating containers.
Once the application is containerized, you then need to define the Batch resources to be able to run the application. Like ParallelCluster, you will need to define a set of Batch resources that apply to all jobs. This includes a job queue to define job ordering and placement priority, and a compute environment (CE) that defines the type of instances that should be used (Intel, AMD, Arm, GPUs, CPU/memory ratios, etc.), and the minimum and maximum number of concurrent nodes that can run jobs. At the level of a job submission, you’ll need a Batch job definition that specifies the job’s “shape” (the runtime CPU and memory requirements) for each type of job submitted. A Batch job definition is analogous to what you would submit to a HPC job scheduler to run, except in Batch you need to predefine the job shape before you can request that any instance of that job is run. Job definitions can also define storage mount-points for the container to access, both for local disk volumes, and also for mounting a shared Amazon Elastic File System mount point.
What you gain from this effort is the ability to scale your workload across multiple AWS Regions. For example, we recently worked with the Max Planck Institute for Biophysical Chemistry in Germany to port GROMACS (a molecular-dynamics simulation), and pmx (a free-energy calculation package), to analyze over 20 thousand compounds in three days across multiple AWS Regions. This gave us a lot of scope for scaling it up and out, and would have been hard to do any other way. You can read more about how we did that in a blog post.
It’s rare that an application is run by itself. Usually, a set of applications are run in a series of steps to form a complete workflow. This can be done via a basic shell or Python script, a Makefile, or feature-rich workflow frameworks such as Apache Airflow, Metaflow, Nextflow, etc.
Basic scripting has the advantage of being easy to implement and run. The downside is that you tend to outgrow them very quickly as your workflow increases in complexity. For example, if you want to restart a workflow from a certain point, you’d need to encode the logic that determines where you left off in a basic script. Workflow frameworks have this capability built-in already, and take care of restarting workflows from where they last left off. This feature alone makes it a lot easier to take advantage of EC2 Spot instances to save money running the workflow.
The other drawback of basic scripting is that it’s not very portable. They are often strongly tied to the environment and scheduler that they were first written for, and have hard-coded paths inside that may not exist on other systems. If you have a Slurm script, ParallelCluster is an option, but you will need to take steps to mimic your local setup for paths, environment variables, applications, shared libraries, etc. I’d recommend writing some integration tests to make sure your expected result is produced in both environments.
If you’re thinking of migrating something to AWS, I strongly recommend that you take this opportunity to migrate it to some workflow framework. Calling back to our bioinformatics example, there’s a strong community around Nextflow called NF-Core, who can provide you with help, best-practices, and even reference workflows for a given domain, such as RNA-Seq. Nextflow and many other workflow engines support both traditional HPC schedulers like Slurm and cloud native ones like Batch.
By leveraging a workflow framework, you gain the ability to run the same workflow across systems. Great for portability and sharing your work and ideas with others. You can even share your learnings as you scale your pipelines to larger workloads and provide the code with examples, as the CZ ID project did with their pathogen identification pipeline, which will help COVID researchers around the world to analyze ever-larger datasets.
To summarize – your choice of which AWS service to use for your HPC workflows largely depends on your personal preferences and application requirements. Here are some guidelines:
- If you’re familiar with using traditional HPC shared clusters, and are looking for this type of environment on AWS, then choose AWS ParallelCluster.
- If you’re coming from a developer or DevOps background, choose AWS Batch and integrate it into your infrastructure-as-code development lifecycle.
- If you want to run your application across traditional on-premises HPC environments and AWS with little-to-no changes to your workflows today, choose ParallelCluster.
- If you encode your workflow using a workflow framework that supports containerized applications, your choice will depend on what the underlying framework supports for a back end to submit the individual tasks. Most likely this will be AWS Batch.
- If you’re looking to integrate your batch-processing workflows with other AWS services like AWS Step Functions, then Batch is a better fit for your needs.
To learn more about AWS Batch, visit the Getting Started with AWS Batch guide. To learn more about AWS ParallelCluster, visit the AWS ParallelCluster 3 documentation. We also have a set of self-paced HPC workshops for both ParallelCluster and Batch.
Finally, if you need some help implementing your HPC workloads on AWS, you can find a list of partners on our AWS High Performance Computing Competency Partners page. We look forward to hearing from you when you decide, and why you chose that option. We always appreciate feedback from customers to help us make both AWS Batch and AWS ParallelCluster better at what they do.
Suggested tags: AWS ParallelCluster, AWS Batch, HPC