Choosing the right compute orchestration tool for your research workload

Research organizations around the world run large-scale simulations, analyses, models, and other distributed, compute-intensive workloads on AWS every day. These jobs depend on an orchestration layer to coordinate tasks across the compute fleet.

As a researcher or systems administrator providing services for researchers, it can be difficult to choose which AWS service or solution to use because there are various options for different kinds of workloads.

In this post, we’ll describe some typical research use cases and explain which AWS tool we think best fits that workload.

Understanding your workload

Before diving into the specifics of each tool, it’s important to understand the nature of your workload.

Factors like the requirement for tightly coupled processes, the use of containers, the need for machine learning capabilities, or the necessity for a cloud desktop are pivotal in your decision-making process.

Research is not a monolith, so AWS supports a diverse range of HPC-based research, from engineering simulations and drug discovery to genomics, machine learning (ML), financial risk analysis, and social sciences.

Also: the tool you choose is not exclusive. Customers can have a mix of solutions to meet their needs all in the same account.

A deep dive into AWS compute orchestration tools

AWS ParallelCluster for classic HPC clusters

AWS ParallelCluster is a flexible tool for building and managing HPC clusters on AWS. It’s ideal for tightly coupled workloads, like running simulations or analytics that require a traditional HPC cluster. It supports Elastic Fabric Adapter (EFA) networking out-of-the-box for low latency and high throughput inter-instance communication, and a high-performance file system (Lustre – available through the Amazon FSx for Lustre managed service).

ParallelCluster provides a familiar interface with a job scheduler – Slurm – making it easy to migrate or burst workloads from an on-premises cluster environment you’re possibly already using.

Figure 1 – Overview of AWS ParallelCluster and its components for HPC workloads. Integration with Slurm and Amazon EC2 right-sizes compute node numbers based on the job queue. Amazon FSx for Lustre allows for access to a high-performance file system while also taking advantage of Amazon S3 object storage. All of this is connected using Elastic Fabric Adapter (EFA) which provide extremely high-performance connectivity and scaling for tightly-coupled workloads.

AWS Batch for container-based jobs

AWS Batch is suited for highly parallel, container-based jobs, including tightly-coupled workloads. It provides a fully-managed scheduler with seamless integration into container orchestrators, like Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (ECS), allowing researchers to leverage existing containerized applications. A typical workload might involve independently running jobs on generic/non-specific compute, leveraging native AWS integrations, or requiring horizontal scalability through MPI or NCCL.

Figure 2 – AWS Batch workflow illustrating container-based job processing and integration with AWS services. Compatibility with Amazon EKS and ECS allows for flexibility at the Compute Environment layer.

Amazon SageMaker for machine learning projects

Amazon SageMaker is ideal for machine learning workloads, especially those developed in Jupyter Notebooks. While not focused on foundational building blocks for research computing, it instead provides a managed ecosystem of ML and data science tools, covering the entire spectrum from data discovery and exploration to model training and deployment.

SageMaker notebooks provide an interactive development environment, allowing researchers to develop and test models easily. SageMaker also contains pre-trained models, allowing researchers to jump-start their ML projects. It also provides managed inference endpoints, making it easier to deploy models and serve predictions.

Figure 3 – Amazon SageMaker ecosystem showcasing high-level end-to-end process from data preparation to model deployment. Integrates with services like Amazon EFS for a local file system in notebooks and also with highly-optimized AWS Deep Learning Containers for training models.

Let’s talk about the underlying compute resources

The three services we just mentioned can take advantage of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances and also AWS Fargate. Spot Instances are spare EC2 capacity, offered at a discounted rate but they can be reclaimed with a 2-minute warning.

SageMaker, Batch, and ParallelCluster can all use Spot Instances to take advantage of their favorable economics. In the case of Spot Instances, you’ll need to verify that your workload can tolerate interruptions from reclaimed capacity, or that the service can shift load on your behalf to avoid interrupting your processes. There are AWS technology partners, like MemVerge, that can handle this for you using OS-level memory checkpointing.

Fargate is a serverless compute engine that can be used for workloads running on Batch with ECS, and in native ECS and EKS clusters. It abstracts away the need for additional servers or infrastructure-related parameters (like instance type) to run containers. Fargate also has a few caveats regarding hardware specifications which you can find in our documentation. But – generally speaking – it’s worthwhile to see if you can use Spot or Fargate with AWS orchestration tools for your research.

Amazon Lightsail for Research for individual cloud desktops

Amazon Lightsail for Research is a simple (but powerful) solution for researchers looking for a predictably-priced, all-in-one cloud desktop. It’s tailored specifically for researchers, providing hardware specifications that are optimized for efficient research and a seamless user experience. Lightsail offers a range of pre-configured virtual private servers that can be customized to meet researchers’ needs and comes with research applications, like Scilab and RStudio. With its easy-to-use interface and affordable pricing, Lightsail for Research provides a reliable and efficient way for researchers to get started with AWS.

Figure 4 – Researchers can use Amazon Lightsail for Research’s simplified management interface and options to deploy their favorite applications like Jupyter, RStudio, and Scilab.

Research and Engineering Studio on AWS for managing cloud desktops at scale

Research and Engineering Studio on AWS (RES) is an open-source web-based portal for administrators to create and manage secure, cloud-based research and engineering environments. It is ideal for research organizations that want a central IT team to easily manage the underlying infrastructure for multiple research environments. It provides one-click deployment for getting started quickly but can be customized to meet an organization’s specific needs.

Administrators can create virtual collaboration spaces for specific sets of users to access shared resources and collaborate. Users get a single pane of glass for launching and accessing virtual desktops to conduct scientific research, product design, engineering simulations, or data analysis workloads.

Figure 5 – Researchers and admins alike can leverage RES to create Engineering Virtual Desktops (eVDI) backed by Amazon EC2. The RES Virtual Desktop screen shown here lists all the eVDI sessions a user created with controls to spin up, shut down, or schedule uptime.

AWS HealthOmics for bioinformatics

AWS HealthOmics is a comprehensive solution for bioinformatics work. It facilitates raw genomic storage and processing, allowing researchers to store and analyze genomic data. It supports popular bioinformatics workflow definition languages like WDL and Nextflow, enabling researchers to process and analyze genomic data efficiently.

HealthOmics even offers researchers the choice to bring their own workflow or use pre-built Ready2Run workflows. Ready2Run workflows are designed by industry leading third-party software companies like Sentieon, Inc. and NVIDIA, and includes common open-source pipelines like AlphaFold for protein structure prediction. Ready2Run workflows don’t require you to manage software tools or workflow scripts – this can save researchers significant amounts of time.

Figure 6 – AWS HealthOmics platform structure highlighting genomic data processing and analysis capabilities. Raw sequence and reference data can be processed through Nextflow or WDL workflows and then analyzed via AWS analytics services such as Amazon Athena.

Leveraging next-generation serverless technologies

In the past decade, AWS has pioneered the field of serverless computing. Serverless computing is a model where you can build and deploy applications without managing server infrastructure. Instead of spinning up a full virtual machine that comes with overhead like patching and monitoring, you can abstract it away and focus just on the code or process you intend to run.

This is great for use cases like event handling or asynchronous tasks, but researchers have been using serverless computing to speed up embarrassingly parallel workloads, too – including ML hyperparameter optimization, genome search, and even MapReduce.

AWS Lambda is a serverless interface for running code without managing servers. It’s designed for loosely coupled workloads, allowing researchers to run code in response to events like changes in data (or the arrival of data). Lambda scales automatically, enabling you to run thousands of concurrent executions.

Even better: Lambda integrates with more than 200 other AWS services, making it easier for you to build quite complex workflows. It provides integration with AWS Step Functions, too, which means you can create visual workflows and construct multi-step, distributed applications. Lambda, combined with Step Functions, is useful for workloads that involve different compute steps, data needs, and may even require decision gates or human input.

Figure 7 – Illustration of a sample serverless computing architecture: a data stream of jobs to be processed is picked up by AWS Lambda and put into a downstream Amazon SQS queue. AWS Step Functions then reads from this queue and handles the heavy lifting of distributed compute orchestration via Lambda worker functions. Step Functions also leverages Amazon DynamoDB for workload state management, event handling with Amazon EventBridge and storing processed results in Amazon S3.

Conclusion: making the right choice

Choosing the right AWS compute orchestration tool is not about finding a one-size-fits-all solution but about aligning the tool’s capabilities with the specific requirements of your workload. The nuances of your project, the nature of your data, and your computational needs should guide your decision.

Start with a small workload to gauge the tool’s compatibility and scalability with your project. AWS has a comprehensive suite of services and is ready to support you at every step of your journey, ensuring that you have the right resources and environment to push the boundaries of your research.

AWS Partners are also here to help you implement these tools. Partners like Ronin and Ansys bring valuable expertise that can accelerate your time to research on AWS.

We recommend consulting with your research team and AWS account team to help you make the best decision for your project. As your research evolves, AWS’s scalable and diverse computing environment will continue to provide the necessary tools and support to meet your computational needs. To dive deeper into the nuances of a few of the tools we touched on in this post, we encourage you to check out some of our previous posts about Choosing between AWS Batch or AWS ParallelCluster for HPC, why you should use Fargate with AWS Batch, and how you can save up to 90% using EC2 Spot.

AWS HPC Blog