Deploying and running HPC applications on AWS Batch
In this document, we introduce the use of AWS Batch and Amazon Elastic Container Service (Amazon ECS) managed services for running HPC applications like GROMACS and RELION. The use of containers is combined with the traditional approach of building and provisioning HPC application binaries, which helps in keeping the container image small and avoiding the proliferation of per-application containers.
The purpose of this document is to share the discoveries and best practices for migrating a simple HPC workflow, which includes batch and legacy X11 applications, to AWS Batch. The goal is to showcase how to architect infrastructure using managed AWS services such as AWS Batch, Amazon FSx for Lustre, Amazon Elastic File System, and AWS Step Functions, with the help of AWS Cloud Development Kit(AWS CDK), and make the solution available as a Service in AWS Service Catalog for a self-service user experience.
The solution described here can be used for a broad range of HPC applications, including those with the most demanding CPU/GPU and networking requirements. In this case, we have selected GROMACS and RELION. GROMACS (GROningen MAchine for Chemical Simulations) is a Molecular Dynamics package used for simulating proteins, lipids, and nucleic acids. RELION (REgularised LIkelihood OptimisatioN) is used for processing electron cryo-microscopy (cryo-EM) images to identify 3D macromolecular structures. Both applications can run on CPUs or GPUs to accelerate compute-intensive steps, and they can scale out on multiple nodes through MPI (Message Passing Interface). The design of this solution is suitable for a Proof-of-Concept activity — you should expect to do more integration work to enable these workloads in your production deployments.
AWS offers a broad range of products and services that can be combined in various ways to achieve optimal price/performance and, more importantly, to attain business agility. The business outcomes underlying this project are as follows:
- Performance optimization: This involves reducing costs by optimizing the application binaries and the underlying software stack of libraries. Costs-performance optimization can be achieved in two main ways: by selecting different Amazon EC2 instance types or by choosing the Spot pricing model.
- Reduction of dependency on legacy software applications: This goal aims to minimize the need for ongoing maintenance for software updates and fixes.
- Highly scalable HPC infrastructure: The objective is to automatically scale out/in the compute environment to meet the demands of the workload.
- Improved availability of the service to end users.
- Enhanced flexibility in balancing between minimum time and maximum cost savings.
- Software-defined infrastructure: This enables building and deploying through any CI/CD pipeline and facilitates self-service delivery.
The goal of this post is to assess the feasibility of running highly demanding HPC applications in a fully managed and highly scalable environment based on fully managed HPC services such as AWS Batch and Amazon ECS.
HPC applications like GROMACS and RELION, originally designed for traditional HPC systems, can be run on AWS using the same software stack utilized in on-premises static data centers. This can be achieved by leveraging the EC2 virtualization layer or employing EC2 metal instances, while benefiting from the high-performance networking offered by Elastic Fabric Adapter (EFA). To handle scale-out/in, a job scheduler such as SLURM can be deployed within an AWS ParallelCluster instance.
However, in this project, we have designed a cloud-native architecture based on managed services like AWS Batch, eliminating the reliance on third-party software for job scheduling (e.g., SLURM) and infrastructure elasticity management (e.g., extensions in ParallelCluster for scale-out/in based on the job queue).
The use of Amazon ECS containers for HPC applications and workflows provides an additional level of flexibility, but it raises several key questions:
- How does application performance on AWS Batch/ECS compare to running on-premises/EC2?
- What is the optimal cost-performance tradeoff considering CPU and GPU instances, Spot, and On-Demand pricing?
- How efficiently does AWS Batch execute tightly-coupled and multi-node parallel applications in containers while leveraging EFA and high-performance file systems?
- What containerization strategy should be employed: one container for each distinct application or for a group of applications?
- How can Spot or On-Demand pricing be selectively used to further optimize costs?
Addressing these questions will help determine the best approach for running HPC workloads on AWS and optimizing performance and cost-efficiency.
The following architecture demonstrates how AWS services are combined to centrally manage applications and associated resources, achieving consistent governance and transforming workflows into IT services for end users.
To identify, manage, and audit the solution, the administrator can deploy the architecture using AWS CDK. This approach ensures flexibility and manages all content relevant to both the architecture itself and architecture governance processes.
The architecture includes the following components:
- An AWS Batch compute environment used to run GROMACS and other containerized HPC applications
- A multi-layer storage solution consisting of Amazon S3, Amazon FSx for Lustre, and Amazon EFS
- An orchestration layer based on Step Functions
- A remote visualization layer using the NICE DCV remote display protocol for running RELION
The AWS Batch compute environment offers multiple queues with various EC2 instance types, including those with GPUs, as some workflows benefit from the NVIDIA Tensor Cores. Spot and On-Demand compute environments are utilized to prioritize either the cost, or completion time of simulations, aiming to avoid unnecessary expenses and control job completion time as suggested in the Cost Optimization Pillar.
For job monitoring and orchestration, AWS Step Functions are employed. The Step Functions workflow is designed to optimize job costs by first attempting execution on Spot Instances and then falling back to On-Demand instances in case of failure or pending status. This configuration ensures quick recovery from spot termination failures and meets demand, as described in the Reliability Pillar.
Considering the requirement for handling multiple categories of data, the solution incorporates different layers of storage. This configuration is based on the Performance Efficiency Pillar, aiming to optimize workload performance.
Amazon S3 is used as a high-resiliency, low-cost storage for input and output files. In a typical use case, the input file is downloaded from the S3 bucket during job bootstrap, and the generated output files are uploaded back to S3 upon completion. The bucket can be configured with a lifecycle policy to migrate files between storage classes based on file access or age, further optimizing costs.
A high-performance scratch file system based on Amazon FSx for Lustre is utilized for staging GROMACS job input and output files. This file system is mounted on each EC2 instance running AWS Batch jobs, and the Docker environment is configured to export the mounted file system to the container.
To simplify the compilation and installation of application binaries, the environment is integrated with the Spack package manager and provided recipes. With Spack, it is possible to build packages with multiple versions, configurations, platforms, and compilers, all coexisting on the same machine. The Lmod environment manager is used to launch the installed applications.
The application binaries are located in Amazon Elastic File System (EFS), which is also used to store job launch configurations and scripts. Similar to the FSx for Lustre file system, EFS is mounted on each EC2 instance running AWS Batch jobs and exported to the Docker container.
The ECS container image has been created with an OS and software stack suitable for running both targeted applications. The use of a single container simplifies infrastructure deployment by using a general-purpose container compatible with the required applications. This configuration is possible because the container does not contain the application binaries; instead, they are installed with Spack in the shared file system.
Post-execution analysis plays a crucial role in the scientists’ workflow. Therefore, the environment is configured to use NICE DCV, allowing applications to run on CPU for software rendering or GPU for 3D/OpenGL rendering support. The ability to choose between software rendering or 3D/OpenGL rendering optimizes costs by selecting the appropriate instance type for the specific application.
For system issue and application error collection, system logs are monitored and retrieved using Amazon CloudWatch.
The IT personnel provision and update the infrastructure using the AWS Cloud Development Kit (CDK), ensuring reproducibility and simplifying maintenance. The infrastructure CDK can be designed to be general enough for reuse through AWS Service Catalog, supporting different business units and/or users located in different AWS Regions.
The availability of AWS Service Catalog for end users transforms application workflows into IT services that are more reproducible and easier to maintain.
The end user’s experience begins with AWS Service Catalog, where they will find the entry for “RELION”. Clicking on this entry will launch an AWS CloudFormation (CF) template, which has been created through CDK. Behind the scenes, the CF template will create a DCV Virtual Session running the RELION application in an X11 Linux Desktop and provide a link for connection using a common web browser or the DCV Client.
RELION will be pre-configured with a custom script to submit a distinct Step Functions instance for each GROMACS job, implementing the desired data and job orchestration functions.
The Step Functions task will retrieve input data and set up a work environment on FSx for Lustre for the job. It will then submit the job to a Spot queue in AWS Batch and monitor its execution. If the job remains pending, the task will terminate and resubmit it to an On-Demand queue. Once the job is completed, the Step Function will transfer the job results back to S3.
The end user’s experience with RELION is identical to running the application on a static infrastructure with a rigid software stack like Slurm and VDI clusters. However, in this case, the user is presented with a simple entry in AWS Service Catalog. The use of ECS containers in conjunction with Batch’s scale-out/in capabilities and Step Functions workflows allows IT administrators to conceal the implementation details of the HPC infrastructure and data/job orchestration functions while minimizing costs.
The AWS Batch compute environment depicted in Figure 1 is suitable for benchmarking purposes, as it allows for identifying the instance types that offer the best cost/performance ratio. However, for production environments, the Batch compute environment should be designed to span multiple Availability Zones, and the selection of instance types should be tailored to the specific workloads of the company.
In addition to basic logging and event monitoring, it is recommended to collect job statistics, including execution times and costs, in production environments. These statistics can be aggregated into a dashboard that provides valuable insights on key business metrics for management and operational teams.
Since the infrastructure is created through AWS CDK, it can be integrated into a higher-level system that automates the execution of an entire workflow. This workflow may include cluster deployment, job submission, retrieval and archiving of results, and can be triggered by an event such as dropping data into an S3 bucket.
In this post, we have outlined the best practices for deploying an HPC application stack on AWS using a range of AWS services. Through this project, we have successfully demonstrated the feasibility of using ECS containers and AWS Batch to run demanding HPC applications such as GROMACS and RELION.
Compared to a software stack based on job schedulers like SLURM, our solution relies on fully managed services for job scheduling, resource allocation, and resource provisioning. This simplifies maintenance and reduces the operational costs of the system.
We have designed the infrastructure using AWS CDK, enabling seamless integration with CI/CD pipelines and providing the potential for an HPC-as-a-Service solution. Additionally, we have optimized the storage architecture by implementing three layers of storage to achieve the best possible cost-to-performance ratio.
The ability to run applications in containers combines the use of the widely adopted package manager, Spack, with containers that are optimized to leverage GPU acceleration. This is further facilitated by AWS Batch, which allows for the execution of tightly coupled MPI jobs on multiple nodes, leveraging the high-performance networking provided by EFA.