AWS HPC Blog

Large scale, cost effective GROMACS simulations using the Cyclone Solution from AWS

This post was contributed by Carsten Kutzner Max Planck Institute for Multidisciplinary Sciences, Nicolai Kozlowski Max Planck Institute for Multidisciplinary Sciences, Ludvig Nordstrom Principal Solutions Architect at Amazon Web Services and Ramin Torabi Senior Specialist HPC Solutions Architect at Amazon Web Services.

Biomolecules like proteins are the nanomachines of life, doing all the work in our cells, and indeed in all living organisms. Proteins and their function are determined by their amino acid sequence (stored in DNA) but the specific function of most proteins is unknown. At the same time, the number of known proteins is so large (more than 200 million!) that it’s impossible to study each one in detail.

At the Department of Theoretical and Computational Biophysics at the Max Planck Institute in Göttingen, we study these building blocks of life to understand them from a physical point of view. To do this, we make extensive use of molecular dynamics (MD) simulations as a tool. When we do this, we very frequently use GROMACS (an open-source MD package) to simulate various biomolecular systems – from small molecules as used in computational drug design, to proteins, membranes, membrane channels, ribosomes, and whole virus shells.

In this post we’ll take you for a walk through of a large scale HPC workload we ran on AWS, deploying GROMACS across three regions concurrently using the AWS-Cyclone-Solution (or just “Cyclone”). Our focus areas were cost-efficiency and capacity, so we used Spot pricing to maximize our scientific results given within a tight budget.

Prep work

This post is a continuation from an earlier one in 2021 where we ran 20k simulations in 3 days to accelerate early-stage drug discovery using AWS Batch. You can get yet more background on GROMACS in our blog series: “Running GROMACS on GPU instances.

To prepare for this run in a new AWS account, we needed to boost our EC2 service quotas for “All G and VT Spot Instance Requests” for the regions we used (Frankfurt, Ireland, and Northern Virginia).

The Dynasome Project: motivation and technical requirements

Because there are so many proteins, the Dynasome project takes a novel approach: we compare lots of proteins in an automated, data-driven way to classify their dynamics and predict their function. To do this, we perform MD simulations on a representative set of 200 proteins. We analyze these simulations to obtain a dynamics fingerprint of the proteins under study. Then we measure the similarity – and dissimilarity – of these fingerprints, which we use to predict their function.

In this part of the project, we wanted to generate as much trajectory data as possible for the 200 proteins. As the individual simulations are not very demanding, it’s not necessarily important to choose the fastest, most powerful instances. Very demanding, in contrast, is the sheer volume of trajectory data the project requires, so we wanted to get as much throughput as possible out of our budget. Balancing this was the length of our funding period – we needed to spend the budget in the calendar year 2023.

All of this meant that we needed a significant scale to finish in time.

AWS offers a variety of compute instances, including those ideal for high throughput computing and others specifically designed for HPC. For MD simulations of proteins with GROMACS, single GPU instances are generally the most cost efficient in our experience (we provide a detailed benchmark later in this post).

To use the entire budget by the end of 2023, we figured that we needed to use these Spot Instances at scale across multiple regions. So, we turned to AWS-Cyclone-Solution to orchestrate it all.

Results: cost-effective instances for GROMACS

First, we used benchmarks to determine which instances give us the greatest cost-efficiency for our GROMACS simulations. This allowed us to maximize the total length of MD trajectory (ie the amount of simulated time at the scale of the molecules) produced from our budget. It’s common to measure computational efficiency for this process in ns/day – that’s nanoseconds of simulated time, per day of compute time on the cluster.

Figure 1 summarizes the results for the representative 82,000 atom MEM benchmark system. The most cost-effective instances appear in the upper part of Figure 1, and these are mostly smaller GPU instances, i.e., those with fewer CPU cores, including small g5g, g5, and g4dn instance families. For simulations with GROMACS, single-GPU instances are generally much more cost-effective than CPU instances.

Figure 1 - Cost efficiency (= simulated time span divided by on-demand instance cost) versus GROMACS performance for different instance types, color-coded by instance family: all g5g instances orange, g4dn green, etc. The number in the icon indicates the exact instance type: “1” for .xlarge, “2” for .2xlarge, “4” for .4xlarge, and so on. Filled symbols indicate GPU instances, open circles CPU instances. Spot prices give even higher cost efficiency, but we didn’t use Spot pricing here for comparison because individual Spot prices can fluctuate. However, the order of instances is typically preserved (g5g, g5, and g4dn are still best).

Figure 1 – Cost efficiency (= simulated time span divided by on-demand instance cost) versus GROMACS performance for different instance types, color-coded by instance family: all g5g instances orange, g4dn green, etc. The number in the icon indicates the exact instance type: “1” for .xlarge, “2” for .2xlarge, “4” for .4xlarge, and so on. Filled symbols indicate GPU instances, open circles CPU instances. Spot prices give even higher cost efficiency, but we didn’t use Spot pricing here for comparison because individual Spot prices can fluctuate. However, the order of instances is typically preserved (g5g, g5, and g4dn are still best).

Figure 2 – Performance, on-demand cost, and on-demand cost-efficiency for GROMACS simulations on various AWS instance types. Performance as a function of the on-demand instance cost ($/h) for the MEM (circles) and RIB (hexagons) benchmarks on CPU (open symbols) and GPU instances (filled symbols). Gray sloped lines are isolines of equal cost efficiency. The most cost-efficient instances appear at the top left. The efficiency increases by a factor of two as you move from one gray line to the next. We have shown that in our case the same instances are optimal for small and for large benchmark systems.

Figure 2 – Performance, on-demand cost, and on-demand cost-efficiency for GROMACS simulations on various AWS instance types. Performance as a function of the on-demand instance cost ($/h) for the MEM (circles) and RIB (hexagons) benchmarks on CPU (open symbols) and GPU instances (filled symbols). Gray sloped lines are isolines of equal cost efficiency. The most cost-efficient instances appear at the top left. The efficiency increases by a factor of two as you move from one gray line to the next. We have shown that in our case the same instances are optimal for small and for large benchmark systems.

Using the Cyclone solution from AWS

The Cyclone solution from AWS is an open-source stack allowing customers to quickly start running HPC workloads in AWS that require (or can benefit from) large amounts of compute or high scheduling throughput. Customers can quickly scale to millions of vCPUs or 10s of thousands of GPUs. Cyclone offers the “HYPER CLI which lets you quickly configure and deploy cloud-native compute clusters, queues, and job definitions to submit and manage potentially millions of simulation jobs. Cyclone supports running containers on AWS Batch or instances on Amazon EC2 instances.

The solution has proven especially valuable to genomics workloads like those run by Max Planck and Iktos. Max Planck has deployed clusters spanning three regions that reached more than 3,500 g4/g5 GPU instances with a single cluster. While the HYPER CLI provides access to image build pipelines in cloud, Max Planck builds their images on premise and then uses the HYPER CLI to import this into the AWS-Cyclone-Solution environment.

To reduce cost, we configure the Cyclone solution to consume EC2 Spot instances by default. Amazon EC2 Spot instances let you take advantage of otherwise unused EC2 capacity in the AWS cloud and are available at up to 90% discounts compared to On-Demand prices. To make the best use of spot capacity, AWS-Cyclone-Solution creates multi-region clusters and leverages both spot-placement-score and EC2 allocation strategies to intelligently spread compute between regions and AZs based on total available spot capacity and reliability of spot capacity.

How we used the Cyclone solution for molecular dynamics

As in our previous project with GROMACS on AWS, we chose an AWS-Cyclone-Solution setup (Figure 3). We configured Cyclone to use AWS Batch in each of the regions. The advantage of this is that once set up, it can be very quickly scaled to other regions of the world by simply adding new regions in your host configuration. We could install and test our setup using a single region with a single instance type, and once everything worked as expected, we added more regions and instance types. Besides the Cyclone installation itself, which works interactively following the HYPERCLI instructions, we needed a few more pieces:

  1. A GPU-enabled Docker container containing our pre-compiled GROMACS software.
  2. A job script with instructions on what each job needs to do to run.
  3. An Amazon S3 bucket to provide our simulation input data and to store the intermediate checkpoint files as well as the simulation output data.

We uploaded all input data to our S3 bucket for the jobs to consume. When submitting jobs, we used the array job option within the qsub command provided by AWS-Cyclone-Solution CLI. The qsub file shell script contains our executable and aws-cyclone-solution configurations for each job as shown in this code snippet.

#!/bin/bash

## HYPER (Cyclone) SETTINGS:
#HYPER -n jobname
#HYPER -q cyclone-queue-md
#HYPER -r 2
#HYPER -d cyclone-job-def-md

## ACTUAL CODE
# set up working directory
mkdir wdir
cd wdir

# Entering loop with mdrun and file transfers every 30 minutes
for t in $(seq 1 96)
do
    # Get input data from s3
    aws s3 cp s3://bucketname/mddata/1bsg/traj_TrajNum/traj.tpr .
    aws s3 cp s3://bucketname/mddata/1bsg/traj_TrajNum/traj.cpt .
  
    # Perform MD
    gmx mdrun -s traj.tpr -cpi traj.cpt -noappend -maxh 0.5 -deffnm traj
  
    # Put output files to s3
    rm traj.tpr
    aws s3 cp --recursive . s3://bucketname/mddata/1bsg/traj_TrajNum/
    aws s3 cp s3://bucketname/mddata/1bsg/traj_TrajNum/traj.cpt s3://bucketname/mddata/1bsg/traj_TrajNum/traj_$( date +%F_%H:%M ).cpt

    # clear working directory
    rm *
done

# clean up
cd ..
rm -r wdir
exit 0

In the qsub command we also specify an array job submission file where each line simply represents parameter replacements to make for each job within the array. Parameter replacements are done by qsub command on the shell script seen above as simple text string replacements when an array file is included.

Here’s an example array file for string replacement in the baseline executable:

{"TrajNum":"1"}
{"TrajNum":"2"}
{"TrajNum":"3"}
...
{"TrajNum":"100"}

Once we submit the jobs, we can monitor their status via the CLI qstat command and use qlog command to query stdout and stderr logs for each job while they are running to ensure jobs are not stuck. Logs include timestamped vCPU and RAM utilization figures, to check that jobs are efficiently using resources specified for that job in the job definition.

Figure 3 - Cyclone-based setup that distributes workers across AWS Batch clusters across three AWS regions. Each job requires one GROMACS input data file (~3 MB .tpr file) and one checkpoint file (~1.5 MB .cpt file), totaling to ~7 GB of input data stored on S3 for the entire project (1500 independent simulations). However, due to continuous checkpointing every 30 minutes, more data is transferred to/from S3. The total size of the resulting data (mostly GROMACS trajectory files) is 116 GB.

Figure 3 – Cyclone-based setup that distributes workers across AWS Batch clusters across three AWS regions. Each job requires one GROMACS input data file (~3 MB .tpr file) and one checkpoint file (~1.5 MB .cpt file), totaling to ~7 GB of input data stored on S3 for the entire project (1500 independent simulations). However, due to continuous checkpointing every 30 minutes, more data is transferred to/from S3. The total size of the resulting data (mostly GROMACS trajectory files) is 116 GB.

We built the Docker container in two stages. First, we installed a GPU-enabled GROMACS 2018 version using Spack (see Spack howto) with a Dockerfile like this:

# Build stage with Spack pre-installed and ready to be used
FROM spack/ubuntu-jammy:latest as builder

# What we want to install and how we want to install it
# is specified in a manifest file (spack.yaml)
RUN mkdir /opt/spack-environment \
&&  (echo spack: \
&&   echo '  # add package specs to the `specs` list' \
&&   echo '  specs:' \
&&   echo '  - gromacs@2018.8+cuda~mpi' \
&&   echo '  view: /opt/view' \
&&   echo '  concretizer:' \
&&   echo '    unify: true' \
&&   echo '  config:' \
&&   echo '    install_tree: /opt/software') > /opt/spack-environment/spack.yaml

# Install the software
RUN cd /opt/spack-environment && spack env activate . && spack install --fail-fast && spack gc -y

# now we have gmx2018 spack installation

Next, we build the final container:

FROM gmx2018 as builder

# Bare OS image to run the installed executables
FROM nvidia/cuda:11.8.0-base-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && \
   apt install -y python3-pip nvidia-driver-535 && \
   rm -rf /var/lib/apt/lists/*   

RUN apt update && \
   apt install -y python3-pip nvidia-cuda-toolkit nvidia-driver-535 && \
   rm -rf /var/lib/apt/lists/*   
# last line above is for cleaning up, we don't need this in our container

RUN pip install --upgrade pip && \
   pip install boto && \
   pip install boto3 && \
   pip install awscli && \
   pip install psutil && \
   pip install jsonpickle && \
   pip install py-cpuinfo==8.0.0.0

RUN ln -s python3.10 /usr/bin/python

COPY --from=builder /opt/spack-environment /opt/spack-environment
COPY --from=builder /opt/software /opt/software
COPY --from=builder /opt/._view /opt/._view
COPY --from=builder /opt/view /opt/view
# copy libs that GROMACS gmx executable needs:
COPY --from=builder /lib/x86_64-linux-gnu/libgomp.so.1      /lib/x86_64-linux-gnu/libgomp.so.1
COPY --from=builder /lib/x86_64-linux-gnu/libgfortran.so.5  /lib/x86_64-linux-gnu/libgfortran.so.5
COPY --from=builder /lib/x86_64-linux-gnu/libquadmath.so.0  /lib/x86_64-linux-gnu/libquadmath.so.0

ENV NVIDIA_DRIVER_CAPABILITIES=compute

COPY start.sh /
RUN chmod +x start.sh

RUN { \
     echo '#!/bin/sh' \
     && echo '.' /opt/software/linux-ubuntu22.04-broadwell/gcc-11.4.0/gromacs-2018.8-mj6fmxyw23jkyiru2fl6uvihgrxaxsbx/bin/GMXRC \
     && echo 'exec "$@"'; \
   } > /entrypoint.sh \
&&  chmod a+x /entrypoint.sh

ENTRYPOINT [ "/entrypoint.sh" ]
CMD [ "/bin/bash" ]

The final container image that we hosted in Amazon Elastic Container Registry (Amazon ECR) was about 5 GB in total. Cyclone solution will automatically replicate images across enabled regions so that worker containers pull the local image when started. This avoids data transfer costs and the time it takes workers to start. It’s a good idea to include static files in the container image for the same reasons.

Results: scale and workload specific results

As we mentioned, we had to spend our budget within the calendar year 2023, so when we started simulating in late November, we had a little more than a month. After setup and testing, we entered production using a soft-limit in Cyclone of 1000 vCPUs. This corresponds to roughly 200 to 300 instances simultaneously. However, this pace wasn’t sufficient to finish the project by the end of the year, so we scaled up in mid-December to over a thousand instances (Figure 4). To reach the required throughput, we utilized five of the most cost-efficient instance types across three regions (Figure 4). After rapidly scaling up we had to gradually reduce spending towards the end of December in order to stay within budget.

Figure 4 - Number of compute instances used simultaneously over the course of the project. Traces on the left indicate the distribution over different regions; traces on the right indicate the type of utilized compute instances. We rapidly scaled up the project in mid-December to finish at the end of the year.

Figure 4 – Number of compute instances used simultaneously over the course of the project. Traces on the left indicate the distribution over different regions; traces on the right indicate the type of utilized compute instances. We rapidly scaled up the project in mid-December to finish at the end of the year.

In total, we spent the $100,000 on 13,000 compute instance-days, producing a total of 1.88 ms of MD trajectories, achieving an average cost efficiency of 18.8 ns/USD. For even greater cost efficiency, we could have added g5g instances, but these would have required a separate container since they are ARM-based. This was technically possible, but we decided against it this time to keep the overall setup simple. We are very pleased that we managed to run our simulations near the optimum of cost-efficiency on one hand, while on the other hand getting the required throughput by scaling out to more than 1000 EC2 instances.

We spend our budget almost exclusively on compute instances using cost-optimized EC2 Spot Instance pricing, with negligible overhead costs for other services like data storage or data transfer – this is clear from Figure 5.

Figure 5 - Costs incurred on a daily basis during the approximately one month it took to complete the project. Costs are broken down by category, with EC2 compute (blue) accounting for the largest share, indicating that there is negligible overhead from non-compute services.

Figure 5 – Costs incurred on a daily basis during the approximately one month it took to complete the project. Costs are broken down by category, with EC2 compute (blue) accounting for the largest share, indicating that there is negligible overhead from non-compute services.

With our specific workload, running one simulation per compute instance, the average MD performance was 144 ns/day. It varied between instance types as expected from our earlier benchmarks. (Figure 6).

Figure 6 - Simulation performance across instance types for this project’s specific workload. Y-axis shows the number of jobs. The spread of performance within each instance type is caused by differently sized proteins: larger proteins take more time on the same instance type.

Figure 6 – Simulation performance across instance types for this project’s specific workload. Y-axis shows the number of jobs. The spread of performance within each instance type is caused by differently sized proteins: larger proteins take more time on the same instance type.

Conclusions & Outlook

Our Cyclone-based workflow distributes large simulation projects across multiple AWS regions, dramatically reducing time to resolution. With Cyclone Solution using AWS Batch and Spot instance pricing by default, we can drastically reduce the cost of our runs without sacrificing the amount of capacity available to us. Cyclone’s single queue entry point made it easy to run at scale in the cloud.

Cost efficiency critically depends on the choice of appropriate instance types. For not too large biomolecular systems that can run on a single GPU, g5, g5g and g4dn Spot Instances currently provide the best price/performance for GROMACS. These instances have a good mix of CPU and GPU-based compute resources for this workload.

With three regions, we were able to get about 1000 g4dn and g5 GPU EC2 instances with spot pricing for our simulations. For higher compute requirements, we could use HYPER CLI to add more regions to the mix, or we could allow for more instance types in the cluster configuration.

The combination of the GROMACS checkpointing mechanism and automatic job retries that Cyclone Solution enables meant that occasional Spot Instance interruptions weren’t a problem.

We expect to see even greater cost efficiencies for GROMACS simulations with new instances like the g6 and g6e, and we’re eager to test these soon.

In the ever-evolving landscape of scientific computing, leveraging cloud computing has become an essential step for researchers and scientists seeking to harness the power of cutting-edge technologies. AWS offers a comprehensive suite of High-Performance Computing (HPC) services tailored to meet the demands of computationally intensive workloads, enabling researchers to accelerate their discoveries and drive innovation.

Wherever you are in your journey to adopting cloud computing for your scientific workloads, AWS HPC has a solution for you. We want you to understand what is the art of the possible, when running workloads using AWS-Cyclone-Solution, AWS Batch, Amazon EC2 with spot instance pricing, or containerizing your application. Embrace the power of the cloud and accelerate your scientific discoveries with AWS HPC.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

References

Kutzner, C.; Kniep, C.; Cherian, A.; Nordstrom, L.; Grubmüller, H.; de Groot, B. L.; Gapsys, V.: GROMACS in the cloud: A global supercomputer to speed up alchemical drug design. Journal of Chemical Information and Modeling 62 (7), pp. 1691 – 1711 (2022) https://pubs.acs.org/doi/10.1021/acs.jcim.2c00044

Carsten Kutzner

Carsten Kutzner

Carsten Kutzner is a staff scientist at the Max Planck Institute for Multidisciplinary Sciences in Göttingen, Germany, in the Department of Theoretical and Computational Biophysics. As a developer of scientific software, he works on new methods for biomolecular simulations with GROMACS and on cost-effective high-performance and high-throughput computing.

Ludvig Nordstrom

Ludvig Nordstrom

Ludvig Nordstrom is an AWS Principal Solutions Architect who built and published the aws-cyclone-solution. He has supported customers in high throughput computing (HTC) and high-performance computing (HPC) over the last 7 years. Supporting customers across Financial Services industry, Life Science industry and Semiconductor industry amongst others to build out cloud-native capabilities for large scale, high throughput compute in cloud.

Nicolai Kozlowski

Nicolai Kozlowski

Nicolai Kozlowski is a PhD student at the Max Planck Institute for Multidisciplinary Sciences in Göttingen, Germany, in the Department of Theoretical and Computational Biophysics. In the 'Dynasome' project he develops comparison metrics for protein dynamics using high-throughput simulations and kinetic modeling.

Ramin Torabi

Ramin Torabi

Ramin Torabi is a Senior Specialist HPC Solutions Architect at AWS since 2022. He enables customers in central Europe architecting their solutions around high-performance computing (HPC). He has more than 15 years of experience with HPC and CAE especially in the Automotive, Aerospace, and other manufacturing industries. Ramin recieved a "Dr. rer. nat." (PhD) in theoretical nuclear structure physics from the Technical University of Darmstadt in 2009.