GROMACS price-performance optimizations on AWS
Molecular dynamics (MD) is a simulation method for analyzing the movement and tracing trajectories of atoms and molecules where the dynamics of a system evolve over time. MD simulations are used across various domains such as material sciences, biochemistry, biophysics and are typically used in two broad ways to study a system. The importance of MD came to bear on recent efforts for the SARS-COV-2 vaccine where MD applications such as GROMACS helped researchers identify molecules that bind to the spike protein of the virus and block it from infecting human cells.
The typical time scales of the simulated system are in the order of micro-seconds (Ms) or nano-seconds (Ns). We measure the performance of an MD simulation as nano-seconds per day (Ns/day). The simulations run for hours (sometimes days) in order to get to meaningful lengthier timescales, and gain insights on final confirmation of a molecule. MD applications are typically tightly coupled workloads where the system of atoms are distributed into multiple domains to attain parallelism and there is significant information exchanged across domains.
You can reduce time to results by parallelizing the simulation across multiple compute instances. This method necessitates the use of special high-performance interconnects keeping the inter process communication overhead low to linearly scale out the simulation. To get results faster when trying to arrive at an average understanding of the system across multiple parameters, we run hundreds of copies of the simulation in parallel. This ensemble method relies mainly on throughput of how many simulations a high performance computing (HPC) system is able to complete in a given timeframe. This blog post provides details on how different types of compute instance configurations perform on a given problem. We will also make architectural recommendations for different types of simulations, based on the performance and price characteristics of the different instances tested.
This blog presents a study of performance and price of running GROMACS, a popular open-source MD application, across a variety of Amazon Elastic Compute Cloud (Amazon EC2) instance types. We also assess how other service components, such as high-performance networking, aide the overall performance of the system. We detail performance comparisons between various EC2 instance types to arrive at optimal configurations that are targeted to single and multi-instance HPC system configurations. The blog post will only focus on CPU based EC2 instance types, not covered are the effects of leveraging GPU-based instances (which we will cover in a future blog post).
After reading this post you will have a more informed opinion about the performance of GROMACS on AWS, and will be able to narrow down configuration choices to match your requirements.
GROMACS is an MD package designed for simulations of proteins, lipids, and nucleic acids. It was originally developed in the Biophysical Chemistry department of University of Groningen, and is now maintained by contributors in universities and research centers worldwide. GROMACS can be run on CPUs and GPUs in single-node and multi-node (cluster) configurations. It is a free, open-source software released under the GNU General Public License (GPL), and starting with version 4.6, the GNU Lesser General Public License (LGPL). To learn more about and download GROMACS, visit their website at https://www.gromacs.org/.
We compiled GROMACS from source using the Intel 2020 compiler to take advantage of single instruction, multiple data (SIMD) AVX2 and AVX512 optimizations, and the Intel MKL FFT library. The MKL FFT library automatically uses code paths that are optimized for CPU architectures. All tests were run on top of the Amazon Linux 2 operating system.
|Compliers||Intel 2020, gcc 7.4, gcc 10.2 (for ARM)|
|FFT libraries||Intel MKL|
You can see the specific compiler flags used with the Intel compiler in the online workshop we reference at the end of this blog post.
For the purposes of the analyses, we have selected three benchmarks of varying atom counts from the Max Planck Institute for Biophysical Chemistry, which represent three different classes of input sizes: small, medium, and large system.
|Benchmark and download link||System size||Description||Number of atoms||Time step resolution in femtoseconds||Total time steps|
|benchMEM||Small||Protein in membrane surrounded by water||82K||2 fs||10K|
|benchRIB||Medium||Ribosome in water||2M||4 fs||10K|
|benchPEP||Large||Peptides in water||12M||2 fs||10K|
The first benchmark will be used only to exhibit single-node intra-instance scalability across cores. The two other medium and large cases will be used for single-instance and multi-instance benchmarks and will also be used to study scalability of the application to get to a faster time-to-result for large systems.
The performance analyses included single-instance, multi-core, and multi-instance benchmarking to compare and contrast Amazon EC2 instance types that were compute-optimized, memory-optimized, and core-frequency optimized.
We also included instances with processor architectures of Intel, AMD, and the latest AWS Graviton2 64 bit ARM processors, equipped with 100 Gbps networking with Elastic Fabric Adapter (EFA).
Application performance scalability tests were carried out across clusters of the previously mentioned instance types equipped with 100 Gbps EFA performance networking support.
Figure 1 shows are the system compute configurations of the instances benchmarked. For ease of comparison across the six instance types (c5n.18xlarge, m5n.24xlarge, m5zn.12xlarge, c5.24xlarge, c5a.24xlarge, c6gn.16xlarge) the following figure shows the grouping on frequency, core count, total memory capacity, memory/core, number of memory channels and network bandwidth.
Storage I/O does not seem to have a bearing on the performance and so all our tests were carried out with shared (NFS) Amazon Elastic Block Storage (EBS) volumes for cluster runs, and the root EBS volume for single-instance runs.
Interconnect performance does increase performance beyond two to four nodes for the medium and large workloads, indicating that the application is communication bound when run across multiple instances. Interconnect performance is covered later in the blog.
Single node performance
When running GROMACS, there are two scenarios where it makes sense to leverage a single instance with compute-intensive capabilities. The first is when you are running GROMACS on smaller systems that fit within the memory and CPU core count bounds of a single instance. The second is when you are running ensemble job simulations that use hundreds of independent instances all running the same simulation with slightly different workload configuration parameters.
We start by examining multi-core scaling and CPU core frequency effects on GROMACS performance. The following figures show the multi-core scaling for the small benchmark (benchMEM) and CPU frequency sensitivity to performance comparing c5n and m5zn instances where the m5zn instance has 25% higher core frequency.
We observe the multi-core scalability is linear with increasing core counts and higher CPU core frequency provides 11% improvement in performance. Frequency sensitivity results were attained by running multiple copies of benchMEM on both instances to occupy all cores to ensure that the system was at full load to avoid effects of frequency throttling, we then noted the maximum value attained for Ns/day. We observed that GROMACS is frequency sensitive, but not to a great extent. Greater gains may be seen when running GROMACS across multiple instances with higher frequency cores – we will cover multiple-instance performance later in the post. We recommend that you compare the price-to-performance ratio before choosing instances with higher CPU core frequency, as instances with a higher number of lower-frequency cores may provide better total performance.
To arrive at the single-instance configuration that provides the fastest time to result and best price-performance ratio, we compared the small and medium benchmark simulation results across the six instance types. The following figure shows the results.
GROMACS takes advantage of simultaneous multithreading (SMT) as seen from the above comparison across the instances. The application can also scale as the per-instance core-count is increased, and the higher the core-count, the more Ns simulated per day (faster time-to-results). In general, we found that using SMT increases the performance by 10%. The c5.24xlarge Intel Cascade Lake based instances provide the fastest time-to-result for single-instance CPU runs as compared to all other instances benchmarked because it has the highest number of cores and memory channels (48 cores, 12 channels).
We will review the price-performance aspects of single-node runs in the price-performance analyses section later in this post.
Multi-node scaling with larger benchmarks
To get to the instance types that provided the best performance and the best price-performance, we scaled the medium size benchmark (benchmRIB) from 2 to 16 instances, and the larger benchmark (benchmPEP) from 2 to 64 instances across clusters of the six instance types. We also scaled both benchmarks to a maximum instance count of 128 instances (4680 cores) for the c5n instance type to check the effect of communication overhead as the simulation scaled. For this test, all benchmark runs were done with SMT disabled. We tested various combinations of the number of MPI rank vs OpenMP threads for each node count to get the best performance across instance types. For both workloads, changing the OpenMP threads per MPI rank provides better performance at high node count (beyond 64 nodes) but not at a lower number of nodes.
We used AWS ParallelCluster to set up the HPC cluster with multiple queues of each of the six instance types with SLURM as the job scheduler. The cluster also integrates with Amazon S3 object storage where all benchmark data and results were stored.
The following figure shows the architecture of the HPC cluster when deployed. If you would like to get started with AWS ParallelCluster, we have an online workshop that helps you build the cluster with easy to follow step-by-step illustrations.
Let us first observe the effect of inter-process communication on GROMACS by closely observing the profiles for the medium and large workloads considered in this blog. The following figure illustrates the benchPEP benchmark scaled over 128 instances (5184 cores) and benchRIB benchmark scaled to 16 instances across the same instance type (c5n) with EFA turned off and on. EFA is a network interface for Amazon EC2 instances for applications requiring high levels of inter-node communications at scale. You can find out more about EFA here: https://aws.amazon.com/hpc/efa/.
The results show that EFA increases the scalability of the application and allowing for a faster time to result compared to the same simulation when EFA is turned off. Instances that support EFA will reduce the runtime for scaled out simulations for large systems with millions of atoms or simulations that target larger-timescale results. Leveraging EFA resulted in ~5.4X runtime improvement at peak performance (144 instances in the case of benchPEP). Performance without EFA enabled starts to plateau beyond eight nodes for benchPEP, and after only two nodes for benchRIB, so choosing an instance with EFA is recommended for saving both time and costs.
Now that we are aware of the performance EFA provides, let’s look at the scalability of the application across various instance types.
We have run the medium and large workloads across instances that are compute and memory-optimized, in addition to specific instances with high core frequency – and the AWS Graviton2 Arm-based family. All these instances support EFA.
As expected, using EFA-based networking resulted in strong scaling for both the benchmarks. The performance increases are close to linear across all the instance types as core count increases.
The results show that the m5n and c5n instances provide the highest scalable performance across both medium and large workloads. Note that while m5n has 25% more cores compared to c5n, the c5n has higher per-core frequency compared to m5n. The data shows that the performance of the m5n instance is closely followed by c5n. As we saw in the single-node analysis, GROMACS is frequency sensitive and the m5n shows better performance at lower instance node counts. The gap between c5n and m5n becomes smaller as instance counts grow. This is due to the effect of higher communications overhead for a larger number of m5n nodes needed to reach the same overall core count as the c5n instances, which have more cores per instance. However, the performance curves of the benchPEP data show that scalability starts diverging again at around 48 nodes (2304 cores for m5n). Instances with higher core-counts do not always result in better price and performance. We recommend running a benchmarking study on your specific workloads to determine the optimal price-performance and scalability as a function of node count.
Also note the scalability of the workloads on c6gn instances. C6gn delivers up to 40% better price-performance over C5n, and is built using AWS Graviton2 processors that are custom built by AWS using 64-bit Arm Neoverse. GROMACS exhibits strong scaling on c6gn based instances for the larger benchPEP workload due 100 Gbps EFA networking, keeping the MPI communications overhead low. Further we will show that c6gn is the best instance as far as price-performance is concerned.
When working in the cloud, customers can spin up customized clusters (or customized queues on existing clusters). These can optimize for different workloads, different sizes of workload, or even different levels of urgency. This leaves an opportunity for optimizing costs for each common use-case. It is always worth measuring price when performance testing on AWS.
Single node cost analysis
As we saw earlier, the c5 Intel Cascade Lake based instances provide the fastest time to result for single-instance CPU runs compared to all other instances benchmarked. This is true for price-performance too, where the c5 shows the lowest cost per Ns simulated when measuring the cost-per-second simulated for the medium size benchmark (benchRIB) across all instance types. This is because the latest generation Intel Cascade lake processors featured in the c5.24xlarge instance has a high core-count (48 physical cores) and a high number of memory channels per instance (12 channels).
Multi node cost analysis
Examining the price-performance for the application scalability of benchRIB benchmark, we see the AWS Graviton2 c6gn instance stands out as the clear choice as far as price-performance is concerned. The following figures show the cost/Ns for benchRIB scaled for four instance types, in addition to the cost/Ns and Ns/day (at a node count of 8 for comparison).
The c6gn instance is nearly 18% lower in terms of cost/Ns (at eight nodes) compared to the c5n instance type, which is the next best in price-performance.
You can also see that the m5n narrowly beats c5n in absolute performance (it has 25% more cores). However, the c5n is a better choice from a price-performance point of view because it is nearly 26% lower in terms of cost/Ns.
From the various benchmark runs we evaluated, GROMACS exhibits strong-scaling with a high-performance interconnect fabric and hence we recommend always to choose an instance with EFA for simulation jobs that are scaled out to get faster time-to-results.
For the fastest time-to-result for single-instance simulations, in addition to ensemble jobs, we recommend a compute-intensive instance, such as one with high core counts, like the c5.24xlarge instance. C5.24xlarge also has the best price-performance for single-instance simulation jobs. For large simulations that would benefit from more memory per core, consider using m5 and r5 high core count instance variants.
For simulations that can be scaled across multiple nodes, with the fastest time-to-result and medium price-performance, the c5n.18xlarge instance with EFA is best. If you need to reduce your instance count and increase memory per core, the m5 and r5 variants are a good choice.
The C6gn AWS Graviton2 ARM instances provide the best price-performance for large scale-out simulations.
GROMACS does not have a very high storage I/O throughput requirement, and there was no requirement to incorporate a high-performance parallel file system like Amazon FSx for Lustre in our tests. If you encounter performance degradation from the NFS shared EBS volume in your environment, consider adding FSx for Lustre into your production architecture.
To get started running GROMACS on AWS, we recommend you check out our GROMACS on AWS ParallelCluster workshop that will show you how to quickly build a traditional Beowulf cluster with GROMACS installed. You can also try our workshop on AWS Batch to run GROMACS as a containerized workload. To learn more about HPC on AWS, visit https://aws.amazon.com/hpc/.