Running GROMACS on GPU instances: single-node price-performance

This three-part series of posts cover the price performance characteristics of running GROMACS on Amazon Elastic Compute Cloud (Amazon EC2) GPU instances. Part 1 covered some background no GROMACS and how it utilizes GPUs for acceleration. This post (Part 2) covers the price performance of GROMACS on a particular GPU instance family running on a single instance. We will cover the price-performance of GROMACS running across a set of GPU instances, with and without the high speed networking capabilities of Elastic Fabric Adapter (EFA) in our next post.

Price performance analysis on a single GPU-enabled instance

For the purposes of our analysis, we chose three benchmarks of varying atom counts from the Max Planck Institute for Biophysical Chemistry, which represent three different classes of input sizes: small-, medium-, and large-sized system. Table 1 provides details and download links for the three workloads. The variety of workload sizes will help us understand the relationship between parallel efficiency and price-performance exhibited by these workloads to provide some guidance on instance selection when you compare to other workloads of similar size.

The results presented in this post and the next post (Part 3) are on compute instances on AWS only and may vary with other system configurations you may have on-premise or with other workloads. Computational complexities vary across workloads (for same or similar size workloads) and manifest different performance profiles depending on the system configurations. Further, for the purposes of having a baseline on optimization, experiments carried out for the results here exercise the out of the box settings for compilers, libraries and the overall software stack. You may be able to attain better performance using finer level optimizations on the software stack such as parameters used with libraries, build options with compilers and other performance oriented build options associated with the application build process.

**Table 1:** GROMACS workload details.
Benchmark and download link	System size	Description	Number of atoms	Time step resolution in femtoseconds	Total time steps
benchMEM	Small	Aquaporin tetramer in lipid membrane surrounded by water	82K	2 fs	10K
benchRIB	Medium	Ribosome in water	2M	4 fs	10K
benchPEP	Large	Peptides in water	12M	2 fs	10K

Our analysis in this post includes single-instance single-GPU, single-instance multi-GPU benchmarks to compare Amazon EC2 GPU-accelerated instance types. Note that all the GPU models considered are Nvidia GPUs. The graph and the table in the following figure, Figure 1, shows details on the instances and configurations considered.

Figure 1: GPU instances under consideration for this study, also detailed are their instance configurations for GPU and CPU architecture and generational specifications.

Storage I/O usually isn’t a gating factor for GROMACS performance, so all our tests were carried out with shared (NFS) Amazon Elastic Block Storage (EBS) volumes for cluster runs, and the root EBS volume for single-instance runs. GROMACS simulations are able to take advantage of high performance interconnects when scaling out the simulation, we cover interconnect performance in our next post (Part 3). All binaries were built using Intel OneAPI and Nvidia CUDA compilation tools. We used the FFT library from Intel MKL used in calculations for the long-ranged component of the non-bonded interactions. For the single-node runs we used a binary with GROMACS thread-MPI library. For scaling runs we used a binary compiled for Intel MPI which we will examine later in the next post where we examine multi-node scaling. Table 2 lists the details for the software stack used.

**Table 2:** Software stack used for benchmarking.
Software	Version details
GROMACS	2020.2
Operating system	Amazon Linux 2
Compilers	Nvidia Cuda compilation tools V11.0.194, Intel OneAPI (icc) 2021.3.0, gcc 7.3.1
FFT libraries	Intel OneAPI MKL 2021.0.3
MPI	Intel OneAPI MPI 2021.3
CUDA version	11.0.207

Single-node performance and price performance analyses

Let’s start by examining single- and multi-GPU scaling where all the GPUs are in a single node. The following figure, Figure 2, shows both the performance and performance-to-price ratios measured across various instance families and sizes and several generations of GPUs. We tested the small-, medium-, and large-sized workloads to uncover single-node performance and multi-GPU scalability across workload types. Please note that these number and performance ratios may vary for problems that are different but of similar size.

We will first compare the performance and price-performance for the three benchmarks across all instances under consideration.

Figure 2: Performance (left axis blue) and performance to price ratio (right axis orange) comparison across small medium and large benchmarks.

The graphs in Figure 2 show the best performance for a GROMACS run we could obtain for the optimal combination of MPI ranks x OMP (OpenMP) threads with dynamic load balancing (DLB) turned on by default. In multiple cases we saw that a unique combination of MPI x OMP provides slightly better performance compared to the default behavior of thread-MPI that starts as many MPI ranks as the number of cores.

The results obtained show the best performance comes from the P4, P3, and G4 metal instances that have 8 GPUs – and all achieve a similar performance. If extreme performance is the only consideration (i.e. price is secondary), the g4dn.metal instance would be our pick as its performance is on par with the P3 and P4 instances, and the performance-to-price ratio is ~4X better. In terms of the best performance-to-price ratio the g4dn.xlarge instance is an obvious choice: it has the highest performance-to-price ratio across all the GPU instances we studied and is even 3.2-4 X higher in performance to price than CPU instances (comparison done with c5.24xlarge).

We benchmarked the whole array of GPUs available on AWS because GROMACS is quite sensitive to the CPU-to-GPU ratio on a given instance configuration. This is the kind of analysis that is easily realized on the cloud, since the logistics of creating a test bed this extensive would probably be much harder than running the tests themselves.

While GROMACS is still capable of scaling up performance on multi-GPU instances across all three problem sizes, we saw higher efficiencies for larger workloads when scaling up with higher GPU-count instances.

Figure 3: Normalized performance (p2.xlarge as baseline) comparison and performance to price (taking P4dn.24xlarge as baseline) comparison across all three benchmarks.

Figure 3 demonstrates that the CPU-to-GPU ratio has a large impact on performance: This is mainly due to the fact that the particle-mesh Ewald (PME) algorithm part of the solution runs mostly on CPUs and uses 3D FFT shows better performance if scaled over a higher number of CPU cores. You can see the performance dependency on scaling the number of CPUs in Figure 4, which shows the comparison between all of the G4dn instances where the GPU count is constant and the CPU core counts increase – i.e., single GPU but the CPU cores double as you move to the next larger instance.

Figure 4: Performance scaling as a function of CPU core count increase while number of GPU’s remain constant.

Take a closer look at Figure 4, and, in particular, the scaling curves for ns/day for the three benchmarks across the g4dn.xlarge to g4dn.16xlarge. The number of GPUs is constant (1) but the core count doubles. We see that the performance increases progressively and exhibits better scaling for the larger vs smaller benchmarks. This indicative that for workload sizes that are closer to a million atoms or more we are better off using GPU instances that have higher CPU cores to GPU ratios to get better performance and price performance for simulation runs.

Conclusion

GROMACS is able to make effective use of all the hardware (CPUs, GPUs) available on a compute node to maximize the simulation performance, it is equipped with a Dynamic Load Balancing (DLB) algorithm that helps with optimal load balancing for a given decomposition and MPI ranks vs threads. Some take-aways from this study of single-node GPU performance:

For single-node GPU instances, our tests have shown that use of GROMACS thread-MPI library coupled with the DLB algorithm is good enough for users to get to the best performance and is better than using an external MPI library.
Certain unique combinations of MPI and OMP specified can produce better results than the default behavior of MPI ranks = number of cores.
For single-node simulation runs, and if extreme performance is the only consideration (i.e., price is secondary), the g4dn.metal instance would be our recommendation.
In terms of the best performance-to-price ratio the g4dn.xlarge instance will be more suited as it exhibits the highest performance-to-price ratio across all the GPU instances and CPU instances.
The CPU-to-GPU ratio in an instance has significant impact on performance. Given an instance family of GPU instances again if performance is the main consideration it is recommended to select the compute instance with the highest CPU-to-GPU ratio.

In Part 3, the last in this series, we will review the price performance when GROMACS is scaled over multiple GPU enabled instances.

AWS HPC Blog

Running GROMACS on GPU instances: single-node price-performance

Price performance analysis on a single GPU-enabled instance

Single-node performance and price performance analyses

Conclusion

Resources

Follow