AWS HPC Blog

GROMACS performance on Amazon EC2 with Intel Ice Lake processors

We recently launched two new Amazon EC2 instance families based on Intel’s Ice Lake – the C6i and M6i. These instances provide higher core counts and take advantage of generational performance improvements on Intel’s Xeon scalable processor family architectures.

In this post we show how GROMACS performs on these new instance families. We use similar methodologies as for previous posts where we characterized price-performance for CPU-only and GPU instances (Part 1, Part 2, Part 3), providing instance recommendations for different workload sizes.

Today we’ll compare the performance on C6i with its predecessors, the c5.24xlarge and c5n.18xlarge which were our top picks on previous posts for single node and scale-out simulation runs. We also show a quick comparison against GPU instances as well. Because we’ll only be using the largest sizes in each of these families, we’ll just refer to the instance family name throughout this post, for brevity.

Our setup

The c6i.32xlarge (C6i from now on) is the instance we used for all the analyses described in this post. It comes with 64 physical cores, 256 GiB of RAM and carries a 50 Gbps network interface with the Elastic Fabric Adapter (EFA).

We’ll use the same benchmark cases from the Max Planck Institute for Biophysical Chemistry that were used in earlier posts.  These cases represent three different classes of input sizes: small (benchMEM, 82k atoms), medium (benchRIB, 2M atoms), and large (benchPEP, 12M atoms) system.

To enable the single-node and multi-node scaling runs, we used AWS ParallelCluster to set up the HPC cluster with SLURM as our job scheduler. We compiled GROMACS from source using the Intel 2020 compiler to take advantage of single-instruction-multiple-data (SIMD) AVX2 and AVX512 optimizations, and the Intel MKL FFT library. You can see the specific compiler flags used with the Intel compiler in the online workshop we reference at the end of this post.

Single node Performance and Price-Performance

Our earlier post showed C5 as the best choice for single-node simulations on a CPU only instance – it was the best in performance and price-performance across the three workloads. This changes with C6i being the better instance. This is mainly due to higher-core counts, providing larger magnitudes of parallel efficiency per instance.

Absolute-performance

Figure 1 shows the comparison of C5 and C6i in terms of performance (ns/day) across the three workloads. Notice that for the smaller workload (benchMEM), the performance on C5 is still comparable to C6i, but as the workload size increases, the C6i take the lead. This is because the smaller workloads just can’t utilize the full parallel efficiency provided by the instance. Still, for a comparable core-count variant of C6i, the performance on C6i is usually similar to, or slightly better than, C5 for these smaller workloads.

Figure 1 Single-instance simulation comparison for small, medium and large benchmarks across c5.24xlarge and c6i.32xlarge

Price-performance

Figure 2. shows similar results on price-performance. Again, since the smaller workloads can’t utilize the full parallel efficiency on C6i, price-performance is slightly better on C5 but picking a smaller variant of C6i with a comparable number of cores may be more beneficial for these smaller cases.

Figure 2 Cost per nanoseconds simulated for small, medium and large benchmarks across c4.24xlarge and c6i.32xlarge instances.

Multi node Performance and Price-Performance

C6i is equipped with the Elastic Fabric Adapter (EFA) which is a network interface for Amazon EC2 instances for applications requiring high levels of inter-node communications at scale. Figure 3 shows the results of scale-out simulations with, and without, EFA enabled as we increase the number of instances.

Figure 3 Comparative scalability for the large workload (benchPEP) with EFA enabled and disabled on the c6i instance types.

Consistent with our earlier posts, you can see the significant performance gains with EFA enabled, with the simulation completing up to 1.5x faster at 64 nodes.

Absolute performance and price-performance

Figure 4 transposes absolute performance (ns/day) of C6i and C5n and two relative comparisons (c5n as the baseline) of performance (ns/day) and price-performance ($/ns) between C6i and C5n so we can have a collective view.

For the scale-out simulation at 64 nodes you’ll see that the $/ns of the C6i is still slightly lower (0.99x of C5n) while the performance is 1.64x of C5n. This result is due to a combination of the generational improvements in the CPU and support for high-performance networking. The C6i features the Intel Icelake-SP Xeon processors which in addition to having more cores per socket also have eight memory channels per socket compared to the six memory channels on the previous generation of processors.

Elastic Fabric Adapter (EFA) becomes essential for performance at scale, where the MPI inter-process communication overhead is significantly higher on C6i vs C5n because there are just more core (and hence more MPI processes) per instance. For example, in the simulation run at 64 nodes, the MPI communications have scaled up by nearly 75% for C6i compared to C5n (8192 MPI ranks for C6i vs 4680 MPI ranks for C5n). However, this doesn’t lead to performance degradation on C6i at this high MPI process-count, which is quite remarkable!

Figure 4 Relative Performance (ns/day) and price-performance ($/ns) comparison between c6i and c5n at scale (left axis) and absolute performance (ns/day) at scale of c6i and c5n instances.

Figure 4 Relative Performance (ns/day) and price-performance ($/ns) comparison between c6i and c5n at scale (left axis) and absolute performance (ns/day) at scale of c6i and c5n instances.

Comparison with GPU instances

Comparing with GPU instances that we covered in our earlier posts (Part 1, Part 2, and Part 3), Figure 5 shows the C6i with two GPU Instances, g4dn.xlarge and g4dn.metal. These were our top picks for instances with the best price performance (g4dn.xlarge) and highest absolute-performance (g4dn.metal) for single node and multi-node simulation runs.

For single-node simulations, g4d.metal and g4dn.xlarge are still better choices than C6i, but there is a change on price-performance, where C6i is now similar to g4dn.metal.

Figure 5 Comparison with GPU instances for single node and multi node GPU comparisons.

Things also change for multi-node simulation runs, where C6i is now the better instance compared to g4dn.metal. Figure 6 shows that for a scale-out simulation run at 16 nodes, the C6i is comparable to g4dn.metal on performance but is better on price-performance, making it the clear choice between the two when scaling out the simulations especially on large system sizes.

Figure 6 Performance and price-performance comparisons C6i vs G4dn.metal

Conclusion

The C6i instance is the fastest CPU-only instance for time-to-results for both single-node and multi-node simulations. This is because of a combination of the generational improvements in the CPU architecture itself, and our use of Elastic Fabric Adapter (EFA).

The collective MPI overhead at large node counts for high core count instances becomes significant and needs to be managed by performance interconnects. EFA demonstrates its capability for these scale-out simulations on C6i.

For large simulations that would benefit from more memory per core, you can use the M6i instance family, which have the same compute characteristics a C6i, but with twice the memory capacity. R6i is yet another step on this ladder, with 2x the memory of the M6i.

G4d.metal and G4dn.xlarge GPU instances still outperform the C6i for single-node simulations. However, for multi-node simulations, the C6i is the better instance overall as it’s comparable to g4dn.metal on absolute performance, and better on price-performance.

To get started running GROMACS on AWS powered by Intel Xeon Scalable Processors, we recommend you check out the GROMACS on AWS ParallelCluster workshop where you’ll quickly build a canonical Beowulf cluster with GROMACS installed. You can also use AWS Batch to run GROMACS as a containerized workload.

Finally, we have a two-part HPC Tech Short video series on C6i performance for other applications, like computational fluid dynamics (CFD). Take a view of Part1 and Part2 and let us know what you think in the comments!

TAGS:
Austin Cherian

Austin Cherian

Austin is a Senior Product Manager-Technical for High Performance Computing at AWS. Previously, he was a Snr Developer Advocate for HPC & Batch, based in Singapore. He's responsible for ensuring AWS ParallelCluster grows to ensure a smooth journey for customers deploying their HPC workloads on AWS. Prior to AWS, Austin was the Head of Intel’s HPC & AI business for India where he led the team that helped customers with a path to High Performance Computing on Intel architectures.