AWS HPC Blog

Best practices for running molecular dynamics simulations on AWS Graviton3E

This post was contributed by Nathaniel Ng, Shun Utsui, and James Chen from AWS Solution Architecture

Last year, we announced the general availability of Hpc7g instances, the instance type focused on HPC workloads powered by AWS Graviton3E. Graviton3E processors deliver up to 35% higher vector-instruction performance compared to Graviton3, providing higher performance benefits for HPC applications.

Molecular dynamics (MD) is a domain that frequently leverages HPC resources. Previously, customers ran their MD workloads using predominantly x86 architectures, but we’ve heard that many are interested in understanding the performance they can get on Graviton3E.

So, in this post, we’ll show how you can run MD workloads on Hpc7g instances using AWS ParallelCluster, a supported open-source cluster management tool that allows you to deploy a scalable HPC environment on AWS in a matter of minutes. We’ll use GROMACS and LAMMPS as examples – two very popular MD applications. And we’ll highlight the best practices for the tools – and the compiler flags – we used to achieve optimal performance.

Architecture

Key to the architecture is AWS ParallelCluster. Customers can install the ParallelCluster CLI on their laptops using Python’s pip package manager, and use it to deploy a cluster in the cloud from their laptops by supplying a short configuration file.

Through the ParallelCluster UI, you can use a wizard to configure a cluster without editing a configuration file or installing anything on your laptop. Details on how to set up a ParallelCluster environment can be found in this workshop. You can also find a one-click launchable stack in the HPC Recipe Library (on GitHub), which will build an Hpc7g cluster for you after asking a minimum of questions.

 Figure 1: Architecture using AWS ParallelCluster to run MD workloads with AWS Graviton 3E instances.


Figure 1: Architecture using AWS ParallelCluster to run MD workloads with AWS Graviton 3E instances.

ParallelCluster uses Slurm for its job scheduler, to dynamically scale the number of Hpc7g.16xlarge compute instances, responding to the queue.

These instances are powered by custom-built AWS Graviton3E processors. They feature the latest DDR5 memory offering 50% more bandwidth compared to DDR4, and they carry 200Gbps network interfaces with the Elastic Fabric Adapter (EFA). Graviton3E processors implement Scalable Vector Extension (SVE) of the Neoverse V1 architecture and hence can deliver up to 2x better performance for floating point codes than Graviton2.

To achieve sufficient performance on storage I/O, we deployed 4.8 TB of Amazon FSx for Lustre – a fully-managed, high-performance Lustre file system. We selected the PERSISTENT_2 deployment type with a disk throughput of 1000 MBps/TiB of storage provisioned, and backed it with an Amazon Simple Storage Service (Amazon S3) bucket.

We’ve documented all the details of our ParallelCluster configuration in our GitHub repository.

Development tooling

We tested several configurations and concluded that we recommend using the following compilers and libraries for most use cases. You can check our GitHub repository to find out best practices for compiler flags when building MD applications.

  • Compiler: Arm compiler for Linux (ACfL) version 23.04 or later
  • Library: Arm performance libraries (ArmPL) version 23.04 or later, included in ACfL
  • MPI: Open MPI version 4.1.5 or later

It’s worth noting that Arm Compilers and Performance Libraries for HPC developers are now available for no cost.

In the rest of this post, we’ll explain why we preferred these tools, by diving deeper into the performance of GROMACS and LAMMPS with different compilers, compiler options, and input files.

GROMACS

GROMACS is an open-source software suite for high-performance molecular dynamics and output analysis. It’s widely adopted in the computer simulation fields not only for biochemical molecules but also for non-biological systems like polymers and fluid dynamics. The GROMACS community have optimized it to make great use of the SIMD capabilities of many modern HPC architectures.

For Arm architectures, the code supports both NEON (ASIMD) and SVE instructions. In this post, we used GNU 12.2 and the Arm compiler for Linux (ACfL) 23.04 as the testing compiler suite, with Arm Performance Library (ArmPL) 23.04 for the math library, and Open MPI 4.1.5 linked with Libfabric to build different binary executables of GROMACS 2022.5.

Build scripts

Our main objective was to find out the best compiler suite – and SIMD support fit – for Graviton3E-based Hpc7g instances. We’ll discuss the build scripts, job submission scripts, and performance data here – and you can find all the scripts in our GitHub repository.

For GROMACS’s build script, there’s no difference in the CMake options for GNU and Arm compiler for Linux.

For GNU compilers, the Open MPI 4.1.5 environment is already installed in ParallelCluster’s machine images. We recommend anyone new to the cloud to use the system default Open MPI library. However, the system default Open MPI does not support ACfL, so we’ve supplied a script which demonstrates how to compile and install Open MPI 4.1.5 with ACfL. Once this is installed, we can use the bash module environments to switch between the GNU and Arm compiler.

With this done, it’s now possible to build executables for GROMACS 2022.5 for both SVE or NEON/ASIMD instruction sets.

We’ve stored the procedures for this in yet another reference script. We only need to change the parameter GMX_SIMD in the configuration setup. For the SVE-enable binary, the parameter is –DGMX_SIMD=ARM_SVE. For building the NEON/ASIMD-enabled binary, you’ll need to switch the parameter to -DGMX_SIMD= ARM_NEON_ASIMD.

Test cases

We applied three standard test cases for GROMACS from the Unified European Application Benchmark Suite (UEABS).

  1. Test Case A is the ion channel system of the membrane protein GluCl embedded in a DOPC membrane and solvated in TIP3P water. This system contains 142k atoms.
  2. Test Case B is a model of cellulose and lignocellulosic biomass in an aqueous solution. The system has 3.3 million atoms.
  3. Test Case C is the standard test case for NAMD benchmarks. The system is a 3 x 3 x 3 replica of the STMV (Satellite Tobacco Mosaic Virus). The size of model is about 28 million atoms.

For all these cases, we set the total simulation steps to 10k.

To find out the best combination of compiler and SIMD setup, we started to test the application performance on a single Hpc7g.16xlarge instance. Figure 2 charts the performance we saw for test case A (142K atoms). ACfL with SVE-enabled generated the best performance. The binary this generated was about 9-10% faster than the one using NEON/ASIMD. The binary produced by the ACfL with SVE ran 6% faster than when we used the GNU compiler with SVE.

Figure 2: Performance of GROMACS 2022.5 for GluCl Ion Channel system (142K atoms) with different settings of compilers and SIMD using one Hpc7g.16xlarge instance. SVE-enable binary generated by ACfL produces the best performance. All the data points are based on the average of three individual runs.

Figure 2: Performance of GROMACS 2022.5 for GluCl Ion Channel system (142K atoms) with different settings of compilers and SIMD using one Hpc7g.16xlarge instance. SVE-enable binary generated by ACfL produces the best performance. All the data points are based on the average of three individual runs.

Figure 3 charts the performance we saw for test case B (3.3M atoms). The best setup we found – again – was to use the ACfL with SVE enabled. We measured a 28% improvement for the SVE-enabled binary, compared with the NEON/ASIMD-enable binary when we used this compiler.

Figure 3: Performance of GROMACS 2022.5 for cellulose and lignocellulosic biomass (3.3M atoms) with different settings of compilers and SIMD using one Hpc7g.16xlarge instance. SVE-enabled binary generated by ACfL produces the best performance. All the data points are based on the average of three individual runs.

Figure 3: Performance of GROMACS 2022.5 for cellulose and lignocellulosic biomass (3.3M atoms) with different settings of compilers and SIMD using one Hpc7g.16xlarge instance. SVE-enabled binary generated by ACfL produces the best performance. All the data points are based on the average of three individual runs.

The performance results for test case C (28M atoms) are shown in Figure 4. Again, we saw a similar pattern to before: the ACfL with SVE enabled was the best option for GROMACS running on Hpc7g instance. In this case, the performance delta was 19% compared to NEON-ASIMD. And again, the binary generated by ACfL was 6% faster than the one built by the GNU compiler.

Figure 4: Performance of GROMACS 2022.5 for Satellite Tobacco Mosaic Virus, STMV (28M atoms) with different settings of compilers and SIMD using one Hpc7g.16xlarge instance. SVE-enable binary generated by ACfL produces the best performance. All the data points are based on the average of three individual runs.

Figure 4: Performance of GROMACS 2022.5 for Satellite Tobacco Mosaic Virus, STMV (28M atoms) with different settings of compilers and SIMD using one Hpc7g.16xlarge instance. SVE-enable binary generated by ACfL produces the best performance. All the data points are based on the average of three individual runs.

Based on the results running on single Hpc7g instance, we concluded that ACfL with SVE-enabled SIMD, generated the best performance results.

By default, the configuration tool in GROMACS detected the CPU automatically, and selected the SIMD support correctly, too. GROMACS also detected the configuration of SIMD correctly for AWS Graviton3.

We chose test case C to highlight the scalability for performance on Hpc7g when we ran across multiple nodes. The result charted in Figure 5 confirms this – scalability of performance is near-linear with EFA enabled.

Figure 5: Scalability of Test Case C running on the Hpc7g-based cluster with, and without, EFA. 200Gpbs EFA contributes the scalability in the cases running beyond 2 compute instances. The binary was compiled with ACfL with SVE-enabled. All the data points are based on the average of three individual runs.

Figure 5: Scalability of Test Case C running on the Hpc7g-based cluster with, and without, EFA. 200Gpbs EFA contributes the scalability in the cases running beyond 2 compute instances. The binary was compiled with ACfL with SVE-enabled. All the data points are based on the average of three individual runs.

Conclusion for GROMACS

Based on these performance results for three quite different, but well-known test cases, we found that SVE-enabled binary was faster than using ASIMD/Neon instructions. ACfL produced faster code than the GNU compiler. Specifically for Hpc7g, ACfL with SVE-enabled SIMD setting was the best configuration when paired with the latest ArmPL and Open MPI with EFA enabled.

LAMMPS

LAMMPS (Large-scale Atomic/Molecular Massively-Parallel Simulator) is a classical molecular dynamics simulator, used for particle-based modelling of materials. It was developed at Sandia National Laboratories, and is available as an open-source tool, distributed under GPLv2. You can download the source code from the LAMMPS GitHub repository.

In terms of algorithms, LAMMPS uses parallel spatial decomposition, as well as parallel FFTs for long-range Coloumbic interactions. The computational cost of scaling with the number of atoms is O(N) for short-range interactions, and O(N log N) when computations involve Coloumbic interactions with FFT-based methods.

To prepare for LAMMPS running on AWS, we can follow the same steps as before for GROMACS for the AWS ParallelCluster, GCC, ACfL, and Open MPI setup. The LAMMPS install guide has several options for installation, but to use the latest version of LAMMPS, and more recent versions of GCC, ACfL, and Open MPI, LAMMPS must be compiled from source.

LAMMPS has specific Makefiles that are optimized for Arm architecture available: Makefile.aarch64_arm_openmpi_armpl and Makefile.aarch64_g++_openmpi_armpl, for use with ACfL and GCC compilers respectively. In the case of ACfL, we used -march=armv8-a+sve to add the SVE instructions, and checked that switching between -march=armv8-a+sve and -march=armv8-a+simd did not impact performance. For both ACfL and GCC Makefiles, we added -fopenmp to CCFLAGS and LINKFLAGS to enable OpenMP. We’ve summarized the final settings in Table 1.

Table 1: Compiler settings for GCC and ACfL

Table 1: Compiler settings for GCC and ACfL

To compile LAMMPS, the final Makefiles available in our GitHub repo, were 2a-compile-lammps-acfl-sve.sh for ACfL and 2b-compile-lammps-gcc.sh for GCC.

The key points to note in our LAMMPS compile script involve loading the correct environment modules – libfabric-aws/1.17.1 and armpl/23.04.1 for both compilers, acfl/23.04.1 for ACfL, and gnu/12.2.0 for GCC – and replacing the CCFLAGS and LINKFLAGS variables with those from Table 1. For ACfL, we set both PATH and LD_LIBRARY_PATH to the Open MPI installation folders.

We pulled the LAMMPS source code from the git repository into the install folder (~/software in our case), and checked out the 23 June 2022 branch for compilation (git checkout stable_23Jun2022_update4). You can check out other releases, too. If you want the default stable branch, use git checkout stable. Use make clean-all to remove all the intermediate object files and executable files, and make no-all to turn off all options. Next, we ran make yes-most to install most of the packages, which was enough for us to test all five LAMMPS benchmarks.

Once we compiled LAMMPS, we were able to submit jobs using Slurm. We’ve included sample job scripts 3a-lammps-acfl-sve.sh (for ACfL-compiled LAMMPS) and 3b-lammps-gcc.sh (for GCC-compiled LAMMPS) in our repo. In the final command, you should execute LAMMPS using mpirun. The -var x, -var y, and -var z flags specify the NX, NY, and NZ parameters passed to the LAMMPS input file, and the -in parameter indicates the LAMMPS input file (in.lj in our example).

For the LAMMPS test runs, we chose the five benchmark cases listed at the LAMMPS Benchmarks site as described in Table 2:

Table 2: Description of the 5 LAMMPS benchmarks

Table 2: Description of the 5 LAMMPS benchmarks

We ran a comparison between LAMMPS compiled with the GCC, and LAMMPS compiled with ACfL for all of the 5 input files (in.lj, in.chain, in.eam, in.chute, and in.rhodo), running on a single Hpc7g.16xlarge instance.

For each of the input files, we submitted three jobs with GCC and three jobs with ACfL, and we used the average of the runs to report the results. We took the speed from the log.lammps output file, reported in tau/day for the chain, chute, and lj test cases, and ns/day for the eam and rhodo test cases.

The LAMMPS benchmarks each contain 32,000 atoms (for NX=NY=NZ=1). We chose NX=NY=NZ=8 as a balance between NX=NY=NZ=1 (with high variability in performance due to the short run time) and NX=NY=NZ=32 (resulting in out of memory errors).

Figure 6. Performance of LAMMPS a single Hpc7g.16xlarge instance with NX=NY=NZ=8, when compiled using the GCC and ACfL compilers. Speed is in tau/day for the chain, chute, and lj test cases, and ns/day for the eam and rhodo test cases.

Figure 6. Performance of LAMMPS a single Hpc7g.16xlarge instance with NX=NY=NZ=8, when compiled using the GCC and ACfL compilers. Speed is in tau/day for the chain, chute, and lj test cases, and ns/day for the eam and rhodo test cases.

ACfL consistently outperformed GCC for all the single-node runs, with improvements ranging from 2.3% to 46%.

Table 3: Speedups with ACfL over GCC for the 5 benchmark cases on a single Hpc7g.16xlarge instance with NX=NY=NZ=8.

Table 3: Speedups with ACfL over GCC for the 5 benchmark cases on a single Hpc7g.16xlarge instance with NX=NY=NZ=8.

For the multi-node runs, we further increased NX=NY=NZ to 32, to increase the runtime – and therefore reduce the variability – in the results. This brought the total number of atoms to 32,000 x 32 x 32 x 32 = ~ 1 billion atoms.

We chose to focus only on the Lennard Jones benchmark and tested both GCC- and ACfL- compiled LAMMPS with OMP_NUM_THREADS=1,2,4. For 8 nodes (512 cores) and above, we observed better results with OMP_NUM_THREADS=2 for both compilers. Unlike the single node runs, we observed that GCC-compiled LAMMPS outperformed its ACfL-compiled counterpart. We tested the GCC-compiled version up to 128 nodes.

We’ve plotted the results in Figure 7.

Figure 7: Scalability of GCC-compiled LAMMPS for Lennard Jones benchmark case (1 billion atoms) using multiple Hpc7g.16xlarge instances. The line for linear scaling is plotted in grey.

Figure 7: Scalability of GCC-compiled LAMMPS for Lennard Jones benchmark case (1 billion atoms) using multiple Hpc7g.16xlarge instances. The line for linear scaling is plotted in grey.

Conclusion

In this post, we showed how you can run GROMACS and LAMMPS on Hpc7g-based instance. We discussed the ideal toolchain and SIMD setup based on the results we saw.

We didn’t modify the source code of GROMACS and LAMMPS, choosing instead to leverage “auto-vectorization” implemented in the compilers.

Arm’s compiler for Linux (ACfL) with SVE enabled SIMD boosted GROMACS performance up to 22% compared with the GNU compiler and NEON/ASIMD on Hpc7g instances. Arm compiler for Linux sped up LAMMPS simulation between 2.3% to 46%, depending on the case, compared to the GNU compiler on a single node basis. For LAMMPS, we also saw performance improvements at larger core counts with hybrid MPI + OpenMP parallelization.

We’ve made all our build scripts – and job submission scripts – available in our GitHub samples repo, and we’d encourage you to use this if you want to build on this work for your own research workloads. If you need to discuss any of this, feel free to reach out to us at ask-hpc@amazon.com.

Nathaniel Ng

Nathaniel Ng

Nathaniel is a Solution Architect at AWS. With a PhD and over a decade of experience in HPC, he is passionate about enabling researchers leverage the power of the cloud to advance their fields and solve real-world problems. He also partners and helps customers adopt and optimize cloud solutions for their high performance computing (HPC) and AI/ML needs.

James Chen

James Chen

James has been working as a High-Performance Computing (HPC) specialist for more than 16 years with experience in both Academia and the IT industry. He obtained his PhD in computational materials science in 2007 and worked for National Super-Computing Centre (NSCC) Singapore as a senior research fellow until 2020. He specializes in system architecture and performance benchmarking. He is now working for HPC customers in ASEAN.

Shun Utsui

Shun Utsui

Shun is a Senior HPC Solutions Architect at AWS. In this role, he is leading the HPC business in Asia Pacific & Japan with a wide range of customers. He is an expert in helping customers migrate various HPC workloads to AWS such as weather forecasting, CAE/CFD, life sciences and AI/ML. Prior to AWS, he held multiple HPC tech lead positions at Fujitsu where he supported customers design and operate their large-scale HPC systems.