Deep-dive into Hpc7a, the newest AMD-powered member of the HPC instance family

Last week, we announced the Hpc7a instance type, the latest generation AMD-based HPC instance, purpose-built for tightly-coupled high performance computing workloads. This joins our family of HPC instance types in the Amazon Elastic Compute Cloud (Amazon EC2) which began with Hpc7a’s predecessor — the Hpc6a in January 2022.

Amazon EC2 Hpc7a instances, powered by 4th generation AMD EPYC processors, deliver up to 2.5x better performance compared to Amazon EC2 Hpc6a instances.

In this post, we’ll discuss details of the new instance and show you some of the performance metrics we’ve gathered by running HPC workloads like computational fluid dynamics (CFD), molecular dynamics (MD) and numerical weather prediction (NWP).

Introducing the Hpc7a instance

We launched Hpc6a last year for customers to efficiently run their compute-bound HPC workloads on AWS. As their jobs grow in complexity, customers have asked for more cores with more compute performance, as well as more memory and network performance to reduce their time to results. The Hpc7a instances deliver on these asks, providing twice the number of cores, 1.5x the number of memory channels, and three-times the network bandwidth compared to the previous generation.

It’s based on the 4th Generation AMD EPYC (code name Genoa) processor with up to 192 physical cores, an all-core turbo frequency of 3.7 GHz, 768 GiB of memory, and 300 Gbps of Elastic Fabric Adapter (EFA) network performance. This is all possible because of the AWS Nitro System, a combination of dedicated hardware and a lightweight hypervisor that offloads many of the traditional virtualization functions to dedicated hardware, result in performance that’s virtually indistinguishable from bare metal.

HPC instances sizes

Many off the shelf HPC applications use per-core commercial licensing that is often much greater in price per core than the cores themselves. Those customers have asked for more available memory bandwidth, and more network throughput available per-core.

Starting today, Hpc7a (along with other 7^th generation HPC instances) will be available in different sizes. Usually in Amazon EC2, a smaller instance size reflects a smaller slice of the underlying hardware. However, for the HPC instances — starting with Hpc7g and Hpc7a — each size option will have the same engineering specs and price, and will differ only by the number of cores enabled.

You have always been able to manually disable cores or use process pinning (affinity) to carefully place threads around the CPUs. But doing this optimally needs an in-depth knowledge of the chip architecture – like the number of NUMA domains and the memory layout. It also means you have to know MPI well, and have a clear sight to what your job submission scripts do when the scheduler receives them.

By offering instance sizes that already have the right pattern of cores turned off, you’ll be able to maximize the performance of your code with less work. This will be a boost for customers who need to achieve the absolute best performance per core for their workloads, which often includes commercially-licensed ISV applications. In this case, customers are driven by pricing concerns to get the best possible results from each per-core license they buy. You can find a detailed explanation in our post on this topic, along with performance comparisons that will help you understand our methodology.

With the HPC instance sizes, the cost stays the same for all sizes because you still get access to the entire node’s memory and network, with selected cores turned off to leave extra memory bandwidth available for the remaining cores. We encourage you to benchmark your own codes and find the right balance for your specific needs.

Hpc7a will be available in four different instances sizes. Table 1 describes these in detail.

Table 1 – Available instance sizes for Hpc7a. Customers choosing a smaller instance type will still have access to the full memory and network performance of the largest instance size but a different number of CPU cores.

Performance

Hpc7a shows significant performance gains over previous generations. While doubling the number of cores per instance, the performance, network, and memory bandwidth per-core have all increased. In many cases, this is leading to a doubling and more in overall simulation speed.

To illustrate this, we’ll look at the relative performance improvement of five common HPC codes across two generations of instance. We’ll look at:

Siemens Simcenter STAR-CCM+ for CFD
Ansys Fluent for CFD
OpenFOAM for CFD
GROMACS for MD
Weather Research and Forecasting Model (WRF) for NWP

Let’s take a detailed look at each code with performance comparisons to the previous generation Hpc6a instance. For the readers convenience, the previous generations’ specs are Hpc6a.48xlarge: 96 physical cores (3^rd Generation AMD EPYC, code name Milan), 384 GB memory, 4 GB memory per core, 3.6 GHz CPU Frequency, and 100 Gbps EFA Network Performance.

Siemens Simcenter STAR-CCM+

First, we take a look at Siemens Simcenter STAR-CCM+. We chose the AeroSUV 320M cell automotive test case – a useful public case with similar characteristics to production automotive external aerodynamics models. We ran this with Siemens Simcenter STAR-CCM+ 2306, using Intel MPI 2021.6. As this is an ISV application, no further tuning to match the architecture is necessary. The graphs below show the iterations per minute as a metric for performance.

We saw an up to 2.7x speed-up between Hpc7a and Hpc6a at 16 instances and similar scaling all the way to 12k cores, that’s around 26k cells per core.

Figure 1 – A graph of performance of Siemens Simcenter STAR-CCM+ on the AeroSUV 320M cell dataset. The figure shows that Hpc7a outperforms Hpc6a up to 2.7x on a per instance basis.

Figure 2 – A graph of performance of Siemens Simcenter STAR-CCM+ on the AeroSUV 320M cell dataset. Hpc7a outperforms Hpc6a up to 1.29x on a per core basis.

Ansys Fluent

Next we took a look at ANSYS Fluent 2023R1 – where we ran the common public dataset External flow over a Formula-1 race car . The case has around 140-million cells and uses the realizable k-e turbulence model and the pressure-based coupled solver, least squares cell-based, pseudo transient solver. We ran it to over 9,000 cores, which is still well within the parallel scaling of this particular test case.

The graphs show Solver rating as defined by Ansys as the number of benchmarks that can be run on a given machine (in sequence) in a 24-hour period. We compute this by dividing the number of seconds in a day (86,400 seconds) by the number of seconds required to run the benchmark.

On a per instance basis Hpc7a exhibits up to 2.48x better performance than Hpc6a at 6 instances. The iso-core benefit to Hpc7as favor peaks with 1.29x at 6144 cores.

Figure 3 – A graph of performance of Ansys Fluent on the F1 race car 140M dataset. Hpc7a outperforms Hpc6a up to 2.48x on a per instance basis.

Figure 4 – A graph of performance of Ansys Fluent on the F1 race car 140M dataset. Hpc7a outperforms Hpc6a up to 1.29x on a per core basis.

OpenFOAM

Next, we tested OpenFOAM with the DrivAer fastback vehicle (from the AutoCFD workshop) with 128M cells (generated using the ANSA preprocessing software by BETA-CAE Systems), and ran the case in hybrid RANS-LES mode using the pimpleFoam solver. We used OpenFOAM v2206 compiled with GNU C++ Compiler v12.3.0 and Open MPI v4.1.5. We use the architecture specific flags to tune the compilation for each instance type (“-march=x86-64-v4 -mtune=x86-64-v4” for Hpc7a and “-march=znver3 -mtune=znver3” for Hpc6a). We ran the case using 192 and 96 MPI ranks per instance, respectively, (fully-populated) and scaled to 3072 cores. The graphs below show the iterations per minute as a metric for performance.

We saw an up to 2.7x speed-up at 4 instances as this is the core count where Hpc7a is showing super linear scaling curve. Super linear scaling stops at 1536 cores for both instances types and further decreases towards 3072. This is 42k cores per MPI rank which is the expected scaling limit for OpenFOAM.

Figure 5 – A graph of performance of OpenFOAM on the DrivAer 128M dataset. Hpc7a outperforms Hpc6a by up 2.7x on a per instance basis.

Figure 6 – A graph of performance of OpenFOAM on the DrivAer 128M dataset. Hpc7a outperforms Hpc6a by up 1.3x on a per core basis.

GROMACS

Next, we looked at Max Planck Institute provided test case for 2M atoms ribosome in water (benchRIB) using GROMACS version 2021.5. We used the Intel compiler version 2022.1.2 and Intel MPI 2021.5.1 to compile and run GROMACS. In this case we used the best matching Intel Compiler flags for Hpc7a (“-march=skylake-avx512 -mtune=skylake-avx512”) and Hpc6a (“-march=core-avx2”) and Intel MKL for the Fast Fourier Transform.

For each scaling data point we used the optimal MPI rank versus OpenMP thread distribution, which means we start with 1 OpenMP thread and an MPI rank on each core at 1 instances and steadily increase the number of OpenMP threads when scaling further to better balance the workload. The maximum number of threads used is 2 threads at 8 instances and 4 at 16 instances for Hpc7a and Hpc6a, respectively. The graphs show simulated time per day (ns/day), higher is better.

We saw up to 2.05x speed-up on 2 instances and similar scaling when comparing Hpc7a to Hpc6a. We also see an up to 1.2x speed-up at 768 cores when comparing on a per core level.

Figure 7 – A graph of performance of GROMACS on the benchRIB dataset. Hpc7a outperforms Hpc6a up to 2.05x on a per instance basis.

Figure 8 – A graph of performance of GROMACS on the benchRIB dataset. Hpc7a outperforms Hpc6a up to 1.2x on a per core basis.

WRF

We looked at CONUS 2.5km benchmark performance using WRF v4.2.2. We used the Intel compiler version 2022.1.2 and Intel MPI 2021.9.0 to compile and run WRF using the same compiler flags as for GROMACS. We used 48 MPI ranks per instances filling up the remaining cores with 4 OpenMP threads to use all 192 cores per instance.

We ran the scaling test up to 128 instances (24,576 cores) and used the total elapsed time to calculate the simulation speed as runs per day (higher is better). Comparing the instances types, we saw an up to 2.6x speed up at 8 instances, and better scalability for Hpc7a compared to previous generation Hpc6a due to the increase of on-node traffic.

Figure 9 – A graph of performance of WRF on the Conus 2.5km dataset. Hpc7a outperforms Hpc6a by up 2.6x on a per instance basis.

Figure 10 – A graph of performance of WRF on the Conus 2.5km dataset. Hpc7a retains better scalability at >10k cores due to increased on-node traffic.

Performance is more than just higher core counts

Hpc7a shows a great performance increase over the previous generation and not only due to doubling the numbers of cores. Figure 11 shows the performance gain for each workload on a per-core basis, so this is additional to that doubling. We ran all instance comparisons using either real production workloads or a close substitute.

Hpc7a instances show, on average, a performance improvement of 29% compared to the previous generation on a per-core basis.

Figure 11 – Relative performance of Hpc6a and Hpc7a for various applications. The performance results shown here use the same number of cores on HPC6a.48xlarge and Hpc7a.96xlarge to highlight the expected iso-core improvement. This translates to using only half the number Hpc7a instances per test case.

Conclusion

In this blog post we introduced the Amazon EC2 Hpc7a instance, which offers up to 2.5x better compute performance compared to previous generation AMD HPC instance. It has twice the cores per instance and yet still has increased per-core compute performance, better memory bandwidth, and greater network performance per core.

We hope you’ll try them for your workloads, too. Reach out to us at ask-hpc@amazon.com and let us know how you fare.

AWS HPC Blog