Performance gains with AWS Graviton4 – a DevitoPRO case study

This post was contributed by Gerard Gorman from Devito, and Cyril Lagrange, Gilles Tourpe, and Theo Wu from AWS

The AWS Graviton4 processor represents a significant leap forward, with 96 Neoverse V2 cores and an enhanced memory subsystem. The 12 DDR5-5600 channels provide up to 75% more memory bandwidth than Graviton3 which is beneficial for memory-bound HPC workloads like seismic imaging.

Now that Graviton4 processors are generally available (in the R8g instance family for now), we put some 3D acoustic wave propagation kernels (from Devito Codes DevitoPRO) to the test on this new processor, and compared it to previous Graviton generations. These benchmarks, including Full Waveform Inversion (FWI) and Reverse Time Migration (RTM) propagation kernels, are critical for developing cost models and assessing computational efficiency in seismic imaging applications.

In this post, we’ll walk you through the tests we performed, the instances we compared, and show you how Graviton4 faired against previous generations.

Background

Seismic imaging is the digital process of creating an image of the subsurface from the reflections of seismic wave recorded at the surface. Seismic images are critical in the identification of natural resources as well as in the assessment of geo hazard risks in construction work.

Seismic imaging is analogous to an ultrasound scan but at a much larger scale. Reverse Time Migration (RTM) and Full Waveform Inversion (FWI) are the 2 most demanding imaging algorithms, and at their core is the finite difference modeling of the seismic wave propagation.

Devito Codes is an company specializing in finite difference methods for partial differential equations (PDEs) in seismic imaging and other scientific fields. Their flagship product, DevitoPRO, is a domain-specific language (DSL) designed to generate highly optimized code for solving PDEs on various hardware architectures, including CPUs and GPUs. DevitoPRO is widely used in both academia and industry for its flexibility and performance-portability for seismic imaging.

To assess the performance capabilities of the Graviton4 processor, we ran a series of benchmarks using three standard 3D acoustic wave propagation kernels. These benchmarks, run in single precision with OpenMP thread parallelism, were meticulously autotuned to optimize parameters such as cache block size. The best results from multiple runs were recorded to ensure accuracy and reliability.

We categorize these benchmarks according to their operational intensity:

Isotropic Acoustic Benchmark: This benchmark is the simplest of the three, commonly used in seismic imaging. Its performance is predominantly influenced by memory bandwidth, making it a critical test for assessing memory-bound applications.
Fletcher-Du-Fowler TTI Kernel: Known for its moderate complexity, this kernel is frequently utilized by hardware vendors for benchmarking purposes. It provides a balanced assessment of both computational and memory performance.
Self-adjoint TTI Kernel: Designed for robustness and accuracy, this benchmark reflects the most demanding computational requirements found in production workloads, particularly for Full Waveform Inversion (FWI) and Reverse Time Migration (RTM) applications. It serves as a comprehensive evaluation of the processor’s ability to handle high-intensity computational tasks.

All three benchmarks ran with dimensions 512x512x512, for a total of 400 time-steps, space-order 8 and time-order 2.

Operating system and compiler optimization

We ran these benchmarks using Amazon Linux 2023. To maximize performance on Graviton, we compiled the benchmarks using GCC 14.1.0, rather than the system-default GCC version. The GCC 14.1 release introduced a range of improvements that enhance the compiler’s ability to leverage the advanced features of Neoverse cores, particularly the Neoverse V2 architecture used in Graviton4.

Key enhancements in GCC 14 include improved support for vectorization and new optimizations that are tailored for Neoverse V2 cores. These improvements allow for better exploitation of the high memory bandwidth and increased core count in Graviton4, resulting in more efficient execution of high-performance computing workloads. Additionally, the release includes enhancements in auto-vectorization, which are particularly beneficial for memory-bound applications like seismic imaging and simulation tasks.

Devito uses the following GCC flags depending on the generation of Graviton to maximize performance on a given instance. As there is just one NUMA domain in all cases, we parallelizing with pure OpenMP.

Graviton2 (Neoverse-N1):

gcc-14 -mcpu=neoverse-n1 -O3 -g -fPIC -Wall -std=c99 -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -ffast-math -shared -fopenmp

Graviton3 (Neoverse-V1):

gcc-14 -mcpu=neoverse-v1 -O3 -g -fPIC -Wall -std=c99 -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -ffast-math -shared -fopenmp

Graviton4 (Neoverse-V2):

gcc-14 -mcpu=neoverse-v2 -O3 -g -fPIC -Wall -std=c99 -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -ffast-math -shared -fopenmp

Performance improvements

The Graviton4 processor shows significant performance gains over its predecessors, particularly for memory-bound HPC applications. For instance, in the DevitoPRO 3D Isotropic Acoustic benchmark, the Graviton4 ran approximately 2.7 times faster than Graviton2 and 1.81 times faster than Graviton3.

Similarly, the 3D Fletcher Du Fowler TTI benchmark shows a 3.4 times increase over Graviton2 and 1.51 over Graviton3. The 3D Self-adjoint TTI benchmark also benefits from the new architecture, with Graviton4 running nearly 3.6 times faster than Graviton2 and 1.80 times faster than Graviton3.

Figure 1’s bar chart shows how the Graviton3 and Graviton4 performance relative to the Graviton2. The r6g.16xlarge, r7g.16xlarge, r8g.16xlarge all have 64 cores; this highlights the performance improvement per core. In contrast, the r8g.24xlarge (a single socket Graviton4) has a total of 96 cores. We can see some strong scaling limits running the isotropic acoustic benchmark as we go from r8g.16xlarge the r8g.24xlarge (~10% performance improvement). However, this does not appear to be an issue when running with more complex propagator kernels such as TTI.

Figure 1 – bar chart comparing the performance of r6g.16xlarge, r7g.16xlarge, r8g.16xlarge and r8g.24xlarge instances for 3 seismic imaging benchmarks. In all cases the r8g.24xlarge perform the best.

These results highlight the generational improvements of the Graviton4 processor, particularly for workloads that are sensitive to memory bandwidth, making it an excellent choice for seismic imaging and similar applications.

Price-performance analysis

The overall picture for price-performance also looks good, though there are some nuances. In the bar-chart in Figure 2 we use the On-Demand pricing for each instance and the benchmark performance in units of giga-points-per-second to create a tera-points-per-dollar (TP/$) benchmark metric. This is a measure of how much work we get done per dollar. This helps users get an estimate of the price-to-solution.

For the isotropic acoustic and the self-adjoint acoustic TTI, we can see that the Graviton4 delivers the highest throughput per dollar, followed by Graviton3 and Graviton2.

However, the Fletcher Du Fowler TTI benchmark presents an exception. In this case, Graviton3 provides the highest throughput per dollar, followed by Graviton4 and then Graviton2. Although both the isotropic acoustic and self-adjoint TTI benchmarks are 1.81 times faster on Graviton4 than on Graviton3, the Fletcher Du Fowler TTI benchmark is only 1.51 times faster on Graviton4 than on Graviton3. Given that the theoretical increase in memory bandwidth is 1.75 times, this discrepancy warrants a deeper performance profiling analysis to understand why this particular benchmark underperformed relative to the other benchmarks. We will cover this in a future post.

Figure 2 – bar chart comparing the price performance r6g.16xlarge, r7g.16xlarge, r8g.16xlarge and r8g.24xlarge instances for 3 seismic imaging benchmarks. For the isotropic and the selft-adjoint TTI benchmark, the best price performance is achieved using the r8g.16xlarge instance. For the Fletcher Du Fowler TTI benchmark, the best price performance is achieved with the r7g.16xlarge.

Instance evaluation

This benchmark focusses on the performance and price performance between different version of the Graviton processor; therefore, we only compare instances within the R family of Amazon EC2 instances, which is currently the only family that includes a Graviton4 instance type.

When choosing between the r8g.16xlarge and r8g.24xlarge instances, it’s important to consider the specific characteristics of your workload. For workloads where the problem domain is too small to benefit from strong scaling across all available cores, allocating the entire node and running multiple shots per node can provide better value. This approach not only maximizes resource utilization but also helps avoid the potential impact of noisy neighbors in multi-tenant environments. By fully utilizing the r8g.24xlarge instance, which contains all 96 cores of the Graviton4 processor, you can achieve more consistent performance, as the risk of resource contention from other tenants is minimized. This strategy ensures you get the best possible value from the Graviton4 architecture for your HPC tasks.

Conclusion

The AWS Graviton4 processor demonstrates substantial performance improvements across various high-performance computing (HPC) benchmarks, particularly in memory-bound seismic imaging applications. DevitoPRO, when run on AWS Graviton4, shows significant speedups compared to previous Graviton generations, highlighting the advancements in core architecture and memory bandwidth that make Graviton4 an ideal choice for demanding computational workloads.

The AWS Graviton4 processor significantly enhances the performance of DevitoPRO’s seismic imaging workloads, particularly for memory-bound applications. The architectural improvements and increased memory bandwidth contribute to substantial performance gains, making Graviton4 a compelling choice for HPC users. While certain benchmarks highlight areas for potential optimization, the overall price-performance benefits and ease of integration position Graviton4 as a powerful option for demanding computational tasks.

The collaboration between AWS, Devito Codes, and other industry stakeholders continues to drive innovation in HPC, providing users with powerful tools and resources to tackle the most challenging computational tasks in seismic imaging and beyond. This ongoing collaboration underscores a commitment to delivering high-performance solutions that meet the evolving needs of the energy sector and other industries reliant on HPC.

AWS HPC Blog

Performance gains with AWS Graviton4 – a DevitoPRO case study

Background

Operating system and compiler optimization

Performance improvements

Price-performance analysis

Instance evaluation

Conclusion

Resources

Follow

Learn

Resources

Developers

Help