Recent improvement to Open MPI AllReduce and the impact to application performance

Recent improvement to Open MPI AllReduce and the impact to application performance At AWS, our engineers work to constantly improve the performance and scalability of HPC applications to enhance the customer experience in the cloud. Recently, we found a way to improve several HPC applications (Code Saturne, OpenFOAM, Laghos, and Fire Dynamics Simulator) at once by improving one of the collectives in Open MPI library, MPI_AllReduce. Our testing has shown that this effort can allow HPC customers to work with applications on AWS using Open MPI without any performance disadvantages. This was true when we benchmarked against commercial MPIs and other open-source libraries like MPICH. And it was especially true for our Arm-based Hpc7g instances.

Earlier this year, we added a node-aware AllGather-based AllReduce method to Open MPI. This improvement was specially designed to improve MPI_AllReduce latency for small messages on EFA. It significantly reduced MPI_AllReduce latency and as a result, improved the scalability of several HPC applications, especially when running on thousands of cores. The new method works well on Arm64 and x86 instances. And we’ve seen significant performance improvements for both microbenchmarks and real customer workloads.

In this post, we’ll first go over the reasons why we need this customized MPI_AllReduce method for EFA and some of the details for this new method. Then we’ll cover the microbenchmark and customer workload results. Finally, we’ll provide details for getting access to the latest improvements to benefit your HPC applications.

Motivation for this work

We’re working with a large utility customer in Europe and they wanted to migrate some of their workload (using Code Saturne) to AWS for some of their users. This way they could rapidly expand the compute capacity without large capital commitment.

The Arm-based Hpc7g instance (with an AWS Graviton3E processor) is one of their best options due to its excellent price-performance and lower power consumption compared to other HPC instances for the same workloads. In Dec 2023, we tested the Code Saturne benchmark BENCH_F128_02, which has 51M cells (similar to a typical workload for them) on Hpc7g with Open MPI 4.1.6.

To our surprise, the workload didn’t scale quite as well on AWS with EFA compared to their on-premises cluster (see Figure 1 for the benchmark elapsed time). There was also a performance inversion between 768 cores and 1536 cores. After some application profiling, we found that the scaling bottleneck was MPI_AllReduce latency for small messages in Open MPI.

Figure 1 – The initial results which set us on the journey to find the performance bottleneck. Code Saturne benchmark Bench_F128_02 elapsed time (lower is better) on Hpc7g with Open MPI 4.1.6 and the on-prem cluster; We are not able to achieve a comparable scaling efficiency and there is a performance inversion (longer run time with more cores) from 768 cores to 1536 cores.

Diving into collectives

Collective communication is a comms pattern that involves all the processes in the communicator.

In Table 1, we’ve listed common MPI collectives and their desired functionalities. Since MPI collectives involve all processes in the communicator, it’s not hard to see that when the communicator size gets larger, we need more messages exchanged. The number of messages can be a quadratic of the total number of ranks for the naïve implementations of several collectives (AllReduce, AllToAll).

In Open MPI, there are already several existing algorithms for each collective. For example, Open MPI has six AllReduce methods: basic linear, reduce + bcast, recursive doubling, ring, segmented ring, and Rabenseifner. Unfortunately, the best exiting methods have 2-3 times the latency of the best performing commercial MPI library.

Table 1. short descriptions of related MPI collectives

First, we took a hierarchical approach. Due to the latency difference between inter-node (12μs) and intra-node (0.5μs) communication, we can often get performance benefits by dividing the communicator into two layers: one for the communication between the nodes (inter-node), and the other one for the communication inside each node (intra-node). This guarantees that inside each sub-communicator, or layer, the distance or communication cost is the same for any two processes.

For collectives inside each layer, the latency can usually be modeled using a linear model:

… where C₁ is the number of phases or steps required to complete the collective communication, β is the cost associated for each send or receive operation, C₂ is the total size of the messages, and γ is rate of transfer (the inverse of network bandwidth); for a given network, β and 𝛾 are constants. For small messages, our focus needed to be reducing the number of required steps (C₁) to complete the collective operation since C₂is small.

Figure 2 shows the schematics of a node-aware AllGather-based AllReduce and how it reduces C₁. It consists of 3 communication stages: intra-node Reduce, inter-node AllGather followed by local Reduce, and intra-node Bcast.

Only one stage involves inter-node communications. Note that this used inter-node AllGather followed by Reduce, instead of Internode Reduce followed by Bcast. This is possible because there’s no dependency between the send message and the receive for each rank and they can happen simultaneously to favor the latency – a common technique used in MPI to trade network bandwidth for latency.

Figure 2 – The schematics of a node-aware AllGather-based AllReduce.

The Eureka moment

Internode AllGather accounts for most of the AllReduce latency, so we added an extended version of Bruck’s algorithm [2] that can handle any fanout k (Bruck-k). In a communicator of size N, this method only requires C₁ = log_k(N) steps, and N * 2 inflight messages (for each step, each process will send and receive a message from k other processes) to complete the AllGather communication.

This method combines the data scattered on different processes (in earlier stages) so that fewer messages with larger sizes can complete the collective communication pattern. The method strikes a balance between the number of steps required and total number of messages in flight.

For comparison: the recursive doubling method [3] requires C₁ = log₂k steps to complete the collective communication – hence higher latency; the direct messaging method requires only a single step, but there will be 2 * N * (N-1) inflight messages simultaneously which can lead to congestion when N is large.

The Bruck-k method improves the latency and can scale to a larger communicator – it also allows the user to reconfigure the fan-out according to the number of communication ports on each node.

For intranode Reduce/Bcast, we added a k-nomial tree Reduce method. This method reduced the number of steps required to the collective communication and provide additional tuning opportunity (fanout) for the end users.

Microbenchmark result

We tested the improved Open MPI method on all HPC instance types on AWS and found that it works well on both x86 and Arm64 instances. Figure 3 shows the OSU microbenchmark MPI_AllReduce latency (in μs, lower is better) for 8-byte messages on a typical x86 instance. We also saw similar improvements for small messages of other sizes and not shown here.

Figure 3 – OSU microbenchmark MPI_AllReduce latency (us) for 8-byte messages on one of the x86 instance types.

Figure 4 shows MPI_AllReduce latency with Open MPI on Hpc7g. We’ve added the latency from a commercial MPI on x86 for instance type 1, for reference. For 8192 cores, the AllReduce latency on
Hpc7g with the improved Open MPI is nearly half the latency with the commercial MPI on the reference
x86 instance.

Figure 4 – OSU microbenchmark MPI_AllReduce latency (us) for 8-byte messages on Hpc7g with Open MPI.

These results all pointed in the right direction, but a real customer application is still the most (and really, only) important test of any work.

Application results

Code Saturne

Figure 5 shows the Code Saturne benchmark BENCH_F128_02 strong scaling result using the improved Open MPI.

With the improvement, we were able to fix the performance inversion at 1536 cores and significantly reduced the time to solution. We achieved similar scaling on EFA compared to the on-premises system with Infiniband – despite the latency difference with the networks.

We’ve put this into a runbook for Code Saturne on Hpc7g, which you can find on the AWS Graviton HPC getting started page.

Figure 5 – Code Saturne benchmark BENCH_F128_02 strong scaling elapsed time (sec) on Hpc7g and reference cluster

OpenFOAM

MPI_AllReduce is well documented to be a scaling bottleneck in Open FOAM [4]. Figure 6 shows the Open FOAM DrivAer car benchmark strong scaling result. We measured solver ratings in iterations per minute.

There was an 18% improvement for 2048 cores and a 35% improvement at 4096 cores using the improved Open MPI.

If you want to use this for your own needs, we’ve also put these details on the Graviton HPC getting started page.

Figure 6 – Open FOAM DrivAer car benchmark stong scaling solver ratings (iterations per minute) on Hpc7g.

We’ve also seen significant performance improvements for Laghos and Fire Dynamics Simulator with this improvement (we didn’t include these results in this blog post, but contact us if you’re interested).

MPI_AllReduce is one of the most widely used MPI collectives [5]. This improvement has the potential to improve the performance of other HPC applications beyond those we discussed here.

How to use this improvement

We’ve already merged the improvement discussed in this post into the Open MPI main branch, which will be included in the upcoming Open MPI v6.0.x release by the end of this year. We’ve created a snapshot with some instructions in case you want to use this code before the official release.

Conclusion

We recently added a node-aware hierarchical AllGather-based AllReduce method to Open MPI that can significantly reduce the AllReduce latency for small messages.

The method also adds additional tuning options to allow the end users to reconfigure the AllReduce method according to the characteristics of their network connection. We tested this, and it works well on the Amazon EC2 instance types designed for HPC. And we saw significant performance improvements for both microbenchmarks and real customer applications (including Code Saturne, Open FOAM, Laghos, and Fire Dynamics Simulator).

We’re very thankful to the advisors from Code Saturne community, Lawrence Livermore National Laboratory (LLNL), and the National Institute of Standards and Technology (NIST) for sharing their use cases with us and for their collaboration in the investigation.

References

[1] Ziegler, T., Bindiganavile Mohan, D., Leis, V., & Binnig, C. (2022). EFA: A viable alternative to RDMA over infiniband for dbmss? Data Management on New Hardware. https://doi.org/10.1145/3533737.3538506

[2] Bruck, J., Ching-Tien Ho, Kipnis, S., Upfal, E., & Weathersby, D. (1997). Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11), 1143–1156. https://doi.org/10.1109/71.642949

[3] Thakur, R., & Gropp, W. D. (2003). Improving the performance of collective operations in MPICH. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 257–267. https://doi.org/10.1007/978-3-540-39924-7_38

[4] Culpo, M. (n.d.). Current Bottlenecks in the Scalability of OpenFOAM on Massively Parallel Clusters. https://prace-ri.eu/wp-content/uploads/Current_Bottlenecks_in_the_Scalability_of_OpenFOAM_on_Massively_Parallel_Clusters.pdf

[5] Chunduri, S., Parker, S., Balaji, P., Harms, K., & Kumaran, K. (2018). Characterization of MPI usage on a production supercomputer. SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/sc.2018.00033

Select your cookie preferences

AWS HPC Blog