EFA: how fixing one thing, led to an improvement for … everyone

We routinely test a lot of HPC applications to make sure they work well on AWS, because we think that’s the only benchmark that matters. Hardware improvements from generation to generation are always important, but software is more frequently the place we find gains for customers.

Last summer, our performance teams alerted us to a discrepancy in runtimes when a single application ran with several different MPIs, while on the same hardware. At a high level, codes running with Open MPI were coming in a little slower than when using other commercial or open-source MPIs. This concerned us, because Open MPI is the most popular MPI option for customers using AWS Graviton (our home-grown Arm64 CPU) and we don’t want customers working on that processor to be disadvantaged.

In today’s post, we’ll explain how deep our engineers had to go to solve this puzzle (spoiler: it didn’t end with Open MPI). We’ll show you the outcomes from that work with some benchmark results including a real workload.

Along the way, we’ll give you a peek at how our engineers work in the open-source community to improve performance for everyone, and you’ll get a sense of the valuable contributions from the people in those communities who support these MPIs.

And – of course – we’ll show you how to get your hands on this so your code just runs faster – on every CPU (not just Graviton).

First, some architecture

If you’ve been paying close attention to the posts on this channel, you’ll know that the Elastic Fabric Adapter (EFA) is key to much of the performance you see in our posts. EFA works underneath your MPI and is often (we hope) just plain invisible to you and your application. EFA’s responsibility is to move messages securely – and rapidly – across our fabric between the instances that form parts of your cluster job. Doing this in a cloud environment adds an element or two to the task, because the cloud is both always under construction, and truly massive in scale. Scale isn’t the enemy however, and we’ve exploited that fact to pull off great performance for HPC and ML codes with an interconnect built on the existing AWS network from existing technologies like Ethernet.

EFA remains largely invisible, because we chose to expose it through libfabric, which is also known as Open Fabrics Interfaces (OFI). This is a communications API for lots of parallel and distributed computing applications. By design, it’s lower-level than, say, MPI, and so allows lots of hardware makers (like AWS) to abstract their networking fabrics, so application programmers don’t need to pay attention to too many details of how they’re built.

Figure 1 – Open Fabrics Interfaces (OFI), or libfabric is the abstraction we’ve chosen to expose EFA to MPIs and NCCL. This allows application programmers to be productive without needing to know too much about how the fabric underneath does its work.

As Figure 1 shows, there are several components and APIs in libfabric that MPI interacts with. While it’s impossible to explain them all in this post, let’s take a minute to work through a simple API flow for MPI that sends and receives a message between two ranks – using libfabric.

An aside: a simple data flow example

There are 3 groups of libfabric APIs involved: a data sending API, a data receiving API, and a completion polling API. We’ve illustrated their connections in Figure 2.

Figure 2 – API flow for MPI to send and receive a message between two ranks using libfabric.

On the sender side, MPI initiates a message sending via the fi_tsend function (or its variants). This allows MPI to post a sending request that includes the address, length, and tag of the sending buffer. On the receiver side, MPI posts a receiving request via the fi_trecv call. This includes the address, length, and tag of the receiving buffer. The tags are used for the receiver to filter what messages should be received from one or more senders.

Both sending and receiving APIs return asynchronously, which means the sending and receiving operations don’t necessarily complete when the calls return. Except for some special cases, MPI needs to poll a completion queue (CQ) to get the completion events for sending and receiving operations, through fi_cq_read.

Hopefully this gives you a picture of what goes on under the hood when MPI ranks want to communicate with each other. Anyhow, back to our main story now …

Intra-node comms

Digging deeper, the other MPIs seemed to have better intra-node performance than Open MPI. This might sound a little surprising – you could be forgiven for presuming that intra-node performance would be … more of a solved problem. It’s literally cores sending messages to other cores inside the same compute node.

Open MPI is designed to use a layer called the matching transport layer (MTL) whenever libfabric is used to manage the two-sided tagged message sending and receiving operations (there is another layer called the byte transfer layer that uses libfabric for one-sided operations, like write and read, but we’re not going to discuss it in this post). This layer offloads all communications, whether they’re intra-node or inter-node, to the libfabric layer. This is different from how the others do it, because they use their own shared-memory implementations for intra-node communications, by default.

Other MPIs allow you to define which underlying libfabric provider they should use for shared memory operations (usually by tweaking some environment variables). Open MPI also has its own shared-memory implementation in its BTL (Byte Transfer Layer) component (BTL/vader in Open MPI 4, and BTL/sm in Open MPI 5). This allowed us to explore several pathways to narrow down the specific portions of code where trouble was occurring.

We noticed that when running on a single instance, Open MPI with BTL/SM had similar performance to the others when they were using their native routines. As we noted above, due to low-level implementation details in Open MPI, when libfabric is being used, BTL/SM cannot be used.

This is where some subtle effects can have quite an impact on outcomes. When Open MPI sends an intra-node message through the libfabric EFA provider, the EFA provider actually hands over the message to the libfabric SHM provider, which is yet another provider in libfabric – that supports pure intra-node communications.

When we put this to the test with OSU’s benchmarking suite and some instrumented code, the high level performance gap we noticed between Open MPI with libfabric EFA and the other MPI’s native implementations could be decomposed into – mainly – two parts:

The gap between libfabric’s SHM provider itself and the others’ shared memory implementation.
The gap between libfabric with (EFA + SHM) and libfabric SHM on its own.

For a sense of scale, for small message sizes under 64 bytes (using a c5n.18xlarge instance) these gaps were adding around 0.4-0.5 µsec extra ping-pong latency. Given these are messages travelling inside a machine and never making it to a PCI bus (let alone a network cable), this was a big penalty.

Community efforts

To address these two gaps, we needed to pursue directions of effort, in parallel.

Addressing the first gap involving the libfabric SHM provider performance became an effort with the help of the libfabric maintainers to optimize the locking mechanisms in the queues, and some other details – which we’ll dive into in a future post.

For the second gap involving the handoff between the EFA provider and the SHM provider, the AWS team worked to onboard the EFA provider using a newly-developed libfabric interface called the Peer API so we could use the SHM provider as a peer provider. Peer providers are a way for independently developed providers to be used together in a tight fashion, which helps everyone (a) avoid layering overhead; and (b) resist the urge to duplicate each other’s’ provider functionality.

Figure 3 – The base case before our improvement work began, characterized by a long pathway through all the layers of the stack, even though intra-node traffic should be able to short circuit much of it.

We’ve illustrated the contrasting pathways that small messages traverse when using the Peer API. Figures 3 and 4 depict these two paradigms – before and after we introduced peering through this new API. We’ll summarize it from two angles: the data movement and the completion-events population.

Figure 4 – After we introduced the Peer API, which effectively short-circuits the pathways an intra-node message needs to travel.

Data movement

Before using the Peer API, when MPI sends a message using the libfabric EFA provider, it cside, MPI posts a receiving requestalls fi_tsend()(or its variants) through the EFA endpoint. The EFA provider performs protocol selection logic and adds EFA headers to the message body before posting the message again though the SHM endpoint. After getting the send request from EFA, SHM does its own protocol selection and message packaging again, before copying the message into a ‘bounce’ buffer, which is shared in both the sender’s and receiver’s memory spaces. Over on the receiver’s side, SHM will copy the received message to a bounce buffer in the EFA provider, before delivering the message to the application’s buffer via another memory copy. That’s a lot of memory copies.

After using the Peer API, on the sender’s side, EFA will check to see if SHM is enabled, and if the destination is on the same instance. If that’s true, EFA will immediately forward the fi_tsend() request from MPI to SHM. On the receiver side, SHM copies the received message directly to the application buffer without the extra forwarding through the bounce buffer in the EFA layer.

Completion events population

The improvement on the completion events population is almost as dramatic. MPI gets a notification for the send/receive completion events by polling the completion queues.

Before using the Peer API, the completion events generated by SHM when the actual send/receive operation completed needed another forwarding inside the EFA provider before populating to the completion queues that the MPI polls from.

After using the Peer API, EFA and SHM just share the same completion queue allocated by MPI, and SHM can write the completion events directly to the queue.

Performance improvements

Let’s show you the performance improvements we saw when we ran both micro-benchmarks and real applications – the ultimate test.

For micro benchmarks, we use the OSU Latency benchmark to measure the latency of point-to-point comms for two ranks, and OSU’s MPI_Alltoallw test to measure the latency for collective comms.

For a real-world application, we ran OpenFOAM with a 4M cell motorbike model.

In our testing, we used Open MPI 4.1.5 with two different versions of libfabric: 1.18.1 (before the Peer API), and 1.19.0 (after implementing Peer API).

OSU latency

This is where the improvements from our combined efforts are most visible – by measuring the point-to-point (two-rank) MPI send and receive ping-pong latency.

Figure 5 shows the OSU latency benchmark when running Open MPI with libfabric 1.18.1 (before using Peer API), and again with the newer libfabric 1.19.0 (after implementing the Peer API). We ran all these tests on two ranks on two different instance types: An hpc6a.48xlarge built around an x86 architecture, and the hpc7g.32xlarge, which is built using the AWS Graviton 3E – an Arm64 architecture.

Figure 5 – Peer to peer intra-node MPI ping-pong latency measurements, before and after we implemented the Peering API in our libfabric providers. Latency approximately halved, which is a dramatic improvement.

The latency we measured approximately halved – a dramatic result, but in line with what we should expect when we removed all the unnecessary memory copies and method calls by using the Peer API, and the improvements in SHM itself.

The work continues in this area: the gap between libfabric 1.19.0 and the other private implementations will likely be reduced further – our teams are working together with the libfabric community to optimize the SHM provider further, with a goal of making it comparable to MPICH. This is a space to watch.

OSU all-to-all communication

To see the impact on MPI collective operations, we benchmarked MPI_Alltoallw – well known for stressing MPI communications.

Figure 6 – The impact on collective operations. Latency dropped by between a third (in the case of the Hpc6a nodes) and a half (for Hpc7g).

Figure 6 shows the results again for running Open MPI with libfabric 1.18.1 (before Peer API), libfabric 1.19.0 (after Peer API). This time we used 64 ranks and conducted the tests as before: using Hpc6a and Hpc7g instances.

The impact was significant. On Hpc6a, the latency fell generally by about a third. For Hpc7g, it fell by half.

OpenFOAM motorbike 4M case

The real test of our work is always an actual application that a customer runs: micro-benchmarks can only tell you so much. We ran the same OpenFOAM motorBike 4M cell case on 4 x hpc6a.48xlarge (96 ranks per node) and 6 x hpc7g.16xlarge (64 ranks per node) – 386 total ranks in both cases. And in this case, we only tested with Open MPI, but with both our newer and older libfabric versions.

We measured runs per day to focus on what a customer will care about – the pace at which they can results from different CFD models. The runs showed around a 10% improvement from libfabric 1.18.1 to libfabric 1.19.0 across both instance types. We’ve graphed this in Figure 7.

Figure 7 – Runs per day for our OpenFOAM workload across two instance types, both before and after Peer API was used. Both show around 10% improvement from the new methods.

How to get this new code for your workloads

The improvements we’ve shown here are already in the main branch of the OFIWG/libfabric GitHub repo and they’re part of libfabric 1.19, which was released at the end of August. We ingested that into EFA installer 1.27.0, which also shipped recently. You can always check our documentation to find out how to get the latest version.

Conclusion

The enhancements we’ve described in this post should make everyone’s lives easier. In most cases, the intra-node latency between ranks using Open MPI will drop by around a half – without any code changes in the end user application. In our real-world test using a CFD benchmark, we saw a 10% performance gain.

We think there are two significant takeaways from this for the HPC community.

The first is that our engineering teams are always looking for ways to boost performance. Sometimes that happens in hardware, very often firmware, and – like today’s post – software, too. When we find those ways, we’ll roll them out as quick as we can, and they’ll form part of the pattern that cloud HPC users have come to expect: things just get faster (and easier) over time, without compromising security.

The second is that those same engineers work collaboratively with their counterparts at partners (and competitors, too) to make things better, and faster – for everyone. This collaborative angle, underscores a truth that innovation thrives in shared effort rather than isolated silos. We’re thankful to have been able to work with so many dedicated engineers from around the world to bring these results to you today.

Reach out to us at ask-hpc@amazon.com if you have any questions – or if you want to find a way to contribute, too.

AWS HPC Blog