In the search for performance, there’s more than one way to build a network
This post was contributed by Brendan Bouffler, Head of HPC Developer Relations in HPC Engineering at AWS.
When we started designing our Elastic Fabric Adapter (EFA) several years ago, I was sceptical about its ability to support customers who run all the difficult-to-scale, “latency-sensitive” codes – like weather simulations and molecular dynamics.
Partly, this was due to these codes emerging from traditional on-premises HPC environments with purpose-built machine architectures. These machines designs are driven by every curve ball and every twist in some complicated algorithms. This is “form meets function”, though maybe “hand meets glove” is maybe more evocative. It’s common in HPC to design machines to fit a code, and also common to find that over time the code comes to fit the machine. My other hesitation came from the HPC hardware community (of which I’m a card-carrying member) getting very focused in recent years on the impact of interconnect latency on application performance. Over time, we came to assume it was a gating factor.
But boy was I wrong. It turns out: EFA rocks at these codes. Ever since we launched EFA we’ve been learning new lessons about whole classes of codes that are more throughput-constrained than limited by latency. Mainly what we learned was that single-packet latency measured by micro-benchmarks is distracting when trying to predict code performance on an HPC cluster.
Today, I want to walk you through some of the decisions we made and give you an insight into how we often solve problems in a different way than others do.
Latency isn’t as important as we thought
Latency isn’t irrelevant, though. It’s just that so many of us in the community overlooked the real goal, which is for MPI ranks on different machines to exchange chunks of data quickly. Single-packet latency measures are only a good proxy for that when your network is perfect, lossless, and uncongested. Real applications send large chunks of data to each other. Real networks are busy. What governs “quickly” is whether the hundreds, or thousands of packets that make up that data exchange arrive intact. They also must arrive soon enough (for the next timestep in a simulation, say) so they’re not holding up the other ranks.
On any real network fabric (ours included) packets get lost, or blocked by transient hot spots and congestion. This isn’t something that happens once a day. It’s happening all the time. That leads to transmit/receive pairs spending numerous cycles (hundreds, sometimes thousands of microseconds) figuring out that a packet must get sent again. And in those milliseconds, a CPU could be executing a quarter of a million instructions.
Most fabrics (like Infiniband) and protocols (like TCP) send the packets in order, like a conga line (cue the Cha Cha music your parents played in the 70s). That’s a design choice those transports made (back in the day), making it the network’s problem to re-assemble messages into contiguous blocks of data. However, it means a single packet getting lost messes up the on-time arrival of all the packets behind it in the queue (an effect called “head of line blocking”). However, it did save the need for a fancy PCI card (or worse, an expensive CPU) being involved to reassemble the data and chaperone all the stragglers.
You can see why single-packet latency matters for these fabrics – it’s literally going to make all the difference to how fast they can recover from a lost packet and maintain throughput.
The measure we all should pay more attention to is the p99 tail latency. This is the worst latency experienced by 99% of all packets and speaks for more of the “real network” effects. It’s the net result of all those lost packets, retransmits, and congestion, and it’s the one that predicts the overall performance of MPI applications the most. This is mainly because of routines like collective operations (like MPI_Barrier or MPI_Allreduce) that hold processing up to get all ranks synchronized before moving to a crucial next step. It’s why you often hear experienced HPC engineers say that MPI codes are only as fast as the slowest rank.
This wasn’t news to protocol designers. It’s just that in, the early 2000s, we didn’t have the same low-cost high-performance silicon options we had later in 2018, or 2021. We also didn’t have AWS.
This last point is significant, because operating infrastructure at the scale and pace of growth we do, forces some different design choices. Firstly, AWS is always on. Even a national supercomputer facility goes offline for maintenance every now and again: usually it’s a window for vital repairs, file system upgrades or to refactor the network fabric to make room for expansion. We don’t have that luxury because customers rely on us 24×7. And because we’re so reliable, and agile and scalable, we’re experiencing a pace of growth that means literal truckloads of new servers arrive at our data centers every day. HPC customers kept telling us that they love the reliability, agility, and scalability of AWS, too, so just building a special Region full of HPC equipment wasn’t going to get us off the hook.
The lessons we drew from this were that we can’t do islands of specially connected CPUs, because they’d quickly be surrounded by oceans of everything else. And that everything else stuff is part of what makes the cloud magical. Why restrict HPC people to a small subset of the cloud? Couldn’t we make that everything else part of the solution?
This forced us to look at things differently and had us dwelling on the central problem statement: HPC applications just want to move large chunks of data from one machine to another quickly and reliably, and don’t (or shouldn’t) care how it’s done. As a specific design goal, that meant an MPI programmer shouldn’t have to change a single line of code to run on AWS and they should get at least as good a result. MPI should just work.
Introducing the Scalable Reliable Datagram (SRD)
It’s well documented now that we built our own reliable datagram protocol, which we call the Scalable Reliable Datagram (SRD). Much of it was inspired by what we saw working elsewhere: Infiniband, Ethernet, and others. But, because of all those other pressures, we took a deliberately different path.
Firstly, SRD is an Ethernet-based transport. We have a massive investment in Ethernet, which provides as such a depth and breadth of control over outcomes that we don’t want to give it up (not easily anyhow).
Second, and maybe most significantly, we relaxed the requirement for in-order packet delivery in the belief that if it’s necessary we can re-assert it in the higher layers of the stack. The p99 tail latency plummeted (by around a factor of 10). Why? Without the conga line model, SRD can push all the packets making up a block of data all at once, over all the possible pathways in our fabric (in practice, for memory reasons, we choose 64 paths at a time from the hundreds or even thousands available). This means that we don’t suffer from head of line blocking, where a single-packet loss at the front of the transmit queue causes everyone else to stall while the packet is recovered. When we measure the throughput of a transmit/receive pair, it means we recover from packet loss scenarios a whole lot faster. That means we can keep our throughput high, with the aim of saturating the bandwidth of the receiver.
This radically changed the performance of HPC applications for customers. In our first version of EFA at launch, we were seeing application performance rivaling custom-built clusters. And we were getting calls from customers using EFA for things we hadn’t thought about.
Relaxed packet ordering wasn’t the only departure from tradition we took, but it’s emblematic of our approach. It opened up a number of new avenues for us to solve further problems differently, like congestion management (to get more insight into that, you can check out this discussion with Brian Barrett in our HPC Tech Shorts channel).
We have good reason to believe this will keep scaling because as a customer’s code scales to consume more nodes on the network, SRD has consequently more paths to choose from. This is because the perimeter of that network gets bigger. We think this is a durable solution because as we improve our algorithms (for things like congestion management), and deploy new firmware to AWS Nitro boards, customers should see better performance automatically.
Further up the stack, customers get all this for free (in terms of code complexity), because we pack it all in our libfabric provider. This means software at higher layers of the stack such as Open MPI, Intel MPI and MVAPICH “just work”. And HPC workloads like weather simulation and molecular dynamics – who live on top of MPI – they “just work”, too.
These last two are especially pleasing, because they typified the workloads customers challenged us to excel at, in order to believe that AWS could crack this problem.
Last Thoughts (for now)
AWS has a long record of doing things backwards. We’re not fans of starting with a technology and building up to a solution. We worked backwards from an essential problem (MPI ranks need to exchange lots of data quickly) and found a different solution for our unique circumstances, without trading off the things customers love the most about cloud: that you can run virtually any application, at scale, and right away.
We were reminded that in the quest for performance, there’s more than one way to solve a problem. It also taught us to be a little more latency-resistant, too.
You can get started on EFA by trying out one of our self-paced workshops using AWS ParallelCluster, or by watching our HPC Tech Short about our “No Tears” experience. We’d love to know what you think. If you just want to know more about SRD, have a read of our paper in IEEE Micro.
L. Shalev, H. Ayoub, N. Bshara and E. Sabbag, “A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC,” in IEEE Micro, vol. 40, no. 6, pp. 67-73, 1 Nov.-Dec. 2020, doi: 10.1109/MM.2020.3016891.