EFA is now mainstream, and that’s a Good Thing

If you’ve been following the story of our Elastic Fabric Adapter (EFA), you’ll recall that we’ve written and spoken frequently about networks and their role in performance. We noticed that network-constrained applications in a cluster were trying hard to swap non-trivial volumes of data quickly, but concluded that packet latency was only one of the factors that mattered. Recognizing that liberated us to think very differently about how we approached a solution.

When we solved it, we’d created a new approach to moving vast volumes of traffic around, and it made use of the fact that our network was large and complex, rather than seeing that as a barrier. We invented a transport which takes a “swarm delivery” approach, sending packets over lots of pathways simultaneously, and in doing so limited the impact of any single-packet going astray or any single path getting congested. In other transports and fabrics, single packets going astray can stall whole streams of packets behind them. EFA just designed around it, chiefly by allowing out-of-order (but still reliable) delivery.

Another aspect of EFA is that it uses the same infrastructure our overall fabric is built on. We’re not forced to build an HPC precinct in the corner of the cloud, using different networking from the rest, forever an island. At the rate at which our data centers expand and fill up with new capacity, islands can become remote from one another, quickly. And we wanted to make sure complex pipelines of work, where each step might have different needs, didn’t get homogenized into a single architecture or instance type.

EFA liberated us, and in doing so set us up for a week very much like this week.

EFA has been busy

Last week, we announced the general availability of Amazon EC2 C6i instances. These are the 6th generation EC2 instance portfolio to include x86-based compute options and are powered by 3rd generation Intel Xeon Scalable processors (code-named ‘Ice Lake’) with an all-core turbo frequency of 3.5 GHz. They come with network interfaces running at 50 Gb/s. And they’re EFA-enabled.

We also launched the Amazon EC2 DL1 instances. These are powered by Gaudi accelerators from Habana Labs (an Intel company). They also have EFA.

And last month we launched Amazon EC2 M6i – almost identical architecture as the C6i family, but with twice as much memory per core. And with, you guessed it: EFA again.

EFA is now offered by sixteen different instance families:

  • General Purpose: m5dn, m5n, m5zn, m5zn.metal, and m6i
  • Compute Optimized: c5n, c5n.metal, c6gn and c6i
  • Memory Optimized: r5dn, r5dn.metal, r5n, and r5n.metal
  • Storage Optimized: i3en, i3en.metal
  • Accelerated Computing: dl1, g4dn.metal, inf1.24xlarge, p3dn, p4d

In case you haven’t seen some of these suffixes before: a ‘z’ depicts a high frequency CPU, ‘n’ means very high-bandwidth networking, and ‘d’ means there’s some super-fast direct attach storage in the instance itself. The ‘g’ in C6gn means the processor is an AWS Graviton2, our Arm-based CPU which is turning out to be extremely performant for HPC workloads like CFD and weather simulation.

Why is this important?

This matters for two big reasons.

First, different applications have different needs. We’ve always believed that one of the cloud’s benefits to HPC customers is our ability to give each code an optimized environment to run in. If your code can run 80% faster with twice the memory, you can step up from C5 (4 GB/core) to M5 (8 GB/core) or even R5 (16 GB/core).

Second, you shouldn’t have to guess this stuff. If you’re using a traditional HPC environment on-premises, you’re no doubt used to committing lots of money up front to buy a lot of exact hardware that must serve you for two, three or even five years from now. But software changes so much faster than that. Methods, too. You should be able to experiment and settle on what works for now, knowing that you can move later if things change.

HPC goes mainstream

We’re just getting started. The AWS Nitro System gives us the flexibility to keep building new instance types, with really different characteristics, be it networking, specialist storage skills or compute acceleration. We’re confident that with this diversity, you’ll find a home for your code that gives you the scale and immediacy you need for your research. And by leveraging many of the instances built for more mainstream computing needs, the economies of scale will stack up to help us keep your costs low.

Our colleagues in the media and entertainment group, AWS Elemental, used EFA for something we never imagined, and are now moving masses of uncompressed video through the production pipeline in TV studios. This is a strong pattern in HPC: amazing breakthroughs lead to novel technologies which become mainstream and solve problems for everyone.

If you want to get started with EFA for your HPC workload, you don’t need to go any further than one of our HPC workshops, using AWS ParallelCluster, which assembles all the things you need in one place. It’ll provide you a complete environment ready for your application to run in. You can experiment with all these instance types right out of the box. Create a Slurm partition for each instance family and see which one runs your code the best.

Brendan Bouffler

Brendan Bouffler

Brendan Bouffler is the head of the Developer Relations in HPC Engineering at AWS. He’s been responsible for designing and building hundreds of HPC systems in all kind of environments, and joined AWS when it became clear to him that cloud would become the exceptional tool the global research & engineering community needed to bring on the discoveries that would change the world for us all. He holds a degree in Physics and an interest in testing several of its laws as they apply to bicycles. This has frequently resulted in hospitalization.