Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system (OS) bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, High Performance Computing (HPC) applications using the Message Passing Interface (MPI) and Machine Learning (ML) applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of CPUs or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS cloud.
EFA is available as an optional EC2 networking feature that you can enable on any supported EC2 instance at no additional cost. Plus, it works with the most commonly used interfaces, APIs, and libraries for inter-node communications, so you can migrate your HPC applications to AWS with little or no modifications.
EFA’s unique OS bypass networking mechanism provides a low-latency, low-jitter channel for inter-instance communications. This enables your tightly-coupled HPC or distributed machine learning applications to scale to thousands of cores, making your applications run faster.
You can enable EFA support on a growing list of EC2 instances and get the flexibility to choose the right compute configuration for your workload. Simply change your cluster configurations as your needs change and enable EFA support on your new compute instances. No prior reservations or upfront planning is needed.
EFA uses libfabric interface and libfabric APIs for communications. Because almost all HPC programming models support this interface, you can migrate your existing HPC applications to the cloud with little to no modifications.
EFA provides a 4X improvement in scaling over ENA for a standard CFD simulation as shown in the chart above.
Solver for this benchmarking provided by Metacomp Technologies
How it works
Computational Fluid Dynamics
Advances in Computational Fluid Dynamics (CFD) algorithms enable engineers to simulate increasingly complex flow phenomena, and HPC helps reduce turn-around times. With EFA, design engineers can now scale out their simulation jobs to experiment with more tunable parameters, leading to faster, more accurate results.
Complex weather models require high memory bandwidth, fast interconnects, and robust parallel file systems to deliver accurate results. The closer the grid spacing on the model, the more accurate the results—and the more computational resources the model requires. EFA offers a fast interconnect that allows weather modelling applications to take advantage of the virtually unlimited scaling capabilities of the AWS cloud and get more accurate predictions in less time.
The training of deep learning models can be significantly accelerated with distributed computing on GPUs. Leading deep learning frameworks such as Caffe,Caffe2, Chainer, MxNet, TensorFlow, and PyTorch have already integrated NCCL to take advantage of its multi-GPU collectives for across nodes communications. EFA is optimized for NCCL on AWS, improving the throughput and scalability of these training models, which leads to faster results.