Posted On: Nov 2, 2020

We are excited to announce that Elastic Fabric Adapter (EFA) now supports NVIDIA GPUDirect Remote Direct Memory Access (RDMA). GPUDirect RDMA support on EFA will be available on Amazon Elastic Compute Cloud (Amazon EC2) P4d instances- the next generation of GPU-based instances on AWS. P4d provides the highest performance for machine learning (ML) training and high performance computing (HPC) in the cloud for applications such a natural language processing, object detection and classification, seismic analysis, and computational drug discovery. GPUDirect RDMA support on EFA enables network interface cards (NICs) to directly access GPU memory. This avoids extra memory copies, making remote GPU-to-GPU communication across NVIDIA GPU-based Amazon EC2 instances faster, and reduces orchestration overhead on CPUs and user applications. As a result, our customers running applications using NVIDIA Collective Communications Library (NCCL) on P4d will be able to further accelerate their multi-node tightly-coupled workloads.

P4d instances deliver up to 60% lower cost to train and over 2.5x better deep learning performance with 2.5x the memory, twice the double precision floating point performance, and 4x local NVMe-based SSD storage compared to previous generation P3 and P3dn instances. They are available in the p4d.24xl size, providing 96 vCPUs, 8 NVIDIA A100 GPUs, 1.1 TB instance memory, 8 TB of local NVMe-based SSD storage, 19 Gbps EBS burst bandwidth, and 400 Gbps of networking bandwidth with EFA and GPUDirect RDMA.

EFA is a custom-built network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-instance communications at scale on AWS. To learn more about how to use EFA, please visit EFA documentation. To learn more about scaling HPC and ML workloads with EFA, please visit AWS HPC Workshops.