Scale HPC Workloads with Elastic Fabric Adapter and AWS ParallelCluster
中文版 – In April, 2019, AWS announced the general availability of Elastic Fabric Adapter (EFA), an EC2 network device that improves throughput and scalability of distributed High Performance Computing (HPC) and Machine Learning (ML) workloads. Today, we’re excited to announce support of EFA through AWS ParallelCluster.
EFA is a network interface for Amazon EC2 instances that enables you to run HPC applications requiring high levels of inter-instance communications (such as computational fluid dynamics, weather modeling, and reservoir simulation) at scale on AWS. It uses an industry-standard operating system bypass technique, with a new custom Scalable Reliable Datagram (SRD) Protocol to enhance the performance of inter-instance communications, which is critical to scaling HPC applications. For more on EFA and supported instance types, see Elastic Fabric Adapter (EFA) for Tightly-Coupled HPC Workloads.
AWS ParallelCluster takes care of the undifferentiated heavy lifting involved in setting up an HPC cluster with EFA enabled. When you set the
enable_efa = compute flag in your cluster section, AWS ParallelCluster will add EFA to all network-enhanced instances. Under the cover, AWS ParallelCluster performs the following steps:
InterfaceType = efain the Launch Template.
- Ensures that the security group has rules to allow all inbound and outbound traffic to itself. Unlike traditional TCP traffic, EFA requires an inbound rule and an outbound rule that explicitly allow all traffic to its own security group ID
sg-xxxxx. See Prepare an EFA-enabled Security Group for more information.
- Installs EFA kernel module, an AWS-specific version of the Libfabric Network Stack, and OpenMPI 3.1.4.
- Validates instance type, base os, and a placement group.
To get started, you’ll need to have AWS ParallelCluster set up, see Getting Started with AWS ParallelCluster. For this tutorial, we’ll assume that you have an AWS ParallelCluster installed and are familiar with the
~/.parallelcluster/config file to include a cluster section that minimally includes the following: