Posted On: Mar 26, 2021

Amazon Elastic Kubernetes Service (EKS) now supports Elastic Fabric Adapter (EFA), enabling applications to achieve the performance of an on-premises machine learning training cluster, with the scalability, flexibility, and elasticity provided by Kubernetes clusters managed by EKS.

Kubernetes has become a leading platform for distributed machine learning applications, as it makes it easy to scale clusters to a large number of nodes with powerful GPU based instances. At scale, network bandwidth can become a bottleneck for distributed workloads. Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS. You can now easily integrate EFA into distributed training applications on Kubernetes by leveraging the newly released EFA device plugin, which automatically discovers and mounts EFA devices into pods that request them. This allows you to add bandwidth as ML training jobs scale horizontally to accommodate ever increasing model sizes. You can now take full advantage of the latest EC2 GPU powered instance types such as P4d that include multiple EFA devices for even greater improvements with model training time.

Elastic Fabric Adapter is supported on all EKS clusters, and EFA enabled instances can be started using managed node groups, eksctl, or CloudFormation. See the Amazon EKS documentation to get started. To learn more about Amazon EKS, visit the product page. Learn more about Elastic Fabric Adapter in the EC2 documentation.