Containers
Deploying managed P4d Instances in Amazon Elastic Kubernetes Service with NVIDIA GPUDirectRDMA
In March 2021, Amazon EKS announced support for Amazon EC2 P4d instances, enabling you to launch a fully managed EKS cluster based on the latest NVIDIA A100 GPUs. Amazon EC2 P4d instances are the next generation of GPU-based instances that provide the best performance for machine learning (ML) training and high performance computing (HPC) in the cloud for applications such as natural language processing, object detection and classification, seismic analysis, and genomics research. This post takes you through how you can quickly get started with deploying these instances in a managed EKS cluster.
Product overview:
Each p4d.24xl instance comes equipped with:
- 8x NVIDIA A100 GPUs
- 96vCPUs
- 8x 1 TB of local NVMe storage
- 4×100 Gbps accelerated networking with support for GPUDirectRDMA utilizing Elastic Fabric Adapter (EFA).
A more thorough deep dive on the Amazon EC2 P4d instances is available here. Setting up the P4d instances with all the performance optimizations related to GPUDirectRDMA (GDRDMA) and the 400-Gbps networking requires manual steps. By providing this in a managed service layer such as Amazon EKS with managed node groups, this infrastructure setup is handled automatically, so you focus on running highly scalable distributed accelerated workloads.
Requirements
Install and configure the follow components in your local environment.
eksctl – You need version 0.43.0+ of eksctl.
kubectl – You use Kubernetes version 1.19 in this blog
You also must set up your environment to authenticate and authorize running AWS Command Line Interface (AWS CLI) commands on your behalf. Install v2 and configure your access key and secret token .
Deployment
Setting up the cluster is covered in the following steps . In this example, we walk through running the NVIDIA Collective Communication Library (NCCL) tests to validate utilization of GPUDirectRDMA over Elastic Fabric Adapter (EFA). The AWS samples GitHub repo for EFA on EKS has additional examples tailored to ML workloads.
Step One: In your AWS Region, ensure at least one of the Availability Zones contains P4d instances. You can check availability with the following command:
Step 2: Copy and paste the following code in your editor and replace any values specific to your Region.
This eksctl config file creates a VPC, EKS cluster, and P4d managed node group. Also notice the use of EKS add-ons to ensure your cluster is launched with at least VPC CNI version 1.7.10. This is a requirement for EFA traffic. The VPC is created with a private and public subnet in each Availability Zone specified. By specifying private networking and a Single-AZ in your managed node group, you ensure that your nodes are launched in a single subnet. This is a requirement for worker nodes to communicate over EFA. Note, you may need to request a limit increase to increase you EC2 On-Demand Instance limits — the default is 128 vCPUs for P series instances. This managed node group can require up to 384 vCPUs (4 p4d.24xlarge instances).
If you have an existing VPC, see this example for how to create a node group with eksctl in a single subnet for an existing VPC. For an existing VPC, ensure that you have the correct networking topology for starting the P4d instances. As a best practice, launch your P4d instances in a private subnet, with a NAT Gateway routing to a public subnet with an Internet Gateway.
Now use config file to create your cluster and node group:
This command takes some time, as eksctl will be creating a cluster and P4d node group in sequential steps. In the logs of the eksctl bootstrap command, you should see a log entry confirming that the EFA device plugin was successfully applied.
Step 3: Next, apply the latest version of the NVIDIA K8s device plugin.
Describe one of the nodes by calling kubectl describe node ip-10-0-57-3.us-west-2.compute.internal
, and you can see the allocatable resources:
By using eksctl and managed node groups, all the heavy lifting of configuring the infrastructure and networking for EFA with GDRDMA is automatically handled. This includes installing the EFA plugin, which presents the EFA network devices as allocatable resources to pods via the vpc.amazonaws.com/efa
Kubernetes extended resource. Additionally, with the efaEnabled
flag, eksctl automatically handles other EFA prerequisites, including creating an EFA enabled security group, an EC2 placement group, and installing the EFA driver as part of EC2 user data. You can find more details on these steps in the EKS documentation. Next, let’s run the NCCL test to validate our training job throughput.
Step 4: Example Benchmarking
With the base EKS cluster in place, you can then add the Kubeflow MPI Operator for your subsequent tests.
Next, clone the aws-samples/aws-efa-eks repo and apply the test configuration:
Once the pods startup and are in the Running
state:
you are able to see the NCCL networking call libfabric and use the underlying EFA devices and GPUDirectRDMA. Here is the expected output:
Step 5: Cleanup
To clean up the environment, you can delete the entire cluster and node group with the following command:
Conclusion
In this post, you learned how to get started with deploying machine learning applications that take full advantage of P4d instances on EKS. By using eksctl with managed node groups, all of the infrastructure setup required for managed elastic scaling of P4d instances with GPUDirectRDMA over EFA is completely automated. You looked at how the NCCL tests ran an all-reduce job across all 16 GPUs and network bandwidth in the two-node cluster. At AWS, we have already seen several EKS customers move to P4d and reduce their time to complete distributed ML training by nearly 50%, and we are excited to see what kind of improvements you will experience, in addition to new types of machine learning this capability unlocks. As always, feel free to leave feedback and comments on either the AWS sample repository, or the AWS Containers roadmap.