AWS Machine Learning Blog

Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing distributed training workloads, efficient resource utilization, and model accuracy and performance. This is where the NVIDIA NeMo Framework comes into play. In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

NVIDIA NeMo Framework

NVIDIA NeMo is an end-to-end cloud-centered framework for training and deploying generative AI models with billions and trillions of parameters at scale. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment. It offers a variety of customization techniques and is optimized for at-scale inference of models for both language and image applications, using multi-GPU and multi-node configurations. NVIDIA NeMo simplifies generative AI model development, making it more cost-effective and efficient for enterprises. By providing end-to-end pipelines, advanced parallelism techniques, memory-saving strategies, and distributed checkpointing, NVIDIA NeMo makes sure AI model training is streamlined, scalable, and high-performing.

The following are benefits of using NVIDIA NeMo for distributed training:

  • End-to-end pipelines for different stages such as data preparation, training, and more, which allows for a plug-and-play approach for your custom data
  • Parallelism techniques, including the following:
    • Data parallelism
    • Tensor parallelism
    • Pipeline parallelism
    • Sequence parallelism
    • Expert parallelism
    • Context parallelism
  • Memory saving techniques, including the following:
    • Selective activation recompute
    • CPU offloading (activation, weights)
    • Attention, including Flash Attention (FA 1/2, FA-cuDNN), Grouped Query Attention, Multi-Query Attention, and Sliding Window Attention
    • Distributed optimizers, including Torch FSDP, Distributed Optimizer (zero-1)
  • Data loaders for different architectures
  • Distributed checkpointing

Solution overview

You can deploy and manage NVIDIA NeMo using either Slurm or Kubernetes orchestration platforms. Amazon EKS is a managed Kubernetes service that makes it straightforward to run Kubernetes clusters on AWS. It manages the availability and scalability of the Kubernetes control plane, and it provides compute node auto scaling and lifecycle management support to help you run highly available container applications.

Amazon EKS is an ideal platform for running distributed training workloads due to its robust integrations with AWS services and performance features. It seamlessly integrates with Amazon FSx for Lustre, a high-throughput file system, enabling fast data access and management using persistent volume claims with the FSx CSI driver. Amazon EKS also integrates with Amazon CloudWatch for comprehensive logging and monitoring, providing insights into cluster performance and resource utilization. It supports Amazon Simple Storage Service (Amazon S3) for scalable and durable data storage and management, providing accessibility for large datasets. Enhanced network performance is achieved with Elastic Fabric Adapter (EFA), which offers low-latency, high-throughput connectivity between nodes. These features collectively make Amazon EKS a powerful and efficient choice for optimizing AI and machine learning (ML) training workflows.

The following diagram shows the solution architecture.

In this post, we present the steps to run distributed training workloads on an EKS cluster. The high-level steps are as follows:

  1. Set up an EFA enabled 2-node 24xlarge cluster.
  2. Set up an FSx for Lustre file system so you can have a shared data repository for storing training dataset and model checkpoints.
  3. Set up an environment for NVIDIA NeMo.
  4. Modify the NVIDIA NeMo Kubernetes manifests to prepare a dataset and train a model.

Prerequisites

You need to be able to launch a CPU-based Amazon Elastic Compute Cloud (Amazon EC2) instance that you’ll use to create the EKS cluster. When your instance is up and running, SSH into your EC2 instance and install the following CLIs:

These steps may change if you are on a non-Linux platform. Consult the preceding documentation for installing the CLIs on other platforms accordingly. We also require that you have a capacity reservation with p4de.24xlarge instances and have the capacityReservationID.

Launch an EKS cluster

ECR p4de.24xlarge instances have the NVIDIA A100 80GB instances, which are highly popular for distributed training generative AI workloads. For more information, refer to Amazon EC2 Instance Types. In this section, we show how to create an EKS cluster with an On-Demand Capacity Reservation for p4de.24xlarge instances.

  1. We provide the cluster creation config in p4de-cluster-config.yaml. See the following code:
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/3.test_cases/2.nemo-launcher/EKS

eksctl create cluster -f p4de-cluster-config.yaml

The following are key points to note when creating this cluster:

  • Make sure the kubectl version and the specified Region are correct.
  • Update the capacityReservationID field and make sure to specify the availabilityZones within the managedNodeGroups section, which should be the same Availability Zone ID in which your capacity lives.
  • This configuration will create two managed node groups: one for the system nodes using c5.2xlarge instances and another for running distributed training on p4de.24xlarge instances. Managed node groups will use Amazon EKS optimized AMIs. If you want to provide a custom AMI, you can create a self-managed node group and specify a custom AMI. To find the AMI ID, refer to Retrieving Amazon EKS optimized Amazon Linux AMI IDs. For more details about the Amazon EKS optimized AMI, see the GitHub repo.
  • Make sure efaEnabled is set to true. You can use the same config for creating a cluster with other node groups. For a list of EFA supported instance types, see Supported instance types.
  • Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.
  1. After the cluster is created, you can enable kubectl to communicate with your cluster by adding a new context to the kubectl config file:
    aws eks update-kubeconfig --region region-code --name my-cluster
  2. You can confirm communication with your cluster by running the following command:
    kubectl get svc
    NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   28h

Next, you can install the AWS EFA Kubernetes Device Plugin. EFA is a network interface for EC2 instances that enhances the performance of inter-node communications, which is critical for distributed training workloads that involve GPUs. This plugin allows Kubernetes to recognize and utilize the EFA device, facilitating high-throughput, low-latency networking necessary for efficient distributed training and deep learning applications.

  1. Install the plugin with the following code:
helm repo add eks https://aws.github.io/eks-charts

helm install efa eks/aws-efa-k8s-device-plugin -n kube-system

The NVIDIA device plugin for Kubernetes enables GPU support within your EKS cluster by exposing the GPUs to the Kubernetes API server through the kubelet. It advertises the available GPU resources, allowing Kubernetes to schedule and manage GPU-accelerated workloads.

  1. Install the plugin with the following code:
    wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
    
    kubectl apply -f nvidia-device-plugin.yml
  2. Run the following command to verify all the pods:
    kubectl get pods --all-namespaces

  3. You can run kubectl get nodes to verify the nodes.

Alternatively, you can use the EKS node viewer tool to view nodes, their costs, and their status in your cluster. After it’s installed, enter eks-node-viewer to get the following view.

The node viewer displays the IP addresses of our two p4de.24xlarge compute nodes.

  1. We can choose one of these private IP DNS names to further examine and describe the node as follows:
kubectl describe node ip-192-168-165-37.us-west-2.compute.internal

The preceding command describes a lot of detail of the node. To make sure EFA is installed correctly, make sure you see details as shown in the following screenshot.

For p4 nodes, you will see vpc.amazonaws.com/efa:4 and for p5.48xlarge nodes, you should see vpc.amazonaws.com/efa:32.

If EFA is enabled in the node group, make sure that a security group is attached to the nodes that allows a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work. For instructions, see Get started with EFA and MPI. This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.

Create an FSx for Lustre file system

For distributed training applications, typically hundreds of GPU instances are used, with each node containing multiple GPUs. It is crucial that all nodes can access a shared file system to train on the same dataset efficiently. For this purpose, a high-performance file system with high throughput and low latency is essential. We recommend using the FSx for Lustre file system for large-scale distributed training, because it meets these requirements and provides seamless data access for all nodes involved in the training process.

To have a FSx for Lustre file system mounted on your EKS cluster, complete the following steps:

  1. Use the following scripts to create an AWS Identity and Access Management (IAM) role and attach the FSx policy:
    export FSX_POLICY_NAME=fsx-csi
    
    wget https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/csi/fsx/fsx-policy.json
    export FSX_POLICY_DOC=file://fsx-policy.json
    
    # From EC2 Auto Scaling Group
    export EKS_INSTANCE_PROFILE_NAME=(eks-1ec6fc6b-1a19-d65d-66ac-293ff0a20eb9 )
    
    POLICY_ARN=$(aws iam create-policy --policy-name ${FSX_POLICY_NAME} --policy-document $FSX_POLICY_DOC --query "Policy.Arn" --output text)
    
    INSTANCE_PROFILE=$(aws iam list-instance-profiles --query InstanceProfiles[?InstanceProfileName=="'${EKS_INSTANCE_PROFILE_NAME}'"].{InstanceProfileName:InstanceProfileName} --output text)
    
    ROLE_NAME=$(aws iam get-instance-profile --instance-profile-name ${INSTANCE_PROFILE} --query InstanceProfile.Roles[0].RoleName --output text)
    
    # Attach FSx Policy to role ${ROLE_NAME} ..."
    aws iam attach-role-policy --policy-arn ${POLICY_ARN} --role-name ${ROLE_NAME}
    
  2. Use the following script to create a security group that allows EKS nodes to access the file system:
    # From EC2 console
    export MY_REGION=us-west-2
    # FSX_SUBNET_ID should be same ID the compute nodes are present in. You can get this from the EKS console 
    export FSX_SUBNET_ID=subnet-0edecd850cff2cfad
    # From EC2 Auto Scaling Group
    export FSX_SECURITY_GROUP_NAME=eks-fsx-sg
    
    # Get VPC_ID from EKS console
    export VPC_ID=vpc-04411d49af198a6ea
    
    # Create security group
    export SECURITY_GROUP_ID=$(aws ec2 create-security-group --vpc-id ${VPC_ID} --region ${MY_REGION} --group-name ${FSX_SECURITY_GROUP_NAME} --description "FSx for Lustre Security Group" --query "GroupId" --output text)
    
    export SUBNET_CIDR=$(aws ec2 describe-subnets --region ${MY_REGION} --query Subnets[?SubnetId=="'${FSX_SUBNET_ID}'"].{CIDR:CidrBlock} --output text)
    
    # Ingress rule
    aws ec2 authorize-security-group-ingress --region ${MY_REGION} --group-id ${SECURITY_GROUP_ID} --protocol tcp --port 988 --cidr ${SUBNET_CIDR}
    
  3. Create a 1.2 TB Persistent_2 FSx for Lustre file system from the FSx for Lustre console in the same Availability Zone as your compute instances (FSX_SUBNET_ID), VPC of Amazon EKS (VPC_ID), and the security group you created (SECURITY_GROUP_ID).
  4. After the file system is created, note the file system ID, DNS name, and mount name from the file system details page.

Before mounting the file system, you need to install the FSx CSI driver that allows EKS clusters to manage the lifecycle of FSx for Lustre file systems.

  1. Install the FSx CSI driver as follows:
    echo "Installing FSx CSI driver ..."
    kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
    
    echo "FSx pods in kube-system namespace ..."
    kubectl -n kube-system get pods | grep fsx
    
  2. Next, to mount the file system, provide scripts in the fsx-storage-class.yaml, fsx-pv.yaml and fsx-pvc.yaml files:
    # Storage Class
    kubectl apply -f fsx-storage-class.yaml
    kubectl get sc
    
    # Persistent Volume
    kubectl apply -f fsx-pv.yaml
    
    # Persistent Volume Claim
    kubectl apply -f fsx-pvc.yaml
    

You can check to make sure that the volumes are in Bound state.

Set up the environment for NVIDIA NeMo

For this post, we use the NVIDIA device plugin for Kubernetes, but if you need to install the GPU Operator, you can do so as follows:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

To enable distributed training, we use the KubeFlow Training Operator, which is essential for managing and scheduling ML training jobs in a Kubernetes environment. This operator simplifies the process of running distributed training jobs by automating the deployment and scaling of the necessary components. See the following code:

# Deploy Kubeflow training operator

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/training-operator/deploy.sh

# Configure RBAC resources

kubectl apply -f ./clusterrole-hpa-access.yaml

kubectl apply -f ./clusterrolebinding-training-operator-hpa-access.yaml

Additionally, we use the KubeFlow MPI Operator for preprocessing training data in parallel. The MPI Operator facilitates running Message Passing Interface (MPI) jobs, which are crucial for parallelizing the preprocessing tasks across multiple nodes, thereby speeding up the training process. See the following code:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml
# Add lease permissions fot mpi-operator cluster role
kubectl apply -f ./clusterrole-mpi-operator.yaml

The NVIDIA NeMo Framework is available publicly in the image nvcr.io/nvidia/nemo:24.01.framework. We provide an AWS optimized Dockerfile for use with P4 and P5 instances. We recommend the following library versions for optimal performance:

ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_VERSION=2.19.4-1

You can build and push the image to Amazon Elastic Container Registry (Amazon ECR) as follows:

## AWS
export AWS_REGION=us-west-2
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

## Docker Image
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=nemo-aws
export TAG=":24.01.framework"

docker build -t ${REGISTRY}${IMAGE}${TAG} -f 0.Dockerfile .

echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Create registry if it does not exist
REGISTRY_COUNT=$(aws ecr describe-repositories | grep ${IMAGE} | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
        echo ""
        echo "Creating repository ${IMAGE} ..."
        aws ecr create-repository --repository-name ${IMAGE}
fi

# Push image
docker image push ${REGISTRY}${IMAGE}${TAG}

The NVIDIA NeMo Framework requires users to provide config files with job and model information. You can copy the launcher scripts from the container as follows:

# Run container
docker run -it ${REPOSITORY}${IMAGE}${TAG} bash

# Copy files
docker cp -a <container-id>: /opt/NeMo-Megatron-Launcher/ <Path-to-save-launcher-scripts>

In a Slurm cluster implementation, the launcher scripts, data, and results folder could reside in the file system that both the head node (node from where jobs are submitted) and compute nodes access. But in this Amazon EKS implementation, the node that you used to create the EKS cluster doesn’t have access to EKS file system. To get around this, you can put the launcher scripts in the head node and the results and data folder in the file system that the compute nodes have access to.

Run NVIDIA NeMo on an EKS cluster

We’re now ready to set up NVIDIA NeMo Kubernetes manifests for data preparation and model training. For more information about running it on premises, see Running NeMo Framework on Kubernetes. There are some modifications to be done for it to run on Amazon EKS, as shown in the following steps. We provide the launcher scripts in the GitHub repo.

  1. Modify the launcher_scripts/conf/cluster/k8s.yaml file as follows. The subPath field is the path where FSx for Lustre is mounted, which is /fsx-shared in this case.
    shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
    volumes:
      persistentVolumeClaim:
        # This claim should be created before running
        claimName: fsx-pvc
        subPath: fsx-shared  # path is mirrored into pod (no leading slash b/c relative to root)
    
    # NOTE: These args will soon be deprecated
    nfs_server: null  # Hostname or IP address for the NFS server where data is stored.
    nfs_path: null  # Path to store data in the NFS server.
    ib_resource_name: null  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters. Can also be a list, but must be same length as ib_count
    ib_count: null  # Specify the number of IB devices to include per node in each pod. Can also be a list, but must be same length as ib_resource_name
    ib_network_annotation: ""  # Specify the networks as comma separated values
    dns_policy: null  # Specify a dnsPolicy to use in all pods, if necessary
    
  2. Install the required Python packages; this is required so that NeMo Launcher can submit jobs to the Kubernetes cluster:
sudo apt install python3-pip

pip install -r <Path-to- NeMo-Megatron-Launcher>/requirements.txt

Next, we copy the following folders from the container to the /fsx-shared/data folder:

  • NeMo-Megatron-Launcher/launcher_scripts/data/bpe
  • NeMo-Megatron-Launcher/launcher_scripts/data/nsfw
  1. To copy files from EKS pods, you can start a pod just for this purpose. Create a file fsx-share-test.yaml as follows:
    apiVersion: v1
    kind: Pod
    metadata:
      name: fsx-share-test
    spec:
      containers:
      - name: fsx-share-test
        image: ubuntu
        command: ["/bin/bash"]
        args: ["-c", "while true; do echo  \"hello from FSx\" - $(date -u) >> /fsx-shared/test.txt; sleep 120; done"]
        volumeMounts:
        - name: fsx-pv
          mountPath: /fsx-shared
      volumes:
      - name: fsx-pv
        persistentVolumeClaim:
          claimName: fsx-pvc
    
  2. Run this pod and copy the files:
    kubectl apply -f fsx-share-test.yaml
    
    kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/bpe fsx-share-test: /fsx-shared/data/
    
    kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/nsfw fsx-share-test: /fsx-shared/data/
    

A few files need to be updated for data preparation for it to work with the EKS cluster.

  1. Modify the launcher_scripts/conf/config.yaml file:
    • For cluster, use k8s.
    • For training, use gpt3/126m.
    • For stages, this should be just data_preparation and no other stages.
    • For launcher_scripts_path, use the path to the NeMo Megatron launch scripts, which should end with /launcher_scripts.
    • For data_dir, use /fsx-shared/data (the location to store and read the data).
    • For base_results_dir, use /fsx-shared/results (the location to store the results, checkpoints, and logs).
    • For container, use ${REPOSITORY}${IMAGE}${TAG}
  2. Modify the conf/data_preparation/gpt3/download_gpt3_pile.yaml file:
    • Set node_array_size to 2.
    • Set file_numbers to “0-5”. With five files, it should be around 350 GB of data
  3. Modify the nemo_launcher/core/k8s_templates/data_preparation/data-prep.yaml file:
    • If you get the error that mpirun is not found, add the full path to the executable /opt/amazon/openmpi/bin/mpirun.
    • Add /fsx-shared in the container volume mount path.
    • Add the volume:
volumes:
          - name: fsx-pv
            persistentVolumeClaim:
              claimName: fsx-pvc
  1. Launch the data preparation job:
    python3 main.py

This script creates a Helm chart for the selected stage (in this case, data_preparation) and runs the Helm chart automatically. Refer to Run NeMo Framework on Kubernetes for an explanation of the data preparation process. Make sure python3 is installed.

  1. You can monitor your job status and logs using three commands: helm list, kubectl get pods, and kubectl logs --follow).
  2. When the job is finished, you can remove the Helm chart:
    helm uninstall download-gpt3-pile

You can see the downloaded the data in the /fsx-shared folder by running in one of the pods as kubectl exec -it nlp-worker-0 bash.

Training

Now that our data preparation is complete, we’re ready to train our model with the created dataset. Complete the following steps:

  1. Modify a parameter in the conf/config.yaml file:
    • Set stages to training and no other stages.
  2. Modify parameters in conf/training/gpt3/126m.yaml:
    • Set num_nodes to 2.
    • Set devices to 1.
    • On line 18, change use_distributed_sampler: False to replace_sampler_ddp: False.

Optionally, if you want to use a mock dataset instead of real dataset for testing purposes, you can modify the data section as follows. You are essentially changing data_impl: mmap to data_impl: mock and assigning an empty list to data_prefix.

data:
  data_impl: mock
  splits_string: "99990,8,2"
  seq_length: 2048
  skip_warmup: True
  num_workers: 2
  dataloader_type: single # cyclic
  reset_position_ids: False # Reset position ids after end-of-document token
  reset_attention_mask: False # Reset attention mask after end-of-document token
  eod_mask_loss: False # Mask loss for the end of document tokens
  index_mapping_dir: null
  data_prefix: [] # Should be weight path weight path... for a blended dataset

# You can just comment the default “data_prefix” values like below.
# - ${data_dir}/my-gpt3_00_text_document
# - .0333
  1. Modify the parameters in the nemo_launcher/core/k8s_templates/training/training.yaml file:
  2. Run python3 main.py to start training and you should see the training pods by running kubectl get pods as follows:
    NAME                    READY   STATUS    RESTARTS   AGE
    nlp-training-worker-0   1/1     Running   0          168m
    nlp-training-worker-1   1/1     Running   0          168m
    

In addition to monitoring your job using helm list, kubectl get pods, and kubectl logs –follow, you can also SSH into your pod with kubectl exec and use nvidia-smi to check GPU status.

  1. When the job is finished, you can delete the helm chart:
    helm uninstall gpt3-126m

Model checkpoints are saved at /fsx-shared/results/checkpoints along with other training logs and TensorBoard events. By default, checkpoints are saved at every 2,000 steps. You can modify the conf/training/gpt3/126m.yaml file to make changes in the training setup.

Troubleshooting deployment failures

If deployment fails due to incorrect setup or configuration, complete the following debug steps:

  1. Find the error message by running kubectl logs --follow PODNAME and kubectl describe pod PODNAME.
  2. Stop any running jobs by removing the Helm chart. This can be done by running helm uninstall CHARTNAME.

Pods should be spun down after removing the Helm chart.

  1. You can double-check by running kubectl get pods.
  2. If pods are not spun down, you can manually stop them by running kubectl delete PODNAME.

Based on the error message, you may find errors from:

  • Unready nodes.
  • Missing Operators or CRDs. In this case, make sure your kubectl get pods -A output looks like that shown earlier. If errors exist, try reinstalling Operators and CRDs.
  • NeMo Framework scripts or Kubernetes manifests. This is more likely a bug or wrong setup on the NeMo side. Errors can vary.

Clean up

It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. To clean up our setup, we must delete the FSx for Lustre file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC.

  1. To delete the file system integration with the EKS cluster, run the following command:
    kubectl delete -f ./fsx-storage-class.yaml

Not only will this delete the persistent volume, it will also delete the EFS file system and all the data on the file system will be lost.

  1. When Step 1 is complete, delete the cluster by using the following script:
    eksctl delete cluster -f p4de-cluster-config.yaml

This will delete all the existing pods, remove the cluster, and delete the VPC you created in the beginning.

Conclusion

In this post, we demonstrated how to train generative AI models at scale using the NeMo Framework within an EKS cluster. We covered the challenges of training LLMs and how NeMo’s comprehensive tools and optimizations address these challenges, making the process more efficient and cost-effective. With NeMo, you can manage and scale distributed training workloads effectively. This post works with P4de instances. Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.

To help you get started, we have published a GitHub repository that provides step-by-step instructions for creating an EKS cluster with P4de instances, mounting an FSx for Lustre file system, and running distributed training workloads with NeMo. This guide empowers you to harness the full potential of NeMo and Amazon EKS for your AI model training needs.


About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring platform. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Wenhan Tan is a Solutions Architect at Nvidia assisting customers to adopt Nvidia AI solutions at large-scale. His work focuses on accelerating deep learning applications and addressing inference and training challenges.