Containers

Announcing AWS Neuron Helm Chart

Introduction

We are pleased to announce the launch of the Neuron Helm Chart, which streamlines the deployment of AWS Neuron components on Amazon Elastic Kubernetes Service (Amazon EKS). With this new Helm Chart, you can now seamlessly install the necessary Kubernetes artifacts needed to run training and inference workloads on AWS Trainium and AWS Inferentia instances.

Until now, users needed to download and apply each component of Neuron Kubernetes plugins separately. The Neuron Helm Chart streamlines the deployment of Neuron components by consolidating the necessary components into a single, directly deployable solution. This means you can deploy the individual components such as Neuron Device Plugin, Neuron Scheduler, and Node Problem Detector with minimal effort and full flexibility. You can also customize which components to deploy using Helm chart configuration values, thus streamlining installation and management.

Key benefits of using the Neuron Helm Chart

  • Standardization: Enforces a consistent method for packaging AWS Neuron Kubernetes artifacts. The chart can be downloaded from the Amazon Elastic Container Registry (Amazon) public registry.
  • Reusability: Can be reused across different environments with configuration overrides by passing the values.yaml file.
  • Versioning: Can be versioned to track changes and make sure of consistent deployments.
  • Templating: Provides flexibility in configuring applications based on specific needs.
  • Modularity: Includes Neuron Device Plugin, Node problem detector, and Neuron Scheduler Extension components. Each artifact can be installed individually by enabling/disabling flags at the installation time.

Key components of the Neuron Helm Chart

  • Neuron Device Plugin: Neuron Device Plugin exposes Neuron cores and devices to Kubernetes as a resource. It’s designed to run as a daemonset in kube-system namespace to make sure of its availability across the worker nodes in the cluster. This component is crucial for enabling access to Neuron hardware on your Kubernetes clusters.
  • Neuron Scheduler Extension: The Neuron Scheduler Extension is needed for scheduling pods that need more than one and less than the total number of Neuron cores or devices. It intelligently manages the scheduling of pods based on the hardware topology of the underlying instance to make sure of optimal performance. The scheduler identifies sets of devices that are directly connected to minimize communication latency, which is important for achieving high-performance results in distributed workloads.
  • Neuron Node Problem Detector Plugin: Neuron Node Problem Detector Pluginis designed to run as a daemonset on AWS Neuron-enabled (Trn1/Inf1/Inf2) EKS worker nodes. The Neuron Node Problem Detector Plugin is responsible for monitoring the health of Neuron devices on each Kubernetes node. If it is detecting an unrecoverable neuron error, then it initiates a node replacement process.

For more detailed information about the individual Neuron components, refer to this documentation.

How to install the Neuron Helm Chart

The Neuron Helm Chart is hosted in an Amazon ECR public repository. Follow these steps to install the chart in your AWS environment. By default, the Neuron Helm Chart deploys the Neuron Device Plugin and the Neuron Node Problem Detector Plugin. Both plugins are designed to run as daemonsets on an EKS cluster.

Prerequisites for installing Neuron Helm Chart

Step1: Create an EKS cluster

Make sure that you have a functioning EKS cluster where Neuron components will be deployed. You can use eksctl for deploying the EKS cluster.

Step2: Set up IAM roles for service accounts for Neuron Node Problem Detector Plugin

To enable Neuron Node Problem Detection and Recovery, you must authorize the plugin through IAM roles for service accounts (IRSA). Follow the steps in this guide to create the necessary IAM policy and configure IRSA.

Step3: Deploy Helm Chart

Make sure that Helm is installed locally on your machine before proceeding. You can install the Neuron Helm Chart using the following command:

helm install neuron-helm-chart \
oci://public.ecr.aws/neuron/neuron-helm-chart \
--version 1.0.0

Output:

Pulled: public.ecr.aws/neuron/neuron-helm-chart:1.0.0
Digest: sha256:6d12b33ef46a5effaf4deb2851fec43d87636124ea49e81af125e7212889b5b3
NAME: neuron-helm-chart
LAST DEPLOYED: Thu Sep  5 15:05:27 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1

Verify that the Neuron Device Plugin and the Neuron Node Problem Detector plugins are running.

# Verify the Neuron Device Plugin
$ kubectl get ds -n kube-system | grep neuron 
neuron-device-plugin 2 2 2 2 2 <none> 102s

# Verify the Neuron Node Problem Detector Plugin
$ kubectl get all -n neuron-healthcheck-system
NAME READY STATUS RESTARTS AGE
pod/node-problem-detector-sm4rz 2/2 Running 0 4m7s
pod/node-problem-detector-vgvdz 2/2 Running 0 4m7s

NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/node-problem-detector   2         2         2       2            2           <none>          4m7s

To Install the Neuron Artifacts:

By default, the Neuron Helm chart installs the Neuron Device Plugin and the Neuron Node Problem Detector Plugin. To install the Neuron Scheduler Extension, you must enable the following specified values and deploy the Helm chart.

Note that Neuron creates a scheduler extension called my-scheduler, thus workloads needing this scheduler must reference this name.

helm install neuron-helm-chart \
oci://public.ecr.aws/neuron/neuron-helm-chart \
--version 1.0.0 \
--set "scheduler.enabled=true"

Check the status of the components installed:

# Verify the Neuron Device Plugin
$ kubectl get ds -n kube-system | grep neuron                                                                                                                      
neuron-device-plugin     2         2         2       2            2           <none>          102s

# Verify the Neuron Scheduler
$ kubectl get deploy -n kube-system | grep my-scheduler                                                                                                            
my-scheduler           1/1     1            1           85s

# Verify the Neuron Node Problem Detector
$ kubectl get all -n neuron-healthcheck-system
NAME                              READY   STATUS    RESTARTS   AGE
pod/node-problem-detector-sm4rz   2/2     Running   0          4m7s
pod/node-problem-detector-vgvdz   2/2     Running   0          4m7s

NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/node-problem-detector   2         2         2       2            2           <none>          4m7s

Refer to the Helm chart values.yaml file for reference.

Conclusion

The Neuron Helm Chart streamlines the deployment of AWS Neuron components on Amazon EKS, offering a simplified, flexible solution for running training and inference workloads on Trainium and Inferentia instances. By consolidating key components such as the Neuron Device Plugin, Neuron Scheduler Extension, and Neuron Node Problem Detector into a manageable Helm chart, you can now achieve a more efficient setup and operation of Neuron-enabled environments.

If you are planning to deploy an LLM for training and inference on Trainium and Inferentia, make sure to refer to these deployment patterns from the Data on EKS project for further guidance.

We encourage you to get started with the Neuron Helm Chart today and use its capabilities to optimize your machine learning (ML) workloads on Amazon EKS. For detailed guidance, visit the official AWS Neuron Kubernetes Getting Started documentation.

TAGS:
Geeta Gharpure

Geeta Gharpure

Geeta is a senior software developer on Annapurna ML engineering team. She is focused on running large scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to audible in her free time.

Ratnopam Chakrabarti

Ratnopam Chakrabarti

Ratnopam Chakrabarti is a Senior Solutions Architect specializing in Containers, AI-ML on Kubernetes and Open-Source technologies at Amazon Web Services (AWS). In his current role, Ratnopam helps AWS customers accelerate their cloud adoption and run scalable, secure and optimized container workloads at scale. You can connect with him on LinkedIn at https://www.linkedin.com/in/ratnopamc/.

Arjun Raman

Arjun Raman

Arjun Raman is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes.

Mounik Chinthapanti

Mounik Chinthapanti

Mounik Chinthapanti is a Software Development Engineer at AWS, specializing in improving the AI/ML experience by integrating AWS Neuron with containerized environments and Kubernetes.

Vara Bonthu

Vara Bonthu

Vara Bonthu is a dedicated technology professional and Worldwide Tech Leader for Data on EKS, specializing in assisting AWS customers ranging from strategic accounts to diverse organizations. He is passionate about open-source technologies, Data Analytics, AI/ML, and Kubernetes, and boasts an extensive background in development, DevOps, and architecture. Vara's primary focus is on building highly scalable Data and AI/ML solutions on Kubernetes platforms, helping customers harness the full potential of cutting-edge technology for their data-driven pursuits.