Monitoring GPU workloads on Amazon EKS using AWS managed open-source services

As machine learning (ML) workloads continue to grow in popularity, many customers are looking to run them on Kubernetes with graphics processing unit (GPU) support. Amazon Elastic Compute Cloud (Amazon EC2) instances powered by NVIDIA GPUs deliver the scalable performance needed for fast ML training and cost-effective ML inference. Monitoring GPU utilization gives valuable information for researchers working on GPU-accelerated ML workflows and infrastructure teams. Patterns in GPU metrics correlate with workload behavior and enable optimizing resource allocation, identifying anomalies, and improving the overall system performance and stability. However, given the significant difference between central processing units (CPUs) and GPUs form both design and functional perspectives, setting up robust observability for GPU workloads could be a daunting task.

We recently announced the release of the AWS CDK Observability Accelerator, a set of opinionated modules to help customers setup observability for Amazon Elastic Kubernetes Service (EKS) clusters. It simplifies observability provisioning and maintenance for Amazon EKS clusters using AWS Cloud Development Kit (AWS CDK) and CDK EKS Blueprints. The AWS CDK Observability Accelerator is organized around patterns, which are reusable units for deploying multiple resources. The open source observability set of patterns instruments observability with Amazon Managed Grafana dashboards, AWS Distro for OpenTelemetry collector to collect metrics and Amazon Managed Service for Prometheus to store them.

The NVIDIA Data Center GPU Manager (DCGM) is a set of low overhead tools that can perform a variety of functions including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration, and accounting. DCGM includes APIs for gathering GPU telemetry. The NVIDIA GPU Operator which installs DCGM Exporter uses Go bindings to collect GPU telemetry data such as memory utilization, and traffic interconnect from DCGM and then exposes it to an http endpoint (/metrics) which is remote written and stored in Amazon Managed Service for Prometheus. The NVIDIA GPU Operator add-on uses the operator framework within Amazon EKS, to automate the management of all NVIDIA software components needed to provision GPU, including DCGM and DCGM Exporter.

In this post, we’ll explore how to leverage the AWS CDK Observability Accelerator to quickly build observability for GPU workloads on Amazon EKS, using AWS managed open-source services and NVIDIA tools.

Solution Overview

The pattern we are going to deploy will provision the following components:

An Amazon EKS cluster with a managed node group supported by GPU-based Amazon EC2 instances; the AMI type of the node group is AL2_x86_64_GPU AMI, which uses the Amazon EKS-optimized Linux AMI with GPU support
AWS Distro For OpenTelemetry Operator and Collector for collecting metrics and traces
AWS for FluentBit to capture and ingest logs into Amazon CloudWatch
Grafana Operator to add AWS data sources and create Grafana dashboards in Amazon Managed Grafana
Flux to perform GitOps sync to the EKS cluster of a Git repository hosting configuration of Grafana dashboards and AWS data sources. Please check our GitOps with Amazon Managed Grafana module in One Observability Workshop to learn more on this topic
External Secrets Operator to retrieve and sync the Grafana API keys from AWS Systems Manager
NVIDIA GPU Operator

In the following diagram, you can see the metrics from source to destination, through the main components of the pattern:

Figure 1: Architecture demonstrating Monitoring of GPU workloads on Amazon EKS using AWS managed open-source Services

Prerequisites

You will need the following to complete the steps in this post:

AWS CLI version 2
AWS CDK version 2.86.0 or later
Homebrew to install required packages for macOS or Linux
An existing Amazon Managed Grafana Workspace
Node version 20.0.0 or later
NPM version 10.0.0 or later
Kubectl
Git
Make

Environment setup:

export AWS_REGION=<YOUR AWS REGION>
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)

Next, let’s get the workspace id of any existing Amazon Managed Grafana workspace:

aws grafana list-workspaces

{
    "workspaces": [
        {
            "authentication": {
                "providers": [
                    "AWS_SSO"
                ]
            },
            "created": "2023-06-07T12:23:56.625000-04:00",
            "description": "accelerator-workspace",
            "endpoint": "g-XXXXXXX.grafana-workspace.us-east-2.amazonaws.com",
            "grafanaVersion": "9.4",
            "id": "g-XXXXXXX",
            "modified": "2023-06-07T12:30:09.892000-04:00",
            "name": "accelerator-workspace",
            "notificationDestinations": [
                "SNS"
            ],
            "status": "ACTIVE",
            "tags": {}
        }
    ]
}

# populate the values from `id` and `endpoint` to below environment variable
export COA_AMG_WORKSPACE_ID="<<Your-WORKSPACE-ID>>"
export COA_AMG_ENDPOINT_URL="https://<<YOUR-WORKSPACE-URL>>"

Next, let’s create a Grafana API key from Amazon Managed Grafana workspace and setup a secret in AWS Systems Manager. The secret will be accessed by the External Secrets add-on and made available as a native Kubernetes secret in the Amazon EKS cluster:

export AMG_API_KEY=$(aws grafana create-workspace-api-key \
  --key-name "grafana-operator-key" \
  --key-role "ADMIN" \
  --seconds-to-live 432000 \
  --workspace-id $COA_AMG_WORKSPACE_ID \
  --query key \
  --output text)

aws ssm put-parameter --name "/cdk-accelerator/grafana-api-key" \
    --type "SecureString" \
    --value $AMG_API_KEY \
    --region $AWS_REGION

Bootstrap CDK

The first step to any CDK deployment is bootstrapping the environment. cdk bootstrap is a tool in the AWS CDK command-line interface (AWS CLI) responsible for preparing the environment (i.e., a combination of AWS account and AWS Region) with resources required by CDK to perform deployments into that environment. CDK bootstrapping is needed for each account/region combination, so if you already bootstrapped CDK in a region, you don’t need to repeat the bootstrapping process.

Execute the commands below to bootstrap the AWS environment in your region:

cdk bootstrap aws://$ACCOUNT_ID/$AWS_REGION

Deploy the open-source observability pattern

Clone the cdk-aws-observability-accelerator repository and install the dependency packages. This repository contains CDK v2 code written in TypeScript.

git clone https://github.com/aws-observability/cdk-aws-observability-accelerator.git
cd cdk-aws-observability-accelerator
make deps
make build
make list

The settings for Grafana dashboard json files are expected to be specified in the CDK context. Generally, such settings are specified inside the cdk.context.json file of the current directory or in ~/.cdk.json in your home directory. You will need to update the context in cdk.json file located in cdk-aws-observability-accelerator directory. Please pay attention to gpuNodeGroup in the below to observe the instanceType used for GPU workloads along other with size definitions of the GPU nodes.

Settings for this deployment are:

"context": {
    "fluxRepository": {
        "name": "grafana-dashboards",
        "namespace": "grafana-operator",
        "repository": {
            "repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
            "name": "grafana-dashboards",
            "targetRevision": "main",
            "path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
         },
         "values": {
             "GRAFANA_CLUSTER_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
             "GRAFANA_KUBELET_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
             "GRAFANA_NSWRKLDS_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
             "GRAFANA_NODEEXP_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
             "GRAFANA_NODES_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
             "GRAFANA_WORKLOADS_DASH_URL": "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json"
         },
         "kustomizations": [
             {
                  "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
             },
             {
                  "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/gpu"
             }
          ]
    },
    "gpuNodeGroup": {
        "instanceType": "g4dn.xlarge",
        "desiredSize": 2,
        "minSize": 2,
        "maxSize": 3,
        "ebsSize": 50
     }
}

Now, let’s run the below command to deploy the open-source observability pattern:

make pattern single-new-eks-gpu-opensource-observability deploy

Once deployment completes, run update-kubeconfig command. You should be able to get the command from the CDK output message.

aws eks update-kubeconfig --name single-new-eks-opensource-observability-accelerator --region <your region> --role-arn arn:aws:iam::xxxxxxxxx:role/single-new-eks-opensource-singleneweksopensourceob-XXXXXXXX

Validation

Validate the resources created by our deployment:

kubectl get pods -A

Figure 2: Resources deployed by the GPU Monitoring CDK Observability Accelerator Pattern.

Next, let’s verify that each node has allocatable GPUs:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Figure 3: Nodes having allocatable GPUs

nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA’s devices from Fermi and higher architecture families. We can now deploy the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container:

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Then request the logs from the Pod:

kubectl logs nvidia-smi

Figure 4: nvdia-smi logs from Amazon CloudWatch

Visualization – Grafana NVIDIA DCGM Exporter Dashboard

Login to your Amazon Managed Grafana workspace and navigate to the Dashboards panel. You should see a dashboard named NVIDIA DCGM Exporter Dashboard under the Observability Accelerator Dashboards folder. Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your Grafana dashboards by accident, they will be re-provisioned automatically.

To see some metrics in the dashboard, we need to generate some load. To accomplish that, we deploy a Pod with an image running dcgmproftester, a tool capable of generating deterministic CUDA workloads for reading and validating GPU metrics. The tool’s main parameters are: -t for generating load for a particular metric and -d for specifying the test duration. In our test, we set the metric to 1004, to generate a half-precision matrix-multiply-accumulate for the Tensor Cores, and we set the duration to 2 minutes.

Please run the following command from terminal:

cat << EOF | kubectl create -f -
 apiVersion: v1
 kind: Pod
 metadata:
   name: dcgmproftester
 spec:
   restartPolicy: OnFailure
   containers:
   - name: dcgmproftester11
     image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
     args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
     resources:
       limits:
          nvidia.com/gpu: 1
     securityContext:
       capabilities:
          add: ["SYS_ADMIN"]
EOF

To verify the Pod was successfully deployed, please run:

kubectl get pods

Figure 5: Screenshot showing load generation

After a few minutes, looking into the NVIDIA DCGM Exporter Dashboard, you should see the gathered metrics, similar to:

Figure 6: Amazon Managed Grafana workspace showing NVIDIA DCGM Exporter Dashboard

Figure 7: NVIDIA DCGM Exporter Dashboard with GPU SM Clocks and GPU Utilization metrics

Figure 8: NVIDIA DCGM Exporter Dashboard with GPU SM Clocks and GPU Utilization metrics

Looking at the graphs above, you can see that the curves span a 2 minutes time-frame, matching the -d 120 parameter we set earlier for dcgmproftester.

Clean up

You can teardown the whole CDK stack with the following command:

make pattern single-new-eks-gpu-opensource-observability destroy

Conclusion

In this blog post, we showed you how to setup robust Observability for GPU workloads, deployed in an Amazon EKS cluster, running on Amazon EC2 instances, featuring NVIDIA GPUs. We described how the cluster is instrumented with NVIDIA GPU Operator add-on to deploy the software components, necessary to gather insightful GPU metrics. We explained the main operator’s components are DCGM and DCGM Exporter. We walked you through the deployment of the GPU Observability pattern and introduced the NVIDIA DCGM Exporter Dashboard. We are currently working on further patterns to capture telemetry from ML workloads running on other types of accelerated ML chips. We recommend that you try out all these patterns as we get these out and continue to support and contribute to our AWS CDK observability accelerator open source project.

For more information, see the following references:

AWS Cloud Operations & Migrations Blog