AWS Cloud Operations Blog
Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights
As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility into how models are utilizing the GPU resources over time, it is difficult to optimize cluster utilization, troubleshoot anomalies, and ensure models are training as quickly as possible. Machine Learning experts needs an easy-to-use observability solution to monitor GPUs to correlate metric patterns with model behaviour and infrastructure utilization.
For workloads that require distributed training, Elastic Fabric Adapter (EFA) metrics become important along with the individual node performance because understanding the performance of inter-node communication during distributed model training is another aspect of validating model performance and infrastructure health.
Historically customers needed to manually install multiple agents like NVIDIA DCGM exporter for GPU metrics and depend on custom built Prometheus Node exporters for EFA metrics. . Additionally, you need to build custom dashboards and alarms to visualize and monitor these metrics. To address the challenges of monitoring GPUs on Amazon Elastic Kubernetes Service (Amazon EKS) along with the performance of inter-node communication for EFAs, Amazon CloudWatch has extended Container Insights for Amazon EKS with accelerated compute observability including the support for NVIDIA GPUs and EFAs.
Container Insights for Amazon EKS deploys and manages the lifecycle of the NVIDIA DCGM exporter that collects GPU metrics from Nvidia’s drivers and exposes them to CloudWatch. Once onboarded to Container Insights, CloudWatch automatically detects NVIDIA GPUs in your environment, collects the critical health and performance metrics on NVIDIA GPUs as CloudWatch Metrics and makes them available on curated out-of-the-box dashboards. You can also setup CloudWatch alarms and create additional CloudWatch dashboards for the metrics available under the “ContainerInsights” namespace. It gathers performance metrics like GPU temperature, GPU Utilization, GPU Memory Utilization and more. A complete list of metrics could be found in the user guide.
Container Insights for Amazon EKS leverages file system counter metrics to gather and publish Elastic Fabric Adapter (EFA) metrics to CloudWatch. Using EFA metrics you can understand the traffic impact on tasks running on your EKS clusters and monitor your latency sensitive training jobs. It gathers performance metrics like received bytes, transmitted bytes, Remote Direct Memory Access (RDMA) throughput, number of dropped packets and more. A complete list of metrics can be found in the user guide.
In this post, we’ll explore how to leverage Container Insights with enhanced observability for EKS and quickly gain observability insights for GPUs and EFAs on EKS, using CloudWatch and CloudWatch Container Insights.
Solution Overview
You can enable Container Insights for Amazon EKS either via manual installation leveraging quick start setup for Amazon EKS cluster or by installing the Amazon CloudWatch Observability EKS add-on which is the recommended method.
In this example you will see how to setup an Amazon EKS demo cluster with CloudWatch Observability EKS add-on for a supported NVIDIA GPU backed & EFA supported instance. Furthermore, we will see how Container Insights with enhanced observability for EKS will provide a unified view of cluster health, Infrastructure metrics and GPU/EFA metrics required to optimize machine learning workloads.
Following are the components we are going to deploy in this solution:
- An EKS cluster with a managed node group supported by GPU-based Amazon EC2 instances and supported EFA instance types
- Amazon CloudWatch observability add-on for collecting metrics and logs.
- Utilities to generate load for GPU and EFA.
Figure 1: Container Insights with enhanced observability for EKS gathering GPU & EFA metrics for NVIDIA instances.
Prerequisites
You will need the following to complete the steps in this post:
Environment setup
- Provide the AWS Region (aa-example-1) along with 2 AWS Availability Zones (aa-example-1a, aa-example-1b) available in the AWS Region where you would like to deploy your EKS cluster. Run the following commands in the AWS CLI.
To setup a GPU based Amazon EKS cluster, select the EC2 nodes supporting GPU’s. You can find the list of instances that support GPUs at GPU-based Amazon EC2 instances and supported EFA instance types.
3. For the demonstration, we have selected g4dn.8xlarge which is a NVIDIA GPU supported instance with EFA availability.
4. Let’s create a config file for creating an Amazon EKS Cluster by executing the command below.
5. Now create an Amazon EKS Cluster using the configuration file created.
6. Verify that you are connected to the cluster. You should see two nodes of type g4dn.8xlarge listed when you execute the following command
if not execute the following to connect to the cluster.
7. Install the EFA device plugin to provide pods to access EFA devices.
8. Store the CloudFormation stack name as a variable for the EKS Cluster which has been created.
9. Retrieve the AWS Role into the ROLE_NAME variable, that has been created automatically by the CloudFormation to add permissions to store the metrics and logs data in CloudWatch.
10. Attach the “CloudWatchAgentServerPolicy” to the Amazon EKS Nodes role.
11. Install CloudWatch Observability EKS add-on for Amazon EKS Cluster
12. Verify that the CloudWatch Observability EKS add-on for Amazon EKS Cluster is created and active. You should see the status as “Active”.
GPU Observability test case
Now that you have deployed the Amazon EKS cluster with GPU nodes, let’s generate the GPU load using the “gpuburn” utility.
1. Generate GPU load using the gpuburn utility by applying the following deployment manifest:
Container Insights dashboards
Container Insights additionally provides out-of-the-box dashboards where you can analyze aggregated metrics at the cluster, namespace and service levels. But more importantly, it delivers drill-down capabilities that allow insights at the node, pod, container and GPU device levels. This provides Machine Learning experts to identify bottlenecks throughout the stack. With highly granular visualizations of metrics like memory usage and utilization, you can quickly pinpoint issues—whether they be a certain node, pod or even a specific GPU.
You can navigate to CloudWatch Console and expand “Insights” and select “Container Insights”. It first opens a landing page, where you can understand the performance and status summary of GPUs across your EKS clusters. Furthermore, you can slice and dice down the performance of the GPUs to understand Top 10 clusters, nodes, workloads, pods, containers running in your AWS Account as shown below.
Figure 2: CloudWatch Container Insights Dashboard – Top 10 Utilization
You can select View performance dashboards link on the top right-hand corner to access the detailed dashboard’s view. Under detailed performance dashboard’s view, you can access your accelerated compute telemetry out-of-the-box as shown below.
Figure 3: CloudWatch Container Insights Dashboard – Cluster Level Performance View
You can either use hierarchy map to drill down or click on graph labels and view container level dashboards and get aggregated metrics by Container or GPU device. This will be useful for instance types with multiple GPU devices, allowing you to see the utilization of each GPU and understand the extent to which you are utilizing your hardware.. With this visibility, you can carefully tune workload placement across GPUs to balance resource usage and remove the guesswork around how container scheduling will impact per-GPU performance.
You can look at the dashboard below with an aggregated performance view at container level performance and the utilization of the GPUs by container, pod level.
Figure 4: CloudWatch Container Insights Dashboard GPU Metrics – Pod Level, Container Level
You can also aggregate the GPU metrics by GPU device level, which will provide overview on how each GPU device is performing as shown below.
Figure 5: CloudWatch Container Insights Dashboard GPU Metrics aggregated by GPU Device.
You can visualize the EFA metrics at the node, pod, container level as a part of the CloudWatch Container Insights dashboard by selecting the respective radio buttons as shown in the below diagram.
Figure 6: CloudWatch Container Insights Dashboard EFA Metrics – Node Level
Container Insights provides you to easily monitor the efficiency of resource consumption by your distributed deep learning and inference algorithms such that you can optimize resource allocation and minimize long disruptions in your applications. Using Container Insights, you can now have detailed observability on your accelerated compute environment with automatic visualizations out -of-the-box.
Clean up
You can tear down the whole stack using the command below
Conclusion
In this blog post, we showed how to setup robust observability for GPU workloads running in an accelerated compute environment, deployed in an Amazon EKS cluster leveraging Amazon EC2 instances, featuring NVIDIA GPUs and Amazon EFAs. Furthermore, we have looked into the dashboards and drill down between different layers to understand the performance of GPUs and EFAs at the cluster, pod, container and GPU device level.
For more information, see the following references:
- AWS Observability Best Practices Guide for Amazon CloudWatch Container Insights
- One Observability Workshop