Posted On: Mar 11, 2024

Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health and performance metrics from your NVIDIA GPUs and delivers them in automatic dashboards to enable faster problem isolation and troubleshooting for your AI/ML workloads. Container Insights with Enhanced Observability delivers you out-of-the-box trends and patterns on your infrastructure health and removes the overhead of manual dashboard and alarm set-ups saving you time and effort.

Using enhanced observability on Container Insights, you can now easily understand if your GPUs and memory on your accelerated instances are healthy and ensure that your training jobs remain performant. You can easily pinpoint errors and quickly drill down to identify root cause while minimizing long disruptions to your training jobs. Enhanced Container Insights delivers accelerated compute observability in curated visualizations and enables you to easily monitor how efficient your resources are consumed by your distributed training models and optimize your allocations accordingly.

Getting started with accelerated compute observability is easy. You can onboard Enhanced Container Insights either by installing CloudWatch Observability Add-on into your clusters or by manually installing the CloudWatch Agent to enable enhanced observability. Once configured you can navigate to Container Insights console and view your NVIDIA GPU telemetry out-of-the-box.

NVIDIA GPU metrics are now available in Container Insights with Enhanced Observability for EKS in all public AWS Regions, including the AWS GovCloud (US) and China Regions. NVIDIA GPU metrics follow observation based pricing – see Container Insights pricing page for details. For further information, see the Container Insights user guide.

04/22 - Post has been updated to provide instructions on manual getting started experience.