Posted On: Apr 23, 2024

Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health metrics from your AWS accelerators Trainium and Inferentia, and AWS high performance network adapters (Elastic Fabric Adapters) as well as NVIDIA GPUs. You can visualize these out-of-the-box metrics in curated Container Insights dashboards to help monitor your accelerated infrastructure and optimize your AI workloads for operational excellence. 

Using Enhanced Container Insights you can now easily correlate compute and memory metrics with your internode network metrics to help understand the traffic impact on tasks running on your EKS clusters, such as monitoring latency sensitive training jobs. Enhanced Container Insights enables you to easily monitor the efficiency of resource consumption by your distributed deep learning and inference algorithms such that you can optimize resource allocation and minimize long disruptions in your applications. Enhanced Container Insights delivers accelerated compute observability with automatic visualizations and removes the need for manual dashboard creations and alarm set-ups.

Getting started with accelerated compute observability is easy. You can onboard Enhanced Container Insights either by installing CloudWatch Observability Add-on into your clusters or by manually installing the CloudWatch Agent to enable enhanced observability. Once configured you can navigate to Container Insights console and view your accelerated compute telemetry out-of-the-box.

Accelerated Compute Observability is now available in Enhanced Container Insights for EKS in all commercial AWS Regions, including the AWS GovCloud (US) and China Regions. Accelerated Compute metrics follow observation based pricing – see Container Insights pricing page for details. For further information, see the Container Insights user guide.