New EFA metrics for improved observability of AWS networking

Posted on: Sep 12, 2025

Today, AWS has introduced five new Elastic Fabric Adapter (EFA) metrics to enhance network observability for AI/ML and High Performance Computing (HPC) workloads. These new metrics help diagnose performance issues by tracking retransmitted packets and bytes, retransmit timeout events, impaired remote connection events, and unresponsive remote receiver events.

With these new metrics, you can monitor for network congestion or instance configuration issues, allowing for timely action to maintain application performance. The metrics are implemented as counters at the per-EFA device level, accumulating data since instance launch or the most recent driver reset. Stored in the sys filesystem, these metrics counters are accessible via the instance command line. For enhanced monitoring and alerting capabilities, you can integrate these metrics into Prometheus scripts, facilitating export to third-party tools such as Grafana for dashboard creation and alarm setting. The new metrics are available on Nitro v4 (and later) instances and require EFA installer version 1.43.0 or higher. For a full list of metrics and to learn more on how to use them, please visit the Monitor an EFA user guide. For a comprehensive list of instances built on different Nitro system versions, please refer to the AWS Nitro Systems documentation.

These new metrics are supported in all commercial AWS Regions, the AWS GovCloud (US) Regions, and the China Regions. To learn more about EFA, please visit the EFA documentation