Networking & Content Delivery

Monitoring EC2 Connection Tracking utilization using a new network performance metric

In 2020, Amazon Elastic Compute Cloud (Amazon EC2) announced new network performance metrics for EC2 instances made available using ENA driver and Amazon CloudWatch agent. We covered the launch in this post. These network performance metrics give customers visibility into the number of packets queued or dropped when an instance’s networking allowances, such as Network Bandwidth, Packets-Per-Second (PPS), Connections Tracked, and Link-local service access (Amazon DNS, Instance Meta Data Service, Amazon Time Sync) are exceeded. By monitoring these metrics, you can quickly troubleshoot network performance issues and manage instance fleet capacity by performing scale-up or scale-out actions on your EC2 instances to accommodate surges in network traffic demand. These metrics can also help you right-size your EC2 instances, allowing you to run network performance benchmark tests against your workload traffic to understand your workload network performance for a given instance size.

Many of our customers running workloads that need high network connections have asked for utilization visibility into their instance’s Connections Tracked allowance resource to take proactive scale up or scale out actions before exhausting the Connection Tracking allowance. We are announcing the availability of an additional network performance metric, conntrack_allowance_available, which reports the number of available tracked connections that can be established before an instance’s Connections Tracked allowance is exceeded. This new metric is available in all AWS Commercial and GovCloud (US) Regions, as well as China (Beijing) and China (Ningxia). Just like the other network performance metrics, you can publish this metric to CloudWatch using our Unified CloudWatch Agent or your favorite third-party observability solution. If you need help determining when connections are tracked, then see the details here.

Why is this important?

Connectivity failures caused by exceeding Connections Tracked allowances can have a larger impact than those resulting from exceeding other allowances. When relying on TCP to transfer data, packets that are queued or dropped due to exceeding EC2 instance network allowances, such as Bandwidth, PPS, etc., are typically handled gracefully thanks to TCP’s congestion control capabilities. Impacted flows will be slowed down, and lost packets will be retransmitted. However, when an instance exceeds its Connections Tracked allowance, no new connections can be established until some of the existing ones are closed to make room for new connections.

Before this launch, customers could scale out instances once they exceeded the Connections Tracked allowance. This new metric lets customers further improve their service reliability by proactively tracking the Connections Tracked allowance utilization to scale EC2 instance capacity, plan for emergent connection demand, and understand connection usage trends.

What does this metric look like?

Like other network performance metrics, this metric will be available through Operating System (OS) tools. Here’s a sample output showing the new conntrack_allowance_available metric on a c6i.2xlarge EC2 instance running Amazon Linux 2:

$ ethtool -S eth0 | grep conntrack
     conntrack_allowance_available: 272548
     conntrack_allowance_exceeded: 0

This new metric reports the number of tracked connections that can be established by the instance before hitting the Connections Tracked allowance of that instance type. Therefore, in the example above, this instance can establish up to 272,548 new tracked connections. Once the limit is reached, attempts to open new connections will fail, and you’ll see the conntrack_allowance_exceeded reporting a positive number of dropped packets.

Applying this new metric to real world scenarios

Here are a couple scenarios inspired by real situations faced by AWS customers who contacted our Support, where both of us, the authors of this post, worked with them through to resolution. Then, we demonstrate how you can leverage this new metric in these situations. We’re certain that you’ll find this useful for other situations, too!

Scenario #1 Keep up with demand

John works as a Site Reliability Engineer (SRE) for a start-up that has standardized on a fleet of inline Intrusion Prevention System (IPS) appliances for inspecting all North-South traffic using Gateway Load Balancer (GLWB). He has initially deployed four IPS appliances as EC2 instances in a shared services. The VPC is based on the estimated traffic volume forecasts. John sets up VPC route tables so that traffic that must be inspected is sent through GWLB. Then, this is forwarded to the fleet of four c6i.2xlarge instances running as IPS appliances. Traffic between EC2 instances and GWLBs is always tracked, regardless of Security Group settings. As the startup grows, John onboards new external services into VPCs while securing the traffic by sending them through GWLB-backed IPS appliances. John got paged into a service outage call, and their customers are complaining about intermittent timeouts during different times of the day while accessing cloud services managed by John’s SRE team. He quickly checks the health of the IPS appliance’s EC2 instance CPU and memory consumption, but he doesn’t find anything unusual.

John dives deep into the EC2 instance network health by checking the network performance metrics, which includes the conntrack_allowance_available metric emitted by the ENA driver. He makes a striking observation about customer request timeout timestamps aligning with timestamps when the conntrack_allowance_available metric shows low available entries. Furthermore, he observes a gradual decrease in available Connections Tracked available entries as the network traffic volume increases on the IPS appliances. This is shown in the following graph.

Figure 1: 'conntrack_allowance_available' metric shows available EC2 connections tracked entries decreasing during peak traffic window on IPS appliances.

 Figure 1 conntrack_allowance_available metric shows available EC2 instance connections tracked entries decreasing during peak traffic window on IPS appliances.

Having found the issue, John quickly scales up the IPS appliance fleet to six c6i.2xlarge instances. In addition to this scale-out activity, he also sets up alarms based on this metric to make sure that alerts are triggered when the available tracked connections entries drop below 30,000 to allow sufficient time for the scale out activity to complete. John has now added the Connections Tracked available metric as one of the several inputs for EC2 instance network health. This helps him manage IPS service capacity to keep up with the growing traffic demands of their startup.

John has also taken a step further in managing IPS capacity by leveraging Amazon EC2 Auto Scaling and by creating Dynamic Scaling policies that increase or decrease the number of running appliances based on the conntrack_allowance_available metric thresholds. John created a Step Scaling policy that increases the number of running IPS appliances behind GWLB by one whenever the conntrack_allowance_available count falls below 30,000, while decreasing the fleet size by one whenever the count goes above 200,000 connections. By leveraging network performance metrics including the newly available conntrack_allowance_available, John can now keep their IPS service up with the demand, proactively scaling resources in and out, for improved availability and cost efficiency.

Scenario #2 Concerning trends

Márcia works as a Site Reliability Engineer (SRE) for a medium-sized company whose main product is an Enterprise Resource Planning (ERP) solution tailored for colleges and universities. One of their customers, a college in Brazil, recently rolled out services used by all of its over 7,000 students to view their grades, access reference material published by their teachers, explore items from the library catalog, view balances and make payments, request documents, and others. The solution is deployed to two m6a.2xlarge instances (application, database), behind a third-party firewall from AWS Marketplace, all in the same VPC.

Some users are complaining about timeouts, which are usually resolved by restarting the desktop application or reloading the web page. Eventually, all of the users are unable to access the ERP modules in any way. In this case, Márcia can usually mitigate the issue by restarting the application instance as a last resort activity. She looks for signs of high CPU, memory, disk, and network utilization, but she couldn’t find anything out of the ordinary. Márcia decides to implement the monitoring of network performance metrics emitted by the ENA driver through the Unified CloudWatch agent, including the new conntrack_allowance_available metric, on both of the ERP instances. Once the problem re-appears, she looks at the metric graphs, and then cross-checks them with the times when the application was unavailable. She observes two things:

  1. Theconntrack_allowance_available metric on the ERP application instance has an unusual pattern of consistently going down over time.
  2. Times where ERP modules were unavailable were matched byconntrack_allowance_available reaching zero, and byconntrack_allowance_exceeded reporting an increased count.

This leads Márcia to compare the number of established connections in the ERP application, and in the firewall appliance. She notices that the ERP application has a much higher count. Márcia enables VPC Flow Logs at the VPC level, with the optional ‘tcp-flags’ field enabled, and checks these logs using Amazon Athena. She observes that a number of client connections remain idle for some time, and when they attempt to send data or close the connection, packets are silently dropped by the firewall before ERP application instance sees those connections. She learns that the firewall has an idle timeout, and that it’s configured to not send a session reset once the idle timeout expires. This causes the ERP application instance to keep these connections open indefinitely. Márcia implements a couple changes: enable TCP Keep-Alive on the ERP Application, and cause the firewall to send TCP resets when idle connections expire. These actions by Márcia mean that users no longer observe timeouts, and the conntrack_allowance_available metric started trending up. By using this new metric, Márcia can now identify similar concerning connection trends to help her proactively troubleshoot TCP timeout related issues or capacity planning for adding new colleges as needed.

Getting started

This new metric is available on Nitro based EC2 instances. To avail of this new metric, the ENA driver version in your instance must be one of:

  • Linux ENA driver 2.8.1

Here are instructions on how to update your Linux ENA driver. When running virtual appliances with limited administrative access (e.g., where an unrestricted shell isn’t available), such as some of those offered through AWS Marketplace, you should reach out to your appliance vendor requesting ENA metrics to be made available for appliance health observability. This is supported on Nitro based instances, to see a complete list, please review EC2 Documentation.

Conclusion

The newconntrack_allowance_available is a much sought-after feature that will benefit Amazon EC2 customers running network connection intensive workloads. This lets them further improve service reliability via the automatic proactive scaling of the instance fleet, capacity planning to meet emergent traffic demands, and quickly troubleshooting network connection issues.

About the authors

Daniel Carmo Olops

Daniel Carmo Olops is a Senior Cloud Support Engineer at AWS Support, focused on Linux and EC2. He is passionate about resolving complex customers’ issues and help others do the same, partnering with EC2 and VPC teams to improve the support experience for both customers and engineers. Daniel moved from Brazil to join AWS in Ireland, and eventually relocated to Oregon, where he lives with his wife and children.

Jasmeet Sawhney

Jasmeet Sawhney is a Senior Product Manager at AWS in the VPC product team based in California. Jasmeet focuses on enhancing AWS customer experience for instance networking and Nitro encryption. Before joining AWS, she developed products and solutions for hybrid cloud, network virtualization and cloud infrastructure to meet customer’s changing networking requirements. When not working, she loves golfing, biking and traveling with her family.