Why is my Amazon EC2 instance exceeding its network limits when average utilization is low?

7 minute read

My Amazon Elastic Compute Cloud (Amazon EC2) instance average utilization is low, but the instance is still exceeding its network limits.

Short description

You can query network performance metrics in real time on instances that support enhanced networking through the Elastic Network Adapter (ENA). These metrics provide a cumulative value of the number of packets queued or dropped on each network interface since the last device reset.

The following are some of the ENA metrics:

bw_in_allowance_exceeded: The number of packets queued or dropped because the inbound aggregate bandwidth exceeded the maximum for the instance.
bw_out_allowance_exceeded: The number of packets queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance.
pps_allowance_exceeded: The number of packets queued or dropped because the bidirectional Packets Per Second (PPS) exceeded the maximum for the instance.
conntrack_allowance_exceeded: The number of packets queued or dropped because the network traffic for the instance exceeded the maximum number of connections that can be tracked.

Network performance specifications vary according to the instance type. To view specifications, see Current generation instances. In some cases, you might see queuing or drops even though your average bandwidth or PPS as seen in Amazon CloudWatch is low. For example, the NetworkIn, NetworkOut, NetworkPacketsIn, or NetworkPacketsOut metrics in CloudWatch might show amounts that don't suggest a limit being reached.

Resolution

Microbursts are the most common cause of the preceding symptoms. Microbursts are short spikes in demand followed by periods of low or no activity. These bursts typically last only for seconds, milliseconds, or even microseconds. In the case of microbursts, the CloudWatch metrics listed in the previous section aren't granular enough to reflect them.

How averages are calculated

The EC2 metrics in CloudWatch listed in the previous section are sampled every 1 minute. These metrics capture the total bytes or packets transferred in that period. These samples are then aggregated and published to CloudWatch in 5-minute periods. Each statistic in the period returns a different sample value:

Sum is the sum of all sample values.
SampleCount is the number of aggregated samples (in this case, 5).
Minimum is the sample value with the lowest byte / packet count.
Average is the average sample value, calculated by dividing Sum by SampleCount.
Maximum is the sample value with the highest byte / packet count.

Average throughput or PPS can be calculated in two ways:

Divide Sum by Period (for example, 300 seconds) for a simple 5-minute average.
Divide Maximum by 60 seconds for an average in the minute with the highest activity.

How microbursts are reflected in CloudWatch metrics

The following is an example of microburst, and how it's reflected in CloudWatch:

The instance has a network bandwidth performance of 10 Gbps (1.25 GB/s).
In a given sample (60 seconds), an outbound data transfer of 20 GB uses up all available bandwidth, causing bw_out_allowance_exceeded to increment. Transfer completes in about 20 seconds, and no further data is sent afterwards.
Instance remains idle for the remaining 4 samples (240 seconds).

In this example, the average throughput in a 5-minute period is much lower than the one during the microburst:

SUM(NetworkOut) / PERIOD = ((20 GB * 1 sample) + (0 GB * 4 samples)) / 300 seconds = ~0.066 GB/s * 8 bits = ~0.533 Gbps

Even if you calculate the throughput based on the highest sample, the average still doesn't reflect the throughput amount:

MAXIMUM(NetworkOut) / 60 = 20 GB / 60 seconds = ~0.333 GB/s * 8 bits = ~2.667 Gbps

Monitoring microbursts

To measure throughput and PPS at a more granular level, use operating system (OS) tools to monitor network statistics. Monitor your network statistics more frequently during periods of shaping or high activity.

The following are examples of OS tools:

Linux

sar
nload
iftop
iptraf-ng

Windows

Performance Monitor

The CloudWatch agent can be used on both Linux and Windows to publish these OS-level network metrics to CloudWatch as custom metrics. These metrics can be published at intervals as low as 1 second. Note that high-resolution metrics (those with a period lower than 60 seconds) lead to higher charges. For more information about CloudWatch pricing, see Amazon CloudWatch pricing.

It's a best practice to monitor the network performance metrics provided by ENA. The driver version must be greater than or equal to 2.2.10 (Linux) and 2.2.2 (Windows) for you to be able to view the metrics. For more information, see the following:

Monitor network performance for your EC2 instance (Linux)
Monitor network performance for your EC2 instance (Windows)

CloudWatch Agent can also publish ENA metrics. For instructions on publishing ENA metrics for Linux, see Collect network performance metrics. For Windows, ENA metrics are available in Performance Monitor. You can configure the CloudWatch Agent to publish available metrics in Performance Monitor.

Preventing microbursts

To avoid microbursts, traffic needs to be paced at the senders, so that it doesn't exceed a maximum throughput or packet rate. This makes microbursts difficult to avoid. Pacing traffic at the senders usually require application-level changes. Depending on how this change is implemented, OS support for traffic pacing might need to be available and turned on at the senders. That might not always be possible or practical, however. Microbursting can also happen because of too many connections sending packets in a short period. When this happens, pacing individual senders doesn't help.

It's a best practice to monitor ENA metrics. If limits are reached often, or if there's evidence that traffic shaping is impacting your applications, then do the following:

Scale up: Move to a larger instance size. Larger instances generally have higher allowances. Network optimized instances (those with an "n", such as C5n), have higher allowances than their respective non-network-optimized counterparts.
Scale out: Spread traffic across multiple instances to reduce traffic and contention at individual instances.

For Linux-based operating systems, in addition to the preceding options, there are mitigation options for advanced users. You can implement these options alone or in combination. It's a best practice to benchmark mitigations in a testing environment to verify that they reduce or eliminate traffic shaping without adversely effecting your workload.

SO_MAX_PACING_RATE: This socket option can be passed by an application to the setsockopt system call to specify a maximum pacing rate (bytes per second). The Linux kernel then paces traffic from that socket so that it doesn't exceed the limit. This option requires the following:
Application code-level changes.
Support from the kernel.
The use of Fair Queue (FQ) queuing discipline or the kernel's support for pacing at the TCP layer (applicable to TCP only).
Queuing disciplines (qdiscs): qdiscs are responsible for packet scheduling and optional shaping. Some qdiscs, such as fq, help smooth out traffic bursts from individual flows. For more information, see the Traffic Control (TC) manual page.
Shallow Transmission (Tx) queues: In some scenarios, shallow Tx queues help reduce PPS shaping. It can be achieved in two ways:
Byte Queue Limits (BQL): BQL dynamically limits the number of in-flight bytes on Tx queues. BQL is turned on by default on ENA driver versions shipped with the Linux kernel (those ending with a "K"). For ENA driver versions from GitHub (those ending with a "g"), BQL is available as of v2.6.1g, and is turned off by default. You can turn on BQL using the enable_bql ENA module parameter.
Reduced Tx queue length: Reduce Tx queue length from its default of 1,024 packets to a lower amount (minimum 256). You can change this length using the ethtool.

Related information

Amazon EC2 instance network bandwidth

Topics

Compute

Relevant content

Network traffic limits on x2gd.medium
Accepted Answer
Freedom_AWS
asked 2 years ago
How can CPU utilization be low but the Average Active Sessions be high?
Accepted Answer
HTZ
asked 4 years ago
Network traffic limits on EC2 instances
AWS-User-8062555
asked 2 years ago
EC2 c7gn instances degrade in network performance throughput
max
asked 2 months ago
Network performance on instance m5dn.8xlarge - Isn't it 25 Gigabit ?
mpgmateus
asked 4 years ago
Why is my EC2 Linux instance becoming unresponsive due to over-utilization of resources?
AWS OFFICIALUpdated 10 months ago
Why is my query running slow in Amazon RDS for MySQL?
AWS OFFICIALUpdated 2 years ago
Why is my EC2 Windows instance performance slow?
AWS OFFICIALUpdated 10 months ago
How can I improve networking performance on my Amazon EC2 Windows instance?
AWS OFFICIALUpdated a year ago
Why is VMware Cloud on AWS reporting my Amazon FSx for NetApp ONTAP datastore is full when it isn’t?
EXPERT
AMcCord
published 2 months ago