Improving Performance on AWS and Hybrid Networks
In this post, we provide recommendations to improve network performance on AWS and hybrid networks. In today’s enterprise networking environment, it is becoming common for customers to have multi-gigabit connectivity to AWS either through AWS Direct Connect or over the Internet. Although network bandwidth is fundamental, several other factors come into play for network performance, ranging from physical distance, hardware, and operating system, to the design of the application itself. We will start with some basic concepts and definitions, followed by a review of these factors along with best practices to maximize the utilization of AWS network resources.
The latency is caused by the speed of light and can’t be decreased (unless Einstein was wrong).
Gigabit networks will cause several networking issues to be looked at differently.
W. Richards Stevens, TCP/IP Illustrated, 1994
Basic concepts and definitions
Latency and Round-Trip Time
Latency (or network delay) is the amount of time it takes for data to travel from one point to another. Round-Trip Time (RTT) is the time it takes for data to travel from one point to another and back to the source. Latency and RTT are usually measured in milliseconds (ms).
Bandwidth and Throughput
In computing, bandwidth refers to the maximum rate in which data can be transferred over a path or network. Throughput, on the other hand, refers to the actual rate at which data is transferred by an application over the network. Both are measured in bits per second (bps) or its multiples (Kbps, Mbps, Gbps).
Another important concept is that of a network flow. Flows are typically defined by the 5-tuple: source-ip, destination-ip, protocol, source-port, destination-port. For example, a flow with the 5-tuple (10.1.1.10, 172.16.1.20, tcp, 14512, 80) would indicate an HTTP connection from the IP address 10.1.1.0 to the IP 172.16.1.20. Sometimes this definition only includes the 3-tuple: source-ip, destination-ip, protocol. Most networking systems track flows for performance reasons (i.e., to cache route lookup results) and to keep all the packets from a particular flow on the same path. This avoids out-of-order packets at the destination.
Other systems go a step further and track connection states for connection-oriented protocols like Transmission Control Protocol (TCP) (e.g., whether a TCP connection is open). These are said to be stateful. Examples of stateful systems include AWS Network Firewall and firewalls in general.
Jitter, also called “Packet Delay Variation” is the difference in delay between packets in the same flow. Jitter is usually measured in milliseconds.
TCP basic mechanisms
TCP uses the mechanism known as 3-way handshake to establish connections. A “syn” packet from the client is followed by a “syn-ack” from the server, and an “ack” from the client. Then, the connection is established and data can flow. This has the implication that the time required to establish a connection is at least three times the one-way latency (or 1.5 times the RTT).
TCP uses a strategy called flow control to make sure there is a specific bandwidth for sending data. This prevents any individual flow from consuming the entire bandwidth of a single physical path. To achieve this flow control strategy, TCP uses a method called a sliding window. In a sliding window flow control strategy, the window size determines how much data can be sent before receiving an acknowledgement. The window size is reduced as each packet is sent until an “ACK” packet is received confirming the receipt by the receiver. Once an ACK packet is received, the window size is increased allowing for more packets to be sent. If the usable window reaches a size of 0, then the sender stops until an ACK is received.
In IP networks, the Maximum Transmission Unit (MTU) is the size of the largest IP packet that can be transmitted in a single network transaction. A larger MTU improves efficiency by increasing the ratio of application data (such as backups or database replication) versus protocol overhead (packet headers). A typical value is 1500 bytes (for Ethernet networks). Ethernet frames that can carry larger payloads are known as Jumbo Frames. We discuss the different MTUs supported by AWS in the next section.
Factors that can affect performance
In this section, we review the most common factors that affect network performance and provide some recommendations to mitigate them.
Start at the ends
The first factor that can affect performance are the resources (on-premises servers or Amazon Elastic Compute Cloud (Amazon EC2) instances) communicating over the network.
On the AWS side, available network bandwidth on a current generation instance depends on the instance family and size (number of vCPUs). For example, an m5.8xlarge instance has 32 vCPUs and 10 Gbps network bandwidth within the region, and an m5.16xlarge instance has 64 vCPUs and 20 Gbps network bandwidth. You will find that some instances are documented as having “up to” a specified bandwidth, for example “up to 10 Gbps”. These instances have a baseline bandwidth and can burst to meet additional demand using a credit mechanism. Instance families ending in “n” are Network Optimized, and support a higher bandwidth. For example, m6in.16xlarge supports 100 Gbps, four times the 25 Gbps on m6i.16xlarge.
Traffic to other AWS Regions, an Internet Gateway (IGW), Virtual Private Gateway (VPG), or Local Gateway (used for AWS Outposts) can utilize up to 50% of the network bandwidth available to a current generation instance with a minimum of 32 vCPUs, or 5 Gbps for a current generation instance with less than 32 vCPUs.
There are other network allowances at the instance level, such as packets per second or number of tracked connections. Amazon EC2 Elastic Network Adapter provides metrics to verify if these allowances have been exceeded. It is common to forget that these allowances apply to marketplace Amazon EC2 appliances as well, and they can be the source of some performance related issues.
Operating system network implementations can also introduce performance limits of their own. For example, the Linux kernel uses receive and transmit queues for packets. These queues are usually tied to a single CPU core and packets are load balanced between queues by flow attributes. This means there are performance limits for single flow throughput and packet rate. Amazon EC2 Elastic Fabric Adapter is designed to bypass the operating system kernel for High Performance Computing (HPC) clusters that require higher throughput between nodes.
Generally speaking, bandwidth for single-flow traffic is limited to 5 Gbps when instances are not in the same cluster placement group. At AWS re:Invent 2022, we announced ENA Express. ENA express uses a technology known as Scalable Reliable Datagram (SRD) to send packets in the same flow over multiple paths, and allows up to 25 Gbps for a single flow between hosts in the same subnet, while reducing latency variability. Refer to the ENA Express documentation for a list of the supported instance types.
Last but not least, don’t forget to look at the other side of the connection. If you are connecting from an on-premises virtualized environment, for example, check any performance limits that may be imposed by the hypervisor, by the server network cards, and by the network hardware (switches, routers, firewalls, and other devices in the path).
The next factor we examine is the connectivity between your end users or on-premises systems and the AWS Cloud.
The Internet consists of many interconnected networks. Given the distributed span of control of these networks, AWS does not and cannot provide any type of guarantee on the end-to-end performance of connections going through. Furthermore, because of the dynamic nature of the networks, there could be considerable variations in link quality over time, commonly referred to as “Internet weather”. Amazon CloudWatch Internet Monitor leverages AWS’ telemetry on Internet weather, tailored to your specific resource footprint, and providing visibility into issues that may affect end user performance. For TCP and UDP applications, AWS Global Accelerator provides a pair of Anycast IP addresses that are advertised in Amazon Points of Presence (PoPs) around the world, to which end users can connect. User traffic entering at a Global Accelerator PoP traverse the AWS backbone to the destination in an AWS Region, utilizing the significant investments we have made to increase performance and availability on our network. This helps users avoid Internet weather and gain a better and more consistent experience than they may get otherwise.
AWS Site-to-Site VPN over the Internet
AWS Site-to-Site VPNs over the Internet are a quick and easy way to provide layer 3 connectivity to AWS for businesses starting their journey to the cloud. However, this type of connectivity is also subject to Internet weather conditions. Accelerated Site-to-Site VPN connections leverage Global Accelerator technology to improve performance and consistency.
Regarding bandwidth, you can see up to 1.25 Gbps per tunnel. This bandwidth is highly dependent on numerous factors. If you use AWS Transit Gateway or AWS Cloud WAN for your VPNs, you can load balance the traffic between tunnels using BGP Equal Cost Multi-Path (ECMP) for a higher aggregated bandwidth. See this blog post for a deep dive on Site-to-Site VPN performance tuning.
AWS Direct Connect
AWS Direct Connect provides the shortest path to link your AWS and on-premises networks to build hybrid applications without compromising performance.
The available bandwidth ranges from 50Mbps to 10 Gbps on hosted connections, and from 1 Gbps to 100Gbps on dedicated connections. Link Aggregation Groups (LAGs) can be used to achieve higher bandwidth by performing layer 2 load balancing. Up to four total connections can be combined for 1 and 10 Gbps connections and two for 100 Gbps connections. If you wish to load balance across more connections and connections on different AWS devices, then layer 3 load balancing can be achieved using ECMP.
Although Direct Connect is isolated from Internet weather, on rare occasions issues on the AWS Backbone or Service Provider networks connecting from Direct Connect to customer on-premises locations can cause increased latency or packet loss to occur. If your application is sensitive to these issues, then consider deploying a solution for performance monitoring and automatic failover like the one outlined in this post.
If you need end-to-end encryption, then it is possible to deploy a private IP VPN with AWS Direct Connect. Note that private IP VPNs over Direct Connect are subject to the same limitation of up to 1.25Gbps per tunnel as Site-to-Site VPNs over the Internet.
Packet size matters
Although increasing the packet size (MTU) can help improve throughput, this must be done end- to-end to avoid packet fragmentation, which has the opposite effect.
All current generation EC2 instances support an MTU size of 9001 bytes (Jumbo Frames). However, note that the MTU can change depending on the path taken by the packets:
- Traffic over an IGW or inter-Region VPC Peering connection is limited to a 1500 byte MTU.
- Traffic over a Site-to-Site VPN is limited to a 1500 byte MTU minus the encryption header size. The maximum MTU that can be achieved is 1446 bytes. However, encryption algorithms have varying header sizes and can prevent you from achieving this maximum value.
- Transit Gateway supports an MTU of 8500 bytes for traffic between VPCs, Direct Connect, Transit Gateway Connect, and peering attachments.
- Direct Connect supports an MTU of 1500 or 9001 bytes for private virtual interfaces and 1500 or 8500 bytes for transit virtual interfaces.
If you are connecting to on-premises networks, then you must verify the MTU supported by your equipment as well, including routers, switches, and firewalls that may be in the traffic path.
Path MTU discovery
Modern operating systems use Path MTU Discovery (PMTUD) to discover the maximum MTU along the traffic path. There are some caveats to the use of PMTUD:
- PMTUD relies on the fragmentation needed (Type 3, Code 4) ICMP (Internet Control Message Protocol) message. Some AWS Services (e.g., AWS Site-to-Site VPN and Transit Gateway) don’t send ICMP packets back, and some security devices, such as firewalls, block ICMP traffic, causing PMTUD to fail.
- Packetization Layer Path MTU Discovery (also called TCP MTU Probing) is an alternate method that doesn’t rely on ICMP for path MTU discovery. PLPMTUD can be enabled on Linux by modifying configuration files. Refer to your distribution documentation for specific instructions.
- TCP MSS Clamping is a method for intermediate systems to reduce the maximum segment size (MSS) for TCP connections, indirectly reducing the size of IP packets to avoid fragmentation. Some intermediate systems (e.g., Transit Gateway) enforce MSS clamping on all TCP connections.
Latency can have different causes: there are packet processing delays caused by network equipment, queuing delays in busy links, and propagation delays over the transmission medium. There are multiple ways to reduce network latency on AWS:
- One way is to move content or systems closer to end users.
- For HTTP(s) applications running over the Internet, you can use Amazon CloudFront. CloudFront is a Content Delivery Network (CDN) with 450+ PoPs around the world. These PoPs cache frequently accessed content, and provide TCP and Transport Layer Security (TLS) termination closer to the end users to reduce latency and improve application response time.
- AWS Local Zones can also be used to deploy compute, storage, database, and other services closer to your end users.
- Even lower latency can be achieved for your on-premises systems by deploying Outposts, which is AWS Infrastructure running in your own data centers.
- Communications Service Providers (CSP) can leverage AWS Wavelength to embed compute and storage within their 5G networks. This avoids the latency that would result from application traffic traversing multiple hops over the Internet.
- Application layer protocols can also help mitigate the effect of latency. For example, TLS 1.3 can be opened in one round-trip (TLS 1.2 required two round-trips). HTTP/3 (also known as QUIC) is a UDP (user datagram protocol) based, stream-multiplexed, secure transport protocol that combines and improves upon the capabilities of existing TCP, TLS, and HTTP/2 for faster response times. Both TLS 1.3 and HTTP/3 are supported by CloudFront.
- Inside the AWS Network, Infrastructure Performance is a capability of AWS Network Manager that can be used to monitor Inter-Region, Inter-Availability Zone (AZ), and Intra-AZ latency metrics.
The LFN in the room
One frequent but often overlooked reason for low application throughput is the effect of latency on TCP. This issue arises as a combination of factors: high delay, applications using a single TCP connection for data replication, and legacy TCP congestion control algorithms that don’t adequately increase the window sizes.
Long Fat Networks
The bandwidth-delay product is defined as the link bandwidth (in bits per second) multiplied by the round-trip time in seconds. This number is important because it is equivalent to the data that can be sent over the network before receiving and acknowledgement.
Long Fat Networks (LFN) are networks that have high latency (long) and bandwidth (fat). Or, to put it more precisely, networks in which the bandwidth-delay product significantly exceeds 105 bits (RFC 1072).
In the past, examples of LFN were limited to geostationary satellites (with round-trip times over 500ms). Typical bandwidths for long-haul point-to-point links were 1.544Mbps (T1) and 2Mbps (E1), which kept the bandwidth-delay product low.
However, today it is common to see multi-Gbps links across states, countries and even continents. For example, a 10 Gbps link across the United States, with 80ms round-trip time, has a bandwidth-delay product of 1010 bits/s x 0.8s = 8 x 109 bits. Link bandwidth has increased exponentially over the years, but latency has remained almost the same, due to the constant nature of the speed of light.
Gigabit bandwidths that were traditionally reserved for Local Area Networks are now available globally. But what happens if we increase the latency without modifying the TCP Window size? For example, let’s assume the TCP Window can accommodate four packets full of data (see the following figure). In a low latency scenario, acknowledgments are received before the window is full and data keeps flowing (left). But as latency increases, the TCP Window fills before receiving an acknowledgement, transmission stalls, and (if no other flows are using it) link bandwidth is wasted.
TCP Window Scaling
Original TCP specification allowed for a maximum window size of 65,535 bytes. This meant that, for a 100ms delay link, no matter the spare bandwidth, the maximum throughput that could be achieved over a single TCP connection was 65,535 bytes x 8 x 1/0.1s = 5,242,800 bits/s (about 5 Mbps).
The TCP Window Scale option was introduced in RFC 7323 to enable efficient data transfer in networks with bandwidth-delay product that were higher than 65KB. This new option allows for window sizes of over 1GB. Using Window Scaling, a single TCP connection could efficiently transfer data over a 10Gbps link with 800ms of latency, or on a 100Gbps link with 80ms of latency.
Note that some operating systems still in use were developed over a decade ago, when the LFN problem was not as prevalent as today. In some cases, default window scaling factors are conservative, leading to a Window Size far below the bandwidth-delay product of modern links, and limiting the throughput of TCP connections. This problem can be identified by capturing traffic with a tool such as Wireshark, Linux tcpdump, or Microsoft Network Monitor, and verifying the window size and scaling factor. The TCP scaling factor can be tuned in most operating systems. The recommendation is that the effective window size should be at least the bandwidth-delay product of the link.
Asymmetric routing refers to a network topology where packets follow one network path from source to destination, but the return traffic does not use the same path in the reverse direction, instead taking a different route. Asymmetric routing introduces complexity in the architecture and makes finding issues harder, and is not generally recommended. Furthermore, since intermediate hops on the network may implement connection tracking, asymmetric routing can have unexpected consequences including dropped connections and reduced network performance.
Besides link bandwidth, many other factors contribute to network performance. We summarize the following recommendations:
- Choose your EC2 instances according to the required performance. Network optimized instances provide the highest bandwidth and packet rates.
- Global Accelerator can be used to improve the performance consistency of applications running over the Internet.
- CloudWatch Internet Monitor can provide visibility into Internet issues that may affect user experience.
- For better reliability, prefer Direct Connect or Accelerated Site-to-Site VPNs for layer 3 connectivity.
- CloudFront, Local Zones, Outposts, and AWS Wavelength can be used to bring data closer to the end user, thus reducing latency.
- Latency inside AWS can be monitored using AWS Network Manager Performance Monitoring.
- Modern protocols like TLS 1.3 and HTTP/3 can help mitigate the effect of latency.
- To improve TCP throughput over LFNs, update your operating systems and tune the TCP Window Scaling.
- Avoid architectures introducing asymmetric routing.
In this post, we reviewed the main factors contributing to network performance and provided best practices for AWS and hybrid networks. For additional guidance, you can leverage AWS re:Post to get expert help from the community.