Networking & Content Delivery
Growing AWS internet peering with 400 GbE
Performance is a key driver of the design of the AWS global infrastructure. AWS has the largest global network infrastructure footprint of any cloud provider, and this footprint is expanding continuously to help our customers deliver better end-user experiences, rapidly expand operations to virtually any region or country, and meet their data locality and sovereignty requirements. This network serves customers in two different ways, either over the public internet, or through AWS Direct Connect. Although Direct Connect is the preferred choice for AWS customers that need direct, private connectivity with predictable performance, and an option for link encryption, it doesn’t fit all needs.
Products such as Amazon CloudFront, Amazon Route53, and AWS Global Accelerator are good examples of AWS services that primarily use and rely on the public internet to deliver their service to the end-user. Looking at CloudFront specifically, it’s a service where customers such as Hulu, Prime Video, and Discovery are using CDN-resources in the 410+ Points of Presence deployed globally to stream their content to users such as you and me.
Over the last three years, network capacity at the AWS internet edge – where the AWS network interfaces with the internet – has grown by 3X. Moreover, in the last three years, our backbone capacity has increased 2.5x.
At the 440+ locations in six continents, marked on figure 1, AWS peers with Internet Service Providers (ISP) either over an Internet Exchange (IX) or with private network interconnects (PNIs) directly in our border routers. The AWS Network has been built out over the course of more than a decade. Our goal is to peer with ISPs that operate infrastructure connecting end-users in every country in the world as locally and closely to their market as possible to reduce latency and increase capacity to the end-users. One of the primary internal users of our global network backbone is Amazon CloudFront (Figure 2)
In contrast to peering and the peering points described here, Direct Connect has a global footprint as well, operating in over 115 different locations around the world and offering connection speeds starting at 50 Mbps and scaling up to 100 Gbps. When AWS customers require the reliability and predictability of a private connection, Direct Connect is the answer. AWS customers that also operate peering networks are discouraged from transferring significant volumes of their own data over peering connections. This is because we acknowledge that peering connections aren’t individually maintained in the same robust manner as the Direct Connect service. Instead, they achieve resiliency through a meshed collection of direct and indirect peering paths.
Direct Connect also offers features that peering simply doesn’t. Importantly, Direct Connect offers three different types of virtual interfaces. Private and Transit virtual interfaces allow direct access to customer VPCs. This means that there is no need to expose their AWS resources to the public internet if you don’t want to. A Public virtual interface provides access to public endpoints such as Amazon Simple Storage Service (Amazon S3), but customers are in control of the AWS Regions that they allow access. This last point is where there is a key and distinct difference between internet peering and Direct Connect.
When it comes to peering, it’s helpful to have up-to-date, contact, and technical information for each peer. In line with the Mutually Agreed Norms for Routing Security (MANRS) principle #3, it’s important to have updated contacts and facility information. This is why every single network in the world that is interested in Internet Peering operates a PeeringDB-record where anyone can view possible interconnect points where either private or shared (over an IX) peerings can be established. Operators can display their various policies and the IX’s of which they are members, or which data centers they are available at to establish a PNI.
Although Internet Peering as we know it has been around in some form since the dawn of TCP/IP, with ISPs interconnecting with everything from X.25 to ATM, 1 Gigabit Ethernet (GbE) was most likely the first “universal” standard, adopted around the year 2000. 10 GbE technologies followed suit shortly thereafter and were launched by most companies for interconnect around 2005. We (as in the Internet industry) had to wait until around 2012, when the first publically known 100 Gigabit Ethernet interconnects were being commonly deployed and supported in networks. Today, this is the standard way of interconnecting both small and large networks. Note that AWS was one of the first cloud providers to widely adopt 100G technology on its global backbone. Over 90% of all capacity interconnected with the AWS border network is being done using 100 GbE technologies today. Ten years later, this isn’t enough. 2023 looks to be the year when we, for the fourth time in the lifespan of global internet peering, will switch to the next major generation of technology: 400 Gigabit Ethernet (400 GbE).
As part of these continued efforts, we’ve rolled out a new platform built upon 400 GbE technology. Not only does this allow non-blocking forwarding within the multi-tier Clos, but also it allows us to natively interconnect with our internet peers and exchanges at 400 GbE. To interconnect, we typically use short-reach optical 400 GbE (Figure 3) for internal in-site usage (Typically 400G-DR4+). 400 GbE direct-attach cabling within racks (Figure 4), and for external connectivity to other ISPs we use 400 GbE long-range optical alternatives (400G-LR4) that has a maximum reach of 10km. Internet peering has typically centered around 10km technologies through both 10 GbE and 100 GbE technologies since it gives a good price/performance while at the same time gives enough headroom for a bad fibersplice and a lot of intermediate connectors.
One of the largest cost drivers for interconnect today in the AWS-border network, outside of the hardware itself, is the cost of cross-connects or fiber across a metropolitan area. Most of the interconnects in the network today are multiple 100G connections made toward the same ISP in the same city, forming a Link Aggregation Group (LAG) of Nx100G—where N is the number of aggregated physical links. If these LAG’s of, let’s say 4x100G, can be swapped out to 1x400G instead, then we save three fiber-pairs per interconnect. These cost-savings will help us deliver more capacity to the customer and allow more innovation being conducted on the internet border. The reduced complexity of potentially managing 1/4 the amount of crossconnects is also a major cost and reliability driver for why we want to go to higher port speeds, faster.
However, we know that not all peers and IXs are ready for the switch to 400 GbE. Although some for-profit IXs already support 400 GbE, and others have it on their roadmap, the story can be a bit different for non-profit exchanges. IXs like the Seattle Internet Exchange (SIX) sustain operations from a combination of mandatory participation fees and ad-hoc donations. Ad-hoc donations take some of the uncertainty out of technology refreshes. This is why AWS donated over $300,000 USD so that the SIX could upgrade their core switching platform to native 400 GbE. As a result, AWS is happy to be the first SIX participant connecting at 400 GbE.
In addition to the SIX upgrade, the new routing platform employed by AWS has allowed us to eschew the traditional constraints of a legacy network with strictly defined functional layers. Instead, we’re using a single multi-function network layer that is actually a multi-tier Clos network (Figure 5). Although Clos-networks have traditionally been used in data centers, and especially at large scale, it’s fairly uncommon in a Service-Provider style network. For us, Clos means horizontal scaling, where we quickly scale our border out to where it needs to be, much more efficiently. This means that we can terminate 400 GbE peering interconnects wherever our border network exists. With this new platform, that’s 160 sites and counting. This flexibility unlocks the AWS vision to decentralize much of our peering interconnect, especially from sites that weren’t originally built as data centers, and move them to more suitable locations that are closer to AWS edge services. In turn, this increases network robustness and availability with a direct benefit to our customers.
When we dig deeper into what any connection type, anywhere, means on our 400 GbE platform, things can really get interesting regarding flexibility for peering interconnect and the robust connectivity that follows.
As shared in Dave Brown’s (VP of Amazon EC2 Networking and Compute Services) leadership session at re:Invent 2022, over the last three years we’ve been on a journey to reinvent our global network to be based on the same underlying AWS-built routers that we use in our data centers. Building on our own hardware and software has enabled us to scale capacity at an unprecedented rate, while increasing network reliability and resiliency.
As shown in these images (Figure6), this is our own 32×400 GbE universal building block with the hood popped off.
We’ve discussed how we prefer to build networking using single system-on-chip fixed 1U switch building blocks, and build fabrics out of them in topologies like Clos networks. For a few years now, our standard building block has been a 32-port 1U 12.8 Tbps device that provides either 32 x 400 GbE ports or 128 x 100G ports (through breakout cables).
Our preference for fixed-port-count (non-modular) network devices (Figure 7) stems from the failure cases and “gray failures” with modular devices. This is where there can be internal impairment or many more modes of failure, and the device appears to be working but is otherwise blackholing some traffic. Fixed devices by their nature are much simpler, and by comparison what might have been a design with 4 or 8 modular devices (a design 4-wide means failure of one device is 25% hit to capacity) becomes a design with 16 or 32 fixed devices (failure of one
device is a 1/16 or 1/32 reduction in capacity).
We deploy network capacity in ‘fabrics’ of capacity (Figure 8), and with our 12.8 Tbps devices, a fabric starts with 100 Tbps of client-facing capacity. However, there is actually 400 Tbps of capacity deployed (32 Switches), for non-blocking/non-oversubscribed any-to-any from any client-facing port to any other client-facing port. As we must scale to add capacity, we can horizontally scale a fabric up to 32 racks wide, for 3200 Tbps of any-to-any non-oversubscribed client-facing capacity.
Historically networks would be built in ‘layers’ with different kinds of layers for connecting different kinds of clients, such as an “internet edge” for internet-peering facing routers, “aggregation” layers, “backbone router” layers, and so-on. To facilitate rapid scaling of capacity, we developed an innovative way of connecting clients that doesn’t dedicate layers to clients. This means we can connect any client to any port, and scale-out horizontally without any hotspots due to the imbalance of client ports. This is a building block that has enabled us to land Local Zones fast and close to customers, with under two microseconds of device latency between an AWS Local Zone, Route53, Global Accelerator, or CloudFront and an internet peer.
All modern ASIC-based network devices have separation between data-plane (hardware) and control-plane (software). But often the control-plane functions are on embedded CPUs with limited compute and RAM. In contrast to popular commercial offerings, we have paired our own device with a Graviton2 based onboard controller for the base-features that a peering-router needs. We build our own devices in a hybrid, with some functions that make sense on the device (such as link-aggregation, route programming) but with BGP signaling separate from the physical device (Figure 9). The Internet operates using the BGP routing protocol, and where we have millions of prefixes and paths to choose from, we centralize routing decisions to much higher performance compute outside of the network devices. Although we run this infrastructure ‘disaggregated’, this is transparent to our internet peers; they don’t need to do anything different to peer with us, even though their seemingly ‘direct’ BGP-sessions actually terminate logically in a high-performance compute cluster locally at the site. Operating disaggregated in this manner also means that we can scale-out internet connectivity very wide, and we have many places where we’re doing multi-Tbps on a single internet peering session.
Operating in this disaggregated manner not only provides performance benefits (faster network convergence), but also enables us to do far more innovative things than blindly following what the internet routing table tells us. We can make better internet path-selection routing choices, and we can route around ‘bad internet weather’. Some of the metrics that we use to measure/monitor are also now available via Amazon CloudWatch Internet Monitor.
We are quite excited to bring all this innovation to the AWS border network, and most likely the bits flowing from AWS to you (the reader) already go through the technology described above.