Enabling global expansion and reduced operational overhead at Comcast with AWS Transit Gateway

This blog post is co-written by David Hocky from Comcast Corporation.

This post explains how Comcast achieved faster time-to-market for new product launches, increased resiliency, and reduced operational overhead by using Amazon Web Services (AWS) Transit Gateway and AWS Direct Connect.

Comcast is a global media and technology company. From the connectivity and platforms, to the content and experiences, Comcast reaches hundreds of millions of customers, viewers, and guests worldwide. Through the Xfinity, Comcast Business, and Sky brands, Comcast delivers world-class broadband, mobile, and entertainment products that delight customers and technology that powers the future. Comcast’s global media and entertainment businesses create, distribute, and stream leading entertainment, sports, and news and bring incredible theme parks and attractions to life through Universal Destinations & Experiences.

The teams at Comcast operate a cloud environment spanning hundreds of AWS Accounts. These AWS accounts are actively used by thousands of developers running a wide variety of workloads across multiple lines of business, like Xfinity X1, Xfinity xFi, Xfinity Home, Xfinity Mobile, Comcast Business, and so on. If you want to learn how teams at Comcast are exploring using AWS, you can read previously published AWS posts and videos on building home security solutions at scale, systems that do analytics on large scale telemetry data, and monitoring home security devices using Amazon CloudWatch.

DX Model 1.0

Early on in Comcast’s AWS adoption journey, we established a network connectivity model known as DX Model 1.0 which used Direct Connect to enable connectivity between Amazon Virtual Private Clouds (Amazon VPCs) and Comcast corporate datacenters.

Each Amazon VPC was connected to Comcast datacenters over a private virtual interface terminating directly on an AWS Virtual Private Gateway (and later, Direct Connect Gateway). This had the side effect of needing traffic between VPCs in AWS to hairpin through on-premises routers, which introduced more latency (Figure 1).

The same network path was used for traffic between Amazon VPCs and on-premises Comcast resources.

Figure 1. Single Region DX Model 1.0

Figure 1. Single-Region DX Model 1.0

The connectivity model was later expanded to multiple AWS Regions and on-premises datacenters based on evolving workload requirements. For brevity, Figure 2 shows an example of only two AWS Regions and two on-premises datacenters.

Although some VPCs had direct VPC peering connections, the majority of traffic between VPCs was routed through on-premises routers, which caused unnecessary hairpinning. This routing configuration led to more latency, making the VPC-to-VPC latency comparable to that of on-premises dependencies.

Figure 2. Multi Region DX Model 1.0

Figure 2. Multi-Region DX Model 1.0

This design pattern worked well in the early stages of adoption, and played to the strengths of the Comcast team. They can quickly deploy new virtual interfaces (VLANs that transport Direct Connect traffic) and establish BGP connections to exchange routes.

However, as the number of VPCs and accounts scaled, it became increasingly complex to manage large numbers of Direct Connect connections, virtual interfaces, and the associated AWS service limits on private VIFs and Routes.

The fact that cross-VPC traffic would hairpin through on-premises was more noticeable as new workloads were brought up on AWS. The legacy VPC-to-VPC traffic flow also created more pressure on Direct Connect load and long-term capacity planning.

DX Model 2.0

In 2021, Comcast began an effort to re-design the DX connectivity model with the goal of increasing scalability, reducing latency, and minimizing time to market. One option would have been to increase the usage of VPC peering. However, the complexity of managing mesh-VPC peering and VPC route table size limits made this impractical. As a result, the team decided to use Transit Gateway and Direct Connect Gateway.

Transit Gateway is an AWS service that streamlines network architectures by connecting Amazon VPCs and on-premises networks through a central hub. It acts as a highly available and scalable router. Transit Gateway enables Regional and cross-Region connectivity, allowing you to connect resources across multiple AWS Regions. It supports network segmentation through multiple route tables and integrates with AWS services such as Direct Connect and VPN for secure on-premises connectivity. Transit Gateway provides centralized monitoring and logging capabilities for network traffic. Overall, it streamlines the management of complex network topologies, enabling efficient and secure connectivity between VPCs and on-premises resources in a hub-and-spoke topology.

In the new DX Model 2.0, a Transit Gateway was provisioned in each AWS Region. The Transit Gateways in different AWS Regions were peered to one another creating a full mesh. This allowed Comcast to keep VPC-to-VPC traffic on the AWS network regardless of whether the flow was intra- or inter- Region. This effectively, offloaded traffic from the Direct Connect connections and reduced latency.

Transit Gateways connect back to on-premises through Direct Connect Gateway using a few transit virtual interfaces to establish BGP sessions and exchange routes.

Even though Direct Connect Gateway is a global construct, it was used in a Regional fashion in this design. This design choice allowed Comcast to have more control over influencing traffic destined from respective AWS Regions back to on-premises, which resulted in streamlined routing. For example, attaching an AWS BGP community tag to influence a route advertised from on-premises would affect only one AWS Region at a time.

Comcast achieved a single-Region SLA of 99.99% by provisioning multiple Direct Connect connections across multiple Direct Connect locations. Each connections provided connectivity to multiple AWS Regions.

Figure 3 shows a high-level architecture of DX Model 2.0 between two Regions.

Figure 3. Multi Region DX Model 2.0

Figure 3. Multi-Region DX Model 2.0

Migration approach

Comcast grew organically over time to hundreds of VPCs across multiple AWS accounts as part of a multi-account strategy. Early on, these new VPCs were on-boarded very infrequently and with a highly manual effort. As Comcast’s usage of AWS grew, Comcast developed automation to create new VPCs with Direct Connect connectivity, but they still retained the VPCs created pre-automation to minimize the impact on running applications. In the end state, we wanted pre- and post- automation VPCs to look as similar as possible to streamline the deployment of future features and to allow all users to benefit from our automation tooling for VPC modifications and firewall rule management. Our goal was to migrate teams with as little disruption to their existing workloads and processes as possible.

As Comcast finalized our Transit Gateway design, we developed a template VPC configuration that would be deployed for all new VPCs. When we were satisfied with this configuration, we developed tooling to evaluate existing VPCs to identify deltas in the configuration that would have to be normalized as part of the migration process. Then, we developed more automation to execute, or rollback, these changes on a per-VPC basis.

Based on AWS best practices, Comcast opted to add dedicated subnets in each Availability Zone(AZ) for our new Transit Gateway attachments. We used different IPv4 CIDR blocks than the ones allocated for workloads, which streamlined changes and rollback. Using different CIDR blocks helped us to do the following:

- Avoid consuming IP address space allocated for applications, and provide cleaner separation of workloads and network infrastructure.
- Pre-provision Transit Gateway attachments in more AZs to accommodate future workload expansion.

Figure 4 (VPC to Transit Gateway attachment) depicts an example architecture demonstrating the use of dedicated subnets for Transit Gateway attachments between multiple AZs in a single VPC and a Transit Gateway.

Figure 4. VPC to TGW attachment

Figure 4. VPC to Transit Gateway attachment

At the time of migration, AWS did not support multiple IPv6 CIDR allocations to a VPC. Therefore, we opted to reserve IPv6 CIDR allocations for our Transit Gateway subnets from the top of the VPC’s /56.

Note VPC soft limits when allocating more IP space, such as the number of CIDRs per VPC. Furthermore, when picking a secondary IPv4 CIDR, note AWS limitations on what IP blocks can be used as secondary CIDRs.

As we got closer to beginning migrations, we capped usage of the old architecture to prevent the scope of migrations continually growing. We began onboarding new VPCs directly to the target architecture, which also allowed us to validate the functionality of the new environments with workloads that were not yet in production.

To build confidence in the change procedures, we identified lower risk environments (for example smaller, non-production environments) to migrate to first before proceeding with our largest workloads. We split the migrations into multiple waves per AWS Region with 1-2 weeks in between to allow workloads to soak and make sure of no issues. This also allowed us to make sure that we could provide a high level of support to individual teams with questions, or in case issues arose during migration.

For each wave, the high-level migration steps looked like the following:

- Validate VPC configuration / limits, modify as needed
- Update each VPC to add dedicated Transit Gateway attachment subnets, create VPC<-> Transit Gateway attachment request
- Accept Transit Gateway attachment and populate static routes in Transit Gateway route table
  - We opted not to enable auto-acceptance of attachments to reduce operational risk because of changes in linked accounts
  - We opted not to enable route-propagation from VPC attachments to further reduce operational risk because of changes in linked accounts
- Update Direct Connect Gateway allowed (advertised) prefix list to advertise migrated VPC CIDRs
- Update VPC route table(s) to replace VGW as next-hop to Transit Gateway as next-hop
- Disassociate VPC from original Direct Connect Gateway/VGW

During migration, there was a period of asymmetric routing where traffic leaving the VPC continued to go down the original connection, but new traffic came through the Transit Gateway. Figure 5 (Asymmetric routing challenge) depicts the asymmetric flow of traffic during the migration and post migration when traffic goes directly between VPCs.

Figure 5. Asymmetric routing challenge

Due to the presence of stateful appliances in the data path (not shown), it was determined early on that some downtime would be needed. Our original migration process resulted in approximately 30 minutes of downtime per VPC. Through continued refinement, in partnership with AWS, Comcast reduced downtime to less than one minute per VPC in most cases.

Results

Comcast teams realized the following benefits by using Transit Gateway:

- Better scale: Migration of several hundred VPCs in the us-east-1 (N. Virginia), us-east-2 (Ohio) and us-west-2 (Oregon) Regions, completed with minimal downtime. Subsequently, they can scale to thousands of Transit Gateway connected VPCs globally.
- Better performance and resilience: With in-Region and cross-Region connectivity staying in AWS, teams can maximize their Amazon Elastic Compute Cloud (Amazon EC2) instance bandwidth usage between VPCs without worrying about physical network capacity planning. Comcast teams saw latency improvements of up to 10x for VPC-to-VPC communication in the same Region after migrating to DX Model 2.0.
- Faster time-to-market: Using self-service automation, Comcast teams can create or expand their network configurations in minutes instead of days. New and migrated VPCs also benefit from firewall automation, which replaces manual requests that took days with repeatable automation that takes hours (or less).

Important considerations/tips

As with major networking changes, a successful migration to Transit Gateway needs careful planning. In addition to the items discussed throughout, consider the following:

- Evaluate your bandwidth and packet-per-second requirements for each VPC. Work with the AWS support team if you expect to need a higher BPS or PPS limit.
- MTU limits for peered VPCs and Transit Gateway connected VPCs are different. Asymmetric routing during migrations can lead to MTU mismatches and packet loss.
- Security group referencing across VPCs is supported across Transit Gateways for inbound security rules only, while it is supported across VPC peering for inbound and outbound security rules. If removing VPC peering as part of migration, then Security Group rules may need to be updated.
- Transit Gateway uses AWS Resource Access Manager (RAM) for managing cross-account access. When using AWS Organizations, enabling the RAM integration can minimize the number of handshakes needed to share Transit Gateway resources and build Transit Gateway VPC attachments.
- As of this writing, Transit Gateway doesn’t support dynamic routing across peers. Having sufficient automation when adding or removing VPC CIDR blocks (or using large, Regional aggregates) is critical to keeping cross-Region Transit Gateway route tables up-to-date. This keeps traffic flowing within AWS as expected.
- Transit Gateway produces aggregate metrics that can be used for high-level network health monitoring. Attachment level metrics and Transit Gateway flow logs can be enabled to drill down into application-specific issues.

Conclusion

At Comcast, AWS Transit Gateway and AWS Direct Connect have allowed us to scale our hybrid-cloud deployments to thousands of VPCs globally. This makes sure that our developers have a reliable and self-service method for accessing the on-premises and Cloud-based systems that we need.

An update was made on October 15, 2024: An earlier version of this post stated that Security group referencing across VPCs is not supported across Transit Gateways. The post has been updated to reflect that Security group referencing on AWS Transit Gateway for inbound security rules was released on September 25, 2024.

About the authors

David Hocky (Guest)

David Hocky is the Technical Product Owner and a Solutions Architect for AWS at Comcast. In this role, he works with internal application development teams to make sure that they have the tools and capabilities to operate workloads securely and reliably at scale.

Amit Kalawat

Amit Kalawat is a Principal Solutions Architect at AWS based out of New York. He works with enterprise users as they transform their business and journey to the cloud.

Tom Adamski

Tom is a Principal Solutions Architect specializing in Networking. He has over 15 years of experience building networks and security solutions across various industries – from telcos and ISPs to small startups. He has spent the last 4 years helping AWS users build their network environments in the AWS Cloud. In his spare time Tom can be found hunting for waves to surf around the California coast.

Acknowledgements

The authors want to thank the many Comcast teams involved with this project, and the AWS Transit Gateway and AWS Direct Connect service teams, for their deep engagement and support from design to production.

Networking & Content Delivery

Enabling global expansion and reduced operational overhead at Comcast with AWS Transit Gateway

About the authors

David Hocky (Guest)

Amit Kalawat

Tom Adamski

Resources

Follow