Networking & Content Delivery

How to seamlessly migrate traffic between Direct Connect gateways

In this blog post, we explore a scenario in which Goldman Sachs, wanted to transfer ownership of several of its key network components between teams in a controlled and seamless manner. Specifically, we take a deep dive on migrating traffic between Direct Connect gateways while maintaining end-to-end connectivity.

As a multinational investment bank and financial services company, Goldman Sachs operates a complex global network infrastructure that is critical to the functioning of its business. This network provides secure connectivity between Goldman Sachs -operated locations, such as offices and data centers, and to partners and third-party providers. In the case of Amazon Web Services (AWS), some elements of this connectivity are physical devices and circuits, and other elements are logical networking constructs.

These network components are owned by multiple teams within Goldman Sachs, and the nature of this ownership is an evolving landscape.

We look at how Goldman Sachs migrated from multiple Direct Connect gateways owned by different lines of business (LOBs) and acquisitions to a set of strategic and central Direct Connect gateways. You may find this type of migration useful in a variety of business or technical scenarios, such as to consolidate accounts during mergers and acquisitions or to minimize technical debt.

The process discussed in this post can also be used to achieve a seamless migration from a virtual private gateway to a transit gateway. This post assumes you have a baseline understanding of the AWS hybrid networking components and their interoperability. Learn more on hybrid networking on Connect your network to AWS with hybrid connectivity solutions and Hybrid Connectivity documentation.

Context

Networks that span AWS and on-premises data centers are referred to as hybrid networks, and these architectures help organizations integrate operations to support a broad spectrum of use cases. To establish a hybrid networking environment, there are specific AWS services to explore. For example, AWS Direct Connect makes it easy to establish a dedicated network connection from your on-premises environment to AWS. AWS Transit Gateway connects Amazon Virtual Private Cloud (Amazon VPC) and on-premises networks through a central hub and acts as a cloud router that enables rich routing scenarios. A virtual private gateway is part of a virtual private cloud (VPC) that provides edge routing for Direct Connect connections or site-to-site VPNs, and an AWS Direct Connect gateway is a globally available resource that you can use to connect your on-premises networks to multiple VPCs through multiple Direct Connect connections. 

Customers use these AWS services and the inherent flexibility and agility of the AWS Cloud to serve the ever-changing needs of their business. This post explores one such scenario where Goldman Sachs wanted to shift its cloud networking technology ownership from its internal LOB application teams to a central networking team without impacting its business. It’s possible to migrate ownership of a Direct Connect connection from one account to another by engaging AWS Support, but the equivalent isn’t possible for Direct Connect gateways.

Solution overview

Let’s start with a baseline scenario shown in the following figure, where VPC A is connected to the on-premises network through a virtual private gateway, a Direct Connect gateway, and Direct Connect. For simplicity, this post shows only relevant AWS network components and on-premises connectivity up to the customer router at the Direct Connect location. Account A in this scenario is the LOB AWS account, which initially owns and manages all the AWS networking components.

Simple hybrid networking setup connecting on-premises to VPC A using a Direct Connect gateway

Figure 1: Simple hybrid networking setup connecting on-premises to VPC A using a Direct Connect gateway

The following figure shows the target state where the hybrid network components (Direct Connect and Direct Connect gateway) are owned by a central networking team (Account B), allowing the LOB teams to focus on the application deployments that deliver value to them. After the migration, the traffic between VPC A and on-premises will flow through Account B’s hybrid networking components, and the original Account A networking components can be decommissioned.

Expected target state depicting traffic between VPC A and on-premises flowing through Account B–owned networking components

Figure 2: Expected target state depicting traffic between VPC A and on-premises flowing through Account B–owned networking components

Generally, to achieve this end state you can simply remove the virtual private gateway association from the old Direct Connect gateway (Direct Connect gateway A) and create an association with the new Direct Connect gateway of Account B (Direct Connect gateway B). However, this process causes a loss of connectivity while the virtual private gateway is not associated with any Direct Connect gateway, and typically takes up to 20 minutes. In some cases this may be acceptable (and if so, it is certainly the simplest approach!), but when supporting critical business workloads such as the ones Goldman Sachs operates, a maintenance window of this duration may exceed business error budget thresholds and impact SLOs. This post outlines an alternative approach that avoids this loss of connectivity.

At a high level, the process is:

  1. Provision temporary hybrid network infrastructure (Direct Connect gateway and transit gateway) to support the migration. This temporary infrastructure can technically be owned by either Account A or Account B, but for the purposes of this post, we assume it is owned by Account B, the central networking team leading the migration. Transit gateways can be attached to VPCs in parallel to virtual gateways, and used to route traffic to and from the on-premises network, which is what allows for the seamless migration described in the following steps.
  2. Attach the temporary hybrid network connectivity to the line of business (Account A) VPC, and reroute traffic through this path.
  3. Disassociate the current hybrid network connectivity from the line of business VPC.
  4. Provision strategic hybrid network infrastructure (Direct Connect and Direct Connect gateway) owned by the central networking team (Account B).
  5. Associate the strategic hybrid network connectivity with the line of business VPC, and reroute traffic through this path.
  6. Decommission the temporary infrastructure.

Things you should note:

  • Step 4 can make use of existing strategic infrastructure or could be completed before starting the whole migration. We’ve structured the flow this way to introduce infrastructure at the point at which it’s required to avoid confusion.
  • The Direct Connect connection(s) used throughout the entire process can be the same; new Direct Connect connections are not required. We draw them as new or independent components for ease of understanding, but the process works the same regardless because the virtual interfaces in use are what actually matter. If existing Direct Connect connections need to change ownership, the support process linked previously can be followed either before or after the migration detailed here.
  • As a best practice, we recommend customers use multiple Direct Connect connections connected to a Direct Connect gateway for resilience purposes. This does not impact the logical migration approach.
  • This solution assumes there are no stateful devices such as firewalls in either traffic path which would present challenges during the asymmetric routing steps detailed below. Although many customers do operate firewalls in the border between their on-premises and Cloud based networks, they typically have a dynamic routing environment ‘outside’ these firewalls where a migration of this style could be performed. If you need to perform a migration across firewall clusters, a similar approach to the AWS side of the activity should work to minimise impact but the assymetric steps will need to be accelerated and will incur some level of impact.

Solution walkthrough: How Goldman Sachs completed an impactless migration between Direct Connect gateways

We recommend that you review this content end to end to ensure you are comfortable with the high-level process and the detailed steps involved, paying particular attention to the live routing changes. We also recommend that you exercise this migration in non-production environments or using a test VPC to validate the experience with your on-premises routing environment.

Step 1

As shown in the following figure, we first provision a our temporary hybrid network infrastructure: a transit gateway, Direct Connect gateway, and Direct Connect connection (if desired). Note that you don’t need to have a new Direct Connect connection if you prefer to directly migrate ownership of existing connections through the previously described support process, and all steps in this process will work the same regardless of whether a new or existing Direct Connect connection is used. However, this exercise could provide an opportunity to upgrade your connections to a higher bandwidth or consolidate connectivity.

The transit gateway and Direct Connect gateway introduced here are temporary because these components will be used only for the duration of this migration to allow for an impactless handoff to the final strategic Direct Connect gateway that we’ll create later (or that may already exist to support other flows). The temporary Direct Connect gateway is required because it’s not possible to associate a single Direct Connect gateway with both a transit gateway and a virtual private gateway.

During this step, we attach the transit gateway to VPC A but will not modify the VPC route table. We also associate the transit gateway to the Direct Connect gateway but do not configure any allowed prefixes nor do we configure on-premises routers to accept inbound route advertisements. This avoids triggering any outbound path changes. Note however that it is fine at this stage to allow appropriate route advertisements from on-premises routers to the Direct Connect gateway, as inbound path selection is controlled statically by the VPC route table.

At this time, you should validate receipt of the on-premises routes by the Direct Connect gateway to ensure they are available for inbound flows during later steps. Subject to this approach, this step has no impact on any existing flows. It’s a bit like connecting a new router to a physical network in parallel to an existing topology, but without advertising any routes.

For simplicity, we are assuming that the on-premises networking components are shared across network flows for both Account A and B. Note, however, that the customer router does not necessarily need to be the same device, as long as the routing domain is common and there is an effective way of directing traffic through the preferred path at each stage of the migration. If there are any stateful firewalls deployed between the old and new paths, ensure that these will not cause issues during the asymmetric routing stages of the process identified later.

Temporary transit gateway, Direct Connect gateway, and Direct Connect connection (as required) provisioned in Account B

Figure 3: Temporary transit gateway, Direct Connect gateway, and Direct Connect connection (as required) provisioned in Account B

Step 2

Now we have our temporary migration infrastructure ready, we reroute flows via this path. First, we configure appropriate allowed prefixes on the temporary Direct Connect gateway to initiate route advertisements for our VPC Classless Inter-Domain Routing (CIDR) to our on-premises routers. Subject to the configuration of our on-premises routers, we should be able to validate receipt of these routes into the Border Gateway Protocol (BGP) tables and then configure our routers to accept these route advertisements into the routing tables and prioritize traffic from our on-premises network through this temporary path. The exact nature of how this is achieved will vary based on the on-premises routing environment but should follow standard mechanisms for path selection. For important information on the specific behavior of allowed prefix configurations in a transit gateway and virtual private gateway context that controls exactly which prefixes are advertised from AWS to your on-premises routers, refer to the AWS Direct Connect ‘allowed prefixes interactions’ user guide.

Now we have ingress traffic (from on-premises towards AWS) using our temporary bridge. Although the traffic flow is asymmetric at this stage (ingress through the temporary path and egress through the legacy path), you can operate in this mode while completing all necessary validation. This state is depicted in the following figure.

AWS ingress traffic flow through Account B’s temporary path and egress through Account A’s legacy path

Figure 4: AWS ingress traffic flow through Account B’s temporary path and egress through Account A’s legacy path

Second, we configure the VPC’s route table (Route table A) to send traffic that is bound on-premises to go through the transit gateway attachment (Figure 5). Now we have bidirectional traffic flowing through our temporary bridge.

All traffic flowing through Account B’s transit gateway, Direct Connect gateway, and Direct Connect

Figure 5: All traffic flowing through Account B’s transit gateway, Direct Connect gateway, and Direct Connect

Step 3

Since traffic is no longer flowing through the original path, we can remove the original virtual private gateway association with Direct Connect gateway A (Figure 6). Removing this association is required to prepare the virtual private gateway for the next stage and may take several minutes. You may choose to retain the Direct Connect gateway and Direct Connect components in Account A if they are shared by other workloads or in case rollback is required. Otherwise they can be fully decommissioned at this stage.

Original Account A networking infrastructure disassociated and optionally decommissioned

Figure 6: Original Account A networking infrastructure disassociated and optionally decommissioned

Step 4

In this step, we create a new Direct Connect gateway within Account B (unless we are using an existing one) and associate it with the virtual private gateway of Account A (Figure 7). This Direct Connect gateway is our strategic Direct Connect gateway, but to avoid impacting any traffic flows at this stage, we won’t modify VPC route tables, configure allowed prefixes on the association, or accept route advertisements on-premises via this path. Note that the process of creating the association between the virtual private gateway and Direct Connect gateway may take several minutes.

Strategic Direct Connect gateway provisioned in Account B and associated with Account A virtual private gateway

Figure 7: Strategic Direct Connect gateway provisioned in Account B and associated with Account A virtual private gateway

Step 5

Now that we have our strategic infrastructure ready, we proceed to reroute flows via this path. First, we provision appropriate allowed prefixes on Direct Connect gateway B to initiate route advertisements for our VPC CIDR to our on-premises routers in the same way we did for the temporary setup. Again, following similar methods, we should validate receipt of these routes into our BGP tables and then accept them into our routing tables with the strategic path preferred (Figure 8). This will switch our ingress traffic, again resulting in a temporary asymmetric flow (ingress via the strategic path and egress via the temporary path), and you should take this opportunity to perform all necessary ingress routing validation.

AWS ingress traffic flowing through the strategic Direct Connect gateway, with egress traffic still using the temporary bridge

Figure 8: AWS ingress traffic flowing through the strategic Direct Connect gateway, with egress traffic still using the temporary bridge

Secondly, we complete our migration by modifying our VPC A route table to use the virtual private gateway to send traffic to the on-premises network, resolving the temporary asymmetric routing, and enabling Account B’s Direct Connect gateway for all traffic (Figure 9).

All traffic flowing through the strategic path

Figure 9: All traffic flowing through the strategic path

Step 6

At this point we can choose to decommission our temporary transit gateway and Direct Connect gateway networking components (unless we intend to reuse them for similar subsequent migrations).

Final expected end state with temporary infrastructure decommissioned

Figure 10: Final expected end state with temporary infrastructure decommissioned

Conclusion

In this post, we detailed how to migrate on-premises connectivity with zero business impact from one Direct Connect gateway to another using a temporary transit gateway. Although not the core focus of this post, the first part of the process (up to Figure 6) can also be used to achieve a seamless migration from a virtual private gateway to a transit gateway.

We strongly encourage you to validate this workflow in non-production environments, such as with development VPCs before migrating your critical business flows. This will allow you to validate any specific nuances related to your on-premises routing environment and to become familiar and comfortable with the actions and resulting traffic flows at each step of the process. We also recommend you perform robust technical and business validation throughout production migration events to validate their success. This end-to-end migration approach has been proven in an AWS lab environment and successfully performed across dozens of active production environments at Goldman Sachs without issues.

For more insight into hybrid networking on AWS, check out this video from the AWS Summit San Francisco 2022: Connect your network to AWS with hybrid solutions.

A correction was made on May 28, 2024: An earlier version of this post used the incorrect icons within the diagrams. The diagrams have been updated to reflect the correct icons. A further update was made on June 17, 2024 to correct the “About the authors” boxes.

About the authors

Sujoy Saha

Sujoy Saha (Guest)

Sujoy Saha is a Vice President on the Cloud Enablement team at Goldman Sachs. Sujoy leads efforts in cloud network software automation and architecture, and builds solutions for securing connectivity between the firm and the public cloud. Outside of work, Sujoy enjoys hiking, photography, and traveling.

Venugopal MP

Venugopal MP (Guest)

Venugopal MP is a Vice President in Network Engineering at Goldman Sachs. He specializes in end to end network connectivity across hybrid topologies including cloud providers, on-premises, collocation facilities, vendor products, and third parties. In his free time, Venu enjoys traveling, Coffee and Spices farming.

Aditya Kanojia

Aditya Kanojia (Guest)

Aditya Kanojia is a Vice President on the Cloud Enablement team at Goldman Sachs. Aditya drives Infrastructure engineering ensuring reliability and proactive analysis to optimize performance and minimize downtime. Outside of work, Aditya enjoys creating music and playing cricket.

Harsha W Sharma

Harsha W Sharma

Harsha W Sharma is a Principal Solutions Architect with AWS in New York. Harsha joined AWS in 2016 and works with Global Financial Services customers to design and develop architectures on AWS and support their journey on the cloud.

Gerrard Cowburn

Gerrard Cowburn

Gerrard Cowburn is a Principal Solutions Architect with AWS based in the UK. Gerrard supports Global Financial Services customers in greenfield and migration-based architectural deep dives and prototyping activities. In his free time, Gerrard enjoys exploring the world through food and drink, road trips, and learning to fly helicopters.