Reducing Autodesk cloud routing latency with AWS Transit Gateway

This blog is co-authored by Anish John of Autodesk and Bhavin Desai of AWS.

Autodesk is a leader in 3D design, engineering, and entertainment software. They make software for people who make things. If you’ve ever driven a high-performance car, admired a towering skyscraper, used a smartphone, or watched a great film, chances are you’ve experienced what millions of Autodesk customers are doing with their software.

This post is about how we reduced Autodesk cloud routing latency from 60 ms to 2 ms without compromising security. It shows how we simplified our cloud networking topology by combining the best features of AWS Transit Gateway with the integration of a transit virtual private cloud (VPC). Later, it briefly describes future plans to integrate more network services around AWS Transit Gateway.

Our cloud network architecture
The main workloads of our organization within Autodesk are hosted in multiple AWS Regions, where we have redundant AWS Direct Connect connections as a transit medium between our data centers and the AWS Region. We have VPCs in each of these Regions, all of which were connecting to the on-premises data center through the private virtual interfaces (VIFs) configured on our DX connections.

At Autodesk, the security pillar is a high priority. Our security model posture is that every cloud workload be tied to a unique account ID, with one or more VPCs, according to the application requirements. All network traffic between these VPCs or our physical data centers must traverse a firewall for inspection. This provides visibility into potential breaches and later prevents a breach on one VPC from propagating to the rest of the network. So, we are using a third-party firewall appliance as a customer gateway device on our side of the DX connection.

By mapping all private VIFs on the firewall to a single security zone and disabling intra-zone traffic in the firewall, we managed to enforce firewall policing for inter-VPC communications for the respective AWS Region. This design was good during its inception period, which was a few years ago.

As time progressed and we started adding more workloads in AWS, it proved to be non-scalable. It elevated latency to the inter-VPC traffic within a Region due to the traffic tromboning through the DX connections. If two VPCs in an AWS Region attempt to communicate with each other, they must travel back and forth from AWS to the on-premises data center firewalls. This added a latency of ~60 ms.

In a real-world workflow scenario, the impact of latency adds up quickly. For example, consider a user attempting to transfer 10 GB of data from one VPC to another. With the legacy model injecting 60 ms of latency, the file transfer would take about 7 minutes. However, if the latency was reduced from 60 ms to 2 ms, the same file transfer would take less than 90 seconds!

Latency directly impacts the throughput of file transfers, increasing transaction times and reducing productivity. In addition to the reduced throughput and hit to productivity, this also added an additional amount of load to our DX connections to AWS. This reduced our available bandwidth for workflows that truly had an AWS data center dependency. This legacy model also added further challenges such as the following.

Challenges

The firewall appliance that we were using had a hard limit on the number of BGP neighbors and sessions it could support. And after we reached that limit, we couldn’t provision any more VIFs for connecting to VPCs in that AWS Region.
Inter-VPC traffic within the same Region tromboning through the DX led to increased traffic and heavy dependency on our DX connections.
As a workaround for all latency sensitive applications, we enabled VPC peering. Soon these peerings grew in number, and our cloud architecture started to become complex and started getting close to the VPC peering limits.
We experienced a latency of ~60 ms for all inter-VPC traffic within the same Region.
Our core and shared services VPC must peer with all VPCs in that Region. After the peering limit of 125 was reached, we had to create more core and shared services VPCs to meet the scaling needs.
We conducted a successful proof of concept with transit VPC implementation, but managing large-scale IPsec tunnels and the automation built around it was considered a future operational challenge. We needed a simple, scalable, and managed solution.

Solution
AWS launched AWS Transit Gateway at re:Invent 2018, and that simplified our requirements dramatically. AWS Transit Gateway is a great addition to the AWS networking stack and really provides a scalable, secure solution for routing in the cloud. The cumbersome process of managing IPsec VPN between multiple customer gateway virtual appliances and the large-scale number of VPCs in a Region was taken out of the picture. Because our security model mandates that we still must maintain firewall policing for inter-VPC traffic, we decided to go for the in-line VPC deployment model of AWS Transit Gateway.

We are referring this in-line VPC as the Secure Transit VPC where we enforce security policies locally on the horizontally scalable fleet of virtual firewalls. We have a pair of virtual firewalls in an active-active setup with asymmetric routing enabled on the firewalls. Also, we are doing in-line inspection with no SRC_NAT or IP change. These firewalls are only inspecting inter-VPC traffic within the AWS Region.

This approach is designed generally for providing security, governance, and auditing or imposing any other firewall functionalities between VPC to VPC traffic or VPC to on-premises traffic, or both. The In-line VPC acts as a middle-box or bump-in-the-wire construct, where all traffic passes through it before reaching the destination.

Some of the use cases where you can implement this design are intrusion detection and prevention (IDS/IPS), firewalls, L7 deep packet inspection, unified threat management, and auditing every packet.

IPsec VPN between EC2 instances in the in-line VPC and TGW
This is a horizontally scalable service pattern with the extra requirement of BGP and IPsec functionalities in the software-based Amazon EC2 instance.

Advantages

Traffic can be load balanced across multiple tunnels terminating on multiple EC2 instances using BGP equal cost multipath (ECMP) routing. This enables horizontal scaling and is compatible with automatic scaling architectures.
Failure detection and re-routing of traffic is handled by BGP and DPD with no automation required.

Key considerations

EC2 instances must be able to support BGP-based VPN to the Transit Gateway with ECMP.
You are limited to 1.25 Gbps of throughput for a single VPN connection to TGW.

Figure: Regional design

This architecture uses AWS managed services such as AWS Transit Gateway, AWS Direct Connect, AWS Resource Access Manager (RAM), AWS Organizations, along with a third-party virtual firewall available in AWS Marketplace. It creates a virtual data center with a traditional hub and spoke design that we named Secure Transit Hub. In this design, all customer VPCs are considered spokes attached to the transit gateway, traversing the Secure Transit VPC.

We created separate transit gateways for different business units (BUs) or environments and created VPN attachments from respective transit gateway to the customer gateway (virtual firewalls). With the ECMP feature for VPN in AWS Transit Gateway, we managed to aggregate more than one VPN attachments for the BU that needed more bandwidth. A single transit gateway by default allows aggregation of up to 50 such VPN attachments and can further scale horizontally for higher bandwidth.

We used AWS RAM and Organizations to share the transit gateway to all accounts. Within each transit gateway, we created separate routing domains (route tables) for VPC and VPN attachments. We then propagated appropriate routes to ensure that all inter-VPC routing must go through the customer gateway through the VPN attachment before reaching the final destination VPC.

Inter-VPC or east-west communication of VPCs is inspected and routed by the virtual firewalls in AWS. Hardware firewalls in the data center where the DX connection terminates are now limited to inspect-only north-south traffic between VPCs and on-premises networks. In this way, we segregated the functionalities of these firewalls and thus reduced the blast radius in case of any outages. Based on demand, we can now increase or decrease the number of virtual firewalls and maintain a fleet of firewalls in the AWS Cloud to meet scalability needs.

Route changes and VPC attachment process needed during the creation of new VPCs were made part of our in-house account vending machine. We also successfully moved to the native AWS Direct Connect gateway (DX gateway) attachment for the transit gateways in all AWS Regions in which we operate. We created hosted transit VIFs in the account where our DX connections are hosted and natively attached the DX gateway to transit gateway.

Similar to VPN associations in the VPN routing domain of the transit gateway, the DX gateway attachment is also an association in the VPN routing domain. The on-premises prefixes are propagated to the VPC routing domain. We also manipulated routing in such a way that we are advertising only the non-VPC CIDRs of that Region through the respective DX gateway. Right now, we are in the process of migrating all our legacy VPCs (connected to on-premises by private VIF) to the transit gateway of the respective AWS Regions.

Benefits

Reduced latency – With the VPCs of the same Region talking to each other through an on-premises firewall in DX, there was a latency of ~60 ms. With the new design, we reduced that to ~2 ms.
Easier and simplified management – Using AWS Organizations and AWS RAM, we simplified the provisioning process, and the capability to run BGP makes routing dynamic.
Security – The next generation firewall that we are using as customer gateway in our architecture has native integrations with AWS Security Hub and is one of the launching partnersof AWS Transit Gateway.
Better visibility and control – With centralized monitoring and controls, we can manage all VPCs and edge connections in a single console. We created custom dashboards in our monitoring tool that pulls data based on Amazon CloudWatch metrics.
On-demand bandwidth – By using features like ECMP, we can always scale up or scale down the bandwidth from 1.25 Gbps to 50 Gbps, based on customer requirements.
Availability – Highly available as the core component in our design is AWS Transit Gateway, which uses the horizontally scalable state management AWS Hyperplane.
Automation – Network configurations required during the creation of new VPCs are made part of our in-house account vending machine, and everything in our architecture is built using AWS CloudFormation.

Recommendations and best practices

Do route summarizations to keep the routing table size small. AWS Transit Gateway can hold 10K routes in its routing table. But if you are using other components in your architecture that can’t hold such a scale (for example, a VPC subnet route table), it’s good to summarize.
Always use managed AWS services in the design wherever possible. Look at the AWS Marketplace for alternatives if a particular service is unavailable. Our firewall and SD-WAN vendor software were readily available in the AWS Marketplace, and AWS makes it easy to buy and deploy them in minutes.
Test all components in an isolated dev account to ensure that the production traffic is not impacted while manipulating routing in various scenarios.
Properly plan your architecture before creating the transit gateway because you can’t modify the resource properties after they are created.
- It’s recommended to enable ECMP support for future scaling needs.
- Enable auto-accept shared attachments to keep the VPC provisioning process simple.
- Disable default route table propagation to have more control with routing.
- Disable default route table association to have more control over attachments to the route table in TGW.

What’s next?
AWS Transit Gateway opens up the possibilities of building more centralized services around it in the cloud. After successfully adding the next generation firewall to the architecture, we are working on integrating our SD-WAN solution also. With SD-WAN in the cloud, we can have our remote offices communicate directly to the AWS Regions without needing to go through regional data center and DX connections. This will reduce latency and improve customer experience.

With every component in our architecture built by AWS CloudFormation, we can extend our private IP connectivity to more AWS Regions in minutes. This enables our developers to build applications in AWS Regions closer to our end customers and provide them more Regions to choose from when designing for higher availability and going global.

Conclusion
Autodesk builds software that helps people imagine, design, and make a better world. AWS Transit Gateway helps us do that by simplifying the management of Amazon VPCs in our cloud infrastructure. This enables our engineering teams to easily scale our network to deliver simple yet robust standardized solutions to support our global workflows. Hopefully you have found this post informative. We look forward to any comments, and happy architecting on AWS!

About the Authors

Bhavin Desai

Bhavin Desai is a Sr. Solutions Architect at AWS. He enjoys providing technical guidance to customers, and helping them architect and build solutions to adopt the art of the possible on AWS.

Anish John

Anish John is a Senior Network Engineer at Autodesk Inc. Passionate about networking and cloud computing. I live in San Rafael and lead some of the cloud transformation projects at Autodesk. I enjoy understanding customer problems and working backwards to provide practical solutions.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Networking & Content Delivery

Reducing Autodesk cloud routing latency with AWS Transit Gateway

Bhavin Desai

Anish John

Resources

Follow