Networking & Content Delivery

How ZS used Network Orchestration for AWS Transit Gateway to optimize costs and scale up

This is a guest post co-written with Roshan Raj, Cloud Network Specialist at ZS Associates

In this blog post, we will highlight challenges faced by ZS to keep cost under control while managing a large, complex global network infrastructure which spans across multiple AWS Regions. We will show how ZS used Network Orchestration for AWS Transit Gateway to transform our network infrastructure. This serverless architecture has provided increased reliability, scalability, and reduced operational overhead for our network team. Furthermore, it also helped us in reducing costs by eliminating the third-party EC2 instances as well as its licensing cost.

ZS is a management consulting and technology firm focused on transforming global healthcare and beyond. We leverage leading-edge analytics, data, and science to help clients make intelligent decisions. We serve clients in a wide range of industries, including pharmaceuticals, healthcare, technology, financial services, and consumer goods. We developed and host several applications for our customers on Amazon Web Services (AWS).

Over time, our network has grown and is now a large and complex network infrastructure that spans multiple AWS Regions and AWS accounts. We relied on a self-managed Amazon Elastic Compute Cloud (Amazon EC2) third-party tool with appropriate AWS Identity and Access Management (IAM) roles to manage and orchestrate the intricate routing within our internal network infrastructure. This third-party tool played five major roles in our internal AWS network:

  1. Route Orchestration: Automatically added or removed routes from multiple route tables in multiple AWS Regions with minimal user intervention.
  2. Load Balancing Traffic Between Two Active-Active Firewalls: Handled the load balancing of traffic between two firewall instances in different Availability Zones (AZs) within the same AWS Region.
  3. NAT Gateway/FQDN (Fully Qualified Domain Name) Filtering: Routing specific VPC traffic through proprietary NAT gateways, which also provided the ability to filter HTTP/HTTPS traffic based on domains/URLs.
  4. IPsec VPN Tunnels: Creating IPsec tunnels with Source NAT and Destination NAT functionality for any traffic passing through the tunnels.
  5. Visualize/Monitoring/Troubleshooting: Provided a complete topology of the AWS network, including attached VPCs and AWS Transit Gateways.
Network architecture diagram showing VPC and TGW route tables and a third-party solution running on EC2 to manage routes and provide VPN services.

Figure 1: Old architecture integrated with a third-party solution

Network components

Spoke Accounts A

  • Spoke Account A has its own Amazon Virtual Private Cloud (Amazon VPC) that is connected to the Transit Gateway in the network account through Transit Gateway VPC attachments.
  • Spoke Account A’s VPC route table is configured with a default route pointing toward the Transit Gateway, forcing all non-local traffic to flow through the Transit Gateway.

Spoke Accounts B

  • For spoke VPCs that need FQDN based filtering, the third-party solution has separate EC2 instances that act as a NAT gateway. Any internet traffic destined from those VPCs would exit through these EC2 instances and only authorized FQDNs would be allowed.
  • Similar to Spoke Account A, Spoke Account B also has its VPC connected to the Transit Gateway in the network account through Transit Gateway VPC attachments.
  • Spoke Account B’s private subnets route table is configured with a default route pointing toward the third-party EC2 NAT instances. These NAT instances filter traffic based on allowed FQDNs and forward traffic to the Internet Gateway (IGW).
  • For other internal network connectivity, the private subnets route its traffic to the Transit Gateway.

Security account

  • The security account hosts the third-party EC2 instances for orchestration and firewall.
  • The third-party orchestration instances are responsible for load balancing traffic between the firewall instances. They maintain a state table for all connections and distribute traffic load using a 5-tuple hash (Source IP/Source Port/Destination IP/Destination Port/Protocol Type). This makes sure that a user session is consistently directed to the same firewall. However, if the user initiates a new session, the source port changes, potentially resulting in the traffic being routed to a different firewall. This load balancing mechanism makes sure of efficient traffic distribution between the firewall pair. The third-party solution monitors firewall failure. If a firewall goes down, then the third-party instance only forwards traffic to the healthy firewall.
  • The Amazon EC2 firewall instances play a crucial role in our network security. They perform deep packet inspection, analyzing the contents of network packets to identify potential threats or unauthorized activities. By enforcing security rules, the firewall instances help maintain a secure network environment.

Network account and Transit Gateway:

  • The third-party EC2s are capable of creating IPSec VPN Tunnels and can perform Source NAT and Destination NAT. The network account hosts the third-party EC2 instances used for IPSec connectivity with remote sites. Any traffic from spoke VPCs that is destined for IPSec tunnel destinations is routed to these EC2 instances, which NAT the traffic and send it to the remote peer.
  • The network account also contains the Transit Gateway, which has multiple Transit Gateway route tables, as well as Security and Spoke Transit Gateway route tables. This Transit Gateway is shared across accounts in our AWS Organizations through AWS Resource Access Manager.
  • Spoke Transit Gateway route table is associated with the spoke VPC attachments. It has a default route to the security Transit Gateway VPC attachment, forcing the traffic through the security account for further processing.
  • Transit Gateway Security route table is associated with the security VPC to direct traffic to the appropriate spoke VPC. Traffic for Spoke A VPC CIDR is routed toward Spoke A VPC Transit Gateway Attachment, while traffic for Spoke B VPC CIDR is routed toward Spoke B VPC Transit Gateway Attachment.

Traffic flow

Figure 2 shows the network traffic flow for the old architecture integrated with a third-party solution.

Network diagram showing the packet flow of a packet from Spoke A VPC to various destinations including Spoke B VPC, Internet, and VPN. It shows how the traffic is routed to the firewall appliances with the third-party EC2 solution.

Figure 2: Traffic flow for old Architecture integrated with a third-party solution

  1. Any egress traffic initiated from the spoke accounts, whether it is internet-bound (except for VPCs which need FQDN filtering) or VPC-to-VPC traffic, is directed to Transit Gateway in the network account.
  2. The Transit Gateway forwards the traffic to the Transit Gateway attachment subnets in the security account.
  3. These Transit Gateway attachment subnets forward the traffic to the third-party EC2 instance’s ENIs.
  4. The third-party instances compute a 5-tuple hash and forward the traffic to one of the firewalls based on the computed hash. The EC2 firewall appliance inspects the traffic and allows or denies based on the firewall security policies.
  5. Internet Bound Traffic: If the traffic is intended for the internet, then the firewall subnet route table has its next hop pointing toward the IGW. The return traffic is sent back toward the Transit Gateway. The Transit Gateway security route table makes sure that the traffic is sent to its intended destination.
  6. Spoke VPC Bound Traffic: If the traffic is intended for any of the spoke VPCs, then the firewall subnet route table has its next hop pointing back toward the Transit Gateway. The Transit Gateway security route table makes sure that the traffic is sent to its intended destination.
  7. IPSec Traffic: If the traffic is intended for IPSec remote CIDRs, then the firewall subnet route table has its next hop pointing back toward the Transit Gateway. The Transit Gateway security route table forwards the traffic to IPSec VPC. The traffic lands on the third-party EC2, which would NAT the addresses and sent them to the remote peer over IPSec VPN. The return traffic gets de-NAT’ed and sent back toward the Transit Gateway. The Transit Gateway security route table makes sure that the traffic is sent to its intended destination.

The combination of Transit Gateway, the third-party solution for load balancing, and firewall instances supports the functioning of traffic distribution, fault tolerance in case of firewall failures, and security measures. Although this third-party solution provided some level of automation, it falls short in terms of cost-effectiveness and integration with other AWS services.

How we replaced the third-party solution

As mentioned earlier, the third-party tool played a crucial role in our internal AWS network, serving five major functions. Let’s go into the details of how we replaced each of these functions.

For route orchestration, the third-party tool used an EC2 instance acting as a controller to manage routing within the AWS environment. It handled tasks such as onboarding new VPCs, attaching them to the Transit Gateway, implementing necessary routing changes, and facilitating inter-Region communication.

Seeking a robust and comprehensive replacement, we turned to Network Orchestration for AWS Transit Gateway, which automates the process of setting up and managing transit networks in distributed AWS environments. It allows you to visualize and monitor your global network from a single dashboard rather than toggling between AWS Regions from the AWS Management Console. It also creates a web interface to help control, audit, and approve transit network changes.

To deploy the solution, you must use three AWS CloudFormation templates.

  1. Management Stack: Deployed in the AWS Organization Management Account.
  2. Hub Stack: Deployed in your network account where the Transit Gateway resides.
  3. Spoke template: Deployed in all spoke accounts.

Figure 3 is an overview of the architecture for the automated approval workflow:

Architecture diagram showing components of automation approval workflow including Lambda, Step Functions, Event Bridge and other services that are part of Network Orchestration for Transit Gateway.

Figure 3: Network Orchestration for Transit Gateway Automated Approval Workflow

  1. The solution monitors changes in VPC tags within the spoke accounts. These tag changes trigger an Amazon CloudWatch Events rule, which sends events to Amazon EventBridge in the network account.
  2. EventBridge is configured to accept events only from trusted accounts. When deploying the hub template, you have the option to provide a list of trusted accounts or use Organizations to trust all accounts within the organization.
  3. Once EventBridge receives an event from a trusted account, it invokes another CloudWatch Events rule, which is set to trigger an AWS Lambda function.
  4. The Lambda function analyzes the CloudWatch event details and initiates the appropriate state machine in AWS Step Functions.
  5. Based on the CloudWatch event, the state machine creates, updates, or deletes Transit Gateway attachments, creates or updates Transit Gateway route table associations, and enables or disables Transit Gateway route table propagations.
  6. After completing all operations, the solution updates the changes in Amazon DynamoDB, which is viewable in the web interface. The web interface files are served from Amazon Simple Storage Service (Amazon S3) through Amazon CloudFront to deliver content through edge locations, making sure of the lowest latency for users. Network administrators can log in to the graphical user interface to review modifications. AWS AppSync manages the GraphQL API layer for the web interface, while AWS WAF provides protection against potential attacks on these APIs. Additionally, Amazon Cognito handles authentication and authorization for users accessing the web interface.

The Network Orchestration for Transit Gateway is a comprehensive AWS solution that automates network changes for users. By simply tagging VPCs with the respective route tables with which they need to associate and propagate, the solution takes care of all routing changes in the background.

In our case, we only needed to add tags to the spoke VPCs to associate them with the spoke Transit Gateway route tables and propagate them in the security Transit Gateway route tables. The solution also includes an optional approval workflow that can be used to approve or deny requested attachments.

This solution primarily operates on the “association” and “propagation” of routes in Transit Gateway route tables. It doesn’t directly handle the addition of static routes. To address this challenge, we made enhancements to our VPC CloudFormation templates by incorporating a Lambda-backed custom resource.

In doing so, we introduced mappings for the required Transit Gateway route tables and attachments within the CloudFormation templates. Based on specific conditions, these templates trigger the Lambda function to add the necessary static routes to the Transit Gateway route tables.

This customization allowed us to seamlessly integrate static routes into our network orchestration while harnessing the core capabilities of the Network Orchestration for AWS Transit Gateway solution. The following diagram in Figure 4 is the new architecture after the replacement of the third-party tool.

Architecture diagram showing VPC and TGW route tables used by Network Orchestration for AWS Transit Gateway for the spoke, network, and security accounts.

Figure 4: New architecture after replacing the third-party tool

To achieve load balancing of traffic between two firewalls, the third-party tool used EC2 instances with proprietary software. The software is capable of load balancing traffic between multiple firewall instances, even across different AZs within the same AWS Region.

To achieve the same capabilities through a managed service, we used Gateway Load Balancer (GWLB), which helps easily deploy, scale, and manage virtual security/firewall appliances. Using GWLB reduces potential network failure points and enhances overall availability and performance due to being a fully managed service built on top of AWS Hyperplane. GWLB provides a new type of VPC endpoint called a Gateway Load Balancer Endpoint. These endpoints can be created in any VPC, even if it’s not in the same VPC as the firewall instances.

The connectivity does not require IP reachability and leverages AWS PrivateLink, a highly available and scalable technology that supports private connections between VPCs and services as if they were in the same VPC.

By adopting GWLBs and leveraging VPC Endpoints, you can achieve efficient load balancing of traffic between firewalls while maintaining security and network integrity.

To replace the FQDN filtering, and NAT Gateway functionality provided by the third-party tool, we utilized our existing firewall virtual appliances. These appliances are equipped to perform Source NAT and filter traffic based on FQDN. Additionally, for IPSec tunnel functionality, we employed a separate firewall virtual appliance.

This setup allowed us to replicate the same configurations and utilize Source NAT and Destination NAT to manage traffic flowing through the tunnel. By leveraging our firewall appliances, we achieved a seamless transition and maintained the necessary functionality for FQDN filtering, NAT Gateway, and IPSec tunnels.

Traffic Flow:

Figure 5 provides a detailed explanation of the traffic flow for the new architecture:

Network architecture diagram showing the packet flow using the GWLB which sends the traffic to the firewall appliances. It shows the packet flow for three different destinations: internet, spoke VPC, and VPN.

Figure 5: Traffic flow for new architecture after replacing third-party tool

  1. Any egress traffic initiated from the spoke accounts, whether it is internet-bound or VPC-to-VPC traffic, is directed to the Transit Gateway in the network account.
  2. The Transit Gateway forwards the traffic to the Transit Gateway attachment subnets in security account.
  3. When traffic is received by the Transit Gateway attachment subnets from the Transit Gateway, it is forwarded to the GWLB VPC Endpoints.
  4. Then the VPC endpoints route the traffic to the GWLB, which encapsulates the packets with the GENEVE protocol.
  5. These packets are forwarded to the firewall instances. Note that the firewall appliance needs to support GENEVE encapsulation and decapsulation.
  6. Internet Bound Traffic: If the traffic is intended for the internet, then the firewall decapsulates the GENEVE packets and forwards it to the IGW.
  7. For any other traffic, the traffic follows a similar reverse path, from GWLB to landing back on the VPC Endpoint subnet. The subnet’s route table contains routes pointing back to the Transit Gateway.
  8. Spoke VPC Bound Traffic: If the traffic is intended for any of the spoke VPCs, then the VPC Endpoint subnet’s route table contains routes pointing back to the Transit Gateway, making sure that the traffic reaches its intended spoke VPC, as managed by the Transit Gateway security route table.
  9. IPSec Traffic: If the traffic is intended for IPSec remote CIDRs, then the VPC Endpoint subnet’s route table contains routes pointing back to the Transit Gateway. The Transit Gateway security route table forwards the traffic to Transit Gateway subnets in IPSec VPC, which in-turn directs it to the newly deployed IPSec firewall virtual appliance.

By replacing the third-party route orchestration functionality with Network Orchestration for AWS Transit Gateway and the load-balancing component with GWLBs, we have embraced a serverless approach to managing the network. This shift eliminates the need for manual upgrades and patches on the third-party solution EC2 instances.

With AWS managing the backend infrastructure maintenance, we can focus on our core tasks without worrying about the underlying infrastructure. This serverless architecture provides increased reliability, scalability, and reduced operational overhead for our network team. Furthermore, it also helped in reducing costs by eliminating the third-party EC2 instances as well as its licensing cost.

Requirement Replacement
Route Orchestration
  • Network Orchestrator for Transit Gateway
  • CloudFormation with Lambda-backed custom resource for static route orchestration.
Load Balancing Traffic Between Two Active-Active Firewalls Gateway Load Balancer (GWLB)
NAT Gateway/FQDN Filtering Leveraging existing firewall virtual appliances
IPSec VPN Tunnels Deployed a new firewall virtual appliance
Visualize / Monitor
  • Network Orchestrator for Transit Gateway
  • Global Network

Conclusion

At ZS, we’re committed to staying at the forefront of technology to better serve our clients and make intelligent decisions in a rapidly evolving global healthcare landscape. By updating and optimizing the AWS architecture, we have not only strengthened our network infrastructure but also yielded significant cost savings.

This has not only empowered us to manage our complex network effectively but also allowed us to fully leverage the capabilities of AWS. We look forward to continuing our journey of innovation and transformation on AWS, delivering even greater value to our clients and partners

Roshan Raj

Roshan Raj

Roshan is a Cloud Network Specialist at ZS, where he leads the team that oversees internal network and security infrastructure. He is passionate about networking technologies and embraces the opportunity to automate and optimize network operations. Outside of work, Roshan loves spending time with his family and friends.

Gourang Harhare

Gourang Harhare

Gourang is a Senior Solutions Architect at AWS based in Pune, India. He has over 21 years of experience in Information technology. With a robust background in large-scale design and implementation of enterprise systems, application modernization and cloud native architectures, he specializes in Serverless and Container technologies. He enjoys solving complex problems and help customer be successful on AWS. In his free time, he likes to play table tennis, enjoy trekking or read books

Scott Hewitt

Scott Hewitt

Scott Hewitt is a Senior Solutions Architect at AWS in Chicago. He helps customers effectively architect their applications to run on AWS. Scott’s passion for networking started over 20 years ago with datacenter networking.