Rivian’s proactive approach to identify unrouteable traffic with AWS Transit Gateway Flow Logs

In complex cloud environments, maintaining visibility across network infrastructure is crucial for operational efficiency. This blog post explores how Rivian, an American electric vehicle manufacturer and technology company, implemented a solution to proactively identify unrouteable traffic in their AWS environment using AWS Transit Gateway Flow Logs. Unrouteable traffic refers to network packets that are dropped because there is no valid route in the AWS Transit Gateway route table to reach their intended destination.

In this post, we demonstrate how a Transit Gateway Flow Logs-based solution helped create an efficient network monitoring architecture with proactive detection capabilities. We walk through Rivian’s implementation process, sharing key insights and best practices that enabled them to achieve enhanced network visibility. The solution described here captures network traffic that would otherwise be silently dropped due to missing routes. A serverless architecture is used to analyze this traffic data and automatically notify relevant teams for investigation. This transformation from reactive to proactive network monitoring, with automatic detection and notification of unrouteable traffic, ultimately helps customers reduce troubleshooting time for network connectivity issues.

Customer environment

At Rivian, we manage a multi-account, multi-Region AWS environment that supports our vehicle manufacturing operations, connected vehicle services, and customer-facing applications. Our network architecture uses Transit Gateway as the central hub, connecting multiple spoke VPCs across different AWS Regions. This architecture serves various workloads, from manufacturing systems and vehicle software development to customer service platforms and data analytics environments.

Challenge

A fundamental challenge in Rivian’s architecture was making sure of comprehensive visibility into unrouteable traffic across their multi-VPC environment. This was particularly relevant because they follow the principle of least privilege in network design: they only add routes in Transit Gateway route tables that are specifically needed for communication between VPCs. By default, they do not allow traffic between applications in different VPCs unless explicitly requested and corresponding routes are added to the Transit Gateway table.

Although Transit Gateway effectively serves as a central hub for routing traffic, a key challenge emerges from the disconnect between application teams and network teams. Application teams lack visibility into Transit Gateway route tables and often aren’t aware when necessary routes are missing. Simultaneously, network teams have no visibility into the connections that application teams are attempting to establish. This results in traffic being silently dropped when there is a missing route in a Transit Gateway route table, creating a blackhole of lost packets that makes it incredibly difficult for both teams to diagnose connectivity issues.

This lack of visibility leads to extended troubleshooting times and fragmented connectivity between applications, resulting in ineffective integration. To address these requirements, Rivian needed a solution that would capture and analyze dropped packets that would otherwise disappear without a trace, map traffic flows to specific Transit Gateway attachments, and provide automated notifications.

Solution

The solution needed to bridge the visibility gap between application and network teams while using AWS native services for scalability and reliability. Transit Gateway Flow Logs proved to be the optimal solution, offering detailed network traffic logging capabilities that could be strategically implemented to capture unrouteable traffic. Rivian enhanced this solution by developing a serverless architecture that processes these logs and automatically maps traffic flows to specific Transit Gateway attachments. This allowed them to gain visibility into dropped packets while maintaining existing Transit Gateway infrastructure and routing policies.

This implementation strategy prioritized non-intrusive monitoring and automated notification capabilities. The solution directs unrouteable traffic to dedicated VPCs deployed across their AWS footprint, where Transit Gateway Flow Logs capture detailed information about packet sources and destinations at the blackhole VPC attachment. These VPCs are referred to as “blackhole VPCs” because they receive all packets with missing routes in their respective AWS Regions where traffic is lost. The solution uses the following AWS services working in concert and readers need familiarity with them:

AWS Lambda for event-driven log processing
Amazon Simple Storage Service (Amazon S3) for scalable log storage
AWS Config for collecting VPC CIDR information
Amazon Simple Queue Service (Amazon SQS) for message queuing
Amazon Simple Notification Service (Amazon SNS) for automated notifications
Amazon DynamoDB for efficient deduplication

Implementation

The following figure shows the infrastructure setup for proactively identifying unrouteable traffic.

Figure 1: Infrastructure setup for proactively identifying unrouteable traffic

Download AWS Cloud Development Kit (AWS CDK) code for Lambda functions and CloudFormation stacks here.

In this section we walk through the key components of the architecture.

Following the least privilege security best practice and segmentation, separate accounts are created for audit, networking, logging, and monitoring.

In the audit account, a cross-account role is created. This role can only be assumed by the Lambda function in the central networking account, allowing it to query AWS Config Aggregator to collect VPC CIDR information for all VPCs in the organization. AWS users who are using AWS Control Tower automatically have a default Config Aggregator created in their Audit account, while other users can independently configure their Config Aggregator by following the steps outlined in the post, Set up an organization-wide aggregator in AWS Config using a delegated administrator account.
In the central networking account, within each AWS Region, Transit Gateway is deployed to provide intra-Region, inter-Region, and on-premises connectivity. A blackhole VPC is created in each AWS Region and attached to the Regional Transit Gateway. Transit Gateway Flow Logs are enabled at the blackhole VPC attachment level rather than the Transit Gateway level across all AWS Regions. This targeted approach is more cost-effective while still capturing all necessary blackhole traffic data.
In the central logging account, there is an S3 bucket for consolidated log storage.
In the central monitoring account, there are a series of event-driven Lambda functions and other AWS services deployed. In the following packet walk-through section, we dive into how these components work together.

Packet walk-through for unrouted traffic

Transit Gateway routing configuration

- Each Transit Gateway route table is configured with a summary static route to direct all (10.0.0.0/8 or appropriate internal CIDRs) unrouted traffic to the blackhole VPC attachment.
- When a packet is received at the Transit Gateway, a route lookup takes place to identify next-hop attachment. When no specific route exists, the summary static route forwards traffic to the blackhole VPC.

Traffic capture mechanism

- Traffic sent to the blackhole VPCs attachment is captured by Transit Gateway Flow Logs with a custom format, which reduces storage costs by making sure that we only log relevant information.
- These Flow Logs are delivered to a single S3 bucket in the central logging account for unified processing.

Transit Gateway flow logs with custom format captured at the blackhole VPC attachment level

Figure 2: Transit Gateway flow logs with custom format captured at the blackhole VPC attachment level

Automated log processing

- When logs are written to the S3 bucket in the central logging account, an Amazon S3 event triggers a Lambda function (Flow Log Parser) in the central monitoring account.
- This Lambda function parses the log files, extracts source and destination IP pairs, and sends these IP pairs to an SQS queue.

IP-to-attachment mapping

- A Lambda function (Attachment CIDR Collector) in the central networking account is triggered every 4 hours to gather all VPCs and related CIDR information from AWS Config Aggregator.
- For each VPC identified, the function calls `describe_transit_gateway_attachments` using AWS SDK for Python (Boto3) with Amazon EC2 client, filtering by VPC ID to collect relevant Transit Gateway attachment IDs. All the collected data is stored in the same S3 bucket in central logging account.
- Another Lambda function (IP Mapper) in the central monitoring account continuously polls the SQS queue for source and destination IP pairs. For each IP pairs, it reads the pre-collected data from S3, matches the attachment and publishes all the collected information to Amazon SNS.

Notification and remediation

- After identifying the source and destination Transit Gateway attachments, the Lambda function (Slack Notifier) deployed in central monitoring account sends notifications to relevant teams via Slack. These notifications include attachment names, regions, accounts, IP addresses, ports, and timestamps.
- DynamoDB is used to prevent duplicate Slack messages from causing excessive notifications. When a source and destination pair has been reported, an entry is added to DynamoDB with a TTL of 30 minutes, making sure that the same message won’t be sent again during that period.
- Optionally, network teams can use Route Analyzer for AWS Network Manager to verify the missing routes before updating the appropriate Transit Gateway route tables.

To demonstrate the effectiveness of this architecture, here’s how the solution handles a typical scenario. When an application team attempts to establish connectivity between VPCs or to on-premises networks without the necessary routes, the following happens:

- Unrouted traffic is forwarded to blackhole VPCs, and flow logs are captured at the Transit Gateway attachment.
- Flow logs are then automatically processed and analyzed using a serverless architecture.
- Relevant teams receive immediate notifications with detailed context through Slack.
- Network teams can validate missing routes and update the necessary route tables.

Conclusion

The implementation of this AWS Transit Gateway Flow Logs-based solution has significantly enhanced Rivian’s network visibility and operational efficiency. The solution has successfully transformed their network monitoring from a reactive to a proactive approach, enabling automatic detection and notification of unrouteable traffic.

Key achievements include:

- Automated detection of missing routes in AWS Transit Gateway route tables.
- Reduced mean time to resolution for network connectivity issues.
- Enhanced collaboration between application and network teams.
- Comprehensive historical traffic analysis capabilities.

This automated process has eliminated the previous blind spots in Rivian’s network infrastructure and significantly improved their ability to maintain reliable connectivity across their growing cloud environment.

To learn more about AWS Transit Gateway Flow Logs, visit the AWS Transit Gateway documentation.

About the authors

Drēm Darios (Guest)

Drēm is a Solutions Architect passionate about building secure, scalable, and cost-efficient AWS solutions. With experience in infrastructure automation, governance, and networking, he helps organizations simplify cloud operations while driving innovation. When not in the cloud, he enjoys being outdoors, experimenting with new technologies, and spending time with his family.

Hardik Shah

Hardik is a Sr. Technical Account Manager at AWS. He brings extensive experience from finance, travel, and retail industries to support customers on their cloud journey. With a deep passion for technology and networking, he enjoys solving complex technical challenges and helping customers optimize their AWS infrastructure. Outside of work, Hardik likes to spend time with his family, traveling, and exploring cultures and cuisines.

Peter Dachnowicz

Peter is a Principal Technical Account Manager at AWS, specializing in the automotive and manufacturing sectors. He serves as a trusted advisor, helping customers navigate their digital transformation using cloud technology. His expertise includes designing highly scalable, flexible, and resilient cloud architectures, with a focus on Generative AI and connected mobility solutions.

Networking & Content Delivery