Monitoring AWS Transit Gateway route limits using a serverless architecture

AWS Transit Gateway simplifies your network and puts an end to complex peering relationships. It acts as a cloud router and scales elastically based on the volume of network traffic. It can centralize connections (known as attachments) from your on-premises networks, and attach to Amazon Virtual Private Clouds (VPC) Virtual Private Networks (VPN), AWS Direct Connect Gateways, Transit Gateways from other Regions, and Transit Gateway Connect peers.

Among these various attachments, VPN, AWS Direct Connect Gateway and Transit Gateway Connect peers have quotas on the number of prefixes that are advertised, both to and from Transit Gateway. Along with attachment-specific quotas, each Transit Gateway has a quota on the total number of routes. These attachment quotas, along with VPC and Transit Gateway peer attachments routes, contribute towards the overall quota. You can learn more about the quotas by referring to the Transit Gateway quotas section of our documentation.

As the number of attachments increases over time, monitoring these quotas from within the AWS Management Console or the Command Line Interface (CLI) becomes complex. In this blog, we walk you through a serverless solution to monitor Transit Gateway attachments and send alerts on the corresponding route limits. This solution uses Amazon CloudWatch, Transit Gateway Network Manager, AWS Lambda and Amazon DynamoDB.

Solution architecture:

Solution overview:

When deployed, this solution captures the current state of the Transit Gateways in your account within a given Region. This initial state is captured by triggering an AWS Lambda function, and the state information is written to a DynamoDB table. Next, the solution listens for routing update events sent by Transit Gateway Network Manager to CloudWatch Logs. Any such events invokes another Lambda function to update the DynamoDB table.

The solution also deploys a Lambda function, which runs every minute, to scan the DynamoDB table and calculates the number of prefixes advertised to and from each attachment. It does this for every Transit Gateway in your account in a given Region. As this information is processed, the Lambda function pushes the metrics to CloudWatch using the custom metric push API.

You use these metrics to create CloudWatch dashboards and alerts based on the limits for each attachment type by following the instructions for Creating a CloudWatch alarm based on a static threshold in our documentation.

Prerequisites

Readers of this blog post should be familiar with the following AWS services:

For this walkthrough, you should have the following:

An AWS Account, you can also create a free tier account here.
AWS Command Line Interface (AWS CLI): You need the AWS CLI installed and configured on your workstation.
Credentials configured in the AWS CLI should have the below permissions:
- Amazon S3 full access to allow create, delete buckets and upload objects. Here is an example policy that you can tweak to your needs.
- CloudFormation full access to allow create, delete, and describe stacks. Here is an example policy that you can tweak to your needs.
- Lambda full access to create, delete, update, and run Lambda functions. Here is an example policy that you can tweak to your needs.
Transit Gateway Network Manager is a global service and uses CloudWatch in the us-west-2 region for event processing. Therefore, this solution must be deployed in us-west-2. However, the solution can monitor Transit Gateways in any Region.
- Make sure that you deploy the solution in the us-west-2 region and your AWS CLI default Region is us-west-2. If us-west-2 is not the default Region, reference the Region explicitly while running AWS CLI commands using –region us-west-2 switch.
Amazon S3 bucket in the us-west-2 region for staging Lambda deployment packages.
Amazon S3 buckets in every region where you must monitor the route limits of Transit Gateways.
One or more Transit Gateways with attachments and route tables configured.
Transit Gateway Network Manager should be configured to monitor all Transit Gateways in your account.

Walk through:

Create an Amazon S3 bucket in us-west-2 for staging the deployment packages. Note: You must specify LocationConstraint for every region other than us-east-1.
```
aws s3api create-bucket --bucket <bucket-name> --region us-west-2 --create-bucket-configuration LocationConstraint=us-west-2
```
Create an Amazon S3 bucket in the Region where the Transit Gateway you are planning to monitor is present. For example, this snippet created the bucket in us-east-1
```
•	aws s3api create-bucket --bucket <bucket-name> --region us-east-1 
```
Download and unzip the file containing the CloudFormation template and Lambda function code from here. This can also be done by running the command that follows to a directory in your local workstation. You must run all of the subsequent commands from this directory.
```
$ wget https://github.com/aws-samples/how-to-monitor-tgw-route-limits-using-serverless-architecture/archive/refs/heads/main.zip
$ unzip main.zip
$ cd how-to-monitor-tgw-route-limits-using-serverless-architecture
```

Zip the Lambda functions init_lambda_function.py, update_lambda_function.py and put_metric_lambda_function.py and upload it to an Amazon S3 bucket you created in step 1. NOTE: zip command was executed on Mac OSX. Depending on your environment ‘zip’ command syntax might vary.

$ zip init_lambda_function.py.zip init_lambda_function.py
$ zip update_lambda_function.py.zip update_lambda_function.py
$ zip put_metric_lambda_function.py.zip put_metric_lambda_function.py
$ aws s3 cp init_lambda_function.py.zip s3://<bucket-name-from-step-1>/
$ aws s3 cp update_lambda_function.py.zip s3://<bucket-name-from-step-1>/
$ aws s3 cp put_metric_lambda_function.py.zip s3://<bucket-name-from-step-1>/

Create the resources required by deploying the AWS CloudFormation template and running the command that follows. You must provide the following information, and you can change the parameters based on your specific needs.
- CloudWatchMetricNameSpace is the CloudWatch metric namespace under which all the route metrics will be pushed.
- S3BucketWithDeploymentPackage. Name of Amazon S3 bucket used in step 1. This will have the deployment package for all the Lambda functions used in this blog.
- S3BucketForTGWRoutesExport. Name of Amazon S3 bucket used in step 2. This will be used to store the route table exported to capture the initial state of Transit Gateways, its attachments and the number of routes in the route tables.
- TGWRegion. Region where the Transit Gateways you want to monitor are present.

aws cloudformation create-stack \
--stack-name TgwRouteMonitoring \
--template-body file://TGWRouteMonitoring.yaml \
--parameters ParameterKey=CloudWatchMetricNameSpace,ParameterValue=TGWRoutes ParameterKey=S3BucketWithDeploymentPackage,ParameterValue=<bucket-name-from-step-1> ParameterKey=S3BucketForTGWRoutesExport,ParameterValue=<bucket-name-from-step-2> ParameterKey=TGWRegion,ParameterValue=<Region-of-tgw-you-want-to-monitor> \
--capabilities CAPABILITY_IAM \
--region us-west-2

This stack includes resources that affect permissions in your AWS account by creating necessary IAM roles. You must explicitly acknowledge this by specifying CAPABILITY_IAM or CAPABILITY_NAMED_IAM value for the –capabilities parameter.

Stack creation will take you approximately 5–7 minutes. Check the status of the stack by running the command that follows every few minutes. You should see StackStatus value as CREATE_COMPLETE when done.

Example:

aws cloudformation describe-stacks --stack-name TgwRouteMonitoring | grep StackStatus

The CloudFormation template will create the following resources:

Two Amazon DynamoDB tables (Logical ID: RoutesDDBTableIn and RoutesDDBTableOut) with 5 Read and Write Capacity Units (RCU and WCU), used to store all the required parameters to monitor the number of routes. It also creates Write Scaling Policy for Amazon DynamoDB to scale the WCUs to max of 15.
InitLambdaFunction (init_lambda_function.py) with required IAM permissions to export Transit Gateway routes and populate the DynamoDB Tables.
UpdateLambdaFunction (update_lambda_function.py) with required IAM permissions to track Transit Gateway route install and uninstall events and update the DynamoDB tables accordingly.
PutMetricLambdaFunction (put_metric_lambda_function.py) with required IAM permissions to scan the DynamoDB table, calculate per attachment incoming and outgoing route advertisements and then push the metrics to CloudWatch.
CloudWatch event rule to run UpdateLambdaFunction as and when there is a Transit Gateway route install and uninstall event.
CloudWatch schedule rule with required IAM permissions to invoke PutMetricLambdaFunction every minute. This will scan the DynamoDB table, calculate per attachment incoming and outgoing route advertisements and then push the metrics to CloudWatch.
CloudWatch schedule rule with required IAM permissions to invoke InitLambdaFunction every 60 minutes. This function will export the routes in Transit Gateway route tables to an Amazon S3 bucket and then parse the data and update the DynamoDB table.

Once the stack is deployed, we must populate the DynamoDB table with the current state of Transit Gateways and route tables. We do that by invoking the InitLambdaFunction manually from AWS CLI. We need the physical ID of the function to do this. That is also done by describing the AWS CloudFormation template as shown in the following snippet:

$ aws cloudformation describe-stack-resources --stack-name TgwRouteMonitoring --region us-west-2 | grep InitLambdaFunction.

"PhysicalResourceId": "TGWRTMON-InitLambdaFunction-1E4ONARQ02SM3", 
"LogicalResourceId": "InitLambdaFunction"

Use the value of PhysicalResourceId from the above output and invoke the function using the following command:

$ aws lambda invoke --function-name TGWRTMON-InitLambdaFunction-1E4ONARQ02SM3 response.json --region us-west-2

Once the function has been invoked, you should see the following output:

{
    "ExecutedVersion": "$LATEST", 
    "StatusCode": 200
}

At this point, all the required components are in place to monitor the number of routes per attachment per Transit Gateway. InitLambdaFunction has populated the DynamoDB table, UpdateLambdaFunction are triggered as and when there is a Transit Gateway route install or uninstall event, and PutMetricLambda is calculating the routes per attachment every minute and pushing it to CloudWatch.

To view the metrics in the AWS Management Console, navigate to CloudWatch, then go to Metrics and click on the Custom Namespace created by the PutMetricLambdaFunction.

Under namespace, you find the Metrics for each attachment depicted by its attachment-id, click on the desired Metric.

Under each Metric are its corresponding dimensions (IN or OUT). Click on the desired dimension.

NOTE: For VPC and Transit Gateway Peering attachments, ‘IN’ and ‘OUT’ means the number of prefixes accessible in each direction. For example for VPC attachments, IN indicates the number of prefixes in the VPC reachable from TGW perspective. Dimension ‘OUT’ for VPC attachments indicates how many prefixes are reachable from the VPC via TGW.

Useful statistic for these metrics is ‘custom percentile, p100’ for a period of 1 minute.

You use these metrics to create dashboards and alerts based on the quotas for each individual attachment, by following the instructions in CloudWatch documentation.

Cost of the solution:

Cost will depend on how many route update events are generated in your network and processed, and stored by the solution. More information on pricing can be found on public pricing pages for each of the services:

Clean up:

To ensure that no charges are incurred, be sure to empty and delete the Amazon S3 bucket created in step 1 and 2, and delete the CloudFormation stack created in step 5.

Conclusion:

In this blog post, we demonstrated how to monitor the number of routes to each Transit Gateway by deploying a serverless solution and using CloudWatch. You can use these metrics to create alarms in CloudWatch to get notified when the number of routes are approaching its attachment limit and take action to keep the routes within limits.

Ananth Balasubramanyam

Ananth Balasubramanyam is a Sr. Solutions Architect based out of London,UK and has extensively worked with Start-ups. He has experience spanning Education, Financial, CRM and Insurance domains. He loves and gets excited building awesome, cool stuff and has been experimenting with AI/ML and Robotics. His interest areas are SaaS, Big Data and AI/ML. When not working on anything AWS related, he loves the outdoors and taking his bike out.

Vijay Menon

Vijay Menon is a Principal Solutions Architect based out of Bangalore, India with a background in large scale networks and communications infrastructure. He enjoys learning new technologies and helping customers solve complex technical problems by providing solutions using AWS products and services. When he is not helping customers, he likes to go on long runs and spend time with family and friends.

Networking & Content Delivery

Monitoring AWS Transit Gateway route limits using a serverless architecture

Ananth Balasubramanyam

Vijay Menon

Resources

Follow

Learn

Resources

Developers

Help