Networking & Content Delivery

Monitor BGP status on AWS Direct Connect VIFs and track prefix count advertised over Transit VIF

As businesses transition to cloud-based infrastructure, establishing reliable connectivity between on-premises and cloud environments becomes a critical requirement. AWS Direct Connect provides a dedicated network link that extends a corporate data center network into the Amazon Web Services (AWS) Cloud. At the core of this connection is the Border Gateway Protocol (BGP), a dynamic routing protocol that initiates and maintains connections between these networks. This post examines mechanisms for monitoring these connections and sending alerts when their state changes.

Alerting and monitoring solution overview

In this post, we explore two common needs.

  1. Establishing a notification mechanism to receive alerts when there are changes in the state of virtual interfaces (VIFs) within your AWS account, whether they are transit, private, or public VIFs.
  2. Monitoring the number of prefixes propagated from the AWS Direct Connect Gateway (DXGW) attachment to an AWS Transit Gateway route table, and receiving notifications when the number of prefixes exceeds a defined threshold.

The post discusses practical solutions to enhance monitoring and visibility of your Direct Connect services, enabling quick response to unexpected changes or thresholds.

Prerequisites

To follow along with this post, you need:

Walk-through:

Monitoring VIF state

Currently, AWS Direct Connect monitoring includes connection level and virtual interface level monitoring. The Direct Connect connection has the ConnectionState metric. Though not specific to each VIF, it remains important and should be monitored.

Our solution uses AWS Lambda, Amazon CloudWatch log groups, and Amazon Simple Notification Service (Amazon SNS) to monitor VIF state and issue alerts. The tested solution within an AWS account, the Lambda code follows a logical sequence.

  1. Retrieve VIF information, including BGP peer status, using the DescribeVirtualInterfaces API.
  2. Log the VIF details to the CloudWatch log groups.
  3. Set up a conditional check to monitor for VIF state changes and trigger Amazon SNS alerts to a specified topic.
  4. Automate the process using an EventBridge-triggered Lambda function running every minute (adjustable as needed).

Configuration steps

  1. Create an IAM role for the Lambda functions with the following permissions:

2. Create an SNS topic and subscription:

  • Create a “VIF-Alert” named Amazon SNS topic for alert notifications in the Amazon SNS console.
  • Create a new subscription in the Subscriptions section.
  • For the Topic ARN, select the “VIF-Alert” topic ARN.
  • Choose Email as the protocol and enter the email address to receive the notifications.
  • Create the subscription.
Figure 1: SNS Topic with an Email address Subscription.

Figure 1: SNS Topic with an Email address Subscription.

3. Set up CloudWatch log group and log stream:

  • In the CloudWatch console, create a new log group named “VIF-Alert”.
  • Within the log group, create a new log stream named “VIF-Alert” to store detailed VIF information, such as VIF ID and BGP status updates. This is shown in the following screenshot (Figure 2).
Figure 2: CloudWatch Log Group with a log stream.

Figure 2: CloudWatch Log Group with a log stream.

4. Create a Lambda function for VIF monitoring:

  • In the AWS Lambda console, create a new function.
  • Select “Author from scratch” and provide a name.
  • Choose Python 3.12 as the runtime.
  • Keep the architecture set to x86_64 and create the function.
  • Use the code provided in this GitHub repository.

Integrate this logic in the Lambda function to push VIF details from DescribeVirtualInterfaces API to CloudWatch logs.

5. Configure a trigger for invoking the Lambda function. In this example, we create an EventBridge trigger:

  • Open the Functions page of the Lambda console and select the function as shown in Figure 3, then choose Add trigger.
Figure 3: Configure triggers for Lambda function.

Figure 3: Configure triggers for Lambda function.

  • Select EventBridge (Cloudwatch events) as the source.
  • Create a new rule with the name and description of your choice
  • For Rule type, select Schedule expression and use the cron expression “rate(1 minute)” as shown in Figure 4 to ensure the Lambda function runs every minute, providing real-time updates on VIF status.
  • Select Add trigger.
Figure 4: Adding a EventBridge Trigger with a rate 1 minute.

Figure 4: Adding a EventBridge Trigger with a rate 1 minute.

By following these steps, you create a comprehensive solution for effective and timely monitoring of VIF states in your AWS environment.

Testing VIF status monitoring: When the VIF state changes from up to down.

  1. Simulating a VIF outage:
  • Shut down BGP neighborships on the on-premise gateway device or use the failover option in the AWS console with the remaining VIF designated as the production VIF.

2. Verifying the Lambda function:

  • Run the Lambda function and check the Execution results tab for a 200 response code, indicating successful execution as illustrated in Figure 5.
Figure 5: Successful execution of the Lambda function code.

Figure 5: Successful execution of the Lambda function code.

  • The Lambda function is triggered by EventBridge every minute and pushes VIF ID and BGP state to the CloudWatch log group.

3. CloudWatch log validation:

  • Review CloudWatch logs to verify VIF status and IDs are successfully transmitted to the designated log group, as shown in Figure 6.
Figure 6: BGP Down status logs in CloudWatch log group.

Figure 6: BGP Down status logs in CloudWatch log group.

  • Confirmation of log entries with precise timestamps indicates the effective functioning of the logging mechanism.

4. Email alert verification:

  • Check email for alert notifications, as shown in Figure 7.
Figure 7: BGP status Down event notification in email.

Figure 7: BGP status Down event notification in email.

  • The email alerts specify the BGP outage on the VIF with its ID, and the alert timestamp aligns with the event timing.

Testing VIF status monitoring: When the VIF state changes from down to up

Once BGP neighborship over the VIF is restored, these steps will help you test the system.

  1. Test the Lambda function:
  • Select Test to run the Lambda function, as seen in the following Figure 8. The Lambda function output indicates that it has run with no indicated errors.
Figure 8: Successful execution of the Lambda function.

Figure 8: Successful execution of the Lambda function.

2. Verify the CloudWatch log stream:

  • Reviewing the log stream confirms consistent logging, including successful BGP restoration on the VIF, as shown in Figure 9. Timestamps within the logs provide precise documentation of each entry. Regular log entries per the EventBridge trigger enable reliable, timely updates, which you can adjust as needed.
Figure 9: BGP UP status logs in CloudWatch log group.

Figure 9: BGP UP status logs in CloudWatch log group.

3. Email alert confirmation:

  • Verify email alerts to confirm stakeholders receive timely notifications when BGP is restored on the VIF.
Figure 10: Email notification for BGP Neighborship restored.

Figure 10: Email notification for BGP Neighborship restored.

  • The received email alert delivers clear and informative content, signaling the successful reestablishment of the BGP neighbor, as shown in Figure 10.

In summary, testing validates the setup responds as expected to BGP changes on VIFs. The Lambda code is a reliable resource that you can customize by substituting values like log stream, group, and SNS topic ARN.

Monitor prefixes propagated from DXGW attachment to the transit gateway route table

AWS Direct Connect serves as a crucial connectivity solution to link with the transit gateway for multiple VPCs, facilitated through transit VIFs. However, there is currently a limit of 100 advertised prefixes over the transit VIF for both IPv4 and IPv6. If the prefix count exceeds 100, the BGP session will go into an idle state, causing the routes to be removed and traffic to cease using the VIF.

Notably, AWS does not automatically notify you when BGP prefixes surpass the limit. In critical scenarios where production applications demand minimal downtime, it’s essential to send alerts when the learned prefixes sent over the transit VIF approach or exceed a certain threshold value close to the maximum permitted limit of 100.

To achieve this proactive monitoring, we used AWS Lambda, Amazon SNS, and CloudWatch, empowering customers to stay informed about potential disruptions and take timely, preventive actions.

Implementation steps for prefix monitoring setup

Create an IAM role for the Lambda functions with the following permissions:

To establish an effective system for monitoring prefixes over the AWS Direct Connect transit VIF, follow these steps:

  1. Create a CloudWatch log group and log stream:
  • Open the CloudWatch console, and navigate to Log groups
  • Choose Actions, and then Create log group.
  • Create a log stream within the log group to organize and store detailed information, including DXGW and TGW ID and the count of prefixes learned.
  • Click on the log group you created, then under Log streams click on Create log stream and enter a name to your log stream.

2. Establish an SNS topic and subscription:

  • Sign in to the Amazon SNS console. Create an SNS topic for alert notifications.
  • Choose a topic type as Standard and enter a Name.
  • Create the subscription, selecting Email as the Protocol and entering the email address.

3. Develop a Lambda function for prefix monitoring:

  • Open the Functions page of the Lambda console.
  • Choose Create function.
  • Select Author from scratch.
  • Enter the name of your Lambda function and choose Python 3.12 as the Runtime.
  • Leave architecture set to x86_64 and choose Create function.
  • Use the code provided in the Github repository.

Here, the Lambda function uses the SearchTransitGatewayRoutes API call. We use this API to check the total number of routes learned from the Direct Connect gateway attachment and push the count to the designated CloudWatch log stream.

4. Threshold-based alerting with Amazon SNS:

  • We implement logic within the Lambda function to compare the route count against a configured threshold value.
  • If the route count surpasses the threshold, send an SNS notification to alert about the potential issue.

5. Optional: CloudWatch alarms for advanced alerting:

  • This step is out of the scope of this post. Customers have options to set up CloudWatch alarms based on specific conditions related to the route count for more advanced alerting mechanisms. For more details on how to create Cloudwatch alarms, please refer to Create a CloudWatch alarm based on a static threshold.

6. Configure a trigger for invoking the Lambda function. In this example, we create an EventBridge trigger:

  • Open the Functions page of the Lambda console and select the function you want to create a trigger for.
  • In the Function overview pane, choose Add trigger.
  • In the Trigger Configuration pane, select the source as EventBridge (Cloudwatch events) from the drop down.
  • Choose Create a new rule and for Rule name enter the name of your rule
  • For Rule description you can enter a description for your rule. This choice is optional.
  • For Rule type, select Schedule expression and use the cron expression “rate(1 minute)” as shown in Figure 4 to ensure the
  • Lambda function runs every minute, providing real-time updates on VIF status. You can select custom cron expression as per your use case.
  • Select Add trigger.

Validation and conclusion of prefix monitoring setup:

  1. Route count verification:
  • The screenshot in figure 11 shows that the Direct Connect gateway attachment has propagated four routes to the TGW route table.
Figure 11: TGW route table with the DXGW propagated routes.

Figure 11: TGW route table with the DXGW propagated routes.

2. Threshold configuration and Lambda execution:

  • In the lab setup, we configured the Lambda code with a maximum threshold value of 2, considering the maximum prefix limit of 100. This means that an alert through Amazon SNS is triggered if the route count exceeds 2. Customers can adjust the threshold based on their current route count and a safety margin.
  • The Lambda code ran without errors, indicating a successful monitoring process.

3. Verification in CloudWatch log groups:

  • When we check the CloudWatch log groups, we can confirm that we successfully pushed the generated logs to the designated log group, as shown in Figure 12.
Figure 12: CloudWatch log events for advertised routes on Transit gateway.

Figure 12: CloudWatch log events for advertised routes on Transit gateway.

4. Email alert confirmation:

  • Simultaneously, we received an email alert confirming that the monitoring system is functioning as expected, as shown in Figure 13.
Figure 13: Email Notification for the number of prefixes learned exceeding the threshold.

Figure 13: Email Notification for the number of prefixes learned exceeding the threshold.

5. Conclusion of setup:

  • The successful execution of the Lambda code, proper log generation, and the receipt of the expected email alert confirm the operational status of the prefix monitoring setup.
  • The provided Lambda code serves as a reference, requiring customization by the customer. They need to update key parameters such as Transit Gateway route table ID, the Direct Connect gateway attachment ID, the SNS topic ARN, the CloudWatch log group name, and the log stream name. Additionally, the threshold should be set according to its specific use case.
  • For instances with two transit VIFs attached to the DXGW, you can adjust the threshold value. The current maximum route limit is 200 (100 per VIF per address family).

Note : The method described does not provide visibility into route distribution across VIFs. Customers should configure the threshold based on their VIF usage scenario.

This setup empowers customers to proactively manage and monitor prefix counts, ensuring timely alerts and responses, contributing to the robustness of the AWS Direct Connect environment.

Cleanup

After testing the features, delete the resources to avoid additional charges, following these steps from our documentation.

  1. Lambda function
  2. CloudWatch log groups
  3. EventBridge trigger
  4. SNS subscription and topic
  5. IAM role

Conclusion

The setups described in this post help you monitor the BGP status of Direct Connect VIFs and track the prefix count advertised over the transit VIF. By using AWS Lambda, Amazon CloudWatch, and Amazon SNS, organizations can quickly identify and address issues, minimizing downtime and optimizing performance. This approach enhances the reliability and resilience of the network architecture, helping you make informed decisions and respond to connectivity needs.

A correction was made on May 30, 2024: An earlier version of this post was missing Figure 8 and had formatting issues that were corrected. 

About the authors

Anant Vaibhav

Anant Vaibhav

Anant is a Senior Cloud Engineer with Amazon Web Services, with over 10 years of experience in configuring, troubleshooting, architecting, and helping customers with architectures to meet their needs. He is an accredited Subject Matter Expert in AWS Direct Connect, Transit Gateway, and AWS VPN services. Anant is passionate about cloud architecting and helping customers leverage the power of the AWS cloud.

Arun Kumar

Arun Kumar

Arun, a Technical Account Manager at AWS, also a Subject Matter Expert (SME) in cloud networking, particularly in AWS Direct Connect and Transit Gateway services. He excels at addressing complex technical challenges, guiding customers to build scalable, highly-available, secure, resilient, and cost-effective cloud networks.

Pankaj Bhatt

Pankaj Bhatt

Pankaj is a Cloud Engineer-II at AWS, specializing in networking services, particularly as a Subject Matter Expert (SME) in AWS Direct Connect. With a background in data center technologies, he brings extensive experience and exceptional problem-solving skills, excelling in precise network trace analysis for seamless operations in complex cloud environments.