Monitor network throughput of interface VPC endpoints using Amazon CloudWatch

Security, cost and performance are always a top priority for AWS customers when they design their network. AWS PrivateLink is becoming increasingly popular because it provides secured private connectivity between Amazon Virtual Private Cloud (Amazon VPC), AWS services and your on-premises networks, without exposing your traffic to the public internet.

In this blog post, we show you how to monitor near-real-time network throughput of interface VPC endpoints (interface endpoint) using Amazon CloudWatch custom metrics and alarms. This solution is highly recommended for a shared VPC that is configured with multiple interface VPC endpoints and handles traffic across many AWS accounts.

VPC endpoint types

VPC endpoints are virtual devices. They are horizontally scaled, redundant and highly available VPC components. They allow communication between instances in your VPC and services without imposing availability risks. You create the type of VPC endpoint that is required by the supported service.

Interface endpoints

An interface endpoint is an elastic network interface with a private IP address from the IP address range of your subnet. It serves as an entry point for traffic destined to a supported AWS service or a VPC endpoint service. Interface endpoints are powered by AWS PrivateLink.

Gateway Load Balancer endpoints

A Gateway Load Balancer endpoint is an elastic network interface with a private IP address from the IP address range of your subnet. This type of endpoint serves as an entry point to intercept traffic and route it to a service that you configured using Gateway Load Balancers for security inspection. You specify a Gateway Load Balancer endpoint as a target for a route in a route table. These endpoints are supported for endpoint services that are configured for Gateway Load Balancers only.

Gateway endpoints

A gateway endpoint is for Amazon S3 and Amazon DynamoDB.

Design recommendations

To make your environment more secure, AWS recommends that you use AWS PrivateLink because it keeps all communication private between the on-premises environment, AWS services and Amazon VPCs. We also recommend that you use a hub and spoke design where all the spoke VPCs use an interface VPC endpoint provisioned inside the hub (shared services) VPC, typically connected through AWS Transit Gateway. This architecture simplifies your network design and helps reduce the cost and maintenance of multiple interface VPC endpoints across different VPCs. To set up this architecture, check the Centralize access using VPC interface endpoints to access AWS services across multiple VPCs blog post.

If you choose to centralize all VPC endpoints, you have multiple applications across many VPCs sharing the same VPC endpoint. Over time, as traffic increases, this design may lead to network bandwidth issues. By default, each interface endpoint supports a bandwidth of up to 10 Gbps per Availability Zone and bursts of up to 40 Gbps. For this reason, it is critical to monitor endpoints to make sure they do not exceed their throughput limit. Monitoring helps you distribute traffic across endpoints and add new endpoints to meet the demand.

Solution overview

In the solution described in this post, an AWS Lambda function is used to generate CloudWatch metrics and alarms for network throughput. This is done by analyzing interface VPC endpoint log streams, which are published to a CloudWatch log group from VPC Flow Logs.

The solution supports:

Dynamic discovery and integration of new VPC interface endpoints.
Flexibility to update the timeframe for CloudWatch metrics collection.
Flexibility to set the threshold monitoring level for alarm conditions.
Ability to export metrics for visibility into CloudWatch dashboards

Architecture

The solution uses VPC Flow Logs as an input data source configured to publish to Amazon CloudWatch Logs. Every VPC interface endpoint has a unique log stream in the CloudWatch log group. An Amazon EventBridge event rule triggers the Lambda function at an interval specified by the user. The log stream data is processed to generate data points to a custom CloudWatch metric and create alarm definitions for each interface endpoint. These metrics can be visualized in a CloudWatch dashboard. The CloudWatch alarms are configured with an Amazon Simple Notification Service (Amazon SNS) topic with target notification endpoints like email, ServiceNow and more.

Figure 1 shows the solution workflow.

The workflow to monitor network throughput is described in the post
Figure 1: Solution workflow

Here is the workflow:

VPC Flow log captures traffic flowing through interface endpoints.
Flow log records are published to a CloudWatch log group. Data for each interface endpoint is captured to a corresponding log stream.
CloudWatch Events rule triggers Lambda function on a schedule , which processes active CloudWatch log streams to compute the network utilization.
The Lambda function generates and updates custom metrics with the latest data and alarm definitions with user-defined thresholds.
A CloudWatch alarm triggers an SNS notification when there is a breach, per alarm definition threshold limits.
You can visualize CloudWatch metrics in a CloudWatch dashboard. The metrics can also be visualized in third-party observability tools like Grafana, Splunk, etc.

Naming convention

Because an interface endpoint can have many underlying Elastic Network Interfaces (ENI) spread across Availability Zones, we use this naming convention to easily identify a network interface:

We combine the VPC ID and ENI ID to create the CloudWatch metric name as vpce-VPCID-ENIID.
For example, if the VPC ID is vpce-0001234231212312d and the ENI ID is eni-04123123123321312, then the metric name is vpce-0001234231212312d-eni-04123123123321312.
The solution creates two CloudWatch alarms for notification. We combine the VPC ID and ENI ID to create the CloudWatch alarm name. We use the suffix -Critical for critical alarms.
For example, if the VPC ID is vpce-0001234231212312d and the ENI ID is eni-04123123123321312, then the initial baseline alarm name is vpce-0001234231212312d-eni-04123123123321312 and the critical alarm name is vpce-0001234231212312d-eni-04123123123321312-Critical.

Setting the monitoring threshold

By default, each interface endpoint supports a bandwidth of up to 10 Gbps per Availability Zone and bursts of up to 40 Gbps. In this solution, we calculate CloudWatch metrics in bytes/min, so we convert the default unit, Gbps, to bytes/min.

Soft limit threshold

10 Gbps (Gigabits/sec) = 1342177280 bytes/sec = 80530636800 bytes/min [ 75gb/min]

The first alarm is triggered on breach of 70% of the soft limit threshold. It is calculated as follows:
- Alert baseline – 70% threshold of 10 Gbps
- 56371445760 bytes/min [ 52.5 Gb/min ]
The critical alarm is triggered on breach of 95% of the soft limit threshold. It is calculated as follows:
- Alert baseline – 95% threshold of 10 Gbps
- 76504104960 bytes/min [ 71.25 Gb/min ]

You can modify the thresholds as appropriate for your use case and pass them as input parameters in the deployment template and also can modify using Lambda environment variables alarm_threshholdbytes , alarm_critical_threshholdbytes.

Prerequisites

To deploy the solution, you need the following:

An AWS account
Amazon VPC with interface endpoints configured
AWS Identity and Access Management (IAM) role with the correct permissions
Terraform setup as you will deploy the solution into your AWS account by launching a Terraform template.
AWS Command Line Interface v2.1.x
Github client v2.x
An SNS topic to receive alarm notifications. For instructions, see Creating an SNS topic.

Deploy the solution using Terraform

The Terraform template has the following input parameters, which you can modify as appropriate for your use case.

Parameter	Variable	Default	Description
AWS Region	aws_region	sa-east-1	The AWS Region to be used for deployment.
Amazon VPC Id	vpc_id		The ID of the VPC to be monitored.
Alarm Critical Threshold (Bytes)	alarm_critical_threshholdbytes	76504104960	The monitoring threshold, in bytes, for critical alarms.
Alarm Threshhold (Bytes)	alarm_threshholdbytes	56371445760	The monitoring threshold, in bytes, for initial alarms.
CloudWatch Log Group	cloudwatch_loggroup	vpcendpointloggroup	The name of the CloudWatch log group name that will capture flow log data.
CloudWatch Metric NameSpace	name_space	vpcendpoint	The CloudWatch metric namespace that will collect metrics for all endpoint interfaces.
SNS Topic ARN for Alarm notification	sns_topic_arn		The ARN of the SNS topic configured for the CloudWatch alarm.
Log Processing Interval (Min)	timerange_min	1	The duration, in minutes, the Lambda function will use to capture log data from the CloudWatch log group.

Deploying the Solution

Using AWS CLI to deploy the template

Run the following command in your AWS CLI environment:

$ git clone https://github.com/aws-samples/aws-privatelink-interface-endpoint-monitoring
$ terraform init
#Modify variables.tf input parameters as per environment needs by referring above table 
$ terraform plan
$ terraform apply

It will take approximately ~10 minutes to deploy the solution. Upon a successful deployment of the template, the following resources are created:

An IAM role, vpcflowlogcwrole, allows VPC Flow Logs to be written to CloudWatch Logs.
A VPC flow log where records are stored in this format:

${interface-id} ${bytes} ${subnet-id} ${vpc-id} ${account-id}

interface-id	The ID of the network interface for which the traffic is recorded.
bytes	The number of bytes transferred during the flow.
subnet-id	The ID of the subnet that contains the network interface for which the traffic is recorded.
vpc-id	The ID of the VPC that contains the network interface for which the traffic is recorded.
account-id	The AWS account ID of the owner of the source network interface for which traffic is recorded.

A CloudWatch log group, which captures flow log data in an individual log stream for each interface VPC endpoint.
An EventBridge rule that triggers the Lambda function at scheduled intervals.
A Lambda function written in Python with environment variables as per user-defined parameters.
CloudWatch metrics are created as part of the Lambda function run.
Two CloudWatch alarm definitions with each endpoint interface name. Additionally Critical alarms are suffixed with Critical.

Viewing and visualizing metrics

After the solution starts to process event data, you can view the metrics and alarm definitions for the interface endpoints.

Now that your data is available in CloudWatch Metrics, you can create an interactive Amazon CloudWatch dashboard using the instructions mentioned here Creating a CloudWatch dashboard.

The CloudWatch dashboard displays the VPCEndpointMonitor metric.

Figure 2: CloudWatch dashboard

For every interface endpoint, two alarm definitions are created and configured per user-defined threshold limits.

The Alarms page in the CloudWatch console displays alarms for the VpcEndpointMonitor metric.

Figure 3: Alarms page in the CloudWatch console

Cleanup

To avoid any additional charges after you test the solution, run the following command to delete the resources:

$ terraform destroy

Conclusion

In this post, we shared a solution for monitoring the near-real-time network throughput of Interface VPC endpoints using Amazon CloudWatch custom metrics and alarms. For more information, see Amazon CloudWatch Logs User Guide and Interface VPC endpoints (AWS PrivateLink).

AWS Cloud Operations Blog