Debugging tool for network connectivity from Amazon VPC

Resources in AWS rely heavily on their underlying network to deliver a service at optimal performance. For example, your databases could be fine-tuned and your front end application servers could be running on the most expensive, high-end Amazon EC2 instances available. However, if the underlying network is experiencing an issue, all of these beneficial factors can become quickly negated.

For this reason, it’s absolutely crucial for you to monitor the health and stability of your network connectivity on a continual basis. Factors that have an impact on performance include latency and the percentage of packet loss across a network path. These things can have a unique influence on the behavior of different applications.

Fortunately, AWS has a built-in tool called AWSSupport-SetupIPMonitoringFromVPC that you can use to monitor some of these metrics. You can even use the tool in conjunction with Amazon CloudWatch alarms to take a specific action if monitoring detects unusual patterns. The following screenshot shows output on a CloudWatch dashboard.

Understanding the outputs of the tool

As you can see from the CloudWatch dashboard, metrics are derived from ping, which is a common tool used in measuring network statistics. This tool is great for gathering information about the health of the general path from point A to Z, but it can’t always assess the health of every possible path in a network.

For this reason, you don’t want to rely on this tool alone as absolute evidence that an issue exists. Instead it’s best to use it in conjunction with other network tools, data, and relevant factors to triangulate and determine if the readings of the tool are false positives or if they point to a real network issue.

This tool not only collects metrics from ping but also MTR, TCP traceroute, and tracepath. See this AWS Knowledge Center article for general information on how to interpret the readings of these tools.

The AWSSupport-SetupIPMonitoringFromVPC tool is an AWS Systems Manager document (SSM document) that creates an Amazon EC2 instance. This instance is referred to as a Monitor Instance in the specified subnet. It monitors the selected target IP addresses by continuously running ping, MTR, TCP traceroute, and tracepath network diagnostic tests. The results are stored using Amazon CloudWatch Logs. A custom metric filter is applied to quickly visualize the percentage packet loss and latency (ms) statistics published to the CloudWatch dashboard. Optionally, you can configure CloudWatch threshold alarms that can trigger an Amazon Simple Notification Service (SNS) notification to alert concerning any packet loss or latency issues.

Prerequisites

Before we start the setup, you need to make sure that the following prerequisites are ready:

Specify the subnet ID for launching the Monitor Instance in a public or private subnet. If it is a private subnet, ensure there is internet access to allow the Monitor Instance to bootstrap itself. It will install the Amazon CloudWatch Logs agent, interact with AWS Systems Manager, and Amazon CloudWatch.
The private subnet must have IPv4 public access, meaning it CANNOT have access to the internet via Egress-Only Internet Gateway.
Specify valid and unique IP addresses that are comma separated without any spaces. The maximum size of the “TargetIPs” string is 255 characters including commas. In the following example, total characters = 15.

Note: Please DO NOT provide an invalid or duplicate IP address. If either of them is entered, then the execution will fail and roll back the Systems Manager Automation.

If the Monitor Instance fails the status checks, then STOP/START the instance and follow this documentation for more information

How to use AWSSupport-SetupIPMonitoringFromVPC

Open the AWS Systems Manager console at https://console.aws.amazon.com/systems-manager.
Choose Automation.
Choose Execute Automation.
Select AWSSupport-SetupIPMonitoringFromVPC (Owner: Amazon).
Select your targets.
Under Input Parameters fill in the following parameters:
- Locate SubnetId (Required), and enter the subnet ID for the Monitor Instance.
- Locate TargetIPs (Required), comma separated list of IPv4s and/or IPv6s to monitor. No spaces allowed.
- Maximum size is 255 characters. Note: If you provide an invalid or duplicate IP address, the automation will fail and roll back the test setup.
- (Optional) Locate CloudWatchLogGroupNamePrefix, and a Log Group prefix. The default is /AWSSupport-SetupIPMonitoringFromVPC.
- (Optional) Locate CloudWatchLogGroupRetentionInDays, and specify the CloudWatch log group retention in days. The default is 7 days.
- (Optional) Locate InstanceType, and specify the instance type to use for the instance that will run the test. The default is t2.micro.
- (Optional) Locate AutomationAssumeRole, and enter the Amazon Resource Name (ARN) of the automation role you want to use. If no role is provided, the document will use the current user permissions.
Choose Execute automation.
Monitor the progress of the execution.

Outputs: After the successful execution of automation, you can view the following in the Outputs section of the Automation.

createCloudWatchDashboard.Output = The CloudWatch Dashboard URL.
createManagedInstance.InstanceId = The ID of the instance that will run the test.

Under the hood

The automation creates Monitor Instance in the specified subnet which pushes the test results every minute to CloudWatch Logs. The results are stored using Amazon CloudWatch Logs, and a custom metric filter is applied to quickly visualize the percentage packet loss (%) and average latency (ms) statistics published to the CloudWatch dashboard.

The CloudWatch dashboard output is as follows:

The CloudWatch dashboard has buttons to Pause, Resume and Terminate the existing automation tests.

If you choose the Pause button, Monitor Instance will be Stopped, which will stop pushing test results to CloudWatch Logs.

If you choose the Resume button, as shown in the following diagram, Monitor Instance will go in the Running state and resume pushing test results to CloudWatch Logs.

By stopping or starting the Monitor Instance, you can pause and resume the test.

If you choose the Terminate button, then the AWSSupport-TerminateIPMonitoringFromVPC SSM Document is invoked. This terminates Monitor Instance, deletes the IAM role and instance profile, and deletes the CloudWatch dashboard.

Note: CloudWatch Logs will be retained in your account for as many days as specified during the SSM Automation execution and the CloudWatch custom metrics available for up to 15 months.

Alarming and alerting with CloudWatch and Amazon SNS

Open the CloudWatch console, choose Alarms, Create Alarms, and select the appropriate custom metrics for packet loss or latency to define threshold. For more information on how to create or edit a CloudWatch alarm and email notification list using the Amazon SNS, see the documentation.

This example illustrates that an ALARM will trigger if there is a packet loss >= 30% for 5 out of 5 data points with a minute interval, and consequently SNS email notification will be sent to the address that you have provided.

Bringing it all together

Here is an overview of the automated monitoring solution with alarming and alerting configured by the customer:

Conclusion

AWSSupport-SetupIPMonitoringFromVPC is a new SSM Automation document that launches a Monitor Instance in the specified subnet. The Monitor Instance pushes network telemetry data to CloudWatch Logs, and you can view the applied custom metric filters with packet loss and latency statistics in the CloudWatch dashboard. You can configure CloudWatch Alarms and SNS notification for alerting. You can start using this today. If you have any questions or suggestions, please leave a comment for us. Happy network monitoring!

Networking & Content Delivery