Monitor hybrid connectivity with Amazon CloudWatch Network Monitor

Today we announce the availability of Amazon CloudWatch Network Monitor, a feature of CloudWatch that makes it easy to gain visibility of your hybrid network connectivity with AWS. CloudWatch Network monitor currently supports hybrid monitors for networking built with AWS Direct Connect and AWS Site-to-Site VPN. You can find Amazon CloudWatch Network Monitor in the Amazon CloudWatch console.

In this post we describe how CloudWatch Network Monitor (CWNM) can be used to measure the performance of a hybrid network and reduce the MTTR (mean time to recovery).

Hybrid connectivity and grey failures

We define grey failures in our Advanced Multi-AZ Resilience Patterns Whitepaper as scenarios where different entities observe failures differently. Grey failures can be challenging to diagnose and resolve. In the networking domain, examples of grey failures are intermittent packet loss on a particular network link or fluctuating latency. When these occur, one entity (the routing control plane) might not observe any impact due to the degradation, whilst another entity (the customer) does, in the form of reduced throughput and/or increased latencies.

Border Gateway Protocol (BGP), the dynamic routing protocol used in AWS Direct Connect and AWS Site-to-Site VPN, relies on TCP as a transport protocol. TCP can tolerate some level of network degradation through mechanisms such as retransmissions and sliding windows. Additionally, TCP sessions are established between the customer router and the AWS Direct Connect router. For these reasons, when a network impairment occurs anywhere between a VPC and the on-premises network, BGP may not necessarily exhibit any sign of degradation.

AWS customers that require hybrid connectivity either via Direct Connect (DX) or Site-to-Site VPN need to detect grey failures as soon as they happen so that they have an option to steer the traffic away from the degraded path. We have learnt from experience that whenever a network issue occurs, network engineers usually spend a considerable amount of time to locate it. Rapidly understanding whether a degradation occurred in a portion of the network that you control or that is within the responsibility of AWS helps reducing the MTTR.

Nomura Research Institute is a Direct Connect partner. For us, it has long been a challenge to detect and locate network gray failures and reduce the time to mitigate their impact. CloudWatch Network Monitor allows us to detect end-to-end performance degradations or outages as soon as they occur. It was easy to set thresholds in CloudWatch alarms and integrate CloudWatch Network Monitor with other AWS services such as Lambda and SNS. By doing this we expect to automatically mitigate the impact of many more network impairments than we’ve ever done before by executing pre-determined actions when a network event is detected.
Kazuki Fujiwara – Senior Associate at Nomura Research Institute

Introducing Amazon CloudWatch Network Monitor

Before using Network Monitor, customers with hybrid architectures have addressed the problem of network grey failures by implementing active monitoring techniques either by using off-the-shelf solutions or by developing their own tooling.

CloudWatch Network Monitor makes this easier by providing a fully managed and agent-less solution.

A monitor can be used to test ICMP or TCP traffic to IPv4 or IPv6 destinations reachable through Direct Connect or Site-to-Site VPN. Each combination of source subnet, destination IP, packet size and protocol/port is called a probe.

When you create a monitor in your Amazon VPC, AWS creates all the necessary infrastructure in the background to perform RTT (Round-Trip Time) and Packet Loss measurements. You only need configure your destinations to respond to monitoring traffic on the ports and protocols configured with the probes.

When used with Direct Connect, a monitor provides a measure of the health of the AWS-controlled network path from the VPC where the monitor is deployed to the Direct Connect Location where you terminate your DX connections. This is called the Network Health Indicator (NHI) and helps speeding up the process of identifying whether the issue is located in the AWS network or in the network you manage.

Metrics produced by monitors are published to Amazon CloudWatch where you can configure dashboards and alarms that can be used to send notifications and/or mitigation actions.

Scenario

A typical customer scenario implementing Direct Connect High Resiliency model is described below. We focus on this design to demonstrate how Network Monitor can be configured and used. Network Monitor can be used in other scenarios.

AWS Customer with a Direct Connect High Resiliency Setup

Figure 1. Customer with a Direct Connect High Resiliency Setup

The customer has a router in each Direct Connect Location. Each of these routers is connected to an AWS Router through a DX Dedicated Connection.
Two Transit VIFs (T-VIFs) are configured to transport the traffic to and from an AWS Transit Gateway associated to the Direct Connect Gateway (DXGW). The customer router exchanges routing information with the DXGW on both these T-VIFs via eBGP. This is a common architecture covered in our documentation, blog posts and resources such as the Hybrid Connectivity whitepaper.

The customer has multiple workloads in private subnets, spread across different Availability Zones (AZs) to provide maximum application availability. We choose to display just two for brevity and so that our diagrams are easier to read. These workloads are accessed by clients on the on-premises LAN.
The customer is leveraging Equal Cost Multi Path (ECMP) to load-balance the network flows across both DX connections. Their LAN (172.16.0.0/24 - 2001:DB8::/48 ) is reachable through both T-VIFs from AWS and – similarly – their VPC is reachable from their LAN via both T-VIFs as shown in the diagram below.

Figure 2. LAN is reachable from across both Transit VIFs

The workloads in the private subnets are latency sensitive. The customer wants to gain visibility of the network performance of both routes to take action if either becomes impaired. Because the Availability Zones in an AWS Region are designed as independent failure domains, visibility needs to be provided from different subnets in different AZs.

Monitoring architectures

Different user personas are likely to have different monitoring objectives. For example, an application owner might want to ensure that end to end latency and packet loss do not impact end user experience, regardless of the how the network routes the application traffic. Application developers will configure a monitor in the same subnets as their workload and probe the on-premises LAN. The monitoring traffic from either workload subnet uses every available path to reach the monitor destination on-premises.

monitoring architecture for application developer

Figure 3. Monitoring architecture (application owner)

On the other hand, network engineers might care about each network path and will configure the monitors in such a way that makes the traffic of every probe follow a specific route to destination.

In this second architecture, shown in Figure 4, network engineers have created probes in each monitoring subnet. Each probe is configured with destinations that are reachable only via the green or the red DX Connection. Similarly, the monitor destinations can reach their corresponding probe on the left side of the diagram only via the green or the red DX connection. This can be achieved with some configuration on the customer routers using – for example – VRF-lite.

monitoring architecture from the perspective of a network engineer

Figure 4. Monitoring architecture (network engineer)

At the time of writing, you cannot create probes for both IPv4 and IPv6 destinations in the same monitor. If you want to configure probes to IPv6 destinations, you can create a separate IPv6 monitor.

Next, we discuss the second architecture and show how this can be configured within the AWS Console. CloudWatch Network Monitor can also be configured using the service API, the AWS SDK, AWS CloudFormation and CDK. We demonstrate how you can create an IPv4-only monitor. Follow the same steps if you want to create an IPv6-only monitor.

Setting up Amazon CloudWatch Network Monitor

To implement the second reference architecture, we create a monitor named monitor-us-east-1-ipv4. We do this by opening the Network Monitor console and choosing Create Monitor.

Figure 5. Create a monitor in the Network Monitor Console

The aggregation period represents how often metrics produced by Network Monitor probes are published to CloudWatch. We keep this at 30 seconds (default) as we rely on the metrics produced by Network Monitor to quickly route around an impaired network path using automated actions.

Network Monitor Create monitor with parameters

Figure 6. Create a new monitor

We then select the subnets monitor-subnet-us-east-1a and monitor-subnet-us-east-1b as sources where our probes will be deployed. Finally, we configure 172.16.100.1 and 172.16.200.1 as destinations. The probes created using the console send ICMP echo packets that are 56 bytes in size by default. This is shown in Figure 7.

network monitor creation with sources and destinations

Figure 7. Specify monitor source subnets and destination IPs

After following the guided process, the monitor and its probes go into a Pending state until all the resources are deployed and probes start emitting metrics to CloudWatch at every aggregation interval that we configured earlier.

Network Monitor Dashboard

CloudWatch Network Monitor will create a dashboard for each of our monitors, which we can access by clicking on the monitor name in the Network Monitor dashboard.

On the top-left we can see the status of the AWS Network in the Region where our monitor is deployed. This can be useful to triage observed packet loss or increased latency when the connection traverses multiple networks (e.g. AWS Network, Direct Connect Partner backbone and your on-premises datacenter network).

On the top-right we can see a summary of the probe status, including the number of probes in alarm, the average packet loss and RTT in the time range selected.

In Figure 8 you can see a historical view of AWS Network Health Indicator metric, along with line-charts displaying packet loss and RTT for each of the probes (Figure 9). We can use this information to further troubleshoot network issues by correlating the AWS Network Health Indicator with observed packet loss / latency.

Figure 8. Monitor summary view

Figure 9. Network Monitor metrics summary view

CloudWatch Metrics

AWS Network Monitor publishes metrics to CloudWatch Metrics under the AWS/NetworkMonitor namespace. We will find three metrics for each of our probes:

HealthIndicator: 0 if AWS Network is healthy at the time of the measurement, and 100 if it is degraded.
PacketLoss: packet loss as a percent value.
RTT: round-trip time in microseconds.

Figure 10. Network Monitor CloudWatch metrics (RTT)

Metrics math

Sometimes it makes sense to combine multiple metrics using a mathematical formula. For example, if several probes share a physical link, we could take an average of the packet loss for all of them during an aggregation period (in our case 30 seconds). To do that, select the relevant metrics and use the option “Add math”. Then select the appropriate function (e.g. “AVG” for average) as shown in Figure 11 and 12.

$Network Monitor CloudWatch Metrics math$

Figure 11. Adding a CloudWatch metric math

$Network Monitor CloudWatch metrics math result$

Figure 12. CloudWatch Metrics result of adding math (RTT)

Creating alarms

Once you’ve identified the metrics that make most sense to your scenario, you can click on the little bell link icon into the Actions section to create alarms that can be acted upon.

The simplest type of alarm uses a static threshold. For example, you can create an alarm to notify you if packet loss average is higher than 1%. For most applications, a packet loss of 1% is considered acceptable.

Figure 13. Configuring a CloudWatch Alarm for PacketLoss

If you don’t want to setup static thresholds, because you don’t know beforehand the plausible metric values or if you are expecting some variations during the day (e.g. for round trip time), you can use setup an alarm using CloudWatch anomaly detection. Anomaly detection will calculate a band of expected values, and alert in case the actual value is greater, lower or outside of the band. Please note that in the case of failed probes (packet loss = 100%) RTT will be reported as 0 ms, so it makes sense to configure anomaly detection to detect values lower than expected as well as higher than expected.

Figure 14. Configuring CloudWatch anomaly detection

Automating failover

In addition to sending alerts using Amazon Simple Notification Service (SNS), it is also possible to automate the fail-over of Direct Connect links when an impairment is detected. For this, use a Lambda function invoked by Amazon EventBridge. An example Lambda function can be found at the AWS Samples GitHub repository. Please refer to the blog post AWS Direct Connect Monitoring and Failover with Anomaly Detection for details and best practices.

It is also best practice to routinely test your fail-over automation to ensure that it will work as expected in case it has to be relied upon when a real failure scenario occurs.

Resources clean-up

After testing Network Monitor, make sure you clean your test environment by removing all monitors and probes, and any other resources you might have created to test this feature to avoid incurring unexpected charges.

Conclusion

In this post we introduced Amazon CloudWatch Network Monitor, a fully-managed and agentless feature that makes it easier to monitor hybrid networking. Network Monitor allows you to troubleshoot and remediate networking issues faster so that you minimize any impact on your customers’ experience.

To learn more about the pricing of Network Monitor, visit the CloudWatch pricing page here. For technical guidance on how to configure and use this feature, please head to the AWS Documentation.

Networking & Content Delivery