How can I troubleshoot Direct Connect network performance issues?

Last updated: 2021-11-30

I am experiencing low throughput, traffic latency, and performance issues with my AWS Direct Connect connection.

Resolution

Follow these instructions to isolate and diagnose network and application performance issues.

Note: It's a best practice to set up an on-premises dedicated test machine and with an Amazon Virtual Private Cloud (Amazon VPC). Use Elastic Compute Cloud (Amazon EC2) instance type size C5 or larger.

Network or application Issue

You can install and use the iPerf3 tool to benchmark network bandwidth, and cross check the results with other applications or tools.

1.    Linux/REHEL installation:

$ sudo yum install iperf3 -y

Ubuntu installation:

$ sudo apt install iperf3 -y

2.    Run iPerf3 on the client and server to measure the throughput bidirectionally similar to the following:

Amazon EC2 instance (server):

$ iperf3 -s -V

On-premises localhost (client):

$ iperf3 -c <private IP of EC2> -P 15 -t 15
$ iperf3 -c <private IP of EC2> -P 15 -t 15 -R

$ iperf3 -c <private IP of EC2> -w 256K
$ iperf3 -c <private IP of EC2> -w 256K -R

$ iperf3 -c <private IP of EC2> -u -b 1G -t 15
$ iperf3 -c <private IP of EC2> -u -b 1G -t 15 -R 

----------------
-P, --parallel n
    number of parallel client threads to run; It is critical to run multi-threads to achieve the max throughput.
-R, --reverse
    reverse the direction of a test. So the EC2 server sends data to the on-prem client to measure AWS -> on-prem throughput.
-u, --udp
    use UDP rather than TCP. Since TCP iperf3 does not report loss, UDP tests are helpful to see the packet loss along a path.

Example TCP test results:

[ ID] Interval          Transfer      Bitrate        Retry
[SUM] 0.00-15.00  sec  7.54 GBytes  4.32 Gbits/sec   18112   sender
[SUM] 0.00-15.00  sec  7.52 GBytes  4.31 Gbits/sec           receiver

Bitrate—the measured throughput or transmission speed.

Transfer—the total amount of data exchanged between client and server.

Retry—the number of re-transmitted packets. Re-transmission is observed on the sender side.

Example UDP test results:

[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5] 0.00-15.00  sec  8.22 GBytes   4.71 Gbits/sec  0.000 ms   0/986756 (0%)  sender
[  5] 0.00-15.00  sec  1.73 GBytes   989 Mbits/sec   0.106 ms   779454/986689 (79%)  receiver

Lost is 0% on the sender side because the maximum amount of UDP datagrams are sent.

Lost/Total datagrams on the receiver side indicates how many packets are lost and the lost rate. In this example, 79% of network traffic is lost.

Note: If the Direct Connect connection uses an Amazon Virtual Private Network (Amazon VPN) over a public virtual interface (VIF), then run performance tests without the VPN.

Check the metrics and interface counters

Check Amazon CloudWatch Logs for the following metrics:

  • ConnectionErrorCount: Apply the sum statistic and note that non-zero values indicates MAC level errors on the AWS device.
  • ConnectionLightLevelTx and ConnectionLightLevelRx: Make sure that the optical signal readings are within the range of -14.4 and 2.50 dBm.
  • ConnectionBpsEgress, ConnectionBpsIngress, VirtualInterfaceBpsEgress, and VirtualInterfaceBpsIngress: Be sure that the bitrate hasn't reached the maximum bandwidth.

For more information, see Direct Connect metrics and dimensions.

If you're using a Hosted Virtual Interface (Hosted VIF) that shares the total bandwidth with other users, then check with the Direct Connect owner about the connection utilization.

Check the router and firewall at the Direct Connect location for the following:

  • CPU, memory, port utilization, drops, discards.
  • Use "show interfaces statistics" or similar to check for interface input and output errors like CRC, frame, collisions, and carrier.
  • Clean or replace the fiber patch lead and SFP module for worn counters.

Check the AWS Personal Health Dashboard to make sure that the Direct Connect connection isn't under maintenance.

Run MTR bidirectionally to check the network path

You can use the Linux MTR command to analyze network performance. For Windows OS, it's a best practice to enable WSL 2 so that you can install MTR on a Linux subsystem. You can download WinMTR from the SourceForge website.

1.    Amazon Linux/REHEL installation:

$ sudo yum install mtr -y

Ubuntu installation:

$ sudo apt install mtr -y

2.    For the on-premises --> AWS direction, run MTR on the localhost (ICMP and TCP based):

$ mtr -n -c 100 <private IP of EC2> --report
$ mtr -n -T -P <EC2 instance open TCP port> -c 100 <private IP of EC2> --report

3.    For the AWS --> on-premises direction, run MTR on the EC2 instance (ICMP and TCP based):

$ mtr -n -c 100 <private IP of the local host> --report
$ mtr -n -T -P <local host open TCP port> -c 100 <private IP of the local host> --report

Example MTR test results:

#ICMP based MTR results
$ mtr -n -c 100 192.168.52.10 --report
Start: Sat Oct 30 20:54:39 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.7   0.7   0.6   0.9   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  266.5 267.4 266.4 321.0   4.8
  4.|-- 10.110.120.1              54.5%   100  357.6 383.0 353.4 423.7  19.6
  5.|-- 192.168.52.10             47.5%   100  359.4 381.3 352.4 427.9  20.6

#TCP based MTR results
$ mtr -n -T -P 80 -c 100 192.168.52.10 --report
Start: Sat Oct 30 21:03:48 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.9   0.7   0.7   1.1   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  264.1 265.8 263.9 295.3   3.4
  4.|-- 10.110.120.1               8.0%   100  374.3 905.3 354.4 7428. 1210.6
  5.|-- 192.168.52.10             12.0%   100  400.9 1139. 400.4 7624. 1384.3

Each line in a hop represents a network device that the data packet passes from the source to the destination. For more information on how to read MTR test results, see the ExaVault website.

The following example shows a Direct Connect connection with BGP peer 10.110.120.1 and 10.110.120.2. Loss percentage is observed on the 4th and 5th destination hop. This can indicate an issue with the Direct Connect connection or the remote router 10.110.120.1. TCP MTR result shows less loss percentage because TCP is prioritized over ICMP with the Direct Connect connection.

#ICMP based MTR results
$ mtr -n -c 100 192.168.52.10 --report
Start: Sat Oct 30 20:54:39 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.7   0.7   0.6   0.9   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  266.5 267.4 266.4 321.0   4.8
  4.|-- 10.110.120.1              54.5%   100  357.6 383.0 353.4 423.7  19.6
  5.|-- 192.168.52.10             47.5%   100  359.4 381.3 352.4 427.9  20.6

#TCP based MTR results
$ mtr -n -T -P 80 -c 100 192.168.52.10 --report
Start: Sat Oct 30 21:03:48 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.9   0.7   0.7   1.1   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  264.1 265.8 263.9 295.3   3.4
  4.|-- 10.110.120.1               8.0%   100  374.3 905.3 354.4 7428. 1210.6
  5.|-- 192.168.52.10             12.0%   100  400.9 1139. 400.4 7624. 1384.3

The following example shows the local firewall or NAT device packet loss at 5%. The packet loss impacts all of the subsequent hops including the destination.

$ mtr -n -c 100 192.168.52.10 --report
Start: Sat Oct 30 21:11:22 2021
HOST:                              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               5.0%   100    0.8   0.7   0.7   1.1   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               6.0%   100  265.7 267.1 265.6 307.8   5.1
  4.|-- 10.110.120.1               6.0%   100  265.1 265.2 265.0 265.4   0.0
  5.|-- 192.168.52.10              6.0%   100  266.7 266.6 266.5 267.2   0.0

Take a packet capture and analyze the results

Take a packet capture on the localhost and the EC2 instance. Use the tcpdump or Wireshark utility to get network traffic for analysis. The following tcpdump example command gets the timestamp and host IP address:

tcpdump -i <network interface> -s0 -w $(date +"%Y%m%d_%H%M%S").$(hostname -s).pcap port <port>

Use the TCP Throughput Calculator on the Switch website to calculate network limit, Bandwidth-delay Product, and TCP buffer size.

For more information, see Troubleshooting AWS Direct Connect.


Did this article help?


Do you need billing or technical support?