How can I troubleshoot Direct Connect network performance issues?

8 minute read
1

I am experiencing low throughput, traffic latency, and performance issues with my AWS Direct Connect connection.

Resolution

To isolate and diagnose network and application performance issues, complete the following steps:

Note: It's a best practice to set up an on-premises dedicated test machine with an Amazon Virtual Private Cloud (Amazon VPC). Use Amazon Elastic Compute Cloud (Amazon EC2) instance type size C5 or larger.

Review for network or application issues

Install and use the iPerf3 tool to benchmark network bandwidth, and cross check the results with other applications or tools. For more information, see What is iPerf / iPerf3? on the iPerf website.

  1. Run the following command to install iPerf3:

    Linux/REHEL

    $ sudo yum install iperf3 -y

    Ubuntu

    $ sudo apt install iperf3 -y
  2. To measure the throughput bidirectionally, run iPerf3 on the client:

    Amazon EC2 instance (server)

    $ iperf3 -s -V

    On-premises localhost (client)

    $ iperf3 -c <private IP of EC2> -P 15 -t 15
    $ iperf3 -c <private IP of EC2> -P 15 -t 15 -R
    
    $ iperf3 -c <private IP of EC2> -w 256K
    $ iperf3 -c <private IP of EC2> -w 256K -R
    
    $ iperf3 -c <private IP of EC2> -u -b 1G -t 15
    $ iperf3 -c <private IP of EC2> -u -b 1G -t 15 -R
     
    ----------------
    -P, --parallel n
        number of parallel client threads to run; It is critical to run multi-threads to achieve the max throughput.
    -R, --reverse
        reverse the direction of a test. So the EC2 server sends data to the on-prem client to measure AWS -> on-prem throughput.
    -u, --udp
        use UDP rather than TCP. Since TCP iperf3 does not report loss, UDP tests are helpful to see the packet loss along a path.

Example TCP test results:

[ ID] Interval          Transfer      Bitrate        Retry[SUM] 0.00-15.00  sec  7.54 GBytes  4.32 Gbits/sec   18112   sender
[SUM] 0.00-15.00  sec  7.52 GBytes  4.31 Gbits/sec           receiver

The preceding example uses the following terms:

  • Bitrate: the measured throughput or transmission speed.
  • Transfer: the total amount of data exchanged between client and server.
  • Retry: the number of re-transmitted packets. Re-transmission is observed on the sender side.

Example UDP test results:

[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams[  5] 0.00-15.00  sec  8.22 GBytes   4.71 Gbits/sec  0.000 ms   0/986756 (0%)  sender
[  5] 0.00-15.00  sec  1.73 GBytes   989 Mbits/sec   0.106 ms   779454/986689 (79%)  receiver

Lost is 0% on the sender side because the maximum amount of UDP datagrams are sent. Lost/Total datagrams on the receiver side is how many packets are lost and the lost rate. In this example, 79% of network traffic is lost.

Note: If the Direct Connect connection uses an Amazon Virtual Private Network (Amazon VPN) over a public virtual interface (VIF), then run performance tests without the VPN.

Check the metrics and interface counters

Check Amazon CloudWatch Logs for the following metrics:

  • ConnectionErrorCount: Apply the sum statistic. Note that non-zero values indicates MAC level errors on the AWS device.
  • ConnectionLightLevelTx and ConnectionLightLevelRx: The optical signal readings must be within the range of -14.4 and 2.50 dBm.
  • ConnectionBpsEgress, ConnectionBpsIngress, VirtualInterfaceBpsEgress, and VirtualInterfaceBpsIngress: Make sure that the bitrate hasn't reached the maximum bandwidth.

For more information, see AWS Direct Connect metrics and dimensions.

If you use a hosted VIF that shares the total bandwidth with other users, then check with the Direct Connect owner about the connection utilization.

Check the router and firewall at the Direct Connect location for the following metrics:

  • CPU, memory, port utilization, drops, discards
  • Use show interfaces statistics or similar to check for interface input and output errors like CRC, frame, collisions, and carrier
  • Clean or replace the fiber patch lead and SFP module for worn counters

Check the AWS Health Dashboard to make sure that the Direct Connect connection isn't under maintenance.

Run MTR bidirectionally to check the network path

Use the Linux MTR command to analyze network performance. For Windows OS, it's a best practice to turn on WSL 2 so that you can install MTR on a Linux subsystem. Download WinMTR from the SourceForge website.

  1. Run the following command to install MTR:

    Amazon Linux/REHEL installation

    $ sudo yum install mtr -y

    Ubuntu installation

    $ sudo apt install mtr -y
  2. For the on-premises to AWS direction, run MTR on the localhost (ICMP and TCP based):

    $ mtr -n -c 100 <private IP of EC2> --report$ mtr -n -T -P <EC2 instance open TCP port> -c 100 <private IP of EC2> --report
  3. For the AWS to on-premises direction, run MTR on the EC2 instance (ICMP and TCP based):

    $ mtr -n -c 100 <private IP of the local host> --report$ mtr -n -T -P <local host open TCP port> -c 100 <private IP of the local host> --report

Example MTR test results:

#ICMP based MTR results$ mtr -n -c 100 192.168.52.10 --report
Start: Sat Oct 30 20:54:39 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.7   0.7   0.6   0.9   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  266.5 267.4 266.4 321.0   4.8
  4.|-- 10.110.120.1              54.5%   100  357.6 383.0 353.4 423.7  19.6
  5.|-- 192.168.52.10             47.5%   100  359.4 381.3 352.4 427.9  20.6

#TCP based MTR results
$ mtr -n -T -P 80 -c 100 192.168.52.10 --report
Start: Sat Oct 30 21:03:48 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.9   0.7   0.7   1.1   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  264.1 265.8 263.9 295.3   3.4
  4.|-- 10.110.120.1               8.0%   100  374.3 905.3 354.4 7428. 1210.6
  5.|-- 192.168.52.10             12.0%   100  400.9 1139. 400.4 7624. 1384.3

Each line in a hop represents a network device that the data packet passes from the source to the destination. For more information on how to read MTR test results, see Reading MTR output network diagnostic tool on the ExaVault website.

The following example shows a Direct Connect connection with BGP peer 10.110.120.1 and 10.110.120.2. Loss percentage is observed on the 4th and 5th destination hop. This can indicate an issue with the Direct Connect connection or the remote router 10.110.120.1. Because TCP is prioritized over ICMP with the Direct Connect connection, TCP MTR result shows less loss percentage.

#ICMP based MTR results$ mtr -n -c 100 192.168.52.10 --report
Start: Sat Oct 30 20:54:39 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.7   0.7   0.6   0.9   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  266.5 267.4 266.4 321.0   4.8
  4.|-- 10.110.120.1              54.5%   100  357.6 383.0 353.4 423.7  19.6
  5.|-- 192.168.52.10             47.5%   100  359.4 381.3 352.4 427.9  20.6

#TCP based MTR results
$ mtr -n -T -P 80 -c 100 192.168.52.10 --report
Start: Sat Oct 30 21:03:48 2021
HOST:                             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               0.0%   100    0.9   0.7   0.7   1.1   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               0.0%   100  264.1 265.8 263.9 295.3   3.4
  4.|-- 10.110.120.1               8.0%   100  374.3 905.3 354.4 7428. 1210.6
  5.|-- 192.168.52.10             12.0%   100  400.9 1139. 400.4 7624. 1384.3

The following example shows the local firewall or NAT device packet loss at 5%. The packet loss impacts all of the subsequent hops including the destination.

$ mtr -n -c 100 192.168.52.10 --report
Start: Sat Oct 30 21:11:22 2021
HOST:                              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.101.222               5.0%   100    0.8   0.7   0.7   1.1   0.0
  2.|-- ???                       100.0   100    0.0   0.0   0.0   0.0   0.0
  3.|-- 10.110.120.2               6.0%   100  265.7 267.1 265.6 307.8   5.1
  4.|-- 10.110.120.1               6.0%   100  265.1 265.2 265.0 265.4   0.0
  5.|-- 192.168.52.10              6.0%   100  266.7 266.6 266.5 267.2   0.0

Take a packet capture and analyze the results

Take a packet capture on the localhost and the EC2 instance. Use the tcpdump or Wireshark utility to get network traffic for analysis. The following tcpdump example command gets the timestamp and host IP address:

tcpdump -i <network interface> -s0 -w $(date +"%Y%m%d\_%H%M%S").$(hostname -s).pcap port <port>

Use the TCP Throughput Calculator on the Switch website to calculate network limit, Bandwidth-delay Product, and TCP buffer size. For more information, see Troubleshooting AWS Direct Connect.

Related information

What's the difference between a hosted VIF and a hosted connection?What is iPerf / iPerf3 ?https://iperf.fr/

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago