AWS Partner Network (APN) Blog

Best Practices from Fortinet for Troubleshooting Common Issues with Gateway Load Balancer

Sri Ramachandran | Principal Cloud Solutions Architect | Fortinet.
James Wenzel | Principal Solutions Architect | AWS
Anthony Smith | Sr. Partner Solutions Architect | AWS

Fortinet 

Many customers have now deployed Fortinet’s FortiGate NGFW integrated with AWS Gateway Load balancer (GWLB) for advanced security inspection and business policy enforcement for North-South and East-West traffic flows for both public and private workloads. GWLB is a service that AWS introduced in November 2020 to enable a seamless insertion of third-party appliances such as Next-Generation Firewalls into the traffic flow and leverage their capabilities. “Integrate your custom logic or appliance with AWS Gateway Load Balancer blog” gives a good explanation of this service and its architecture. We will cover common issues encountered in the field with deployments to help troubleshoot and identify the root cause.

Although the GWLB architecture with its endpoints is straight-forward, several challenges could be encountered in the field that leads to the flows not working as expected.

Fortinet is an AWS Specialization Partner and AWS Marketplace Seller with Competencies in Networking, Security, and Small and Medium Business Software. Fortinet provides enterprise-grade security for workloads on AWS, including next-gen and web firewall and intrusion prevention.

Troubleshooting common issues with Gateway Load Balancer

Here are some of the common issues and potential ways to identify the root cause:

#1 – Unable to get health checks working between GWLB and third-party appliances

Gateway Load Balancer performs health checks of its registered targets and only sends traffic to targets it deems healthy. If the health check to a target fails, GWLB will not send traffic to that target.

GWLB performs health checks on a target using a given protocol on a given port. This can be seen under Health check settings in the Target group associated with GWLB. Other settings include the periodicity of the health check, thresholds, and timeouts.

Figure 1 – Health check status for GWLB.

Use packet capture commands on the appliance to ensure that the appliance is receiving these probes and responding to them. The probes will arrive using a source IP address of the GWLB ENI as listed with the prefix of ‘ELB gwy/<GWLB name>’ in the description under Network Interfaces.

10.0.30.92 in the sample snip below is the IP Address of the GWLB ENI sending TCP probes to the primary ENI IP address of the third-party appliance on port 541. The third-party appliance in this case is Fortinet’s FortiGate, having 10.0.10.10 as the interface IP address. The command used is FortiGate’s built-in packet sniffer for capturing packets. In this example, it captures packets from any interface that matches the protocol filter of TCP on port 541 and the verbosity level of 4 which is to print header of packets with interface name. The target group health check was set to TCP on port 541.

diag sniffer packet any ‘tcp and port 541’ 4

1.337859 port1 in 10.0.30.92.5562 -> 10.0.10.10.541: syn 1579433989
1.337903 port1 out 10.0.10.10.541 -> 10.0.30.92.5562: syn 1657832195 ack 1579433990
1.338238 port1 in 10.0.30.92.5562 -> 10.0.10.10.541: ack 1657832196

Note that a common misconception is GENEVE being used in GWLB health checks. GENEVE is a data plane protocol and not used in the control plane for health checks.

If the health check fails, possible causes of failure on firewalls include reverse path forwarding (RPF) check failures, firewall policies, NACLs or security groups denying this traffic, or ports not being open on firewalls to receive the probes.

#2 – Route tables seem to be configured correctly, but packets are not seen on third-party appliances

Route CIDRs and next hop are stitched hop by hop. Troubleshooting this requires a methodical hop-by-hop validation to ensure that at each hop the routing table for the subnet or the Transit Gateway attachment has the correct next hop for the given CIDR.

You can use the AWS Reachability Analyzer for troubleshooting from the internet gateway (IGW) to GWLB.

GWLBe to GWLB:

  • Ensure the next hop at the point of ingress is the VPC endpoint ID of the GWLBe in that AWS Availability Zone (AZ).
  • As shown in Figure 2, ensure the GWLB is listed under VPC endpoint services with the status of Available and the AZs it serves. Also ensure every GWLBe is listed under VPC endpoints with an endpoint type of GWLB and a status of Available.

Figure 2 – Checking status of GWLB and GWLBe.

GWLB to third-party appliance and back to GWLB:

  • As shown in Figure 3, ensure the GWLB in consideration is associated with the correct target group.

Figure 3 – Checking configuration of the GWLB’s association with the target group.

  • Ensure all appliances are registered with GWLB and the targets are reported as healthy through health checks in the target group.

Figure 4 – Checking status of the GWLB targets.

  • Ensure that GENEVE tunnels on the third-party appliance are set up to the IP addresses of the GWLB ENI. That is, the IP addresses for the ENI whose description in network interfaces is ‘ELB gwy/<GWLB name>‘ as shown in Figure 5.

Figure 5 – GENEVE tunnel configuration on third-party appliance referencing GWLB ENI.

  • Ensure that third-party appliance policies and routes are set up correctly to receive the traffic, and then process and return it to the GWLB. Policy routing must be set up to accept all packets from GWLB, inspect traffic to allow or deny and if allowed, and egress it out on the same interface.
  • Use the native troubleshooting tool on the appliance to check for packets coming in from GWLB over GENEVE on UDP port 6081 and sent back to GWLB after inspection.

Figure 6 shows a sample output from FortiGate when Secure Shell (SSH) traffic comes through. The SSH packet is received within the GENEVE header from the IP address of the GWLB ENI and sent back with the same header after inspection. The SSH is seen as a payload of this packet and is not visible in the outer header.

Figure 6 – Confirming successful ingress/egress of GENEVE packets on third-party appliance.

Below, Figure 7 shows a packet capture from FortiGate showing the encapsulations for an SSH session sent over GWLB. Packet captures can be enabled directly from the FortiGate user interface (UI), and once captured they can be downloaded to a PCAP file that can be viewed via Wireshark.

As you can see, the wrapper header is GENEVE with 10.0.30.92 being the IP address of the GWLB ENI sent over UDP port 6081, the GENEVE protocol, and port to the FortiGate’s interface IP of 10.0.10.10. Within that wrapper header is the original packet of the endpoint source and destination IP addresses of a public IP address and 10.1.10.100 for the SSH session.

Figure 7 – Packet-level detail of the outer and inner header of GENEVE packets.

As shown in preceding diagram:

  1. GENEVE packet in the outer header with the IP address of the GWLB endpoint as the source.
  2. GENEVE packet with the IP address of the FortiGate interface as the destination.
  3. UDP as the GENEVE protocol with the destination port of 6081.
  4. IPv4 inner header with a public IP address of the internet host as the source to the private destination IP address of the workload in the VPC. Ingress routing does the DNAT of the VPC workload public IP to its private IP address.
  5. TCP in the inner header with SSH port 22 as the destination port.
  6. Encrypted payload of SSH.

#3 – Packets received from GWLB on third-party appliance but don’t get back to the destination via GWLB

As stated earlier, routing for CIDRs and next hop are stitched hop by hop. Troubleshooting this requires a methodical hop-by-hop validation to ensure that at each hop the routing table for the subnet or the Transit Gateway attachment has the correct next hop for the given CIDR.

  • GWLB expects to receive return traffic using the same 5-tuple information it created in the state table when it first forwarded the traffic it received from GWLBe.
  • Ensure the firewall policies on the appliance does not apply NAT to the payload traffic it receives the traffic and returns does not match the 5-tuple on the GWLB. If it does, GWLB will drop the traffic. Checking the firewall logs and filtering by IP addresses, protocols, and ports will give information on the firewall policy being applied. Checking the policy will give information whether NAT had been applied.

#4 – Application owners report session timeouts after migration to GWLB

  • GWLB maintains a 5-tuple session information and times it out after a certain period of inactivity. Refer to GWLB best practices guide for defaults.
  • Next-gen firewalls such as FortiGate are stateful devices and maintain information on TCP and UDP flows to ensure there are no anomalies in the flow. The same firewall needs to receive the traffic for a given session. When GWLB times out the session, the next packet will potentially be sent to another firewall that will drop it, as it had not seen the beginning of session establishment.
  • Applications can introduce periodic keepalive packets within the GWLB timeouts to keep the sessions active in GWLB and traffic to be forwarded to the same device.

#5 – Source AZ is not preserved when GWLB receives traffic from AWS Transit Gateway

  • Source AZ is preserved only when the AWS Transit Gateway attachment to the VPC with third-party appliances is not in appliance mode. When an attachment is set to appliance mode, Transit Gateway will round-robin across its attachments to the different AZs it’s attached to.
  • Appliance mode setting is individual to each Transit Gateway attachment and not a global setting.
  • Check in graphical user interface (GUI) for the specific Transit Gateway attachment or from AWS CloudShell, as shown in Figure 8 through
    aws ec2 describe-transit-gateway-vpc-attachments –transit-gateway-attachment-ids <attachment id>.

Figure 8 – Checking status of appliance mode on Transit Gateway.

  • If appliance mode is disabled, it can be enabled using the command
    aws ec2 modify-transit-gateway-vpc-attachment –transit-gateway-attachment-id <tgw-attach-xyx> –options ApplianceModeSupport=”enable”
  • Note that in firewall deployments, appliance mode needs to be enabled to ensure the Transit Gateway preserves the next hop for a given flow. In this case, there will be no control of the AZ that the Transit Gateway delivers a flow to.

Conclusion

Troubleshooting complex Gateway Load Balancer (GWLB) environments with many dependencies and security controls can be challenging and time-consuming. This post talks to some of the most common challenges faced in the field, particularly from the perspective of GWLB and ways to methodically approach it.

Key points to remember:

  • GENEVE is a UDP-based data plane protocol and used between the GWLB and third-party appliance; it’s not used for health checks.
    • Use packet capture commands on your appliance to confirm health check probes are being received on the configured protocol and port and responded to. Refer to the section on “Unable to get health checks working between GWLB and the third-party appliances”for further elaboration on this topic and an example from Fortinet’s FortiGate NGFW
  • GWLB can receive the onward flow only from a registered GWLB endpoint (GWLBe). All ingress traffic to GWLB must be sent through GWLBe.
    • Ensure route tables that direct traffic to GWLB are sent to the VPC endpoint and not to the ENI of the GWLB. The ENI of the GWLB is shown with the prefix of ‘elb gwy’ under Network Interfaces.
  • GWLB will always return the traffic to the GWLBe that originated the traffic and cannot be sent to any other GWLBe. This is the behavior of the service and there’s no control by the network administrators to direct it elsewhere.
  • The route table of the subnet with GWLBe is consulted only when GWLB sends the traffic back to GWLBe. Traffic sent into GWLBe flows directly to GWLB without any route table consultations. This is the behavior of the service and there is no control by the network administrators to alter this behavior.
  • GWLB maintains session states and clears it from its cache when there is no activity within the timeout period.
    • See the section on “Tune TCP keep-alive or timeout values to support long-lived TCP flows” in GWLB best practices, if applications are experiencing session timeouts.
  • Appliance mode in AWS Transit Gateway is applied per attachment and not global to the Transit Gateway. Best practice is to enable appliance mode on the attachment that connects to stateful devices such as firewalls.

For more details, you can reach out to consulting@fortinet.com.