Networking & Content Delivery

Building Resilient Multi-cluster Applications with Amazon EKS, Part 1: Implementing Cross-cluster Load Balancing with NLB

This three-part series explores design patterns and strategies to enhance application resiliency through multi-cluster deployment on Amazon Elastic Kubernetes Service (EKS). In this first part, we address a common challenge when using a Network Load Balancer (NLB) in multi-cluster environments.

Organizations increasingly rely on Kubernetes—whether through Amazon Elastic Kubernetes Service (EKS) or self-managed clusters on Amazon Web Services (AWS)—to power and scale their mission-critical applications. While operating workloads on a single EKS cluster offers convenience, it presents challenges in maintaining high availability during critical operations. Activities such as cluster upgrades, add-on updates, and workflow changes can potentially compromise workload resilience and application availability, making it essential to address these concerns proactively.

To mitigate these challenges, users often deploy applications across multiple EKS clusters. This multi-cluster approach offers several key benefits:

  • Blue-green upgrades: Zero-downtime upgrades through blue-green deployments, allowing gradual traffic migration between clusters.
  • Cluster upgrades and add-on updates: Staggered cluster and add-on updates across clusters, minimizing system-wide disruptions.
  • Workload resilience: Enhanced workload resilience against accidental cluster deletions.
  • Failover and disaster recovery: Improved disaster recovery with cross-cluster failover capabilities.

While these benefits are compelling, implementing effective load balancing across multiple clusters has historically been challenging. However, AWS Load Balancer Controller (LBC) v2.10+ now addresses this gap by introducing support for cross-cluster traffic distribution through multi-cluster TargetGroupBinding, a powerful feature that we’ll explore in detail throughout this post. This solution is particularly valuable for client-server communication patterns and scenarios where organizations manage their own Layer 7 proxy configurations, as Network Load Balancers (NLBs) provide the necessary flexibility and performance for these use cases.

Enhanced NLB Target Group Binding: Supporting Multi-cluster EKS Deployments

This new feature enables an NLB to register targets (TargetGroupBinding) from different EKS clusters into the same target group of the NLB, making sure that traffic is distributed seamlessly.

How this feature works

With LBC version 2.10+, the new feature enables efficient target management across multiple EKS clusters. Using a ConfigMap for each TargetGroupBinding allows the controller to seamlessly support multi-cluster deployments.

A new parameter, multiClusterTargetGroup, allows the NLB to handle targets across multiple clusters. When enabled, this flag makes sure that each cluster manages its targets independently, enhancing reliability and streamlining load balancing across clusters.

The following figure shows the reference architecture:

Figure 1 – Architecture

 

Here’s how the process works:

  1. Managing targets: For each EKS cluster, the LBC maintains a separate ConfigMap that tracks the targets for that cluster’s service endpoints. This makes sure that only the targets specific to a particular cluster are registered or deregistered from the NLB, avoiding any accidental changes to other clusters.
  2. Pod registrations: When a new pod is initiated, the LBC updates the ConfigMap during its reconciliation loop. The new target is registered in the NLB, making sure that traffic can route correctly to the new pod.
  3. Pod deletions: Similarly, when a pod is deleted, the LBC updates the ConfigMap to reflect the change and deregisters the deleted target from the NLB, keeping the system consistent and avoiding errors.
  4. Reconciliation process: LBC regularly reconciles service endpoints with NLB targets, adding new endpoints and removing outdated ones while using ConfigMap to maintain cluster separation. When changes occur, the LBC updates the entire ConfigMap object as a single operation. The controller does not support partial updates or patching functionality.

Now, let’s implement this multi-cluster configuration.

Prerequisites

For this walkthrough, you should have an AWS account with the appropriate AWS Identity and Access Management (IAM) permissions to create EKS clusters and IAM roles, and be able to launch the provided AWS CloudFormation template. For detailed pricing information, refer to the official AWS Pricing page.

Deploy the solution with CloudFormation

We use a CloudFormation stack to deploy this solution. The stack creates all necessary resources, such as the following:

To get started, complete the following steps:

  1. Sign in to the AWS Management Console.
  2. Choose Launch Stack in any AWS Region and open it in a new tab: 
  3. On the Create stack page, proceed with all the default values.
  4. Choose the check box to acknowledge the creation of IAM resources.
  5. Choose Create stack.
  6. Wait for the stack creation to complete. This step may take up to 15 minutes to complete.

At this point, you have Primary and Secondary EKS clusters ready. Configure the LBC add-on on both clusters and deploy a sample application to demonstrate multi-cluster load balancing.

From the Console, launch AWS CloudShell:

Figure 2 – Launching CloudShell from the AWS Console

Using CloudShell, run the following command:

git clone https://github.com/aws-samples/eks-design-patterns
cd eks-design-patterns
bash ./install_prereqs.sh

The script successfully deploys three nginx pods with a NLB LoadBalancer service on the Primary EKS cluster (pri-eks-clu1), while the Secondary EKS cluster (sec-eks-clu1) is initialized and ready for further configuration.

We cover two scenarios in the following sections:

  1. Scenario 1 (existing workload): Using an existing NLB to route traffic to an nginx service running across both the primary and secondary EKS clusters.
  2. Scenario 2 (new workload): Creating a new NLB to route traffic to an nginx service running on both primary and secondary EKS clusters.

Each scenario demonstrates how to distribute traffic seamlessly across multiple clusters using a single NLB setup.

We begin with Scenario 1 for existing workloads.

Scenario 1: Extending existing NLB configuration for multi-cluster support

In this scenario, you work with an existing NLB and its associated Target Group and TargetGroupBinding configured on the Primary EKS cluster. You update the TargetGroupBinding on the Primary EKS cluster to enable multi-cluster support. Then, you create a new TargetGroupBinding on the Secondary EKS cluster for the nginx service. This configuration allows the existing NLB to distribute traffic between nginx services running on both Primary and Secondary EKS clusters using the same Target Group. This approach makes sure of seamless traffic distribution across both clusters while maintaining the existing load balancer infrastructure.

Step 1: Verify multi-cluster support in LBC

Using CloudShell, verify that both Primary and Secondary EKS clusters have the aws-load-balancer-controller installed with multi-cluster feature enabled by running these commands:

kubectl --context pri-eks-clu1 explain targetGroupBinding.spec.multiClusterTargetGroup
kubectl --context sec-eks-clu1 explain targetGroupBinding.spec.multiClusterTargetGroup

The successful output should display the multiClusterTargetGroup field definition in the TargetGroupBinding spec for both clusters, confirming LBC’s multi-cluster capability is properly enabled. If this field is missing, ensure you have the latest LBC version installed.

Step 2: Enable multi-cluster support and delete protection on NLB

Using CloudShell, run the following command to add delete protection and enable multi-cluster target group using annotation on the Primary EKS cluster:

kubectl --context pri-eks-clu1 patch service nginx \
  --patch '{"metadata": {"annotations": {"service.beta.kubernetes.io/aws-load-balancer-multi-cluster-target-group": "true", "service.beta.kubernetes.io/aws-load-balancer-attributes": "deletion_protection.enabled=true"}}}' \
  --type=merge

The successful output shows service/nginx patched.

Step 3: Verify multi-cluster configuration in TargetGroupBinding

Using CloudShell, run the following command to verify that the targetGroupBinding was updated with the multiClusterTargetGroup flag:

kubectl --context pri-eks-clu1 get targetgroupbinding -l service.k8s.aws/stack-name=nginx \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.multiClusterTargetGroup}{"\n"}{end}'

The successful output shows true.

Step 4: Verify target registration in ConfigMap

When the TargetGroupBinding’s multiClusterTargetGroup flag is updated, the LBC creates a ConfigMap object containing the list of targets and reconciles the target Pods with the NLB Target.

Using CloudShell, run these commands to verify the ConfigMap creation and target registration:

kubectl --context pri-eks-clu1 get cm aws-lbc-targets-$(kubectl get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.metadata.name}{end}')
kubectl --context pri-eks-clu1 describe cm aws-lbc-targets-$(kubectl get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.metadata.name}{end}')

The successful output shows a ConfigMap containing registered targets, confirming LBC’s target reconciliation between the nginx pods and NLB.

Step 5: Deploy nginx application on Secondary EKS cluster

Using CloudShell, run the following commands to create a deployment and a service object of type ClusterIP on the Secondary EKS cluster:

kubectl --context sec-eks-clu1 create -f nginxapp.yaml
kubectl --context sec-eks-clu1 create -f nginx_sec-eks-clu1.yaml
kubectl get all

The successful output shows a nginx deployment with 2 running pods and a ClusterIP service created on the Secondary cluster.

Step 6: Create multi-cluster TargetGroupBinding on Secondary EKS cluster

Using CloudShell, run the following to create TargetGroupBinding with the multiClusterTargetGroup flag on the Secondary EKS cluster:

# Get targetGroup ARN from Primary EKS Cluster 
targetgrouparn=$(kubectl --context pri-eks-clu1 get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.spec.targetGroupARN}{end}')

lb=$(kubectl --context pri-eks-clu1 get svc nginx -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' | sed -n 's/.*\(k8s-default-nginx-[a-z0-9]*\).*/\1/p')

# Get Security GroupId
secgrpid=$(aws ec2 describe-security-groups --group-ids $(aws elbv2 describe-load-balancers --names ${lb} --query 'LoadBalancers[*].SecurityGroups[]' --output text) --query "SecurityGroups[?Description=='[k8s] Shared Backend SecurityGroup for LoadBalancer'].GroupId"   --output text)

# update tgb.yaml file
sed "s|<securitygroupid>|${secgrpid}|g; s|<targetgrouparn>|${targetgrouparn}|g" tgb.yaml > tgb_sec.yaml

kubectl --context sec-eks-clu1 create -f tgb_sec.yaml

Step 7: Verify ConfigMap creation on the Secondary EKS cluster

Using CloudShell, run the following commands to verify the ConfigMap object and its contents created by the controller alongside the targetGroupBinding:

kubectl --context sec-eks-clu1 get cm aws-lbc-targets-nginx
kubectl --context sec-eks-clu1 describe cm aws-lbc-targets-nginx

The successful output shows a ConfigMap aws-lbc-targets-nginx containing the IP addresses and ports of the nginx targets.

Step 8: Verify target registration across both EKS clusters

Using CloudShell, run the following command to verify that the NLB has registered targets from both the Primary and Secondary EKS clusters:

aws elbv2 describe-target-health --target-group-arn $( kubectl get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.spec.targetGroupARN}{end}')  --query 'TargetHealthDescriptions[*].[Target.Id,Target.AvailabilityZone,TargetHealth.State]' --output table

The successful output shows healthy targets distributed across Primary and Secondary EKS clusters, confirming successful multi-cluster configuration.

Step 9: Verify traffic distribution across EKS clusters

Using CloudShell, run the following commands to perform a synthetic workload test and verify traffic distribution to Pods on both the Primary and Secondary EKS clusters:

# Get NLB hostname
lb=$(kubectl --context pri-eks-clu1 get svc nginx -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
# Run ApacheBench test
ab -n 1000 -c 40 http://$lb/

# Check logs on Primary cluster
echo "Primary EKS Cluster:"
kubectl --context pri-eks-clu1 logs deployment/nginx --tail=2

# Check logs on Secondary cluster
echo "Secondary EKS Cluster:"
kubectl --context sec-eks-clu1 logs deployment/nginx --tail=2

The successful output shows HTTP requests logged in both clusters, confirming that the NLB is properly distributing traffic across all nginx pods on Primary and Secondary EKS clusters.

Cleaning up: Using CloudShell, run the following commands: 

kubectl --context sec-eks-clu1 delete targetgroupbinding nginx
kubectl --context sec-eks-clu1 delete svc nginx
kubectl --context pri-eks-clu1 patch service nginx   --patch '{"metadata": {"annotations": {"service.beta.kubernetes.io/aws-load-balancer-attributes": "deletion_protection.enabled=false"}}}'   --type=merge
kubectl --context pri-eks-clu1 delete svc nginx
kubectl --context pri-eks-clu1 delete deployment nginx

You have successfully tested the migration of existing workloads in Scenario 1. In Scenario 2, you can explore how to implement multi-cluster load balancing for new deployments.

Scenario 2: Implementing multi-cluster load balancing for new deployments

In this scenario, you create a new NLB designed to support multi-cluster traffic distribution from the beginning. This implementation needs LBC 2.10 or higher installed on both EKS clusters.

Step 1: Deploy nginx with LoadBalancer Service on the Primary EKS cluster

Using CloudShell, run the following commands:

kubectl --context pri-eks-clu1 create -f nginxapp.yaml
kubectl --context pri-eks-clu1 create -f nginxsvc.yaml

Step 2: Verify ConfigMap creation by LBC

Using CloudShell, run the following commands to verify the ConfigMap object and its contents created alongside the targetGroupBinding:

kubectl --context pri-eks-clu1 get cm aws-lbc-targets-$(kubectl get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.metadata.name}{end}')
kubectl --context pri-eks-clu1 describe cm aws-lbc-targets-$(kubectl get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.metadata.name}{end}')

The successful output shows a ConfigMap containing IP addresses and ports of nginx targets, confirming LBC’s proper configuration.

Step 3: Deploy nginx service on the Secondary EKS cluster

Using CloudShell, run the following commands to create an nginx deployment and a ClusterIP service on the Secondary EKS cluster:

kubectl --context sec-eks-clu1 create -f nginxapp.yaml
kubectl --context sec-eks-clu1 create -f nginx_sec-eks-clu1.yaml

Step 4: Configure TargetGroupBinding on the Secondary EKS cluster

Using CloudShell, run the following commands to create a TargetGroupBinding with multi-cluster support:

# Get targetGroup ARN from Primary EKS Cluster 
targetgrouparn=$(kubectl --context pri-eks-clu1 get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.spec.targetGroupARN}{end}')

lb=$(kubectl --context pri-eks-clu1 get svc nginx -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' | sed -n 's/.*\(k8s-default-nginx-[a-z0-9]*\).*/\1/p')
# Get Security GroupId
secgrpid=$(aws ec2 describe-security-groups --group-ids $(aws elbv2 describe-load-balancers --names ${lb} --query 'LoadBalancers[*].SecurityGroups[]' --output text) --query "SecurityGroups[?Description=='[k8s] Shared Backend SecurityGroup for LoadBalancer'].GroupId"   --output text)
# update tgb.yaml file
sed "s|<securitygroupid>|${secgrpid}|g; s|<targetgrouparn>|${targetgrouparn}|g" tgb.yaml > tgb_p2.yaml
kubectl --context sec-eks-clu1 create -f tgb_p2.yaml

Step 5: Verify ConfigMap on the Secondary EKS cluster

Using CloudShell, run the following commands to verify the ConfigMap object and its contents created by the controller:

kubectl --context sec-eks-clu1 get cm aws-lbc-targets-$(kubectl --context sec-eks-clu1 get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.metadata.name}{end}')
kubectl --context sec-eks-clu1 describe cm aws-lbc-targets-$(kubectl --context sec-eks-clu1 get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.metadata.name}{end}')

The successful output shows a ConfigMap containing the Secondary EKS cluster’s nginx target IPs and ports.

Step 6: Verify target registration across both EKS clusters

Using CloudShell, run the following command to verify that the NLB has successfully registered targets from both the Primary and Secondary EKS clusters:

aws elbv2 describe-target-health --target-group-arn $( kubectl get targetgroupbinding -l service.k8s.aws/stack-name=nginx   -o jsonpath='{range .items[*]}{.spec.targetGroupARN}{end}')  --query 'TargetHealthDescriptions[*].[Target.Id,Target.AvailabilityZone,TargetHealth.State]' --output table

Expected output:

-------------------------------------------
|          DescribeTargetHealth           |
+--------------+--------------+-----------+
|  xx.xx.xx.xxx|  us-east-1b  |  healthy  |
|  xx.xx.xx.xxx|  us-east-1c  |  healthy  |
|  xx.xx.xx.xxx|  us-east-1c  |  healthy  |
|  xx.xx.xx.xxx|  us-east-1b  |  healthy  |
+--------------+--------------+-----------+

The output shows healthy targets distributed across primary and secondary EKS Clusters, confirming successful multi-cluster configuration.

Cleaning up: Using AWS CloudShell, run the following commands to clean up all resources:

kubectl --context sec-eks-clu1 delete targetgroupbinding nginx
kubectl --context sec-eks-clu1 delete svc nginx
kubectl --context pri-eks-clu1 patch service nginx   --patch '{"metadata": {"annotations": {"service.beta.kubernetes.io/aws-load-balancer-attributes": "deletion_protection.enabled=false"}}}'   --type=merge
kubectl --context pri-eks-clu1 delete svc nginx
kubectl --context pri-eks-clu1 delete deployment nginx
aws cloudformation delete-stack --stack-name eks-nlb

This demonstration showed how the NLB multi-cluster feature enhances service resilience by enabling traffic distribution across multiple EKS clusters for both existing and new load balancer deployments.

Considerations

  • This feature currently supports only active-active, equal distribution across targets in both EKS clusters. Weighted target load balancing is not yet supported.
  • Existing VPC account limits, API limits, and NLB limits remain in effect with this feature.
  • As a best practice, enable delete protection on the NLB to prevent accidental deletions.
  • Each service has a one-to-one mapping with a ConfigMap object, making sure of precise management of TargetGroupBindings per service.
  • LBC writes all targets to the ConfigMap (1MB limit) in a single update rather than incrementally. Monitor EKS control plane health accordingly.

Conclusion

In this first part of our series, we’ve demonstrated how to achieve resiliency across multiple Amazon EKS clusters using the new NLB feature with a declarative approach. As organizations increasingly migrate and modernize their applications to Kubernetes environments, implementing robust and scalable solutions becomes critical for maintaining high availability. Distributing workloads across clusters not only enhances fault tolerance but also streamlines upgrades and strengthens disaster recovery capabilities. Visit our EKS Best Practices Hub for architecture patterns, security guidelines, and cost optimization strategies for production workloads

Stay tuned for the next posts in this series, where we explore more design patterns to further improve the resiliency of workloads operating on Amazon EKS.

About the authors

Krishna Sarabu is a Senior Database Engineer with AWS. He focuses on containers, application modernization, infrastructure, and open-source database engines Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL. He enjoys working with users to help design, deploy, and optimize relational database workloads on AWS.
Anuj Butail is a principal solutions architect at AWS. He is based out of San Francisco and helps users in San Francisco and Silicon Valley design to build large scale applications on AWS. He has expertise in the area of AWS, edge services, and containers. He enjoys playing tennis, reading, and spending time with his family.
Pushkar Patil is a Product Owner in the AWS networking team based out of California. He has over a decade of experience driving product innovation and strategic planning in cloud computing and infrastructure. Pushkar has successfully launched many new products by understanding users’ needs and delivering innovative solutions. When not working, you can find this cricket enthusiast traveling with his family.