Containers

Secure Cross-Cluster Communication in EKS with VPC Lattice and Pod Identity IAM Session Tags

Solution overview

When you create your applications and want to expose internal API endpoints, you can build your microservices using different compute options such as AWS Lambda, Amazon Elastic Container Service (ECS), and Amazon Elastic Kubernetes Service (Amazon EKS). Then, you can deploy your applications across multiple AWS accounts and multiple Amazon Virtual Private Clouds (VPCs), at which point you need a secure way to connect them. Amazon VPC Lattice enables east-west traffic across accounts, offers service discovery, traffic management, and access controls. When working on Amazon EKS, Amazon VPC Lattice comes with the AWS Gateway API Controller that implements the Kubernetes Gateway API.

Traffic throughout this architecture can be protected using encryption in transit, and you can enable fine-grained AWS Identity and Access Management (IAM) authorizations on every Amazon VPC Lattice service. For encryption, you can rely on AWS Private Certificate Authority (CA) to manage your private domain, and AWS Certificate Manager (ACM) to create certificates for each of your services. For Authorization, you can rely on the IAM auth policies feature of Amazon VPC Lattice and EKS Pod Identity, which simplifies how cluster administrators can configure Kubernetes applications to obtain IAM permissions. These permissions can now be configured with fewer steps directly through the Amazon EKS console, APIs, and CLI. EKS Pod Identity lets you reuse an IAM role across multiple clusters and simplifies policy management.

You can associate an IAM role to a Pod service account with the following API:

aws eks create-pod-identity-association \
  --cluster-name $CLUSTER_NAME \
  --namespace $NAMESPACE \
  --service-account $SERVICE_ACCOUNT \
  --role-arn arn:aws:iam::$AWS_ACCOUNT:role/$POD_ROLE_NAME

The IAM role $POD_ROLE_NAME needs to have the following trust policy that allows the Amazon EKS service to use it:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "pods.eks.amazonaws.com"
    },
    "Action": ["sts:AssumeRole","sts:TagSession"]
  }]
}

When EKS Pod Identity assumes an IAM role, it sets session tags to the IAM session. These tags contain information such as the eks-cluster-name, kubernetes-namespace, and kubernetes-pod-name, which can be used for Attribute-Based Access Control (ABAC). ABAC can grant access to your AWS resources to only specific Kubernetes pods, and from specific namespaces in specific EKS clusters. This feature can also be used when reaching an Amazon VPC Lattice service when we enable IAM authorization, providing a direct and rich access control system for your microservices within Amazon EKS or between different EKS clusters.

In order to access an Amazon VPC Lattice service on which we enable IAM auth policies, your application request must be signed with the AWS Sigv4 algorithm (or Sigv4A for cross-Region), the same feature that is used with any AWS service. You can implement the request signature by using the AWS in your application code (see these examples). Although this is the recommended method, we recognize that this means you need to make changes to your application code, and we can propose an alternative solution. The associated Amazon EKS Blueprints pattern demonstrates an alternative approach that allows you to use a sidecar proxy without modifying your application code. This proxy automatically handles Sigv4 signing for requests targeting Amazon VPC Lattice services. We use Envoy, a widely adopted proxy that supports EKS Pod Identity and the AWS Sigv4 signature feature, enabling seamless integration with AWS services from within the Kubernetes cluster.

Furthermore, for pod-to-pod encryption, you can set up your Amazon VPC Lattice services with HTTPS listeners. If your application container cannot support TLS encryption for HTTPS requests, then you can offload this to an Envoy sidecar proxy along with the Sigv4 signing. In this set up, your application makes an HTTP request to an Amazon VPC Lattice service that is set up as HTTPS, and this request is routed to the Envoy sidecar proxy that then issues a signed HTTPS request to Amazon VPC Lattice as a client. When your application makes an HTTP request to an Amazon VPC Lattice service, the request is automatically routed to the Envoy sidecar with a local iptable rule. From there, Envoy creates a Sigv4 signature for this request and forwards the request to Amazon VPC Lattice in HTTPS, relying on the AWS Private CA to validate our private domain certificates issued by ACM.

To ease further usage of the proxy, the pattern relies on an annotation in our deployment. This triggers a Kyverno cluster policy to automatically inject the Envoy sidecar proxy within your application pod.

Walkthrough

In this solution, we rely on EKS Blueprints for Terraform patterns. EKS Blueprints are a collection of patterns that demonstrate specific usage of Amazon EKS and other AWS services. Here, we use the vpc-lattice/cross-cluster-communication pattern.

We have divided the pattern into three Terraform stacks, also shown in the following diagram:

We have divided the pattern into three Terraform stacks, also shown in the following diagram:

  1. The first stack named environment creates AWS resources that are needed for both EKS clusters:
  • An Amazon Route 53 private hosted zone named example.com, attached to a dummy Private VPC (only created at this stage to have a private hosted zone).
  • An AWS Private CA that manages the private domain. From this AWS Private CA, we create a wildcard ACM certificate that is later attached to our Amazon VPC Lattice services.
  • We also create an IAM role that is used by our applications using EKS Pod Identity. The role has permissions to invoke Amazon VPC Lattice services and to download the AWS Private CA root certificate, allowing the application to trust the private domain.

2. The next two stacks are deployed using the same Terraform code in the cluster directory. We use the Terraform workspace feature to instantiate the cluster stack twice to create two EKS clusters: cluster1, and cluster2.

First, the stack creates a dedicated VPC (with overlapping CIDR, as it has the exact same configuration).

Then, the stack creates the EKS cluster with a dedicated managed node group.

Next, it installs some Amazon EKS and Kubernetes add-ons:

  • Gateway API Controller that manages Amazon VPC Lattice objects creation from HTTPRoute and IAMAuthPolicy definitions.
  • An External-DNS that is responsible for creating records in the Route53 Private Hosted Zone based on the same HTTPRoute objects containing custom domain names.
  • And finally, we deploy Kyverno, which is in charge of injecting the Envoy proxy within our application pods.

Next, we install two Helm charts, the first named platform creates the GatewayClass and Gateway objects, which create the Amazon VPC Lattice service network, and the Kyverno cluster policy used to inject the Envoy proxy into the application. The second Helm chart named demo deploys the demo application, named demo-cluster1 in the first cluster and demo-cluster2 in the second, as shown in the following diagram.

When the Gateway API Controller finds the Gateway object created from the platform Helm chart, it creates a VPC association between your VPC and the Lattice Service Network. The VPC association can also include a security group that defines who within the VPC is allowed to make inbound requests to the service network. With different conditions, we can restrict who is allowed to consume the network, for example also restricting by VPC identifiers.

Although Amazon VPC Lattice can be used cross-account by using AWS Resource Access Manager (AWS RAM) for simplicity in this pattern we only rely on the same AWS account.

The Amazon VPC Lattice target group health checks originate from the Amazon VPC Lattice service. Amazon VPC Lattice has a specific managed prefix list that needs to be allowed into the EKS cluster security group, which is done by the Terraform definition. If your Amazon EKS application needs to initiate connections into the service network, then you also need to update the VPC association security group to allow inbound traffic from the node group security group on the appropriate ports, which is also configured in Terraform.

The demo Helm chart creates an HTTPRoute object containing the demo service definition and uses a custom domain name:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: demo-cluster2
  namespace: apps
spec:
  hostnames:
  - demo-cluster2.example.com
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: lattice-gateway
    namespace: lattice-gateway
    sectionName: http-listener
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: lattice-gateway
    namespace: lattice-gateway
    sectionName: https-listener-with-custom-domain
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: demo-cluster2-v1
      port: 80
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

The Gateway API Controller creates associated Amazon VPC Lattice resources (service, listeners, and target groups). With our deployment, it enables TLS termination for each service on the listeners, which forward the traffic to the associated target groups, respecting the route defined for each service. The target groups are set up as type IP and forward directly to the associated pods in the target Kubernetes service.

An Amazon VPC Lattice service could be configured to spread requests across targets from different clusters.

Because we have defined a custom domain name, the Gateway API Controller also creates a Kubernetes DNSEndpoint object:

kind: DNSEndpoint
metadata:
  name: demo-cluster1-dns
  namespace: apps
  ownerReferences:
  - apiVersion: gateway.networking.k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: HTTPRoute
    name: app4
spec:
  endpoints:
  - dnsName: demo-cluster1.example.com
    recordTTL: 300
    recordType: CNAME
    targets:
    - demo-cluster1-apps-082dc3111b7018633.7d67968.vpc-lattice-svcs.eu-west-1.on.aws
status:
  observedGeneration: 1

External-DNS watches this object with the CRD extension. We configure External-DNS to be able to read from the DNSEndpoint custom resource definition by providing this configuration:

--source=crd --crd-source-apiversion=externaldns.k8s.io/v1alpha1 --crd-source-kind=DNSEndpoint

When you create an API Gateway HTTPRoute with a custom domain specified, Amazon VPC Lattice creates a DNS Endpoint for that entry.

With this configuration, External-DNS can create a dedicated Route53 CNAME record in the private hosted zone that routes traffic from the custom domain name to the Amazon VPC Lattice service endpoints. Thus your internal services can be discovered and accessed using their internal domain names.

The Kubernetes Gateway object also contains the amazon resource name (ARN) of the ACM certificate, which is used by Amazon VPC Lattice to terminate the TLS session of your requests.

The demo Helm chart also deploys an IAMAuthPolicy object associated with the HTTPRoute, specifying that each request must be signed with the Sigv4 algorithm, and defining ABAC rules stating that for the demo-cluster2 application, only requests from the apps namespace from the cluster1 EKS cluster are allowed. And for the demo-cluster1 application, it only accepts requests from the apps namespace from the cluster2 EKS cluster.

We can also configure an access log subscription so that all requests to the Amazon VPC Lattice services are logged in an Amazon CloudWatch  log group.

Deployment

You can follow the deployment instructions in the associated EKS Blueprints pattern.

As a quick setup, you can also execute the following:

# Clone EKS Blueprint repository
git clone https://github.com/aws-ia/terraform-aws-eks-blueprints.git
cd patterns/vpc-lattice/cross-cluster-pod-communication

# Deploy the environment
cd environment
terraform init
terraform apply --auto-approve

# Deploy cluster 1
cd ../cluster
./deploy.sh cluster1
eval `terraform output -raw configure_kubectl`

# Deploy cluster 2
cd ../cluster
./deploy.sh cluster2
eval `terraform output -raw configure_kubectl`

Once the solution is correctly deployed with the three stacks in success, then you can execute the following commands to validate that the cross-cluster communication is working as expected.

For this, we do an exec into the pods to execute a cURL command targeting the services in HTTP. As already explained, the request is routed to the Envoy sidecar in HTTP, which signs the request and forwards it in HTTPS using our AWS Private CA certificates:

1. From cluster1 app1, call cluster2 app2 -> success

$ kubectl --context eks-cluster1 exec -ti -n apps deployments/demo-cluster1-v1 \
  -c demo-cluster1-v1 -- curl demo-cluster2.example.com

Requesting to Pod(demo-cluster2-v1-c99c7bb69-2gm5f): Hello from demo-cluster2-v1

2. From cluster2 app2, call cluster1 app1 -> success

$ kubectl --context eks-cluster2 exec -ti -n apps deployments/demo-cluster2-v1 \
  -c demo-cluster2-v1 -- curl demo-cluster1.example.com

Requesting to Pod(demo-cluster1-v1-6d7558f5b4-zk5cg): Hello from demo-cluster1-v1

We can see that if we don’t use the authorized flow as in the preceding commands, then Amazon VPC Lattice rejects the request as unauthorized:

3. From cluster1 app1, call cluster1 app1 -> forbidden

$ kubectl --context eks-cluster1 exec -ti -n apps deployments/demo-cluster1-v1 \
  -c demo-cluster1-v1 -- curl demo-cluster1.example.com 

AccessDeniedException: User: arn:aws:sts::12345678910:assumed-role/vpc-lattice-sigv4-client/eks-eks-cluste-demo-clust-1b575f8d-fb77-486a-8a13-af5a2a0f78ae is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:eu-west-1:12345678910:service/svc-002349360ddc5a463/ because no service-based policy allows the vpc-lattice-svcs:Invoke action

4. From cluster2 app2, call cluster2 app2 -> forbidden


$ kubectl --context eks-cluster2 exec -ti -n apps deployments/demo-cluster2-v1 \
-c demo-cluster2-v1 -- curl demo-cluster2.example.com

AccessDeniedException: User: arn:aws:sts::12345678910:assumed-role/vpc-lattice-sigv4-client/eks-eks-cluste-demo-clust-a5c2432b-b84a-492f-8cbc-16f1fa5053eb is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:eu-west-1:12345678910:service/svc-00b57f32ed0a7b7c3/ because no service-based policy allows the vpc-lattice-svcs:Invoke action

You can check how the IAMPolicy is defined on the demo-cluster1 application:

kubectl --context eks-cluster1 get IAMAuthPolicy -n apps demo-cluster1-iam-auth-policy  -o json | jq ".spec.policy | fromjson"

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::12345678910:root"
      },
      "Action": "vpc-lattice-svcs:Invoke",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalTag/eks-cluster-name": "eks-cluster2",
          "aws:PrincipalTag/kubernetes-namespace": "apps"
        }
      }
    }
  ]
}

We can confirm that only requests initiating from the cluster eks-cluster2 in the namespace apps are allowed.

Cleaning up

To avoid incurring future charges, delete the resources by following the cleanup section of the EKS Blueprints pattern or executing the following snippet:

  1. We start by deleting the cluster2 Terraform stack.

Note that we need to do this in order, so that our Kubernetes controllers can clean external resources before deleting the controller, and Kubernetes nodes.

./destroy.sh cluster2

  1. Then, we can delete the cluster1 Terraform stack.

./destroy.sh cluster1

  1. Finally, delete the environment Terraform stack.

cd ../environment
terraform destroy -auto-approve

Conclusion

With this solution, we have demonstrated how you can secure cross-EKS-cluster application communication using Amazon VPC Lattice, with an automated example that you can use as a reference to adapt to your own microservices applications.

The key benefits of this approach include the following:

  • Secure Communication: By using Amazon VPC Lattice, you can make sure that communication between your EKS clusters is encrypted in transit and protected by fine-grained IAM authorization policies. This helps maintain the security and integrity of your application data, even when it is being transmitted across different clusters or accounts.
  • Simplified Service Discovery: With the integration of the Gateway API Controller and External-DNS, you can expose your services using custom domain names, making it easier for other services to discover and communicate with them.
  • Scalability and Flexibility: Amazon VPC Lattice allows you to distribute your application across multiple VPCs and accounts, enabling you to scale your infrastructure as needed while maintaining secure connectivity between your components.
  • Automated Deployment: By using the EKS Blueprints for Terraform, you can automate the deployment and configuration of your EKS clusters, Amazon VPC Lattice resources, and other supporting services, reducing the risk of manual errors and making sure of consistent deployments.
  • Reusability: The solution demonstrates how to use EKS Pod Identity and the Envoy proxy to enable secure communication without modifying your application code. This approach can be adapted to other applications, allowing you to reuse the same patterns and best practices across your organization.
  • Observability and Monitoring: By configuring access logs for Amazon VPC Lattice services, you can gain valuable insights into the traffic flowing between your clusters, enabling you to monitor and troubleshoot issues more effectively.

Overall, this solution provides a comprehensive and secure approach to enabling cross-cluster communication for your Amazon EKS-based applications, using the power of Amazon VPC Lattice and other AWS services. By following the patterns and best practices demonstrated in this example, you can build scalable, secure, and highly available microservices architectures that meet the demands of modern cloud-native applications.

Sebastien Allamand

Sebastien Allamand

Sébastien is a Senior Container Specialist Solution Architect with more than 15 years of experience in building production architectures that prioritize reliability, scalability, and operational efficiency. He is currently focusing on Infrastructure as Code, Distributed Systems, and optimizing developer workflows at Amazon Web Services.