Using mTLS with SPIFFE/SPIRE in AWS App Mesh on Amazon EKS

NOTICE: October 04, 2024 – This post no longer reflects the best guidance for configuring a service mesh with Amazon EKS and its examples no longer work as shown. Please refer to newer content on Amazon VPC Lattice.

——–

By Efe Selcuk and Apurup Chevuru and Michael Hausenblas

You know that here at AWS we consider security as “job zero”, and in the context of the shared responsibility model we provide you with controls to take care of your part. One popular use case of service meshes is to strengthen the security posture of your communication paths, something we’re focusing on in AWS App Mesh. Also, the challenges of using mTLS safely and correctly have been the subject of discussions amongst practitioners. To address your ask from the App Mesh roadmap for mutual TLS (mTLS), we’ve now launched support for this feature. In this blog post we explain the background of mTLS and walk you through an end-to-end example using an Amazon Elastic Kubernetes Service (EKS) cluster.

Background

If you’re not that familiar with mTLS then this section is for you, otherwise you can skip ahead to the walkthrough.

The Secure Production Identity Framework for Everyone (SPIFFE) project, a Cloud Native Computing Foundation (CNCF) open source project with wide community support, provides fine-grained, dynamic workload identity management. Based on the SPIFFE reference implementation called SPIRE you can assign and query a cryptographically strong and proof-able identity in any kind of distributed system. Note that SPIRE is not the only option in this space, you can use for example use Kubernetes secrets as described in Using EKS encryption provider support for defense-in-depthfor encryption, however in the context of this post we will be focusing on SPIRE.

A little bit of SPIFFE/SPIRE terminology to get everyone on the same page:

A workload is a piece of software deployed with a particular configuration, for example a microservice packaged and delivered as a container.
The workload is defined in the context of a trust domain, such as a cluster or an entire company network.
The SPIFFE ID represents the identity of a workload in the form spiffe://trust-domain/workload-identifier
An SPIFFE Verifiable Identity Document (SVID) is the document with which a workload proves its identity and is considered valid if it has been signed by an authority within the trust domain. A common example of an SVID instance is an X.509 certificate.
The SPIFFE workload API provides an platform agnostic way to identify services, akin to what the AWS EC2 Instance Metadata API provides in an AWS specific way.

To learn more check out the video Introduction to SPIFFE and SPIRE Projects by Evan Gilman which, in less than 10 minutes, explains how all these things play together.

mTLS in App Mesh

The general setup in the context of App Mesh looks as follows:

mTLS in App Mesh

In the data plane App Mesh uses Envoy that acts as a proxy, intercepting any kind of traffic. With mTLS enabled, the communication between the Envoy proxies is authenticated using TLS [1], whereas the communications between a service and its Envoy proxy is plain-text [2].

You can use mTLS authentication for all protocols supported by AWS App Mesh, including L4/TCP, HTTP (1.1/2), and gRPC. We support two mTLS in two modes:

PERMISSIVE mode for the TLS configuration on the server endpoint, allowing plain-text traffic to connect to the endpoint. This is mainly relevant for migration scenarios and we come back to this in the end of this post.
STRICT mode forces encrypted traffic and should be considered the default, going forward.

App Mesh supports two certificate sources for mutual TLS authentication with a server validation in a listener TLS configuration that can be sourced from either the local file system of the Envoy proxy or via Envoy’s Secret Discovery Service (SDS) API, via SPIRE. Note that App Mesh stores any sensitive data used for mTLS authentication in memory only.

Let’s consider a concrete usage scenario: take the case of an application that handles consumer payments and may have as one of its requirements to be Payment Card Industry Data Security Standard (PCI DSS) compliant. With mTLS, you can now tick that box and leave the heavy lifting to us.

Now that we understand why mTLS is beneficial and how it works on a high level in the context of App Mesh let’s move on to a concrete example.

An mTLS walkthrough

As a preparation, clone the aws-app-mesh-examples.git repo, the following setup is based on the howto-k8s-mtls-sds-based walkthrough. Make sure you have the environment variables AWS_ACCOUNT_ID and AWS_DEFAULT_REGION set since this will be needed later on to build and push the container images for the example app to ECR. Further, make sure you have Docker running.

First, create an EKS cluster that is App Mesh-enabled, using the eks-cluster-config.yaml config file as follows:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: mtls-demo
  region: eu-west-1
  version: '1.18'
iam:
  withOIDC: true
  serviceAccounts:
  - metadata:
      name: appmesh-controller
      namespace: appmesh-system
      labels: {aws-usage: "application"}
    attachPolicyARNs:
    - "arn:aws:iam::aws:policy/AWSAppMeshFullAccess"
managedNodeGroups:
- name: default-ng
  minSize: 1
  maxSize: 3
  desiredCapacity: 2
  labels: {role: mngworker}
  iam:
    withAddonPolicies:
      certManager: true
      cloudWatch: true
      appMesh: true
cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

Execute the following commend to create the EKS cluster:

$ eksctl create cluster -f eks-cluster-config.yaml
[ℹ]  eksctl version 0.34.0
[ℹ]  using region eu-west-1
...
[✔]  EKS cluster "mtls-demo" in "eu-west-1" region is ready

Next, install the App Mesh controller using the commands shown in the following below.

First we get the CRDs in place:

helm repo add eks https://aws.github.io/eks-charts

kubectl apply -k "https://github.com/aws/eks-charts/stable/appmesh-controller/crds?ref=master"

Note that if you already have the Helm repo configured that you do an helm repo update before you apply the CRDs.

Verify the installation:

$ kubectl api-resources --api-group=appmesh.k8s.aws -o wide 
NAME              SHORTNAMES   APIGROUP          NAMESPACED   KIND             VERBS
gatewayroutes                  appmesh.k8s.aws   true         GatewayRoute     [delete deletecollection get list patch create update watch]
meshes                         appmesh.k8s.aws   false        Mesh             [delete deletecollection get list patch create update watch]
virtualgateways                appmesh.k8s.aws   true         VirtualGateway   [delete deletecollection get list patch create update watch]
virtualnodes                   appmesh.k8s.aws   true         VirtualNode      [delete deletecollection get list patch create update watch]
virtualrouters                 appmesh.k8s.aws   true         VirtualRouter    [delete deletecollection get list patch create update watch]
virtualservices                appmesh.k8s.aws   true         VirtualService   [delete deletecollection get list patch create update watch]

Now, install the Kubernetes controller for App Mesh itself:

$ helm upgrade -i appmesh-controller eks/appmesh-controller \
               --namespace appmesh-system \
               --set region=eu-west-1 \
               --set serviceAccount.create=false \
               --set serviceAccount.name=appmesh-controller \
               --set sds.enabled=true
Release "appmesh-controller" does not exist. Installing it now.
NAME: appmesh-controller
LAST DEPLOYED: Wed Feb 10 11:44:03 2021
NAMESPACE: appmesh-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
AWS App Mesh controller installed!

You can, optionally, verify the installed controller version (should be v1.3.x or above) with:

kubectl -n appmesh-system get deployment appmesh-controller -o json \ | 
        jq -r ".spec.template.spec.containers[].image" \ | 
        cut -f2 -d ':'

Next, install the SPIRE server—as a stateful set—and the SPIRE agents—as a daemon set, one per worker node—with the pre-configured trust domain howto-k8s-mtls-sds-based.aws:

kubectl apply -f https://raw.githubusercontent.com/aws/aws-app-mesh-examples/master/walkthroughs/howto-k8s-mtls-sds-based/spire/spire_setup.yaml

Note that we also maintain Helm charts tailored for single cluster scenarios that you can use to set up your SPIRE installation.

Next, verify the SPIRE setup, that is, make sure that all pods are up and running:

$ kubectl -n spire get all
NAME                    READY   STATUS    RESTARTS   AGE
pod/spire-agent-gs2wp   1/1     Running   0          43s
pod/spire-agent-hwcbz   1/1     Running   0          43s
pod/spire-server-0      1/1     Running   0          44s

NAME                   TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/spire-server   NodePort   10.100.14.174   <none>        8081:31939/TCP   43s

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/spire-agent   2         2         2       2            2           <none>          43s

NAME                            READY   AGE
statefulset.apps/spire-server   1/1     44s

Now we can register agents and workloads, using the helper script register_server_entries.sh:

$ ./register_server_entries.sh register
Registering an entry for spire agent...
Entry ID      : 9dfa0073-2c11-427b-ad6b-dacae74b5b9d
SPIFFE ID     : spiffe://howto-k8s-mtls-sds-based.aws/ns/spire/sa/spire-agent
Parent ID     : spiffe://howto-k8s-mtls-sds-based.aws/spire/server
TTL           : 3600
Selector      : k8s_sat:cluster:k8s-cluster
Selector      : k8s_sat:agent_ns:spire
Selector      : k8s_sat:agent_sa:spire-agent

Registering an entry for the front app...
Entry ID      : 4a2310cb-a16a-4105-afae-e39d8872e5ba
SPIFFE ID     : spiffe://howto-k8s-mtls-sds-based.aws/front
Parent ID     : spiffe://howto-k8s-mtls-sds-based.aws/ns/spire/sa/spire-agent
TTL           : 3600
Selector      : k8s:ns:howto-k8s-mtls-sds-based
Selector      : k8s:sa:default
Selector      : k8s:pod-label:app:front
Selector      : k8s:container-name:envoy

...

Note that you can list the registered entities at any time using the following command:

kubectl exec -n spire spire-server-0 \
             -c spire-server -- \
             /opt/spire/bin/spire-server entry show

Finally, we deploy and example app to test the connectivity, using the helper script deploy_app.sh:

$ ./deploy.sh
CRD check passed!
aws-app-mesh-controller check passed! v1.3.0 >= v1.3.0
deploy images...
Login Succeeded
Sending build context to Docker daemon  3.584kB
...
7f03bfe4d6dc: Pushed
latest: digest: sha256:c2ea478c3ca7d1b6ade35f9639d257d8e3be831d1e41de20e1e959a945cd74ca size: 2631
...
namespace/howto-k8s-mtls-sds-based created
mesh.appmesh.k8s.aws/howto-k8s-mtls-sds-based created
...
service/color-red created
deployment.apps/red created
service/color created

Verify that all the pods are up and running as well as the custom resources our App Mesh controller looks after; it should look something like this:

$ kubectl -n howto-k8s-mtls-sds-based get all
NAME                         READY   STATUS    RESTARTS   AGE
pod/blue-6f7c4d4757-qz6cq    2/2     Running   0          28m
pod/front-74c86557b6-wpg2l   2/2     Running   0          28m
pod/green-65677456f5-66c6l   2/2     Running   0          28m
pod/red-849ffcbd75-84q5w     2/2     Running   0          28m

NAME                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/color         ClusterIP   10.100.119.181   <none>        8080/TCP   28m
service/color-blue    ClusterIP   10.100.248.144   <none>        8080/TCP   28m
service/color-green   ClusterIP   10.100.97.188    <none>        8080/TCP   28m
service/color-red     ClusterIP   10.100.168.32    <none>        8080/TCP   28m
service/front         ClusterIP   10.100.33.255    <none>        8080/TCP   28m

NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/blue    1/1     1            1           28m
deployment.apps/front   1/1     1            1           28m
deployment.apps/green   1/1     1            1           28m
deployment.apps/red     1/1     1            1           28m

NAME                               DESIRED   CURRENT   READY   AGE
replicaset.apps/blue-6f7c4d4757    1         1         1       28m
replicaset.apps/front-74c86557b6   1         1         1       28m
replicaset.apps/green-65677456f5   1         1         1       28m
replicaset.apps/red-849ffcbd75     1         1         1       28m

NAME                                  ARN                                                                                                                 AGE
virtualrouter.appmesh.k8s.aws/color   arn:aws:appmesh:eu-west-1:123456789012:mesh/howto-k8s-mtls-sds-based/virtualRouter/color_howto-k8s-mtls-sds-based   28m

NAME                                   ARN                                                                                                                                    AGE
virtualservice.appmesh.k8s.aws/color   arn:aws:appmesh:eu-west-1:123456789012:mesh/howto-k8s-mtls-sds-based/virtualService/color.howto-k8s-mtls-sds-based.svc.cluster.local   28m

NAME                                ARN                                                                                                               AGE
virtualnode.appmesh.k8s.aws/blue    arn:aws:appmesh:eu-west-1:123456789012:mesh/howto-k8s-mtls-sds-based/virtualNode/blue_howto-k8s-mtls-sds-based    28m
virtualnode.appmesh.k8s.aws/front   arn:aws:appmesh:eu-west-1:123456789012:mesh/howto-k8s-mtls-sds-based/virtualNode/front_howto-k8s-mtls-sds-based   28m
virtualnode.appmesh.k8s.aws/green   arn:aws:appmesh:eu-west-1:123456789012:mesh/howto-k8s-mtls-sds-based/virtualNode/green_howto-k8s-mtls-sds-based   28m
virtualnode.appmesh.k8s.aws/red     arn:aws:appmesh:eu-west-1:123456789012:mesh/howto-k8s-mtls-sds-based/virtualNode/red_howto-k8s-mtls-sds-based     28m

And now we can check mTLS:

$ kubectl -n default run -it --rm curler --image=tutum/curl /bin/bash
# first we try the path via the front-end that is secured (TLS enabled):
root@curler:/# curl -H "color_header: blue" front.howto-k8s-mtls-sds-based.svc.cluster.local:8080; echo;
blue
# now we directly try to access the blue service (should fail because of strict mode):
root@curler:/# curl -k https://color-blue.howto-k8s-mtls-sds-based.svc.cluster.local:8080 -v
* Rebuilt URL to: https://color-blue.howto-k8s-mtls-sds-based.svc.cluster.local:8080/
* Hostname was NOT found in DNS cache
*   Trying 10.100.50.219...
* Connected to color-blue.howto-k8s-mtls-sds-based.svc.cluster.local (10.100.50.219) port 8080 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Request CERT (13):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS alert, Server hello (2):
* error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure
* Closing connection 0
curl: (35) error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure

That’s it! You can also view the status in the App Mesh console where you should see something like this (annotated):

mTLS in the App Mesh console

To clean up you can use one or more of the following commands:

# get rid of the example:
kubectl delete ns howto-k8s-mtls-sds-based

# ... and/or get rid of SPIRE:
kubectl delete ns spire

# ... and/or get rid of the entire setup (App Mesh and EKS cluster):
eksctl delete cluster --region=eu-west-1 --name=mtls-demo

Usage considerations

As you’ve seen from above walkthrough, the usage of the new mTLS feature of App Mesh in the context of EKS is straight-forward. This is partly due to the controller we developed and also due to SPIRE taking care of the heavy lifting concerning the workload identities management.

SPIRE issues short-lived certificates, with a default of one hour, and automatically renews them in advance of expiry, also called auto-rotation. The certificates are pushed to the Envoy proxies by the SPIRE agents.

Some further usage considerations for mTLS in the context of App Mesh on EKS:

You want to plan ahead and consider migrating existing (not encrypted) workloads.
In above walkthrough we’ve shown a simple scenario with self-signed certificates, however you can and likely want to use a Certificate Authority (CA), for example Amazon Certificate Manager (ACM).
When using SPIRE in the context of EKS on Fargate, note that you can not use above shown solution as Kubernetes daemonsets are not yet supported in this compute engine.
For more (related) hands-on walkthroughs check out the App Mesh examples repo.

Let us know your experience with this new App Mesh security feature and share feedback and suggestions via our roadmap.

Efe Selcuk

Efe is a Software Development Engineer (SDE) in the container service team, working on Amazon EKS.

Apurup Chevuru

Apurup is a Software Development Engineer (SDE) in the container service team, working on Amazon EKS.

Containers

Using mTLS with SPIFFE/SPIRE in AWS App Mesh on Amazon EKS

Background

mTLS in App Mesh

An mTLS walkthrough

Usage considerations

Efe Selcuk

Apurup Chevuru

Resources

Follow