Deliver Namespace as a Service multi tenancy for Amazon EKS using Karpenter

Introduction

Karpenter is an open-source, high-performance Kubernetes cluster autoscaler that automatically provisions new nodes in response to unschedulable pods. Customers choose Karpenter for many reasons, such as improving the efficiency and cost of running workloads in their clusters. Karpenter works by configuring a custom resource called Provisioner. This Provisioner sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes.

Customers who are considering a multi-tenant Amazon Elastic Kubernetes Service (Amazon EKS) clusters are looking to share cluster resources across different teams (i.e., tenants). However, they still require a node isolation for critical and highly regulated workloads. While deploying Karpenter on a multi-tenant Amazon EKS cluster, Karpenter doesn’t support namespace as metadata on its own CRDs (both AWSNodeTemplate and Provisioners). This post will show how we used Karpenter to provision node, scale up, and scale down the cluster per tenant without impacting other ones.

Walkthrough

This solution uses admission controllers using Open Policy Agent GateKeeper to enforce taint/tolerations and Node Selector on those Nodes that’ll be created by Karpenter on specific namespaces.

This example, we’re going to use the following scenario:

We have two tenants, TenantA and TenantB, that need to run deployments in different namespaces
TenantA will use a namespace named tenant-a and TenantB will use a namespace named tenant-b
TenantA workloads must run on nodes apart of PoolA and TenantB workloads should run on nodes apart of PoolB
As a consumer of the Amazon EKS cluster, your tenant will be completely isolated and they won’t make changes to their pod specs to schedule their pods in a particular namespace

Prerequisites

An Amazon EKS Cluster
Enable VPC-CNI Network Policy
Follow the Getting Started section in the Amazon EKS documentation to install aws cli, kubectl, and eksctl on your machine
Karpenter (v0.31 or older) (Installation Guide) **Karpenter has graduated to beta therefore the APIs have changed since the writing of this blog. Please ensure you use the alpha APIs with this blog. If you would like to test with Beta, you can review the Beta changes here **
Open Policy Agent Gatekeeper (3.13.0 at least) (Installation Guide)

Create two namespaces

Create two namespaces called tenant-a and tenant-b with the commands:

kubectl create ns tenant-a

kubectl create ns tenant-b

Confirm you have two newly created namespaces with the following command:

kubectl get ns | grep -i tenant

Create a default deny-all network policy

By default, pods aren’t isolated for egress and ingress traffic: all inbound and outbound connections are allowed. In a multi-tenant environment, where users share the same Amazon EKS cluster, they require an isolation between their namespaces, pods, or external services. Kubernetes NetworkPolicy helps control traffic flow at the IP address or port. Please check this section on Amazon EKS best practices to build a multi-tenant EKS cluster.

VPC-CNI supports Network Policies natively start from its version 1.14 on Amazon EKS 1.25 or later. It integrates with the upstream Kubernetes Network Policy Application Programming Interface (API), ensuring compatibility and adherence to Kubernetes standards. You can define policies using different identifiers supported by the upstream API. As best practices, defining network policies has to follow a principal of least privilege. First, we create a deny all policy that restricts all inbound and outbound traffic across namespaces, and then we start allowing traffic like allow Domain Name System (DNS) queries, etc. For more details, you can check this section on Amazon EKS best practices guide.

In this example, we’ll use a network policy that denies all traffic across namespaces and allow dns queries for service name resolutions:

cat << EOF > deny-all.yaml 
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  namespace: default
  name: deny-from-other-namespaces
spec:
  podSelector:
    matchLabels:
  ingress:
  - from:
    - podSelector: {}

EOF
kubectl create -f deny-all.yaml -n tenant-a

kubectl create -f deny-all.yaml -n tenant-b
cat << EOF > allow-dns-access.yaml 
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-access
spec:
  podSelector:
    matchLabels: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
EOF
kubectl create -f allow-dns-access.yaml -n tenant-a

kubectl create -f allow-dns-access.yaml -n tenant-b

Verify all the network policies are applied in place:

kubectl get networkpolicies -A

Install Karpenter with proper Provisioner files for node pools configuration

After Installing Karpenter on the Amazon EKS Cluster, create an AWSNodeTemplate and two Provisioner files as shown below, we’ll create node pools ourselves using a combination of taints/tolerations and node labels. Use the schema below when creating the Provisioner:

Nodes in PoolA will have:
- A NoSchedule taint with key node-pool and value pool-a
- A label with key node-pool and value pool-a
Nodes in PoolB will have:
- A NoSchedule taint with key node-
- pool and value pool-b
- A label with key node-pool and value pool-b

Create manifest default-awsnodetemplate.yaml:

export CLUSTER_NAME=<YOUR_EKS_CLUSTER_NAME>

cat << EOF >  default-awsnodetemplate.yaml
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
    name: default
spec:
    subnetSelector:
       karpenter.sh/discovery: $CLUSTER_NAME
    securityGroupSelector:
       karpenter.sh/discovery: $CLUSTER_NAME
    instanceProfile: KarpenterNodeInstanceProfile-$CLUSTER_NAME
    tags:
        karpenter.sh/discovery: $CLUSTER_NAME
EOF

Create manifest called pool-a.yaml:

cat << 'EOF' > pool-a.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: pool-a
spec:
  providerRef:
    name: default
  taints:
    - key: node-pool
      value: pool-a
      effect: NoSchedule
  labels:
    node-pool: pool-a
  ttlSecondsAfterEmpty: 30
 EOF

Create manifest called pool-b.yaml:

cat << 'EOF' > pool-b.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: pool-b
spec:
  providerRef:
    name: default
  taints:
    - key: node-pool
      value: pool-b
      effect: NoSchedule
  labels:
    node-pool: pool-b
  ttlSecondsAfterEmpty: 30
 EOF

You can save and apply the Provisioner to your cluster by running the following command:

kubectl create -f default-awsnodetemplate.yaml

kubectl create -f pool-a.yaml


kubectl create -f pool-b.yaml

Deploy OPA Gatekeeper policies

Confirm OPA Gatekeeper is deployed and running in your cluster with this command:

kubectl get deployment -n gatekeeper-system
NAME                          READY UP-TO-DATE  AVAILABLE AGE
gatekeeper-audit               1/1  1           1         24h
gatekeeper-controller-manager  3/3  3           3         24h

Forcing deployments on specific Namespace to be deployed on the proper Node Pool

Using OPA Gatekeeper ,we can force our Deployment to be deployed on the proper Node Pool based on the Namespace. Using Admission Controller we can Mutate the pod to add a nodeSelector and tolerations to the spec. By using a nodeSelector, it allows teams to still define their own nodeAffinity to provide additional guidance on how Karpenter should provision nodes. Rather than writing our own admission controller, we will used OPA Gatekeeper and its mutation capability.

Here are the assigned policies that we will used for Pool A and Pool B and similarly we need to do for each Namespace (node Pool).

Create a policy called nodepool-selector-pool-a:

cat << 'EOF' > nodepool-selector-pool-a.yaml
apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: nodepool-selector-pool-a
spec:
  applyTo:
  - groups:
    - ""
    kinds:
    - Pod
    versions:
    - v1
  location: spec.nodeSelector
  match:
    kinds:
    - apiGroups:
      - '*'
      kinds:
      - Pod
    namespaces:
    - tenant-a
    scope: Namespaced
  parameters:
    assign:
      value:
        node-pool: pool-a
EOF

kubectl create -f nodepool-selector-pool-a.yaml

Create a policy called nodepool-selector-pool-b:

cat << 'EOF' > nodepool-selector-pool-b.yaml
apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: nodepool-selector-pool-b
spec:
  applyTo:
  - groups:
    - ""
    kinds:
    - Pod
    versions:
    - v1
  location: spec.nodeSelector
  match:
    kinds:
    - apiGroups:
      - '*'
      kinds:
      - Pod
    namespaces:
    - tenant-b
    scope: Namespaced
  parameters:
    assign:
      value:
        node-pool: pool-b
EOF

kubectl create -f nodepool-selector-pool-b.yaml

We need pods to be assigned to the worker nodes so the OPA policy can use that and apply it to the newly created worker nodes. Create the toleration using the manifests as shown below:

cat << 'EOF' > nodepool-toleration-pool-a.yaml
apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: nodepool-toleration-pool-a
spec:
  applyTo:
  - groups:
    - ""
    kinds:
    - Pod
    versions:
    - v1
  location: spec.tolerations
  match:
    kinds:
    - apiGroups:
      - '*'
      kinds:
      - Pod
    namespaces:
    - tenant-a
    scope: Namespaced
  parameters:
    assign:
      value:
      - key: node-pool
        operator: Equal
        value: pool-a
EOF

cat << 'EOF' > nodepool-toleration-pool-b.yaml
apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: nodepool-toleration-pool-b
spec:
  applyTo:
  - groups:
    - ""
    kinds:
    - Pod
    versions:
    - v1
  location: spec.tolerations
  match:
    kinds:
    - apiGroups:
      - '*'
      kinds:
      - Pod
    namespaces:
    - tenant-b
    scope: Namespaced
  parameters:
    assign:
      value:
      - key: node-pool
        operator: Equal
        value: pool-b
        
EOF

kubectl create -f nodepool-toleration-pool-a.yaml

kubectl create -f nodepool-toleration-pool-b.yaml

Testing it out

Now that we have our node pools defined and the mutation capability, let’s create a deployment for each of our tenants and make sure it is functioning.

Run the follow command to create the deployment:

kubectl create deployment nginx --image=nginx --replicas 3 -n tenant-a
kubectl expose deployment nginx --port=8080 --target-port=80 -n tenant-a
kubectl create deployment nginx --image=nginx --replicas 3 -n tenant-b
kubectl expose deployment nginx --port=8080 --target-port=80 -n tenant-b

As you can see when creating a specific deployment in tenant-a namespace, we have the NodeSelector and Tolerations added to the Pod Specs through OPA.

kubectl get nodes -L node-pool
NAME STATUS ROLES AGE VERSION NODE-POOL
...
ip-10-100-10-109.us-west-2.compute.internal Ready <none> 36m v1.27.4-eks-8ccc7ba pool-b
ip-10-100-22-46.us-west-2.compute.internal Ready <none> 65m v1.27.4-eks-8ccc7ba pool-a
...

In the following pod specification, note the node-selectors and tolerations:

kubectl describe pods nginx-55f598f8d-6q5fs -n tenant-a
Name:             nginx-55f598f8d-6q5fs
Namespace:        tenant-a
Priority:         0
Service Account:  default
Node:             ip-10-100-19-121.us-west-2.compute.internal/10.100.19.121
Start Time:       Sat, 23 Sep 2023 19:08:47 +0200
Labels:           app=nginx
                  pod-template-hash=55f598f8d
Annotations:      <none>
Status:           Running
IP:               10.100.16.70
IPs:
  IP:           10.100.16.70
Controlled By:  ReplicaSet/nginx-55f598f8d
Containers:
  nginx:
    Container ID:   containerd://f410c5f0979dffcac1e5299a18960f523aaabbb251969113ef1ff182a0711b24
    Image:          nginx
    Image ID:       docker.io/library/nginx@sha256:32da30332506740a2f7c34d5dc70467b7f14ec67d912703568daff790ab3f755
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 23 Sep 2023 19:09:12 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k92mc (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-k92mc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              node-pool=pool-a
Tolerations:                 node-pool=pool-a
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Let’s confirm these new nodes is running with the following command:

kubectl get nodes -L node-pool | grep -i pool

Last thing, let’s make sure that ingress traffic for both namespaces are blocked:

kubectl create deployment curl-tenant-a —image=curlimages/curl:latest — sleep 3600 -n tenant-a
kubectl get pods -n tenant-a

NAME READY STATUS RESTARTS AGE
curl-bdff4577d-rhcwv 1/1 Running 0 11m
nginx-55f598f8d-6q5fs 1/1 Running 0 14m
nginx-55f598f8d-cqjnj 1/1 Running 0 14m
nginx-55f598f8d-sx4d7 1/1 Running 0 14m
We use curl command form tenant-a to nginx running service on tenant-b.
kubectl exec -i -t -n tenant-a curl-bdff4577d-rhcwv — curl -v —max-time 3 nginx.tenant-b.svc.cluster.local:8080
  Trying 172.20.81.215:8080...
Connection timed out after 3000 milliseconds
Closing connection
curl: (28) Connection timed out after 3000 milliseconds
command terminated with exit code 28

Now you should see, the proper worker nodes starting up in their own namespace and isolated, thanks to Karpenter and Network Policies implemented by VPC-CNI!

Note: it might take a moment or two for the nodes to startup and join the cluster.

Cleaning up

After you complete this experiment, you can delete the Kubernetes deployment and respective resources.

kubectl delete -f default-awsnodetemplate.yaml

kubectl delete -f pool-a.yaml

kubectl delete -f pool-b.yaml
kubectl delete ns tenant-a

kubectl delete ns tenant-b

kubectl delete ns gatekeeper

Delete your EKS Cluster (this depends on how you created your EKS cluster)

Conclusion

In this post, we showed you how to use Karpenter along with admission controller ,like OPA Gatekeeper. We were able to assign labels, tolerations to our deployment of pods, labels, and taints assigned to the newly created nodes by Karpenter via different provisioner (i.e., one provisioner per namespace). Together with Network policy provided by VPC-CNI, we were able to have a multi-tenant environment on top of Amazon EKS scalable and each workload is isolated from one to another.

Containers