Scaling Kubernetes with Karpenter: Advanced Scheduling with Pod Affinity and Volume Topology Awareness

This post was co-written by Lukonde Mwila, Principal Technical Evangelist at SUSE, an AWS Container Hero, and a HashiCorp Ambassador.

Introduction

Cloud-native technologies are becoming increasingly ubiquitous, and Kubernetes is at the forefront of this movement. Today, Kubernetes is seeing widespread adoption across organizations in a variety of different industries. When implemented properly, Kubernetes can help these organizations achieve higher availability, scalability, and resiliency for their workloads. Combining Kubernetes with the attributes of cloud computing—such as unparalleled scalability and elasticity—can help organizations enhance their containerized applications’ resiliency and availability.

As detailed in this introductory post, Karpenter‘s objective is to make sure that your cluster’s workloads have the compute they need, no more and no less, right when they need it.

In its most recent updates, Karpenter added support for more advanced scheduling constraints, such as pod affinity and anti-affinity, topology spread, node affinity, node selection, and resource requests. This post will specifically delve into podAffinity, podAntiAffinity, and volume topology awareness and elaborate on the use cases that they’re best suited for.

Prerequisites

To carry out the examples in this post, you need to have Karpenter installed in a Kubernetes cluster in AWS. We’ll be making use of Amazon EKS for demonstrative purposes. You can automate the process of provisioning an EKS cluster, with Karpenter as an add-on, by making use of the Terraform EKS blueprints.

Pod affinity and pod anti-affinity scheduling

Applying scheduling constraints to pods is implemented by establishing relationships between pods and specific nodes or between pods themselves. The latter is known as inter-pod affinity. Using inter-pod affinity, you assign rules that inform the scheduler’s approach in deciding which pod goes to which node based on their relation to other pods. Inter-pod affinity includes both pod affinity and pod anti-affinity.

Like node affinity, this can be done using the rules requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution depending on your requirements. As the names imply, required and preferred are terms that represent how hard or soft the scheduling constraints should be. If the scheduling criteria for a pod are set to the required rule, then Kubernetes ensures the pod is placed on a node that satisfies this. Similarly, pods that contain the preferred rule are scheduled to nodes that match the highest preference.

Pod affinity: The podAffinity rule informs the scheduler to match pods that relate to each other based on their labels. If a new pod is created, then the scheduler takes care of searching the nodes for pods that match the label specification of the new pod’s label selector.

Pod anti-affinity: In contrast, the podAntiAffinity rule allows you to prevent certain pods from running on the same node if the matching label criteria are met.

These rules can be particularly helpful in various scenarios. For example, podAffinity can be beneficial for pods to co-locate each other in the same AZ or node to support any inter-dependencies and reduce network latency between services. On the other hand, podAntiAffinity is typically useful for preventing a single point of failure by spreading pods across AZs or nodes for high availability (HA). For such use cases, the recommended topology spread constraint for anti-affinity can be zonal or hostname. This can be implemented using the topologyKey property which determines the searching scope of the cluster nodes. The topologyKey is a key of a label attached to a node.

An example of a podAntiAffinity implementation would be the CoreDNS Deployment. Its Deployment resource has the podAntiAffinity policy to ensure that the scheduler runs the CoreDNS pods on different nodes for HA and to avoid VPC DNS throttling. You’ll notice that the Deployment’s anti-affinity topologyKey is set to the hostname. In addition to this, podAntiAffinity can be used to give a pod or set of pods resource isolation on exclusive nodes, as well as mitigating the risk of some pods interfering with the performance of others.

Using Karpenter allows you to make sure that new compute provisioned for your cluster will satisfy these pod affinity rules as workloads scale, without configuring additional infrastructure. Karpenter tracks unscheduled pods and will provision compute resources in accordance with the required or preferred affinity rules defined in your resource manifests.

Pod affinity example with Karpenter

In this example, you’ll create a deployment resource with a podAffinity rule that requires scheduling the pods on nodes in the same AZ (availability zone). In the process, Karpenter will interpret the requirements of the pods that need to be scheduled and provision nodes that allow for these affinity rules to be met in an optimal way.

As a starting point, you’ll need to install the Karpenter Provisioner on your cluster. The Provisioner is a CRD that details configuration specifications and parameters such as node types, labels, taints, tags, customer kubelet configurations, resource limits and cluster connections via subnet and security group associations. The Provisioner manifest used in this example can be seen below.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  # Requirements that constrain the parameters of provisioned nodes.
  # These requirements are combined with pod.spec.affinity.nodeAffinity rules.
  # Operators { In, NotIn } are supported to enable including or excluding values
  requirements:
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot", "on-demand"]
  # Resource limits constrain the total size of the cluster.
  # Limits prevent Karpenter from creating new instances once the limit is exceeded.
  limits:
    resources:
      cpu: 1000 
      memory: 1000Gi
  provider:
    subnetSelector:
      karpenter.sh/discovery: alpha
    securityGroupSelector:
      karpenter.sh/discovery: alpha
    tags:
      karpenter.sh/discovery: alpha
  ttlSecondsAfterEmpty: 30

You can start by fetching all the nodes in your cluster using the kubectl get nodes command in your terminal. This will give you an idea of the existing nodes before Karpenter launches new ones in response to the application you’ll deploy to the cluster shortly.

NAME                                       STATUS   ROLES    AGE    VERSION
ip-10-0-0-126.eu-west-1.compute.internal   Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-0-193.eu-west-1.compute.internal   Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-0-35.eu-west-1.compute.internal    Ready    <none>   47h    v1.21.12-eks-5308cf7
ip-10-0-1-12.eu-west-1.compute.internal    Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-1-73.eu-west-1.compute.internal    Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-3-30.eu-west-1.compute.internal    Ready    <none>   2d2h   v1.21.12-eks-5308cf7

After that, you can proceed to create a deployment resource with the following manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 8
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - inflate
            topologyKey: "topology.kubernetes.io/zone"
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1

Karpenter will detect the unscheduled pods and provision a node that will help fulfill the inter-pod affinity requirements of this deployment:

NAME                                       STATUS   ROLES    AGE    VERSION
ip-10-0-0-126.eu-west-1.compute.internal   Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-0-193.eu-west-1.compute.internal   Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-0-35.eu-west-1.compute.internal    Ready    <none>   2d     v1.21.12-eks-5308cf7
ip-10-0-1-12.eu-west-1.compute.internal    Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-1-233.eu-west-1.compute.internal   Ready    <none>   32m    v1.21.12-eks-5308cf7 // New node
ip-10-0-1-73.eu-west-1.compute.internal    Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-3-30.eu-west-1.compute.internal    Ready    <none>   2d3h   v1.21.12-eks-5308cf7

The newly created node has the hostname ip-10-0-1-233.eu-west-1.compute.internal.

Here is a partial description of the new node:

Name:               ip-10-0-1-233.eu-west-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=c6i.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=eu-west-1
                    failure-domain.beta.kubernetes.io/zone=eu-west-1b
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/provisioner-name=default
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-1-233.eu-west-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c6i.2xlarge
                    topology.ebs.csi.aws.com/zone=eu-west-1b
                    topology.kubernetes.io/region=eu-west-1
                    topology.kubernetes.io/zone=eu-west-1b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-00725be7dfa8ef814"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
...

We can then fetch the relevant pods using the appropriate label, app=inflate in this case, to review how the pods have been scheduled.

kubectl get pods -l app=inflate -o wide
NAME                       READY   STATUS    RESTARTS   AGE    IP           NODE                                       NOMINATED NODE   READINESS GATES
inflate-588d96b7f7-2lmzb   1/1     Running   0          104s   10.0.1.197   ip-10-0-1-233.eu-west-1.compute.internal   <none>           <none>
inflate-588d96b7f7-gv68n   1/1     Running   0          104s   10.0.1.11    ip-10-0-1-233.eu-west-1.compute.internal   <none>           <none>
inflate-588d96b7f7-jffqh   1/1     Running   0          104s   10.0.1.248   ip-10-0-1-233.eu-west-1.compute.internal   <none>           <none>
inflate-588d96b7f7-ktjrg   1/1     Running   0          104s   10.0.1.81    ip-10-0-1-12.eu-west-1.compute.internal    <none>           <none>
inflate-588d96b7f7-mhjrl   1/1     Running   0          104s   10.0.1.133   ip-10-0-1-233.eu-west-1.compute.internal   <none>           <none>
inflate-588d96b7f7-vhjl7   1/1     Running   0          104s   10.0.1.21    ip-10-0-1-233.eu-west-1.compute.internal   <none>           <none>
inflate-588d96b7f7-zb7l8   1/1     Running   0          104s   10.0.1.18    ip-10-0-1-73.eu-west-1.compute.internal    <none>           <none>
inflate-588d96b7f7-zz2g4   1/1     Running   0          104s   10.0.1.207   ip-10-0-1-233.eu-west-1.compute.internal   <none>           <none>

As you can see, six of the pods have been scheduled on the new node ip-10-0-1-233.eu-west-1.compute.internal, whereas the other two have been scheduled to nodes in the same AZ (eu-west-1b) as per the topologyKey of the podAffinity rule. Since these nodes are all in the same AZ, they are part of the same topology, thus meeting the scheduling requirements.

Pod anti-affinity example with Karpenter

In the second example, you’ll apply a podAntiAffinity rule to preferably schedule the pods across different nodes in the cluster, based on their AZs. As before, Karpenter reads the pod requirements and launch nodes that support the podAntiAffinity configurations.

Similar to the previous example, start by fetching all the nodes in the cluster:

NAME                                       STATUS   ROLES    AGE    VERSION
ip-10-0-0-126.eu-west-1.compute.internal   Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-0-193.eu-west-1.compute.internal   Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-0-35.eu-west-1.compute.internal    Ready    <none>   47h    v1.21.12-eks-5308cf7
ip-10-0-1-12.eu-west-1.compute.internal    Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-1-73.eu-west-1.compute.internal    Ready    <none>   2d2h   v1.21.12-eks-5308cf7
ip-10-0-3-30.eu-west-1.compute.internal    Ready    <none>   2d2h   v1.21.12-eks-5308cf7

After that, you can proceed to create a deployment resource.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 8
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - inflate
              topologyKey: "topology.kubernetes.io/zone"
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1

In response to the pod requirements, Karpenter will provision a new node:

NAME                                       STATUS   ROLES    AGE    VERSION
ip-10-0-0-126.eu-west-1.compute.internal   Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-0-193.eu-west-1.compute.internal   Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-0-35.eu-west-1.compute.internal    Ready    <none>   2d     v1.21.12-eks-5308cf7
ip-10-0-1-12.eu-west-1.compute.internal    Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-1-69.eu-west-1.compute.internal    Ready    <none>   73s    v1.21.12-eks-5308cf7
ip-10-0-1-73.eu-west-1.compute.internal    Ready    <none>   2d3h   v1.21.12-eks-5308cf7
ip-10-0-3-30.eu-west-1.compute.internal    Ready    <none>   2d3h   v1.21.12-eks-5308cf7

The newly created node has the hostname ip-10-0-1-69.eu-west-1.compute.internal.

Here is a partial description of the new node:

Name:               ip-10-0-1-69.eu-west-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=c5.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=eu-west-1
                    failure-domain.beta.kubernetes.io/zone=eu-west-1b
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/provisioner-name=default
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-1-69.eu-west-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c5.xlarge
                    topology.ebs.csi.aws.com/zone=eu-west-1b
                    topology.kubernetes.io/region=eu-west-1
                    topology.kubernetes.io/zone=eu-west-1b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0617fbc688949e367"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
...

As before, fetch the pods with the relevant label:

kubectl get pods -l app=inflate -o wide
NAME                       READY   STATUS    RESTARTS   AGE    IP           NODE                                       NOMINATED NODE   READINESS GATES
inflate-54cd576f79-88rt4   1/1     Running   0          101s   10.0.0.128   ip-10-0-0-35.eu-west-1.compute.internal    <none>           <none>
inflate-54cd576f79-bpcc7   1/1     Running   0          101s   10.0.0.49    ip-10-0-0-126.eu-west-1.compute.internal   <none>           <none>
inflate-54cd576f79-jdcfc   1/1     Running   0          101s   10.0.1.247   ip-10-0-1-12.eu-west-1.compute.internal    <none>           <none>
inflate-54cd576f79-nh8zg   1/1     Running   0          101s   10.0.1.120   ip-10-0-1-69.eu-west-1.compute.internal    <none>           <none>
inflate-54cd576f79-qm8dc   1/1     Running   0          101s   10.0.1.236   ip-10-0-1-69.eu-west-1.compute.internal    <none>           <none>
inflate-54cd576f79-vvgkg   1/1     Running   0          101s   10.0.1.18    ip-10-0-1-73.eu-west-1.compute.internal    <none>           <none>
inflate-54cd576f79-vzzjx   1/1     Running   0          101s   10.0.0.123   ip-10-0-0-193.eu-west-1.compute.internal   <none>           <none>
inflate-54cd576f79-xtwkr   1/1     Running   0          101s   10.0.1.147   ip-10-0-1-69.eu-west-1.compute.internal    <none>           <none>

Except for three pods, all the others are dispersed to other nodes across the different AZs of the select region (eu-west-1) as per the affinity rule specification.

Next, we’ll further explore Karpenter‘s use with another advanced scheduling technique, namely volume topology awareness.

Volume topology aware scheduling

Before volume topology awareness, the processes of scheduling pods to nodes and dynamically provisioning volumes were independent. As you can imagine, this introduced the challenge of unpredictable outcomes for your workloads. For example, you might create a persistent volume claim which will trigger the dynamic creation of a volume in a certain AZ (i.e., eu-west-1a), whereas the pod that needs to make use of the volume gets placed on a node in a separate AZ (eu-west-1b). As a result, the pod will fail to start.

This is especially problematic for your stateful workloads that rely on storage volumes to provide persistence of data. It would be inefficient, and counter-intuitive to dynamic provisioning, to manually provision the storage volumes in the appropriate AZs. That’s where topology awareness comes in.

Topology awareness complements dynamic provisioning by ensuring that pods are placed on the nodes that meet their topology requirements, in this case, storage volumes. The goal of topology awareness scheduling is to provide alignment between topology resources and your workloads. Thus, it gives you a more reliable and predictable outcome. This is handled by the topology manager, a component of the kubelet. This means the topology manager will make sure that your stateful workloads and the dynamically created persistent volumes are placed in the correct AZs.

To use volume topology awareness, ensure that you set the volumeBindingMode of your storage class to WaitForFirstConsumer. This property delays the provisioning of persistent volumes until a persistent volume claim is created by a pod that will use it.

In scaling events, Karpenter, the scheduler, and the topology manager work well together. In combination, they optimize the process of provisioning the right compute resources and align scheduled workloads with their dynamically created persistent volumes.

These technologies enable you to run and reliably scale stateful workloads in multiple AZs. You can thus spread your applications or databases in your cluster across zones to prevent a single point of failure in the case that an AZ is impacted. Considering that elastic block store (EBS) volumes are AZ-specific, your workloads should be configured with nodeAffinity to ensure that they are provisioned in the same AZ where they were first scheduled for a successful reattachment.

Volume topology aware example with Karpenter

In this example, you’ll create a stateful set for an application with 20 replicas, each with a persistent volume claim using a storage class with volumeBindingMode already set to WaitForFirstConsumer. In addition, you’ll specify that the nodeAffinity of the workloads should be scheduled to topology resources in the eu-west-1a AZ.

To review the default storage class, run the kubectl get storageclass command:

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  3d11h

Next, you’ll fetch the nodes in the respective AZs:

Running kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a returns the following:

NAME                                       STATUS   ROLES    AGE     VERSION
ip-10-0-0-126.eu-west-1.compute.internal   Ready    <none>   4d23h   v1.21.12-eks-5308cf7
ip-10-0-0-193.eu-west-1.compute.internal   Ready    <none>   4d23h   v1.21.12-eks-5308cf7
ip-10-0-0-35.eu-west-1.compute.internal    Ready    <none>   4d21h   v1.21.12-eks-5308cf7

Whereas, running kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1b returns the following:

NAME                                      STATUS   ROLES    AGE     VERSION
ip-10-0-1-12.eu-west-1.compute.internal   Ready    <none>   4d23h   v1.21.12-eks-5308cf7
ip-10-0-1-73.eu-west-1.compute.internal   Ready    <none>   4d23h   v1.21.12-eks-5308cf7
ip-10-0-3-30.eu-west-1.compute.internal   Ready    <none>   4d23h   v1.21.12-eks-5308cf7

Once you have the layout for your nodes, you can proceed to create the stateful set and an accompanying load balancer service for the application.

apiVersion: v1
kind: Service
metadata:
  name: express-nodejs-svc
spec:
  selector:
    app: express-nodejs
  type: LoadBalancer
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: express-nodejs
spec:   
  serviceName: express-nodejs-svc
  replicas: 20
  selector:
    matchLabels:
      app: express-nodejs
  template:
    metadata:
      labels:
        app: express-nodejs
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - eu-west-1a
      containers:
      - name: express-nodejs
        image: lukondefmwila/express-test:1.1.4
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        ports:
        - containerPort: 8080
          name: express-nodejs
        volumeMounts:
        - name: express-nodejs
          mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: express-nodejs
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: gp2
        resources:
          requests:
            storage: 10Gi

After deploying the application, persistent volumes are dynamically created in response to each claim from the stateful set replicas. Each one is created in the appropriate AZ, resulting in the successful creation of each pod replica. In conjunction, Karpenter provisions new nodes in eu-west-1a to meet the compute requirements of the stateful set.

Now when we run kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a, we get the following response:

NAME                                       STATUS   ROLES    AGE     VERSION
ip-10-0-0-126.eu-west-1.compute.internal   Ready    <none>   5d      v1.21.12-eks-5308cf7
ip-10-0-0-176.eu-west-1.compute.internal   Ready    <none>   4m18s   v1.21.12-eks-5308cf7
ip-10-0-0-193.eu-west-1.compute.internal   Ready    <none>   5d      v1.21.12-eks-5308cf7
ip-10-0-0-35.eu-west-1.compute.internal    Ready    <none>   4d21h   v1.21.12-eks-5308cf7
ip-10-0-0-4.eu-west-1.compute.internal     Ready    <none>   6m12s   v1.21.12-eks-5308cf7
ip-10-0-0-53.eu-west-1.compute.internal    Ready    <none>   8m8s    v1.21.12-eks-5308cf7
ip-10-0-0-96.eu-west-1.compute.internal    Ready    <none>   2m28s   v1.21.12-eks-5308cf7

As you can see, Karpenter has launched the following additional nodes:

ip-10-0-0-176.eu-west-1.compute.internal
ip-10-0-0-4.eu-west-1.compute.internal
ip-10-0-0-53.eu-west-1.compute.internal
ip-10-0-0-96.eu-west-1.compute.internal

Furthermore, we reviewed the created persistent volume claims, persistent volumes, and pods by running the appropriate commands as shown in the following code:

kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                      STORAGECLASS   REASON   AGE
pvc-07ad8f9f-e14f-4760-a908-58d2c48c49ac   10Gi       RWO            Delete           Bound    default/express-nodejs-express-nodejs-11   gp2                     9m53s
pvc-0c3a949a-aa2f-4244-9988-d650b409698a   10Gi       RWO            Delete           Bound    default/express-nodejs-express-nodejs-3    gp2                     13m
pvc-32dfc65e-9d10-42ea-a1c1-946f25500766   10Gi       RWO            Delete           Bound    default/express-nodejs-express-nodejs-7    gp2                     11m

pvc-3cf7afb9-8bf4-4a0c-8064-cc97f0cdcbd5   10Gi       RWO            Delete           Bound    default/express-nodejs-express-nodejs-9    gp2                     11m
pvc-3d8d0cb8-c5f4-43ad-a3a4-a95922a13c53   10Gi       RWO            Delete           Bound    default/express-nodejs-express-nodejs-0    gp2
...
kubectl get pvc

NAME                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
express-nodejs-express-nodejs-0    Bound    pvc-3d8d0cb8-c5f4-43ad-a3a4-a95922a13c53   10Gi       RWO            gp2            14m
express-nodejs-express-nodejs-1    Bound    pvc-af841cf3-634a-4314-bfd1-a1d6da9d4101   10Gi       RWO            gp2            13m
express-nodejs-express-nodejs-10   Bound    pvc-741c0d72-7311-4063-9d9b-15942f71a9a9   10Gi       RWO            gp2            10m
express-nodejs-express-nodejs-11   Bound    pvc-07ad8f9f-e14f-4760-a908-58d2c48c49ac   10Gi       RWO            gp2            10m
express-nodejs-express-nodejs-12   Bound    pvc-4a91fb0a-778b-42d2-b4b7-33d5d3dc8a87   10Gi       RWO            gp2            9m45s
...
kubectl get pods -l app=express-nodejs -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP           NODE                                       NOMINATED NODE   READINESS GATES
express-nodejs-0    1/1     Running   0          16m     10.0.0.174   ip-10-0-0-193.eu-west-1.compute.internal   <none>           <none>

express-nodejs-1    1/1     Running   0          16m     10.0.0.30    ip-10-0-0-35.eu-west-1.compute.internal    <none>           <none>
express-nodejs-10   1/1     Running   0          12m     10.0.0.235   ip-10-0-0-53.eu-west-1.compute.internal    <none>           <none>
express-nodejs-11   1/1     Running   0          12m     10.0.0.137   ip-10-0-0-53.eu-west-1.compute.internal    <none>           <none>
express-nodejs-12   1/1     Running   0          12m     10.0.0.161   ip-10-0-0-4.eu-west-1.compute.internal     <none>           <none>
...

Cleanup

To avoid incurring any additional costs, make sure you destroy all the infrastructure that you provisioned in relation to the examples detailed in this post.

Conclusion

In this post, we covered a hands-on approach to scaling Kubernetes with Karpenter specifically for supporting advanced scheduling techniques with inter-pod affinity and volume topology awareness.

To learn more about Karpenter, you can read the documentation and join the community channel, #karpenter, in the Kubernetes Slack workspace. Also, if you like the project, you can star it on GitHub here.

Lukonde Mwila, Principal Technical Evangelist, SUSE

Lukonde is a Principal Technical Evangelist at SUSE, an AWS Container Hero, and a HashiCorp Ambassador. He has years of experience in application development, solution architecture, cloud engineering, and DevOps workflows. He is a life-long learner and is passionate about sharing knowledge through various mediums. Nowadays, Lukonde spends the majority of his time providing content, training, and support in the Kubernetes ecosystem and SUSE’s open-source container management stack.

Containers

Scaling Kubernetes with Karpenter: Advanced Scheduling with Pod Affinity and Volume Topology Awareness

Introduction

Prerequisites

Pod affinity and pod anti-affinity scheduling

Volume topology aware scheduling

Cleanup

Conclusion

Lukonde Mwila, Principal Technical Evangelist, SUSE

Resources

Learn

Resources

Developers

Help