Containers
Scaling Kubernetes with Karpenter: Advanced Scheduling with Pod Affinity and Volume Topology Awareness
This post was co-written by Lukonde Mwila, Principal Technical Evangelist at SUSE, an AWS Container Hero, and a HashiCorp Ambassador.
Introduction
Cloud-native technologies are becoming increasingly ubiquitous, and Kubernetes is at the forefront of this movement. Today, Kubernetes is seeing widespread adoption across organizations in a variety of different industries. When implemented properly, Kubernetes can help these organizations achieve higher availability, scalability, and resiliency for their workloads. Combining Kubernetes with the attributes of cloud computing—such as unparalleled scalability and elasticity—can help organizations enhance their containerized applications’ resiliency and availability.
As detailed in this introductory post, Karpenter‘s objective is to make sure that your cluster’s workloads have the compute they need, no more and no less, right when they need it.
In its most recent updates, Karpenter added support for more advanced scheduling constraints, such as pod affinity and anti-affinity, topology spread, node affinity, node selection, and resource requests. This post will specifically delve into podAffinity
, podAntiAffinity
, and volume topology awareness and elaborate on the use cases that they’re best suited for.
Prerequisites
To carry out the examples in this post, you need to have Karpenter installed in a Kubernetes cluster in AWS. We’ll be making use of Amazon EKS for demonstrative purposes. You can automate the process of provisioning an EKS cluster, with Karpenter as an add-on, by making use of the Terraform EKS blueprints.
Pod affinity and pod anti-affinity scheduling
Applying scheduling constraints to pods is implemented by establishing relationships between pods and specific nodes or between pods themselves. The latter is known as inter-pod affinity. Using inter-pod affinity, you assign rules that inform the scheduler’s approach in deciding which pod goes to which node based on their relation to other pods. Inter-pod affinity includes both pod affinity and pod anti-affinity.
Like node affinity, this can be done using the rules requiredDuringSchedulingIgnoredDuringExecution
and preferredDuringSchedulingIgnoredDuringExecution
depending on your requirements. As the names imply, required and preferred are terms that represent how hard or soft the scheduling constraints should be. If the scheduling criteria for a pod are set to the required rule, then Kubernetes ensures the pod is placed on a node that satisfies this. Similarly, pods that contain the preferred rule are scheduled to nodes that match the highest preference.
Pod affinity: The podAffinity
rule informs the scheduler to match pods that relate to each other based on their labels. If a new pod is created, then the scheduler takes care of searching the nodes for pods that match the label specification of the new pod’s label selector.
Pod anti-affinity: In contrast, the podAntiAffinity
rule allows you to prevent certain pods from running on the same node if the matching label criteria are met.
These rules can be particularly helpful in various scenarios. For example, podAffinity
can be beneficial for pods to co-locate each other in the same AZ or node to support any inter-dependencies and reduce network latency between services. On the other hand, podAntiAffinity
is typically useful for preventing a single point of failure by spreading pods across AZs or nodes for high availability (HA). For such use cases, the recommended topology spread constraint for anti-affinity can be zonal or hostname. This can be implemented using the topologyKey property which determines the searching scope of the cluster nodes. The topologyKey is a key of a label attached to a node.
An example of a podAntiAffinity
implementation would be the CoreDNS Deployment. Its Deployment resource has the podAntiAffinity
policy to ensure that the scheduler runs the CoreDNS
pods on different nodes for HA and to avoid VPC DNS throttling. You’ll notice that the Deployment’s anti-affinity topologyKey
is set to the hostname. In addition to this, podAntiAffinity
can be used to give a pod or set of pods resource isolation on exclusive nodes, as well as mitigating the risk of some pods interfering with the performance of others.
Using Karpenter allows you to make sure that new compute provisioned for your cluster will satisfy these pod affinity rules as workloads scale, without configuring additional infrastructure. Karpenter tracks unscheduled pods and will provision compute resources in accordance with the required or preferred affinity rules defined in your resource manifests.
Pod affinity example with Karpenter
In this example, you’ll create a deployment resource with a podAffinity
rule that requires scheduling the pods on nodes in the same AZ (availability zone). In the process, Karpenter will interpret the requirements of the pods that need to be scheduled and provision nodes that allow for these affinity rules to be met in an optimal way.
As a starting point, you’ll need to install the Karpenter Provisioner on your cluster. The Provisioner is a CRD that details configuration specifications and parameters such as node types, labels, taints, tags, customer kubelet configurations, resource limits and cluster connections via subnet and security group associations. The Provisioner manifest used in this example can be seen below.
You can start by fetching all the nodes in your cluster using the kubectl get nodes
command in your terminal. This will give you an idea of the existing nodes before Karpenter launches new ones in response to the application you’ll deploy to the cluster shortly.
After that, you can proceed to create a deployment resource with the following manifest:
Karpenter will detect the unscheduled pods and provision a node that will help fulfill the inter-pod affinity requirements of this deployment:
The newly created node has the hostname ip-10-0-1-233.eu-west-1.compute.internal
.
Here is a partial description of the new node:
We can then fetch the relevant pods using the appropriate label, app=inflate
in this case, to review how the pods have been scheduled.
As you can see, six of the pods have been scheduled on the new node ip-10-0-1-233.eu-west-1.compute.internal
, whereas the other two have been scheduled to nodes in the same AZ (eu-west-1b
) as per the topologyKey
of the podAffinity
rule. Since these nodes are all in the same AZ, they are part of the same topology, thus meeting the scheduling requirements.
Pod anti-affinity example with Karpenter
In the second example, you’ll apply a podAntiAffinity
rule to preferably schedule the pods across different nodes in the cluster, based on their AZs. As before, Karpenter reads the pod requirements and launch nodes that support the podAntiAffinity
configurations.
Similar to the previous example, start by fetching all the nodes in the cluster:
After that, you can proceed to create a deployment resource.
In response to the pod requirements, Karpenter will provision a new node:
The newly created node has the hostname ip-10-0-1-69.eu-west-1.compute.internal
.
Here is a partial description of the new node:
As before, fetch the pods with the relevant label:
Except for three pods, all the others are dispersed to other nodes across the different AZs of the select region (eu-west-1
) as per the affinity rule specification.
Next, we’ll further explore Karpenter‘s use with another advanced scheduling technique, namely volume topology awareness.
Volume topology aware scheduling
Before volume topology awareness, the processes of scheduling pods to nodes and dynamically provisioning volumes were independent. As you can imagine, this introduced the challenge of unpredictable outcomes for your workloads. For example, you might create a persistent volume claim which will trigger the dynamic creation of a volume in a certain AZ (i.e., eu-west-1a
), whereas the pod that needs to make use of the volume gets placed on a node in a separate AZ (eu-west-1b
). As a result, the pod will fail to start.
This is especially problematic for your stateful workloads that rely on storage volumes to provide persistence of data. It would be inefficient, and counter-intuitive to dynamic provisioning, to manually provision the storage volumes in the appropriate AZs. That’s where topology awareness comes in.
Topology awareness complements dynamic provisioning by ensuring that pods are placed on the nodes that meet their topology requirements, in this case, storage volumes. The goal of topology awareness scheduling is to provide alignment between topology resources and your workloads. Thus, it gives you a more reliable and predictable outcome. This is handled by the topology manager, a component of the kubelet
. This means the topology manager will make sure that your stateful workloads and the dynamically created persistent volumes are placed in the correct AZs.
To use volume topology awareness, ensure that you set the volumeBindingMode
of your storage class to WaitForFirstConsumer
. This property delays the provisioning of persistent volumes until a persistent volume claim is created by a pod that will use it.
In scaling events, Karpenter, the scheduler, and the topology manager work well together. In combination, they optimize the process of provisioning the right compute resources and align scheduled workloads with their dynamically created persistent volumes.
These technologies enable you to run and reliably scale stateful workloads in multiple AZs. You can thus spread your applications or databases in your cluster across zones to prevent a single point of failure in the case that an AZ is impacted. Considering that elastic block store (EBS) volumes are AZ-specific, your workloads should be configured with nodeAffinity to ensure that they are provisioned in the same AZ where they were first scheduled for a successful reattachment.
Volume topology aware example with Karpenter
In this example, you’ll create a stateful set for an application with 20 replicas, each with a persistent volume claim using a storage class with volumeBindingMode
already set to WaitForFirstConsumer
. In addition, you’ll specify that the nodeAffinity of the workloads should be scheduled to topology resources in the eu-west-1a
AZ.
To review the default storage class, run the kubectl get storageclass
command:
Next, you’ll fetch the nodes in the respective AZs:
Running kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a
returns the following:
Whereas, running kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1b
returns the following:
NAME STATUS ROLES AGE VERSION
ip-10-0-1-12.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7
ip-10-0-1-73.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7
ip-10-0-3-30.eu-west-1.compute.internal Ready <none> 4d23h v1.21.12-eks-5308cf7
Once you have the layout for your nodes, you can proceed to create the stateful set and an accompanying load balancer service for the application.
After deploying the application, persistent volumes are dynamically created in response to each claim from the stateful set replicas. Each one is created in the appropriate AZ, resulting in the successful creation of each pod replica. In conjunction, Karpenter provisions new nodes in eu-west-1a
to meet the compute requirements of the stateful set.
Now when we run kubectl get nodes -l topology.kubernetes.io/zone=eu-west-1a
, we get the following response:
As you can see, Karpenter has launched the following additional nodes:
- ip-10-0-0-176.eu-west-1.compute.internal
- ip-10-0-0-4.eu-west-1.compute.internal
- ip-10-0-0-53.eu-west-1.compute.internal
- ip-10-0-0-96.eu-west-1.compute.internal
Furthermore, we reviewed the created persistent volume claims, persistent volumes, and pods by running the appropriate commands as shown in the following code:
Cleanup
To avoid incurring any additional costs, make sure you destroy all the infrastructure that you provisioned in relation to the examples detailed in this post.
Conclusion
In this post, we covered a hands-on approach to scaling Kubernetes with Karpenter specifically for supporting advanced scheduling techniques with inter-pod affinity and volume topology awareness.
To learn more about Karpenter, you can read the documentation and join the community channel, #karpenter, in the Kubernetes Slack workspace. Also, if you like the project, you can star it on GitHub here.