Containers
Managing Pod Scheduling Constraints and Groupless Node Upgrades with Karpenter in Amazon EKS
Feb 2024: This blog has been updated for Karpenter version v0.33.1 and v1beta1 specification.
About Karpenter
Karpenter is an open-source node lifecycle management project built for Kubernetes. It observes the aggregate resource requests of unschedulable pods and makes decisions to launch new nodes and terminate them to reduce scheduling latencies and infrastructure costs sending commands to the underlying cloud provider. Karpenter launches the nodes with minimal compute resources to fit the unschedulable pods for efficient bin-packing and it works in tandem with the Kubernetes scheduler to bind the unschedulable pods to the new nodes that are provisioned.
Why Karpenter
Kubernetes users needed to dynamically adjust the compute capacity of their clusters to support applications using Amazon EC2 Auto Scaling groups and the Kubernetes Cluster Autoscaler before the launch of Karpenter. Some of the challenges with Cluster Autoscaler include significant deployment latency because many pods must wait for a node to scale up before they can be scheduled. Nodes can take multiple minutes to become available as Cluster Autoscaler does not bind pods to nodes and scheduling decisions are made by the kube-scheduler
which results in longer wait for the Nodes to become available and it can increase pod scheduling latency for critical workloads.
One of the main objectives of Karpenter is to simplify the management of capacity. If you are familiar with other Auto Scalers, you will notice Karpenter takes a different approach referred as group-less auto scaling. Traditionally we have used the concept of a node group as the element of control that defines the characteristics of the capacity provided (i.e: On-Demand, EC2 Spot, GPU Nodes, etc) and that controls the desired scale of the group in the cluster. In AWS, the implementation of a node group matches with Auto Scaling groups. Over time, clusters using this paradigm, that run different type of applications requiring different capacity types, end up with a complex configuration and operational model where node groups must be defined and provided in advance.
Configuring Nodepools
Karpenter’s job is to add nodes to handle unschedulable pods (pods with the status condition Unschedulable=True set by the kube-scheduler), schedule pods on those nodes, and remove the nodes when they are not needed. To configure Karpenter, you create nodepools that define how Karpenter manages unschedulable pods and expires nodes.
NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes.
Additionally, it also allows the pods to request nodes based on instance types, architectures, OS or other attributes by adding specifications to Kubernetes pod deployments, so that the Pod scheduling constraints like Resource requests, Node selection, Node affinity, Topology spread fall within nodepool constraints for the Pods to get deployed on the Karpenter provisioned Nodes if not then the pods will not deploy.
In many scenarios a single nodepool can satisfy all the requirements and can use the Scheduling Constraints with nodepool and pods by that it helps in achieving the use case of different teams having different constraints for running their workloads (such as one team can use only nodes in specific AZ and other teams can use Arm64 hardware nodes) , for billing purposes, having different de-provisioning requirements, etc.
Use cases for Nodepool Constraints
With Karpenter layered constraints, you can be sure that the precise type and amount of resources needed are available to your pods.
However, for specific requirement of choosing an instance type or availability zones etc we can tighten the constraints defined in a nodepool by defining additional scheduling constraints in the pod spec.
Below are some of use cases for using nodepool scheduling constraints or use of specific requirements in the nodepool and binding the unschedulable pods to Nodes via Karpenter.
- Needing to run in specific instance type on zones where dependent applications or storage are available
- Requiring certain kinds of processors or other hardware
Upgrading nodes
A straight-forward way to upgrade nodes is to set spec.disruption.expireAfter. Nodes will be terminated after a set period of time and will be replaced with newer nodes. The recommended method to patch your Kubernetes worker nodes is using Drift, please refer the Blog on How to upgrade Amazon EKS worker nodes with Karpenter Drift . Also, you can read on Karpenter Disruption for more details.
Walkthrough
In this section, you will provision an EKS cluster, deploy Karpenter, deploy a sample application, and demonstrate Node scaling with Karpenter and process of deploying constraints with Pods in line to requirements of nodepool for different application workloads or different teams needing different instance capacity for their application.
Prerequisites
- A user or role with permission to create a cluster.
- AWS CLI
- eksctl (Use the latest version)
- Installing kubectl
- Using Helm with Amazon EKS
Prerequisites
- A user or role with permission to create a cluster
- AWS CLI
- eksctl (use the latest version)
- Install kubectl
- Running Helm with Amazon EKS
Karpenter Deployment Tasks
1) Set the following environment variables:
2) Create an Amazon EKS Cluster and IAM Role for KarpenterController
- Create a cluster with eksctl. This example configuration file specifies a basic cluster with one initial node and sets up an IAM OIDC provider for the cluster to enable IAM roles for pods
- Install Karpenter Helm Chart
Deploy the nodepool and application pods with layered constraints
Deploy the below Karpenter nodepool spec that has the following requirements:
- Architecture type (arm64 & amd64)
- Capacity type (Spot & On-demand)
Run the application deployment on a specific capacity, instance type, hardware and availability zone using Pod scheduling constraints.
- Below sample deployment defines the nodeSelector with
topology.kubernetes.io/zone
kubernetes.io/zone for choosing a specific Availability zone, on-demand arm64 instance withkarpenter.sh/capacity-type
&kubernetes.io/arch: arm64
and specific instance typenode.kubernetes.io/instance-type
so that new Nodes can be launched by Karpenter using the below Pod scheduling constraints.
- Scale the above deployment to see the Node scaling via the Karpenter and it would choose the above configuration from the EC2 fleet via the createFleet API for the application pods.
- Review the Karpenter pod logs for events and more details.
- Example snippet of the logs.
Validate the application pods with below command and the same would be in Running
state
- Example snippet of the Node output and Pods output.
From the above demonstration we can see that Karpenter's ability to apply layered constraints that was used to launch nodes that satisfied Multiple scheduling constraints of a workload, like instance type, specific AZ and hardware architecture via Karpenter.
Group less Node upgrades
As mentioned in earlier section, when using the nodegroups (Self-managed or Managed) with EKS Cluster and as part of upgrade the Worker nodes to a newer version of Kubernetes, we would have to rely on either migrating to new nodegroup for Self-managed or launching a new Autoscaling group of Worker nodes for Managed nodegroup as mentioned in Managed nodegroup upgrade behaviour . Whereas, with Karpenter group less autoscaling the upgrade of nodes works with the Drift value.
Drift handles changes to the NodePool/EC2NodeClass. For Drift, values in the NodePool/EC2NodeClass are reflected in the NodeClaimTemplateSpec/EC2NodeClassSpec in the same way that they’re set. Karpenter uses Drift to upgrade Kubernetes nodes and upgrades the nodes rolling deployment. With Karpenter version v0.33.x Drfit feature gates is enabled by default and upgrade of nodes would be respect the Drift.
Note: Karpenter supports using custom AMI and you can specify amiSelectorTerms
with EC2NodeClass, this will fully override the default AMIs that are selected on by your EC2NodeClass amiFamily
- Validate the current EKS Cluster Kubernetes version with below command.
- Example snippet of the above command.
- Deploy PodDisruptionBudget for your Application deployment. PodDisruptionBudget (PDB) limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
Note : With PDB you can set minAvailable or maxUnavailable as integers or as a percentage. Please refer Kubernetes documentation about Poddisruptions, and how to configure them for more details.
Example snippet of above PDB and sample application deployment that was configured in earlier section
- Upgrade the EKS Cluster to newer Kubernetes version via console or eksctl as mentioned in EKS documentation
- We can see that cluster got upgraded successfully to
1.21.
- We can see that cluster got upgraded successfully to
- Validate the application pods with below commands and we can see that Karpenter Launched Nodes are upgraded to 28same as that of EKS Cluster Kubernetes version.
Checking our workload and Node drifted by Karpenter and we can see that new Nodes are of version 1.28
as the Karpenter used the latest version of the EKS optimized AMI based on the new EKS Cluster version i.e 1.28
. We can observe the Drift events from the Karpenter controller logs
- Review the Karpenter controller pod logs for events and more details.
- Example snippet of the logs.
Note : In the above logs, we can see that Karpenter Drifted the Node
to the latest version of the EKS optimized AMI for 1.28
and launched a New node for the workload. Later old Node was Cordoned, Drained and Deleted by Karpenter.
From the above demonstration we can see that Karpenter respected the PDB and its ability to apply Node Disruption Drift workflow for Upgrading of Nodes launched by Karpenter for a group-less management of worker nodes for Upgrades.
In general, you can configure Karpenter to disrupt Nodes through your NodePool in multiple ways by using spec.disruption.consolidationPolicy, spec.disruption.consolidateAfter or spec.disruption.expireAfter . You can use node expiry to periodically recycle nodes due to security concerns and then Drift to upgrade the nodes. Please refer to Karpenter Disruption for more details.
Cleanup
Delete all the nodepools (CRDs) that was created.
Remove Karpenter and delete the infrastructure from your AWS account.