Containers

Announcing Karpenter 1.0

Introduction

In November 2021, AWS announced the launch of v0.5 of Karpenter, “a new open source Kubernetes cluster auto scaling project.” Originally conceived as a flexible, dynamic, and high-performance alternative to the Kubernetes Cluster Autoscaler, in the nearly three years since then Karpenter has evolved substantially into a fully featured, Kubernetes native node lifecycle manager.

The project has been adopted for mission-critical use cases by industry leaders. It has added key features such as workload consolidation, which is designed to automatically improve utilization, and disruption controls to allow users to specify how and when Karpenter performs node lifecycle management operations in their clusters. In October 2023, the project graduated to beta, and AWS contributed the vendor-neutral core of the project to the Cloud Native Computing Foundation (CNCF) through the Kubernetes SIG auto scaling. Engagement from the Karpenter community made it one of the top-ten most-popular AWS open source projects by GitHub stars, and contributions by non-AWS community members have increased both in number and scope. Underpinning this evolution, the user success stories, new features, and community contributions, the Karpenter team at AWS has been working diligently to raise the bar on the project’s maturity and operational stability.

Today, with the release of Karpenter v1.0.0, we are proud to announce that Karpenter has graduated out of beta. With this release, the stable Karpenter APIs – NodePool and EC2NodeClass –remains available for future 1.0 minor version releases and will not be modified in ways that result in breaking changes from one minor release to another. In this post we describe the changes between the current v0.37 Karpenter release and v1.0.0.

What is changing?

As part of the v1 release, the custom resource definition (CRD) application programming interface (API) groups and kind name remain unchanged. We have also created conversion webhooks to make the migration from beta to stable more seamless. On the subsequent minor version of Karpenter after v1 (v1.1.0), we plan to drop support for v1beta1 APIs. The following shows a summary of the new features and changes.

Enhanced disruption controls by reason

In Karpenter release v0.34.0, Karpenter introduced disruption controls to give users more control over how and when Karpenter terminates nodes to improve the balance between cost-efficiency, security, and application availability. These disruption budgets follow the expressive cron syntax and can be scheduled to apply at certain times of the day, days of the week, hours, minutes, or all the time to further protect application availability. By default, if a Karpenter disruption block is not set, then Karpenter limits disruptions to 10% of nodes at any point in time.

Karpenter v1 adds support for disruption budgets by reason. The supported reasons are Underutilized, Empty, and Drifted. This enables users to have finer-grained control of the disruption budgets that apply to specific disruption reasons. For example, the following disruption budgets define how a user can implement a control where:

  • 0% of nodes can be disrupted Monday to Friday from 9:00 UTC for eight hours if drifted or underutilized
  • 100% of nodes can be disrupted if empty at all times.
  • At any other time of day it allows 10% of nodes to be disrupted, when drifted or underutilized and a stricter budget is not active.

Users might use this budget to make sure that empty nodes can be terminated during periods of peak application traffic to optimize compute. If a reason is not set, then the budget applies to all reasons.

...
disruption:
  budgets:
  - nodes: “0”
    schedule: "0 9 * * mon-fri"
    duration: 8h
    reasons:
    - Drifted
    - Underutilized
  - nodes: "100%"
    reasons:
    - Empty
  - nodes: "10%"
    reasons:
    - Drifted
    - Underutilized
...

Renamed consolidation policy WhenUnderutilized to WhenEmptyOrUnderutilized

The WhenUnderutilized consolidation policy has been renamed to WhenEmptyOrUnderutilized. The functionality remains the same as it was in v1beta1 were Karpenter would consolidate nodes that are partially utilized or empty when consolidationPolicy=WhenUnderutilized. The new name WhenEmptyOrUnderutilized explicitly reflects the conditions correctly.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized

Introducing consolidateAfter consolidation control for underutilized nodes

Karpenter prioritizes nodes to consolidate based on the least number of pods scheduled. Users with workloads that experience rapid surges in demand or interruptible jobs might have high pod churn and have asked to be able to tune how quickly Karpenter attempts to consolidate nodes to retain capacity and minimize node churn. Previously, consolidateAfter could only be used when consolidationPolicy=WhenEmpty, which is when the last pod is removed. consolidateAfter can now be used when consolidationPolicy= WhenEmptyOrUnderutilized, thus allowing users to specify in hours, minutes, or seconds how long Karpenter waits when a pod is added or removed before consolidating. If you would like the same behavior as v1beta1, then set consolidateAfter to 0 when consolidationPolicy=WhenEmptyOrUnderutilized.

New disruption control terminationGracePeriod

Cluster administrators would like a way to enforce a maximum node lifetime natively within Karpenter, and for it to be compliant with security requirements. Karpenter gracefully disrupts nodes by respecting Pod Disruption Budgets (PDB), a pod’s terminationGracePeriodSeconds, and karpenter.sh/do-not-disrupt annotation. If these settings were misconfigured, then Karpenter would block indefinitely waiting for nodes to be disrupted, which prevents cluster admins from rolling out new Amazon Machine Images (AMIs).

Therefore, a terminationGracePeriod has been introduced. terminationGracePeriod is the maximum time Karpenter will be draining a node before forcefully deleting, and it won’t wait for a replacement node after the node’s expiration has been met. The maximum lifetime of a node is its terminationGracePeriod + expireAfter. As part of this change, expireAfter configuration has also been moved from the disruption block to template spec.

In the following example, a cluster administrator might configure a NodePool so that the nodes start draining after 30 days, but they have a 24 hour grace period before being forcefully terminated so that existing workloads (such as long-running batch jobs) have enough time to complete before being forcefully terminated.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      terminationGracePeriod: 24h
      expireAfter: 720h
...

Drift feature gate removed

Karpenter drift replaces nodes that drift from a desired state (e.g. using an outdated AMI). In v1 drift has been promoted to stable and the feature gate removed, which means nodes now drift by default. If users disabled the drift feature gate in v1beta1 they can now control drift by using disruption budgets by reason.

Require amiSelectorTerms

In Karpenter v1beta1 APIs, when specifying amiFamily with no amiSelectorTerms, Karpenter would automatically update nodes through drift when a new version of the Amazon EKS optimized AMI in that family is released. This works well in pre-production environments where it’s nice to be auto-upgraded to the latest version for testing, but might not be desired in production environments. Karpenter now recommends that users pin AMIs in their production environments. More information on how to manage AMIs can be found in the Karpenter documentation.

amiSelectorTerms has now been made a required field and a new term, alias, has been introduced, which consists of an AMI family and a version (family@version). If an alias exists in the EC2NodeClass, then Karpenter selects the Amazon EKS optimized AMI for that family. With this new feature, users can pin to a specific version of the Amazon EKS optimized AMI. The following Amazon EKS optimized AMI families can be configured: al2, al2023, bottlerocket, windows2019, and windows2022. The following section provides an example.

Using Amazon EKS optimized AMIs

In this example, Karpenter provisions nodes with the Bottlerocket v1.20.3 Amazon EKS optimized AMI. Even after AWS releases newer versions of the Bottlerocket Amazon EKS optimized AMI, the worker nodes do not drift.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
...
  amiSelectorTerms:
    - alias: bottlerocket@v1.20.3
...

Using custom AMIs

If the EC2NodeClass does not specify an alias term, then amiFamily needs to configure which user data is used. The amiFamily can be set to one of AL2, AL2023, Bottlerocket, Windows2019, and Windows2022 to select a pre-generated user data, or to Custom if the user provides their own user data. You can use the existing tags, name, or ID field in in amiSelectorTerms to select an AMI. Examples of injected user data can be found in the Karpenter documentation for the Amazon EKS optimized AMI families.

In the following example, the EC2NodeClass selects a user-specified AMI with ID “ami-123” and uses the Bottlerocket generated user data.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
...
  amiFamily: Bottlerocket
  amiSelectorTerms:
    - id: ami-123   
...

Removed Ubuntu AMI family selection

Beginning with v1 the Ubuntu AMI family has been removed. To continue using the Ubuntu AMI you can configure an AMI in amiSelectorTerms pinned to the latest Ubuntu AMI ID. Furthermore, you can reference amiFamily: AL2 in your EC2NodeClass to get the same user data configuration that you received before. The following is an example:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
...
  amiFamily: AL2
  amiSelectorTerms:
    - id: ami-123   
...

Restrict Instance Metadata Service access from containers by default

It is an Amazon EKS best practice to restrict pods from accessing the AWS Identity Access and Management (IAM) instance profile attached to nodes to help make sure that your applications only have the permissions they need, and not that of their nodes. Therefore, by default for new EC2NodeClass, access to Instance Metadata Service (IMDS) is blocked by setting the hop count to one (httpPutResponseHopLimit:1) and requiring IMDSv2 (httpTokens: required). Pods using host networking mode continue to have access to IMDS. Users should use Amazon EKS Pod Identity or IAM roles for service accounts to grant AWS permissions to pods to access AWS services.

Moved kubelet configuration to EC2NodeClass

Karpenter provides the ability to specify a subset of kubelet arguments for additional customization. In Karpenter v1 the kubelet configuration has been moved to the EC2NodeClass API. If you provided a custom kubelet configuration and have multiple NodePool with different kubelet configurations referencing a single EC2NodeClass, then you now need to use multiple EC2NodeClass. In Karpenter v1 the conversion webhooks maintain this compatibility. However, before migrating to v1.1.x, users must update their NodePool to reference the correct EC2NodeClass, which results in nodes drifting.

NodeClaims made immutable

Karpenter v1beta1 did not enforce immutability on NodeClaims, but it assumed that users would not be acting against these objects after creation. Therefore, NodeClaims are now immutable as the NodeClaim lifecycle controller won’t react to changes after the initial instance launch.

Require all NodePool nodeClassRef fields and rename apiVersion field to group

Karpenter v1beta1 did not require users to set the apiVersion and kind of the NodeClass that they were referencing. In Karpenter v1 users are now required to set all nodeClassRef fields. In addition, the apiVersion field in the nodeClassRef has been renamed to group.

...
  nodeClassRef:
    group: karpenter.k8s.aws
    kind: EC2NodeClass
    name: default
...

Karpenter Prometheus metric changes

Karpenter makes several metrics available in the Prometheus format to allow monitoring of the Karpenter controller and cluster provisioning status. As part of the Karpenter v1 release a number of the v1beta1 metrics have changed, therefore for users that have dashboards with queries that use these metrics will need to be updated. For a detailed list of metric changes review the Karpenter v1 upgrade documentation.

Planned deprecations

As part of this change the following beta deprecations have been removed in v1:

  • karpenter.sh/do-not-evict annotation was introduced as a pod-level control in alpha. This control was superseded by the karpenter.sh/do-not-disrupt annotation that disables disruption operations against the node on which the pod is running. The karpenter.sh/do-not-evict annotation was declared as deprecated throughout beta and is dropped in v1.
  • karpenter.sh/do-not-consolidate annotation was introduced as a node-level control in alpha. This control was superseded by the karpenter.sh/do-not-disrupt annotation that disabled the disruption operations rather than just consolidation. The karpenter.sh/do-not-consolidate annotation was declared as deprecated throughout beta and is dropped in v1.
  • ConfigMap-based configuration was deprecated in v1beta1 and has been fully removed in v1. This configuration was deprecated in favor of a simpler, CLI/environment variable based configuration.
  • Support for the karpenter.sh/managed-by tag which stores the cluster name in its value, is replaced by eks:eks-cluster-name.

For a full list of new features, changes, and deprecations, read the detailed changelog.

Migration path

As the v1 APIs for Karpenter do not result in a changing API group or resource, this enables use of the Kubernetes webhook conversion process to upgrade APIs in-place without having to roll nodes. Prior to upgrading you must be on a version (0.33.0+) of Karpenter that supports v1beta1 APIs, such as NodePool, NodeClaim, and EC2NodeClass.

A summary of the upgrade process from beta to v1 is as follows:

  1. Apply the updated v1 NodePool, NodeClaim, and EC2NodeClass CRDs
  2. Upgrade Karpenter controller to its v1.0.0 version. This version of Karpenter starts reasoning in terms of the v1 API schema in its API requests. Resources are converted from the v1beta1 to the v1 version automatically, using conversion webhooks shipped by the upstream Karpenter project and the Providers (for EC2NodeClass changes).
  3. Next and before upgrading to Karpenter v1.1.0 users must update their v1beta1 manifests to use the new v1 version, taking into consideration the API changes for the release. See the before upgrading to Karpenter v1.1.0 in the v1 migration documentation for more details.

For detailed upgrade steps, see the Karpenter v1 migration documentation.

Conclusion

In this post, you learned about the Karpenter 1.0.0 release, and a summary of the new features and changes. Before you upgrade Karpenter to v1.0.0, we recommend that you read the full Karpenter v1 migration documentation and test your upgrade process in a non-production environment. If you have questions or feedback, then reach out in the Kubernetes slack #karpenter channel or on GitHub where you can share feedback.