Containers

Using Amazon EC2 Spot Instances with Karpenter

October 2025: This post was updated for accuracy.

Overview

Karpenter is a dynamic, high performance, open-source cluster autoscaling solution for the Kubernetes platform introduced at re:Invent 2021. Customers choose an autoscaling solution for a number of reasons, including improving the high availability and reliability of their workloads and at the same time reduce costs. With the introduction of EC2 Spot instances, customers can reduce cost up to 90% compared to On-demand prices. EC2 Spot instances are instances created from spare-capacity on AWS that can be interrupted, meaning that the workload must be fault tolerance, flexible and stateless. As containers are meant to be immutable and ephemeral, they are a perfect candidate for EC2 Spot. Combining a high performant cluster autoscaler like Karpenter with EC2 Spot instances, Amazon Elastic Kubernetes Service (Amazon EKS) clusters can acquire compute capacity within minutes while keeping costs low.

In this blog post, you will learn how to use Karpenter with EC2 Spot Instances and handle Spot Instance interruptions.

Getting started

To get started with Karpenter in AWS, you need a Kubernetes cluster. You will be using an Amazon EKS cluster throughout this blog post. To provision an Amazon Elastic Kubernetes Service (Amazon EKS) cluster and install Karpenter, please follow the getting started docs from the Karpenter documentation.

Karpenter’s single responsibility is to provision compute capacity for your Kubernetes clusters, which is configured by a custom resource called NodePool. Currently, when a pod is newly created, e.g. by the Horizonal Pod Autoscaler (HPA), kube-scheduler is responsible for finding the best feasible node so that kubelet can run it. If none of the scheduling criteria are met, the pod stays in a pending state and remains unscheduled. Karpenter relies on the kube-scheduler and waits for unscheduled events and then provisions new node(s) to accommodate the pod(s).

Diversification and flexibility are important when using Spot instances: instance types, instance sizes, Availability Zones, and even Regions. Being as flexible as possible enables Karpenter to have a wider choice of spare-capacity pools to choose from and therefore to reduce the risk of interruption. The following code snippet shows an example of a Spot NodePool configuration specifying some constraints:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["nano", "micro", "small", "medium"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]          # ["spot","on-demand"]

The configuration is restricted to use smaller  “c”, “m”, or “r” instances but is still diversified as much as possible. For example, this might be needed in scenarios where you deploy observability DaemonSets.

Node selection

Karpenter makes provisioning decisions based on the scheduling constraints defined on each pending Pod. Karpenter gets the pending pods in batches and binpacks them based on CPU and memory to find the most efficient instance type (i.e. the smallest instance type). Karpenter selects a list of instance types within the diversified range of instance types defined in the NodePool that can fit the pod batch and passes them to the Amazon EC2 Fleet API. EC2 fleet then uses an allocation strategy to select the EC2 instance to launch.

Karpenter uses the Price Capacity Optimized (PCO) allocation strategy when calling the EC2 Fleet to launch the optimal EC2 instance. When using Spot instances, the PCO strategy considers first the Spot Capacity Pool with the highest capacity availability to reduce frequency of Spot terminations, and then the lowest price of the Spot instance to optimize costs. A Spot Capacity Pool is a set of unused EC2 instances with the same instance type (for example, m5.large) and same Availability Zone. When using On Demand instances, Karpenter uses the lowest-price allocation strategy that launch the cheapest instance type.

Capacity Type

When creating a NodePool, you can use either Spot, On-demand, or both. When you specify both and if the pod does not explicitly specify whether it needs to use Spot or On-demand, then Karpenter opts to use Spot when provisioning a node.

If the EC2 Fleet API returns an insufficient capacity error for Spot instances, Karpenter temporarily removes those specific Spot Capacity Pools from consideration across the entire NodePool. This approach allows Karpenter to continue making progress by automatically exploring alternative Pools to ensure capacity and lowest prices, hence the importance of diversifying the instance types and Availability Zones.

To configure Spot as the capacity type, add this constraint in the NodePool’s requirements block:

  requirements:
    - key: "karpenter.sh/capacity-type" 
      operator: In
      values: ["spot"]     # ["spot","on-demand"] if you want both capacity types

You can check which instance type and capacity have been launched by executing:

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter

You should see a “created nodeclaim” message that lists instance types that can fit your Pods, and a “launched nodeclaim” message indicating the instance type selected. A NodeClaim is a Karpenter resource that represents a request for capacity and manages the lifecycle of a Kubernetes node with the underlying cloud provider.

In the following example, Karpenter provides a diverse list of potential instance types that meet the unschedulable pod’s requirements. The instances were sent to the EC2 Fleet API, and the optimal one was selected to be launched:

... "message":"created nodeclaim",..."instance-types":"m5.2xlarge, m6g.4xlarge, m7g.2xlarge, c5.xlarge, c6i.2xlarge and 55 other(s)"}
... "message":"launched nodeclaim",..."instance-type":"m6g.2xlarge","zone":"us-west-2c","capacity-type":"spot",...}}

Resiliency

Karpenter can handle Spot instance interruption natively: it will automatically cordon and drain the node ahead of the interruption event. The NodePool will launch a new node as soon as it sees the EC2 Spot interruption warning, informing that in 2 minutes Amazon EC2 will reclaim the instance.

To improve resiliency in your cluster, you should follow Spot best practices and diversify the Spot Capacity Pools. Remember that a Spot Capacity Pool is a set of unused EC2 instances with the same instance type and Availability Zone . Therefore, to diversify your NodePool, you should be flexible about the instance types and Availability Zones you use.

Providing a diverse and flexible list in the NodePool that meets your compute requirements will help the EC2 Fleet choose the optimal EC2 Spot instance. The decision is based on the Price Capacity Optimized strategy explained above, which considers first the lowest chance of interruption and then the lowest price.

To enable Spot interruption-handling function, you need to create an Amazon SQS queue and Amazon EventBridge rules and targets that forward interruption events from AWS services to the SQS queue. Karpenter provides details for provisioning this infrastructure in the CloudFormation template in the Getting Started Guide. Then, configure the --interruption-queue-name CLI argument with the name of the interruption queue provisioned to handle interruption events.

Another useful feature for EC2 Spot instances in Karpenter is Consolidation. By default, Karpenter sets the consolidationPolicy to WhenEmptyOrUnderutilized to automatically consider empty or underutilized nodes as consolidatable nodes. Karpenter has two mechanisms for consolidation: deletion if all its pods can run on other nodes in the cluster, and replace if the node can be replaced by a smaller and cheaper one. You can modify the consolidation behaviour for your Node Pools in the disruption block as below:

spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized  # WhenEmpty | WhenEmptyOrUnderutilized
  template:
    spec:
      expireAfter: 720h

Karpenter versions prior to v0.34.0 only supported replacement consolidation for On-Demand Instances, Spot instances had deletion consolidation policy enabled by default. Since v0.34.0, you can enable the feature gate to use Spot-to-Spot consolidation. You can read more about this in the Applying Spot-to-Spot consolidation best practices with Karpenter blog post.

Handling SIGTERM signals is also a best practice when dealing with any kind of interruptions of containers. When an interruption is about to happen, Kubernetes sends a SIGTERM signal to the main process (PID 1) of each container in the Pod that is being evicted to inform about the interruption. Then, it waits some time (30 seconds by default) to shutdown gracefully before sending the final SIGKILL signal that terminates the containers. Therefore, to ensure your processes terminates gracefully you should handle the SIGTERM signal properly.

Monitoring

Spot interruptions can occur at any time. Monitoring Kubernetes cluster metrics and logs can help to create notifications when Karpenter fails to acquire capacity. You have to setup adequate monitoring at the Kubernetes cluster level for all the Kubernetes objects and monitor the Karpenter NodePool. You will use Prometheus and Grafana to collect the metrics for Kubernetes cluster and Karpenter. CloudWatch Logs will be used to collect the logs.

To get started with Prometheus and Grafana on Amazon EKS, please follow the Prometheus and Grafana installation instruction from the Karpenter getting started guide. The Grafana dashboards are preinstalled with dashboards containing controller metrics, node metrics and pod metrics.

Using the panel Pod Phase that is included in the pre-built Grafana dashboard named  Karpenter Capacity, you can check for pods that have Pending status for over a predefined period (e.g. 3 minutes). This will help us to understand if there are any workloads which are unable to be scheduled.

Karpenter controller logs can be sent to CloudWatch Logs using either Fluent Bit or FluentD. (Here’s information on how to get started with CloudWatch Logs for Amazon EKS.) To view the Karpenter controller logs, go to the log group /aws/containerinsights/cluster-name/application and search for Karpenter.

In the log stream, search for Provisioning failed log messages in the Karpenter controller logs for any provisioning failures. The example below shows provisioning failure due to reaching the account limit for Spot Instances.

2021-12-03T23:45:29.257Z        ERROR   controller.provisioning Provisioning failed, launching capacity, launching instances, with fleet error(s), UnfulfillableCapacity: Unable to fulfill capacity due to your request configuration. Please adjust your request and try again.; MaxSpotInstanceCountExceeded: Max spot instance count exceeded; launching instances, with fleet error(s), MaxSpotInstanceCountExceeded: Max spot instance count exceeded   {"commit": "6984094", "provisioner": "default"

Clean up

To avoid incurring any additional charges, don’t forget to clean up the resources you created. If you followed the getting started docs from the Karpenter documentation, check the “Delete the cluster” section.

Conclusion

In this blog post, you learned about Karpenter and how you can use EC2 Spot Instances with Karpenter to scale the compute needs in an Amazon EKS cluster. You can check out the Further Reading section below to discover more about Karpenter.

Further Reading