How Grover saves costs with 80% Spot in production using Karpenter with Amazon EKS

This post is co-written with Suraj Nair, Sr. DevOps Engineer at Grover.

Introduction

Grover is a Berlin based global leader in technology rentals, enabling people and empowering businesses to subscribe to tech products monthly instead of buying them. As a pioneer in the circular economy, Grover’s business model of renting out and refurbishing tech products results in devices being used more frequently and for a longer period.

To date, Grover has circulated more than 1.2 million devices and is one of the fastest growing scaleups in Europe, serving millions of customers in Germany, US, Austria, the Netherlands, and Spain. Grover is running and managing its microservices using Amazon Elastic Kubernetes Service (Amazon EKS) and this post discusses how we adopted Karpenter to optimize our AWS costs and reduce the overhead of Kubernetes node management.

Challenges

Initially, we were using Kubernetes Cluster Autoscaler with Amazon EKS managed node groups to manage nodes within the EKS cluster. As Grover grew in size to have multiple engineering teams, we faced some challenges managing this infrastructure. Initially, there were only two large managed node groups that hosted all the applications running in our cluster running mostly Spot instances and a small number of On-demand nodes. This started to become a bottleneck for the data and risk engineering teams within Grover, who needed a diverse set of instance types and sizes to run services with different compute requirements without interruptions. To support the growing demand from the engineering teams for a diverse set of instance types and sizes, and purchase options, we had to increase the number of node groups in the cluster. However, we experienced configuration and management overheads when it came to managing the type and amount of instances in the node group. We often observed underused nodes during normal business hours and had to over-provision during peak business hours and campaigns such as Black Friday and Cyber Monday. As to how Cluster Autoscaler works, including the async nature of Autoscaling groups associated with Amazon EKS managed node groups, the scale up time would not fit our application performance demands. Additionally, we had to update the node groups regularly to use the latest Amazon EKS optimized Amazon Machine Images (AMIs) and keep track of the different versions.

How Karpenter works at Grover

Due to the challenges described previously, we decided to use Karpenter. We started to make use of Karpenter NodePools to meet different requirements from individual departments/teams adding taints and labels so that each service can use these node labels in their Kubernetes manifests. We ended up having less NodePools than with our previous setup. However, since Grover continued to grow in engineering team sizes and requirements, we added additional NodePools dedicated to the monitoring stack, and a NodePool of GPUs.

In our Kubernetes cluster, KEDA triggers Horizontal Pod Autoscaler (HPA) to scale out pods. Karpenter monitors for unschedulable pods and adds the required compute capacity to our clusters. The majority of our workload is stateless, fault-tolerant, and suitable for Spot instances. We also have stateful workloads, such as Jupyter Notebooks, that are using Amazon Elastic Block Store (Amazon EBS) to store data and the Amazon EBS CSI Driver to mount the volumes. For some of the critical workloads, we use Pod Disruption Budgets (PDBs) along with Topology Spread Constraints that are respected by Karpenter.

Our EKS cluster now has only a single AWS managed node group that runs Karpenter and always-running services such as Argocd or Kyverno. The rest of the nodes are managed by Karpenter using NodePools. We adopted Karpenter quite early (v0.16.1) and have been following the project eagerly. The following file is a Karpenter configuration of the NodePool type, in which Karpenter can be configured. It shows some of the important configurations that we use with Karpenter.

In order to save costs, we wanted to use Amazon Elastic Compute Cloud (Amazon EC2) Spot instances for the majority of our workloads. Therefore, we specify both On-demand and Spot instances in our Karpenter NodePool (see capacity-type), which was not possible to mix and match in a single place in our previous setup. This resulted in having less NodePools than before. Karpenter prioritizes Spot instances because it only attempts to provision On-demand compute capacity if there’s no Spot available. Karpenter right sizes instances through its built-in consolidation feature to save costs (see consolidationPolicy: WhenUnderutilized). We prefer to not use very large instances to reduce the pod density on a single machine, which also makes it easy to evict pods fast and gracefully before a Spot node is reclaimed. We thoroughly tested our workload with running Karpenter initially without any instance size or type restrictions for a while, and we ended up running only those specific types and sizes of instances that suit our workloads.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # refresh nodes every 30 days
  template:
    metadata: {}
    spec:
      nodeClassRef: default
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.k8s.aws/instance-family
        operator: NotIn
        values:
        - t3
        - t3a
        - t4
        - t2
        - i3
        - i4i
        - i3en
        - c6id
        - m6id
        - r6id
        - d3
        - d3en
      - key: karpenter.k8s.aws/instance-hypervisor
        operator: In
        values:
        - nitro
      - key: karpenter.k8s.aws/instance-size
        operator: In
        values:
        - large
        - xlarge
        - 2xlarge
        - 4xlarge
      - key: kubernetes.io/os
        operator: In
        values:
        - linux

To make sure of reliability for our services, we additionally use Karpenter’s built-in Interruption Handling. In our Helm chart, we specify the Amazon Simple Queue Service (Amazon SQS) queue name that is needed for Karpenter to receive notification messages, such as Amazon EC2 Spot Instance interruptions. With that, Karpenter proactively drains a node resulting in kube-scheduler recreating pods to new nodes. When the queue has to be provisioned in advance, we use AWS CloudFormation. There is a sample Helm chart on Karpenter’s GitHub repository showing the interruptionQueue parameter.

settings:
  clusterName: *clusterName
  clusterEndpoint: https://xxxxxxxxxxxxxxxxxxx.eu-central-1.eks.amazonaws.com
  interruptionQueue: node-termination-handler

In order to periodically recycle nodes to have the most recent AMIs in the Kubernetes cluster, we force termination of all nodes after a specified time to live (TTL) in the NodePool:

disruption:
  expireAfter: 720h # refresh nodes every 30 days

Some of our workloads cannot be scaled down due to long processing, and thus we needed a way to tell Karpenter to avoid any disruption on nodes on which they’re running. For that, Karpenter provides the ability to specify a do-not-disrupt label as an annotation on the pod, which removes the node that the pod is running on from any consolidation process:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "true"

In order to reduce any outage blast radius when nodes are being scaled up and down, we’re also using PodTopologySpreadConstraints on the Pod manifest to spread the number of pods across different nodes of a specific topology. An example of this configuration can be seen in the following:

spec:
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "topology.kubernetes.io/zone"
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: {{ include "servicename" . | lower | quote }}
      - maxSkew: 1
        topologyKey: "kubernetes.io/hostname"
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: {{ include "servicename" . | lower | quote }}

Migration from Cluster Autoscaler to Karpenter managed nodes

Migration of the Kubernetes services from nodes provisioned through Cluster Autoscaler to Karpenter managed nodes was relatively stress free and without downtime. We started with deploying Karpenter alongside our existing Cluster Autoscaler and running both in parallel without any issues with no Karpenter NodePools. We created the same number of NodePools with Karpenter with the same Taints and Node Labels as we had in the managed node groups with Cluster Autoscaler. This made sure that apps can try to schedule either on a Karpenter managed node or the Cluster Autoscaler one – whichever is in Ready status without any issue. We started to reduce the number of nodes in the Amazon EKS managed node groups, which would kick-start pod eviction. During this time, when a node was not available, Karpenter managed nodes would become ready to schedule pods faster than the new node being spawned up by Cluster Autoscaler. Due to multiple replicas running for each service combined with Topology Spread Constraints, there was no interruption to the overall performance of the services. By repeating this in small steps, we changed all the Amazon EKS managed node groups desired count size to 0, and eventually all the nodes running in the cluster were Karpenter managed ones. Finally, we removed Cluster Autoscaler as well as the Amazon EKS managed node group definitions from the cluster and adjusted the vCPU and memory limits for each NodePool as needed. We now have a single managed node group running Karpenter and other operations tools as mentioned previously. We tested and adopted Karpenter in our dev cluster in approximately 10 workdays. With our knowledge gained, the migration from Cluster Autoscaler to Karpenter took a single day for our production cluster.

Results

Using Karpenter, we were able to reduce our Amazon EKS cluster management overhead significantly along with the following benefits.

We experience 25% higher Spot usage in all our Amazon EKS clusters since Karpenter adoption without affecting the platform performance. Spot usage alters around 70-80% max. The following image is a Grafana dashboard on a normal business day.

This image shows the percentage of Spot instances in our EKS cluster which alters around 70-80%.

Spot Node percentage alters around 70-80% in our EKS cluster.

Supporting our cost optimization goals, since we adopted Karpenter in all of our Kubernetes clusters, we experience an increased ratio of Spot instances as compared to On-Demand instances in the overall Amazon EC2 spending costs. Karpenter’s flexibility to specify both purchase options in NodePools, the consolidation feature, as well as the Amazon EC2 Spot allocation strategy helps reduce costs. Karpenter uses the lowest-price pool for On-demand and price-capacity optimized allocation strategy for Spot instances.

The ratio of Spot vs. On-demand instances in our total EC2 costs.

We specify both Amazon EC2 On-demand and Amazon EC2 Spot instances, and we benefit from Karpenter’s Spot prioritization. The following shows the pod distribution rate for On-demand and Spot instances. As can be seen in the diagram, pods on On-demand nodes are more stable and have less disruptions when compared to pods on Spot instances.

Distribution of Pod disruptions on Spot vs. On-demand instances.

Conclusion

Using Karpenter at Grover, we were able to significantly reduce our Amazon EKS cluster management overhead combined with Amazon EC2 cost savings in the long run. With Karpenter, we increased our Spot usage by 25% and now use around 80% Spot instances in production, and our customers have a better experience with seasonal high demand, such as Black Fridays. It also reduced the operational overhead that we had with managing node groups and Cluster Autoscaler before and it was proven to be faster to add new nodes to the cluster. This allowed our scale-up operation to be smoother than ever. Recently, we conducted an evaluation, and we are now using Amazon Managed Service for Prometheus, a managed Prometheus service. Next, we want to strengthen our AppSec process to improve our security posture.

Grover Team

Suraj Nair, Grover

Suraj Nair is a Sr. DevOps Engineer at Grover Group GmbH and is passionate about working on solutions that enable teams to work efficiently and at scale. When not working, he is interested in travelling and exploring places.

Containers