Amazon EKS and Spot Instances in action at Delivery Hero

This post was coauthored by Christos Skevis, Senior Engineering Manager, Delivery Hero; Giovanny Salazar, Senior Systems Engineer, Delivery Hero; Miguel Mingorance, Senior Systems Engineer at Delivery Hero at the time the blog post was written; Cristian Măgherușan-Stanciu, Senior Specialist Solutions Architect, Flexible Compute, AWS; and Sascha Möllering, Principal Specialist Solutions Architect, Containers, AWS.

This post shows how to set up a production-grade Amazon Elastic Kubernetes Service (Amazon EKS) cluster with Amazon Elastic Compute Cloud (Amazon EC2) Spot managed node groups. We will also share how Delivery Hero, one of the largest food delivery companies, is using Amazon EKS and Amazon EC2 Spot Instances in production to keep costs under control in spite of massive business growth observed during the COVID-19 pandemic.

Kubernetes is a popular container framework that a lot of companies started adopting over the last years. In the early days, many AWS users were managing the entire Kubernetes infrastructure, which required significant operational effort. Based on your feedback, we released multiple services and features that made it much easier to run Kubernetes on AWS, taking over the bulk of this operational work so you can focus on building your applications and serving your own customers.

In 2018, we released Amazon EKS, which offers fully managed, secure, and highly available upstream Kubernetes cluster control planes provisioned across multiple Availability Zones, including logging and least-privilege access support on the pod level and many other operational best practices. In 2019, we released managed node groups so that Amazon EKS would also provision and manage the underlying Amazon EC2 instances that provide compute capacity to the Amazon EKS clusters, further simplifying operational activities such as Kubernetes version deployments.

Before managed node groups, once your Amazon EKS control plane was created, you would need to use eksctl, AWS CloudFormation, or other tools to create and manage the Amazon EC2 instances for your cluster. For managed node groups, we’ve extended the Amazon EKS API to also natively manage the Kubernetes data plane. Node groups are primary components in the Amazon EKS console, which makes it easier to manage and visualize the infrastructure used to run your cluster in a single place.

In December 2020, we continued by releasing support for running Amazon EC2 Spot Instances in Amazon EKS managed node groups, which reduce your Kubernetes worker node costs by up to 90% by tapping into spare Amazon EC2 capacity pools. Spot managed node groups take care of most of the work previously required to adhere to Spot best practices for Amazon EKS and automatically handle the Spot interruptions without any impact on well-designed containerized applications. All that was left for you to do was to diversify the instance types configured on the managed node group by providing a list of multiple instance types so that the group could tap into multiple capacity pools, which is critical for all Spot Instances workloads.

Considering our wide range of over 475 instance types, this was easier said than done. So in early 2021, we also simplified this remaining step by automatically integrating the Amazon EC2 Instance Selector open-source library into the eksctl command line tool so that you only needed to size the instances in the managed node group. The Spot-friendly diversified instance selection is now fully automated when creating managed node groups from the command line using recent versions of eksctl. The widest possible set of instance types across families and generations is automatically selected for the desired size.

To handle Spot Instance interruptions on self-managed node groups, users also needed to run additional automation tools on the cluster, such as the AWS Node Termination Handler. The Amazon EKS managed node group handles Spot Instance interruptions automatically without any additional tools. The underlying Amazon EC2 Auto Scaling group is opted in to Capacity Rebalancing. This means that when one of the Spot Instances in your node group is at an elevated risk of interruption and gets an Amazon EC2 instance rebalance recommendation, the Amazon EC2 Auto Scaling group will attempt to launch a replacement instance. The more instance types are configured in the managed node group, the more chances Amazon EC2 Auto Scaling has of launching a replacement Spot Instance immediately to gracefully handle the interruption. Just as with On-Demand managed node groups, Amazon EKS will automatically drain the pods from the instances that are being scaled in or terminated, as illustrated in this diagram.

This diagram illustrates Amazon EKS managed node group Capacity Rebalancing.

Figure 1: Amazon EKS managed node group Capacity Rebalancing

Prerequisites

For this blog post, we create our Amazon EKS cluster with eksctl, which is a command line interface (CLI) tool for creating and managing clusters on Amazon EKS. It is written in Go, uses AWS CloudFormation, was created by Weaveworks, and welcomes contributions from the community.

Creating the Amazon EKS cluster

In this section, we show how to create an Amazon EKS cluster using eksctl. We will also include an On-Demand managed node group that could host any non-interruptible workloads you might have, such as the Kubernetes Cluster Autoscaler:

eksctl create cluster -–name simple-cluster \
--region eu-west-1 --nodegroup-name=ng-on-demand \
--managed --node-type=t3.small

After successful creation of the cluster, we can confirm our On-Demand nodes have been created:

kubectl get nodes

NAME STATUS ROLES AGE VERSION
ip-192-168-13-24.eu-west-1.compute.internal Ready <none> 13m v1.20.4-eks-6b7464
ip-192-168-53-176.eu-west-1.compute.internal Ready <none> 13m v1.20.4-eks-6b7464
ip-192-168-79-142.eu-west-1.compute.internal Ready <none> 13m v1.20.4-eks-6b7464

Adding the Spot Instances

We now create a Spot managed node group. We will use eksctl to launch new nodes running on Spot Instances that will connect to the Amazon EKS cluster, using the new instance-selector integration to automatically diversify across a wide range of instance types without having to explicitly specify any instance types:

eksctl create nodegroup --name=ng-spot \
    --region=eu-west-1 \
    --cluster=simple-cluster \
    --managed --spot \
    --instance-selector-vcpus=2 \
    --instance-selector-memory=4 \
    --instance-selector-cpu-architecture=x86_64

The Spot managed node group we created follows the Spot best practices, including using the capacity-optimized allocation strategy as the spotAllocationStrategy, which will launch instances from the Spot Instance pools with the most available capacity, aiming to decrease the number of Spot Instance interruptions in our cluster. It’s recommended to isolate On-Demand and Spot capacity into separate Amazon EC2 Auto Scaling groups. This is preferred over using a base capacity strategy because the scheduling properties are fundamentally different. Since Spot Instances can be interrupted at any time (when Amazon EC2 needs the capacity back), users will often taint their preemptible nodes, requiring an explicit pod toleration to the preemption behavior. These taints result in different scheduling properties for the nodes, so they should be separated into multiple Amazon EC2 Auto Scaling groups.

Spot managed node groups also create the eks.amazonaws.com/capacityType label and set it to SPOT for their nodes, which we can use to query the Spot nodes we just created:

kubectl get nodes \
--label-columns=eks.amazonaws.com/capacityType \
--selector=eks.amazonaws.com/capacityType=SPOT
NAME STATUS ROLES AGE VERSION CAPACITYTYPE
ip-192-168-55-103.eu-west-1.compute.internal Ready <none> 17m v1.20.4-eks-6b7464 SPOT
ip-192-168-94-138.eu-west-1.compute.internal Ready <none> 17m v1.20.4-eks-6b7464 SPOT

This label can then be used to identify which nodes are Spot Instances and, similarly, which are On-Demand Instances, so we can schedule the appropriate workloads to run on Spot Instances. In 2021, we also released support for using Kubernetes node taints on Amazon EKS managed node groups, so you can also use this additional mechanism to control the scheduling of your application on Spot capacity.

Introducing Karpenter

The standard Kubernetes Cluster Autoscaler expects the instances within a given managed node group to have the same size. This reduces the level for diversification over multiple instance types, which is recommended for Spot users. To compensate for this reduced diversification, multiple node groups might need to be attached to the same cluster, which increases complexity, especially at scale. Our new Karpenter open-source project offers an alternative node provisioning mechanism that avoids this complexity. Whenever Kubernetes needs additional capacity for scheduling pods, Karpenter directly launches Amazon EC2 instances instead of adjusting the capacity of node groups. When used with Spot Instances, Karpenter uses the capacity-optimized allocation strategy over a wide range of instance types for increased Spot Instance diversification. This way, Karpenter can help you heavily diversify by avoiding the complexity of using multiple managed node groups. For a hands-on deep dive into Karpenter, please refer to our Karpenter workshop.

Cleanup

In order to save costs, we can easily delete all existing resources using a single command:

eksctl delete cluster –-name simple-cluster --region eu-west-1

Spot at Delivery Hero

Delivery Hero is the world’s leading local delivery platform, operating in more than 40 countries across 4 continents. In addition to food delivery, Delivery Hero is currently pioneering “quick commerce,” the next generation of e-commerce, by aiming to bring groceries and household goods to customers in as little as 10 to 15 minutes. They have been using AWS for many years and have experienced the benefits of running their workloads on the cloud.

In the last two years, Delivery Hero has been growing at a very fast pace, which initially caused a similar increase in cloud costs and, in certain areas, was also inflated by tech debt. The collaboration between the Delivery Hero’s tech department and their AWS account manager led to some solutions that helped keep these costs under control. One of the most impactful measures was the introduction of Spot Instances in their Amazon EKS platform.

Delivery Hero manages its cloud infrastructure with Terraform and has been historically using self-managed node groups. Over time, they gradually adopted Spot Instances in their existing setup while following Amazon EC2 Spot best practices to avoid any visible customer impact. The first step was to implement the capacity-optimized allocation strategy, followed by a few additional Delivery Hero internal standards and requirements.

The team had to fulfill the following main requirements:

Cluster-overprovisioner must be in place to ensure enough compute capacity is available while the traffic and the cluster rapidly grow.
In the rare event that no Spot Instances are available, it automatically falls back to On-Demand nodes.
Particular applications must have the option to keep running on On-Demand nodes (for example, CoreDNS).
Control the percentage of Spot Instances that run for each cluster.
Use similarly sized Spot Instances that will be load tested before selection.

Delivery Hero combined different Kubernetes-native mechanisms alongside Amazon EKS self-managed node group features, such as multiple instance types per launch template. Cluster-overprovisioner, cluster-autoscaler priority expander, customized Amazon EC2 user data, and multiple Amazon EC2 Auto Scaling groups with multiple instance types are used in conjunction with Kubernetes node taints to satisfy their business requirements and provide the best possible user experience.

Further details of Delivery Hero’s current implementation of Spot Instances in their Amazon EKS clusters are available on the Delivery Hero blog. They are also currently looking into adopting Karpenter to simplify their current setup.

Conclusion

In this blog, we presented the evolution of Spot Instances support for Amazon EKS towards increasingly simpler, easier-to-use, and more robust constructs that allow you to use Spot Instances with minimal effort. We showed you how easy it is to use Amazon EKS with Spot managed node groups with the recent eksctl integration, but due to Kubernetes Cluster Autoscaler constraints, it can still lead to relatively complex configurations at scale if you want to widely diversify. We then explained how the new Karpenter project can be used to further simplify such a setup going forward.

Lastly, we covered how Delivery Hero has been very successful at keeping costs under control using Spot Instances in their Amazon EKS platform and is currently undergoing a similar simplification journey. We’ve seen many of our users on similar Spot Instances adoption and simplification journeys, and we are closely following the developments in this space.

If you’re also somewhere on this path or are considering adopting Spot Instances in your Amazon EKS clusters, please refer to our Amazon EC2 Spot Instances workshops, where AWS has detailed, hands-on instructions that should help you get started.

Christos Skevis, Senior Engineering Manager, Delivery Hero

Christos Skevis comes from a family of olive oil producers in Greece, so you can trust that he knows about good cloud infrastructure. Starting his career as a humble web developer in 2007, Christos moved up to Principal Systems Engineer in DH in 2017 and, in 2018, moved into Team Management. He currently leads the Infrastructure Domain and its three teams in DH and is devoted to tooling, automation, and operations of user-facing platforms in the EU and APAC regions.

Giovanny Salazar, Senior Systems Engineer, Delivery Hero

Giovanny is a Colombian that loves to dance salsa and play online games in his free time. He is an Electrical and Electronic Engineer that started working in the field of app automation in South America until 2019 when he moved into infrastructure engineering for central DH in Berlin. Now he is working on optimization and management of the customer-facing infrastructure that serves more than 3M orders per day.

Miguel Mingorance, Senior Systems Engineer, Delivery Hero at the time the blog post was written

Miguel Mingorance is an excellent football player who enjoys building cloud infrastructure as much as running after the ball. Like any other Spaniard, he enjoys having a good time with some tapas and friends. He has been working on the platform’s infrastructure for more than 6 years, starting his career as a System Administrator at Delivery Hero. Following his passion for technology, he dived into the cloud and Kubernetes ecosystems, where within a few years, he became a Senior Systems Engineer who creates solutions and contributes to the whole Kubernetes community.

Containers