Amazon EKS now supports provisioning and managing EC2 Spot Instances in managed node groups

This post was contributed by Ran Sheinberg, Principal Solutions Architect and Deepthi Chelupati, Sr Product Manager

Amazon Elastic Kubernetes Service (Amazon EKS) makes it easy to run upstream, secure, and highly available Kubernetes clusters on AWS. In 2019, support for managed node groups was added, with EKS provisioning and managing the underlying EC2 Instances (worker nodes) that provide compute capacity to EKS clusters. This greatly simplified operational activities such as rolling updates for new AMIs or Kubernetes version deployments. If you’re not familiar with EKS managed node groups, we recommend that you read the announcement blog post and the documentation. Listening to customer requirements in the AWS public containers roadmap, EKS further enhanced the managed node groups experience with features such as specifying custom AMIs and using launch templates. Similarly, customers have expressed strong interest in using EKS managed node groups for launching and managing Spot Instances.

Amazon EC2 Spot Instances allow AWS customers to run EC2 instances at steep discounts by tapping into EC2 spare capacity pools. Spot Instances can be interrupted with a two-minute notification when EC2 needs the capacity back. Using Spot Instances as Kubernetes worker nodes is an extremely popular usage pattern for workloads such as stateless API endpoints, batch processing, ML training workloads, big data ETLs using Apache Spark, queue processing applications, and CI/CD pipelines. For example, running a stateless API service on Kubernetes is a great fit for using Spot Instances as worker nodes, because pods can be gracefully terminated and replacement pods will be scheduled on other worker nodes when Spot Instances are interrupted.

Starting today, customers can use Spot Instances in EKS managed node groups. This enables you to take advantage of the steep savings that Spot Instances provide for your interruption tolerant containerized applications. Using EKS managed node groups with Spot Instances requires significantly less operational effort compared to using self-managed nodes. In addition to launching Spot Instances in managed node groups, it is now also possible to specify multiple instance types in EKS managed node groups.

This post provides an overview of using EKS managed node groups with Spot Instances, and a look under the hood on how managed node groups automatically apply Spot Instances best practices, as well as a tutorial that shows how to set up an EKS cluster with managed node groups.

Moving from self-managed nodes to EKS managed node groups with Spot Instances

Previously, customers had to run Spot Instances as self-managed worker nodes in their EKS clusters. This meant doing some heavy lifting such as building and maintaining configuration for Spot Instances in EC2 Auto Scaling groups, deploying a tool for handling Spot interruptions gracefully, deploying AMI updates, and updating the kubelet version running on their worker nodes. Now, all you need to do is supply a single parameter to indicate that a managed node group should launch Spot Instances, and provide multiple instance types that would be used by the underlying EC2 Auto Scaling group.

Before you start using managed node groups with Spot Instances, there are a few best practices to keep in mind. First, remember to use Spot Instances for fault-tolerant applications. Second, to enhance the availability of your applications when using Spot Instances, we recommend you to use more than one instance type in creating a managed node group. Use multiple EC2 instance type generations and variants for your workload. Lastly, to further enhance availability, set up your workload to use all Availability Zones. Note that if your pods are using EBS volumes via PersistentVolumeClaim today, consider persisting the data to an external storage such as Amazon Elastic File System, which works across Availability Zones, and will then allow you to run multi-AZ managed node groups.

When selecting what instance types to use, you can look to all instance types that supply a certain amount of vCPUs and memory to the cluster, and group those in each node group. Although you might see some performance variability in your workload as it runs on different instance types, this flexibility is key in order to provision and maintain your desired capacity while benefiting from steep cost savings. For example, a single node group can be configured with: m5.xlarge, m4.xlarge, m5a.xlarge, m5d.xlarge, m5ad.xlarge, m5n.xlarge, and m5dn.xlarge. These instance types supply almost identical vCPU and memory capacity, which is important in order for Kubernetes cluster autoscaler to efficiently scale the node groups. The EC2 Instance Selector is an open source tool that can help you find suitable instance types with a single CLI command (as seen in the tutorial section of this blog).

The following is an example configuration of EKS managed node groups:

Note that the cluster has one on-demand EKS managed node group for cluster management and operational tools. This is where customers run workloads like datastores, monitoring tools, and any applications that are not interruption tolerant and needs to remain available. The cluster also has multiple EKS managed node groups with Spot Instances, each diversified with similarly sized instance types, in order to tap into multiple Spot capacity pools.

EKS managed node groups with Spot Instances: a look under the hood

In this section, we touch on how EKS configures the underlying EC2 Auto Scaling groups with Spot Instances best practices, as well as manages the Spot worker node lifecycle. The following are default and non-modifiable configurations:

The allocation strategy will be configured as capacity-optimized. This means that every time the node group will scale out (due to the Kubernetes cluster autoscaler increasing the node group size, manual size increases, or other automations you may use) the instances will be launched from the most-available capacity pools. This works to decrease the number of Spot interruptions in the node group, and increase the resilience of the application.
To handle Spot interruptions, you do not need to install any extra automation tools on the cluster, for example: AWS Node Termination Handler. The managed node group handles Spot interruptions for you in the following way: the underlying EC2 Auto Scaling group is opted-in to Capacity Rebalancing, which means that when one of the Spot Instances in your node group is at elevated risk of interruption and gets an EC2 instance rebalance recommendation, it will attempt to launch a replacement instance. The more instance types you configure in the managed node group, the more chances EC2 Auto Scaling has of launching a replacement Spot Instance. Click here for more details in the user guide.
Same as with on-demand managed node groups, if the group needs to rebalance capacity between Availability Zones, it will automatically drain the pods from the instances that are being scaled in.
The nodes will be labeled with eks.amazonaws.com/capacityType=SPOT so you can easily point your fault-tolerant and stateless workloads to run on Spot Instances using node selectors. You can also use affinity and anti-affinity rules to achieve this.
EC2 Auto Scaling groups launched as managed node groups with Spot Instances or on-demand instances are automatically tagged to be used with Kubernetes cluster autoscaler autodiscovery functionality. Specifically:
k8s.io/cluster-autoscaler/enabled=true
k8s.io/cluster-autoscaler/<cluster-name>

Tutorial: deploying and auto scaling Spot Instances in EKS managed node groups using eksctl

In this tutorial, you create a new EKS cluster and two EKS managed node groups, each running either on-demand or Spot Instances. You also deploy Kubernetes cluster autoscaler and run a demo application to trigger the scale out of Spot Instances. To complete the tutorial, make sure you have eksctl and kubectl installed on your computer or on an AWS Cloud9 environment. You can run the tutorial by using an AWS IAM user or role that has the AdministratorAccess policy attached to it, or check the minimum required permissions for using eksctl.

Cluster and node groups deployment

First, launch an EKS cluster with one managed node group running on-demand instances, as seen in the diagram earlier in the post. You can specify multiple instance types for the on-demand node group. The underlying Auto Scaling group will launch the next instance types in the list in case some instance types are not available for any reason.

eksctl create cluster --name=eks-spot-managed-node-groups --instance-types=m5.xlarge,m5a.xlarge,m5d.xlarge --managed --nodes=2 --asg-access --nodegroup-name on-demand-4vcpu-16gb

Launching the cluster and managed node groups will take approximately 15 minutes. Once this step is complete and eksctl returns a message that the EKS cluster is ready, you can test connectivity to the cluster by checking that the following command shows the nodes are in Ready status.

kubectl get nodes

Next, deploy another managed node group, this time running Spot Instances. For this example, assume that nodes with 4 vCPUs and 16GB of RAM will be suitable for the fault-tolerant workloads that should run on this node group with Spot Instances.

To select suitable instance types, you can use the ec2-instance-selector tool. Below is an example of requesting instance types with the aforementioned number of vCPUs and memory capacity, along with x86_64 CPU architecture, number of GPUs set to 0 because GPUs are not required for this node group, and no burstable (t2, t3) instance types.

ec2-instance-selector --vcpus=4 --memory=16 --cpu-architecture=x86_64 --gpus=0 --burst-support=false

Response:

m4.xlarge
m5.xlarge
m5a.xlarge
m5ad.xlarge
m5d.xlarge
m5dn.xlarge
m5n.xlarge

Feed the result into the eksctl create nodegroup command below and run it. Note the new eksctl flag to indicate that a node group will run Spot Instances: --spot. The example also specifies --node-max 20 so you can scale out this node group with the test workload using cluster autoscaler.

eksctl create nodegroup --cluster eks-spot-managed-node-groups --instance-types m5.xlarge,m4.xlarge,m5a.xlarge,m5d.xlarge,m5n.xlarge,m5ad.xlarge,m5dn.xlarge --managed --spot --name spot-4vcpu-16gb --asg-access --nodes-max 20

The node group creation will take approximately four minutes. You can run the following command to confirm that the new two nodes running on Spot Instances were added to the cluster.

kubectl get nodes --selector=eks.amazonaws.com/capacityType=SPOT

You can also check the EKS console compute tab under your cluster, and see the new node group, along with the diversified instance type configuration.

Deploying Kubernetes cluster autoscaler

Apply the cluster autoscaler manifest file from the official GitHub repository to your cluster.

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Open the cluster autoscaler deployment for editing:

kubectl edit deployment cluster-autoscaler -n kube-system

Find the line with --node-group-auto-discovery and modify <YOUR CLUSTER NAME> to the right cluster name. In this example, eks-spot-managed-node-groups

Also confirm that the version in the image line is the right major release for your cluster version. For example, if your EKS cluster is running Kubernetes version 1.17, make sure that the image is v1.17.*. Click here to check for cluster autoscaler releases.

Lastly, check that cluster autoscaler started successfully by looking at the logs:

kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler

Deploy a sample NGINX deployment

Create a file called nginx-spot-demo.yaml and copy the following snippet into it, then save the file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-spot-demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-spot-demo
        resources:
          limits:
            cpu: 1000m
            memory: 1024Mi
          requests:
            cpu: 1000m
            memory: 1024Mi
      nodeSelector:    
        eks.amazonaws.com/capacityType: SPOT

Apply the manifest file to the cluster:

kubectl apply -f nginx-spot-demo.yaml

Confirm that the deployment was successful and that the two nginx-spot-demo pods were scheduled on the Spot worker nodes. You can do that by running kubectl get and instructing to only list the Spot Instances according to their label, and then running kubectl describe to check what pods those nodes have running.

kubectl get no -l eks.amazonaws.com/capacityType=SPOT

kubectl describe node <one of the nodes from the output of the previous command>

You can also set up the kube-ops-view tool for easily seeing a visual representation of which pods where scheduled on which nodes.

Scale the NGINX deployment and confirm that cluster autoscaler increases the size of the managed node group running Spot Instances

kubectl scale deployment nginx-spot-demo --replicas=20

Confirm that some pods could not be scheduled due to lack of instances in the cluster. The following command will show some pods with status pending:

kubectl get pods

You can also check the cluster autoscaler logs to confirm it identified the pending pods, and have chosen an Auto Scaling group to scale out

kubectl logs deployment/cluster-autoscaler -n kube-system --tail 500 | grep scale_up

You should see lines similar to:

scale_up.go:271] Pod default/nginx-spot-demo-7fbfcb596b-dbwfp is unschedulable

scale_up.go:271] Pod default/nginx-spot-demo-7fbfcb596b-7xs6d is unschedulable

scale_up.go:271] Pod default/nginx-spot-demo-7fbfcb596b-t8vww is unschedulable

scale_up.go:539] Final scale-up plan: [{eks-cebaf87d-d2b3-c88f-4004-e467c43935d6 2->7 (max: 20)}]

scale_up.go:700] Scale-up: setting group eks-cebaf87d-d2b3-c88f-4004-e467c43935d6 size to 7

Within 2-3 minutes, all the pending pods will be scheduled on the Spot Instances in the EKS managed node group.

Finally, visit the EC2 instances management console and filter according to the node group name: spot-4vcpu-16gb. Confirm that Spot Instances were launched across the different Availability Zones. You might see different instance types launched in the node group depending on which instance types were selected by the capacity-optimized allocation strategy from the most available Spot capacity pools.

Cleanup

Delete the resources that you created in the tutorial

kubectl delete deployment nginx-spot-demo

eksctl delete nodegroup on-demand-4vcpu-16gb --cluster eks-spot-managed-node-groups

eksctl delete nodegroup spot-4vcpu-16gb --cluster eks-spot-managed-node-groups

eksctl delete cluster eks-spot-managed-node-groups

Conclusion

You can now use EKS managed node groups to launch Spot Instances via the AWS CLI, AWS Management Console, API/SDKs, CloudFormation, Terraform, and eksctl (from version 0.33). With this launch, we further enhance the Amazon EKS experience and enable you to leverage Spot Instances without added operational overhead from running self-managed EKS worker nodes, and optimize your EKS clusters for cost, scale, and resilience. There are no additional costs to use EKS managed node groups, you only pay for the AWS resources that are provisioned.

Ran Sheinberg

Ran Sheinberg is a principal solutions architect in the EC2 Spot team with Amazon Web Services. He works with AWS customers on cost optimizing their compute spend by utilizing Spot Instances across different types of workloads: stateless web applications, queue workers, containerized workloads, analytics, HPC, and others.

Deepthi Chelupati

Deepthi is a Senior Product Manager in the EC2 service team, working on EC2 Spot Instances.

Containers

Amazon EKS now supports provisioning and managing EC2 Spot Instances in managed node groups

Moving from self-managed nodes to EKS managed node groups with Spot Instances

EKS managed node groups with Spot Instances: a look under the hood

Tutorial: deploying and auto scaling Spot Instances in EKS managed node groups using eksctl

Cleanup

Conclusion

Ran Sheinberg

Deepthi Chelupati

Resources

Learn

Resources

Developers

Help