Amazon EKS improves control plane scaling and update speed by up to 4x

Years before Amazon Elastic Kubernetes Service (EKS) was released, our customers told us they wanted a service that would simplify Kubernetes management. Many of them were running self-managed clusters on Amazon Elastic Computer Cloud (EC2) and were having challenges upgrading, scaling, and maintaining the Kubernetes control plane. When EKS launched in 2018, it aimed to reduce our customers’ operational burden by offering a managed control plane for Kubernetes. This initially included automated upgrades, patching, and backups, which we often refer to as “undifferentiated heavy lifting.” We analyzed volumes of data to create a control plane that would work for the vast majority of our customers.

However, as usage of EKS grew, we discovered there were customers who occasionally exceeded the provisioned capacity of the cluster. When this happened, they had to file a ticket with AWS support to have their cluster control plane resized. This was not an ideal user experience.

Today, the control plane is scaled automatically when certain metrics are exceeded. At first, we used basic metrics such as CPU/memory for scaling. As we learned how the control plane behaved under different conditions, we adjusted our metrics to make scaling more responsive. Now we use a variety of metrics to scale the control plane, including the number of worker nodes and the size of the etcd database. These enhancements are a great example of the flywheel effect where AWS releases a feature in response to customer feedback, solicits feedback from end users about its impact, and uses that feedback to continue improving the customer experience.

Recent enhancements

Since introducing control plane auto scaling, we’ve been looking at ways to further improve the scaling experience for our customers. The latest enhancement involves reducing the amount of time it takes to scale the control plane.

Previously, control plane scaling could take as long as 50 minutes. This time was felt most acutely by customers whose requests to the kube-apiserver steadily increased (linear growth). Long scaling delays could cause API and etcd latencies to increase or even cause the API server to become temporarily unresponsive.

With our latest updates, the control plane can now scale in 10 minutes or less, which represents a 4x improvement. Multiple engineering teams implemented several changes that together increased the speed. These changes include the following:

Concurrent API server and etcd scale ups: In the past, when the control plane needed to be scaled we would wait for the control plane nodes to be scaled before scaling etcd. Now when we receive a scale up signal, both these components are scaled in parallel.

Change cooldown to 15 mins after creation: We’ve shrunk the window before a cluster becomes eligible for scale up to 15 minutes.

Blue/green style updates for the api-server: To achieve high availability and meet our Service Level Agreement obligations, Amazon EKS requires that a minimum number of control plane nodes be running at all times. In the past, when the control plane had to be scaled, we would increase the size of the auto scaling group (ASG) and perform a rolling deployment of the instances in the ASG, one at a time. With the latest updates, custom logic performs a blue/green style deployment where all of the new larger instances are launched in parallel. Only when the new nodes are healthy and ready to start serving traffic are the old nodes terminated.

Reduce etcd instance warm-up time: When a new etcd node is joined to the cluster, it needs time to receive and process updates from the other members of the cluster. Until it’s fully synchronized, it can’t serve client requests. Previously, we waited four minutes to allow etcd changes to propagate to a newly joined node. With the latest updates, we replaced the waiting period with a health check that determines whether the node is ready.

Speed up api-server instance bootstrap: We reduced the time needed to bootstrap new control plane nodes by fetching large files ahead of time rather than lazy loading them from Amazon Simple Storage Service (S3). This allowed us to reduce our bootstrap time by half.

Adjustments to the QPS and burst rates: As the cluster scales, it increases the Kubernetes API request and burst rates for the Kubernetes Controller Manager and Kubernetes Scheduler. This means you’ll be less likely to experience slowness while creating, updating, or deleting pods.

A cluster’s control plane can be scaled up and down multiple times throughout its lifetime. When it’s scaled up, the scaling is gradual, using progressively larger instances each time until it reaches its maximum size. Once a cluster has been scaled up, it won’t be scaled down unless utilization has remained below the scaling threshold for several days. This cooling-off period helps ensure that the control plane is appropriately sized for the workloads running on it.

Additional benefits

The work we’ve done to increase the speed at which the cluster can scale also affects how quickly updates are applied to the EKS the control plane. In the past when certain updates were applied to the cluster, EKS performed a rolling replacement of the EC2 instances backing control plane. With blue/green style updates for the API server and other recent enhancements, updates to all currently supported versions of EKS can now complete in 10 minutes or less. These faster update times mean that EKS customers can get back to performing operations on their clusters and take advantage of the latest security and features faster than before. Other types of EKS updates will soon benefit from these changes, including enabling AWS Key Management Service (KMS) encryption of secrets on existing clusters and associating a third-party OIDC provider for user authentication.

The future

Over the coming months, we’ll be making more incremental improvements to control plane auto scaling with the aim of making the EKS user experience even better. As with previous updates, these will be unobtrusive and released with little fanfare because we consider scalability to be a core tenet of EKS. These updates will be applied to all new and existing clusters automatically, requiring no customer intervention. Eventually, these improvements will allow EKS to support clusters larger than 5,000 nodes.

Recent improvements

Since publishing this blog in June 2022, we have been busily looking for opportunities to improve how quickly the EKS control plane scales. Two updates resulted from this work. The first update reduces the amount of time it takes to add KMS envelop encryption to an existing cluster from 50m down to 17m, a 66% improvement. The second update reduces the time it takes to associate and disassociate an OIDC provider from 30m to 7m, a 77% improvement. And we’re not done yet! We will continue improving the scaling characteristics of the EKS control plane until they become as invisible and unobtrusive as possible.

Feedback

As always, if you have comments or suggestions about the scaling experience on EKS, please visit the containers roadmap on GitHub and open or upvote an issue. We’re always interested in hearing how we can make AWS the best place to run Kubernetes.

Containers