Deep dive into Amazon EKS scalability testing

Introduction

The “Elastic” in Amazon Elastic Kubernetes Service (Amazon EKS) refers to the ability to “acquire resources as you need them and release resources when you no longer need them”. Amazon EKS should scale to handle almost all workloads but we often hear questions from Amazon EKS customers like: “What is the maximum number of Pods/Nodes supported in a single Amazon EKS cluster?”

The answers to these questions can vary because Kubernetes is a complex system, and the performance characteristics of a Kubernetes cluster can vary based on the characteristics of your workload. The Kubernetes community have defined Service Level Indicators (SLIs) and Service Level Objectives (SLO) for the Kubernetes components, which can be used as a starting point for scalability discussions. This post will walk through those SLIs/SLOs and how the Amazon EKS team runs scalability tests.

SLIs are how we measure a system. There are metrics that can be used to determine how well the system is running (e.g., request latency or count). SLOs define the values that are expected for when the system is running well (e.g., request latency remains less than 3 seconds). The Kubernetes SLOs and SLIs focus on the performance of the Kubernetes components and are independent from the Amazon EKS Service SLAs, which focus on availability of an Amazon EKS cluster’s Kubernetes Application Programming Interface (API) endpoint.

Kubernetes upstream SLOs

Amazon EKS is conformant with upstream Kubernetes releases and ensures that Amazon EKS clusters operate within the SLOs defined by the Kubernetes community. The Scalability Special Interest Group (SIG) defines the scalability goals for Kubernetes and investigates bottlenecks in performance through SLIs and SLOs.

Kubernetes has a number of features that allow users to extend the system with custom add-ons or drivers, like Container Storage Interface (CSI) drivers, admission webhooks, and auto-scalers. These extensions can impact the performance of a Kubernetes cluster in different ways, (i.e., an admission webhook with failurePolicy=Ignore could add latency to Kubernetes API requests if the webhook target is unavailable). The Kubernetes Scalability SIG defines scalability using a “you promise, we promise” framework:

If you promise to:

correctly configure your cluster
use extensibility features “reasonably”
keep the load in the cluster within recommended limits

then we promise that your cluster scales, i.e.:

all the SLOs are satisfied.

The Kubernetes SLOs don’t account for all of the plugins and external factors that could impact a cluster, such as worker node scaling or admission webhooks. These SLOs focus on Kubernetes components and ensure that Kubernetes actions and resources are operating within expectations. The SLOs help Kubernetes developers ensure that changes to Kubernetes code do not degrade performance for the entire system.

The Kubernetes Scalability SIG defines the following official SLO/SLIs and the Amazon EKS team regularly runs scalability tests on Amazon EKS clusters for these SLOs/SLIs to monitor for performance degradation as changes are made and new versions are released.

Objective	Definition	SLO
API request latency (mutating)	Latency of processing mutating API calls for single objects for every (resource, verb) pair, measured as 99^th percentile over last 5 minutes	In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99^thpercentile per cluster-day <= 1 second
API request latency (read-only)	Latency of processing non-streaming read-only API calls for every (resource, scope) pair, measured as 99^th percentile over last 5 minutes	In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99^th percentile per cluster-day: (a) <= 1 second if scope=resource (b) <= 30 seconds otherwise (if scope=namespace or scope=cluster)
Pod startup latency	Startup latency of schedulable stateless pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99^th percentile over last 5 minutes	In default Kubernetes installation, 99^th percentile per cluster-day <= 5 seconds

API request latency

The kube-apiserver has –request-timeout defined as 1m0s by default, which means a request can run for up to one minute (60 seconds) before being timed out and cancelled. The SLOs defined for Latency are broken out by the type of request that is being made, which can be mutating or read-only:

Mutating

Mutating requests in Kubernetes make changes to a resource, such as creations, deletions, or updates. These requests are written to the etcd backend before the updated object is returned. Etcd is a distributed key-value store that is used for all Kubernetes cluster data.

This latency is measured as the 99^th percentile over 5 minutes for (resource, verb) pairs of Kubernetes resources. For example, this would measure the latency for Create Pod requests and Update Node requests. The request latency must be <= 1 second to satisfy the SLO.

Read-only

Read-only requests retrieve a single resource (such as Get Pod X) or a collection (such as “Get all Pods from Namespace X”). The kube-apiserver maintains a cache of objects, so the requested resources may be returned from cache or they may need to be retrieved from etcd first.

These latencies are also measured by the 99^th percentile over 5 minutes; however, read-only requests can have separate scopes. The SLO defines two different objectives:

For requests made for a single resource (i.e., kubectl get pod -n mynamespace my-controller-xxx ), the request latency should remain <= 1 second.
For requests that are made for multiple resources in a namespace or a cluster (for example, kubectl get pods -A) the latency should remain <= 30 seconds

The SLO has different target values for different request scopes because requests made for a list of Kubernetes resources expect the details of all objects in the request to be returned within the SLO. On clusters with large collection of resources (e.g., Kubernetes objects), this can result in large response sizes which can take some time to return. For example, in a cluster running tens of thousands of Pods with each Pod being roughly 1 KiB when encoded in JSON, returning all Pods in the cluster would consist of 10 MB or more. Kubernetes clients can help reduce this response size using APIListChunking to retrieve large collections of resources.

Pod startup latency

This SLO is primarily concerned with the time it takes from Pod creation to when the containers in that Pod actually begin execution. To measure this the difference from the creation timestamp recorded on the Pod, and when a WATCH on that Pod reports the containers have started is calculated (excluding time for container image pulls and init container execution). To satisfy the SLO the 99^th percentile per cluster-day of this Pod Startup Latency must remain <=5 seconds.

Kubernetes SLI metrics

Kubernetes is also improving the Observability around the SLIs by adding Prometheus metrics to Kubernetes components that track these SLIs over time. Using Prometheus Query Language (PromQL) we can build queries that display the SLI performance over time in tools like Prometheus or Grafana dashboards, below are some examples for the previous SLOs.

API server request latency

Metric	Definition
apiserver_request_sli_duration_seconds	Response latency distribution (not counting webhook duration and priority and fairness queue wait times) in seconds for each verb, group, version, resource, subresource, scope, and component.
apiserver_request_duration_seconds	Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope, and component.

Note: The apiserver_request_sli_duration_seconds metric is available starting in Kubernetes 1.27.

You can use these metrics to investigate the API Server response times and if there are bottlenecks in the Kubernetes components or other plugins/components. Comparing these metrics can provide insight into where the delays in request processing are being introduced.

API request latency SLI – This is the time it took Kubernetes components to process the request and respond. The SLI metrics provide insight into how Kubernetes components are performing by excluding the time that requests spend waiting in API Priority and Fairness queues, working through admission webhooks, or other Kubernetes extensions.

API request total latency – The total duration metrics provide a more holistic view as it reflects the time your applications would be waiting for a response from the API server. This is calculated from when the request was received to when the response was sent, including all of the webhook execution and time spent in priority and fairness queues.

Pod startup latency

Metric	Definition
kubelet_pod_start_sli_duration_seconds	Duration in seconds to start a pod, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch.
kubelet_pod_start_duration_seconds	Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run. This does not include the time to schedule the pod or scale out worker node capacity.

Note: The kubelet_pod_start_sli_duration_seconds metric is available starting in Kubernetes 1.27.

Similar to the previous queries, you can use these metrics to gain insight into how long node scaling, image pulls, and init containers are delaying the pod launch compared to kubelet actions.

Pod startup latency SLI – This is the time from the pod being created to when the application containers reported as running. This includes the time it takes for the worker node capacity to be available and the pod to be scheduled, but this does not include the time it takes to pull images or for the init containers to run.

Pod startup latency total – This is the time it takes the kubelet to start the pod for the first time. This is measured from when the kubelet receives the pod via WATCH, which does not include the time for worker node scaling or scheduling. This includes the time to pull images and init containers to run.

How Amazon EKS approaches Scalability

Amazon EKS manages the Kubernetes control plane components and ensures their security, availability, and scalability, but you are responsible for the availability and scalability of your application, extensions, and data plane infrastructure (if you are not using AWS Fargate). The Amazon EKS team regularly run a series of internal load tests to verify that changes and new releases improve, or maintain the same performance level. The Scalability section of the EKS Best Practices guide has recommendations and patterns you can implement to improve the scalability of your clusters.

To ensure consistency with upstream Kubernetes SLO and SLI definitions, the Amazon EKS team measures the scalability of an Amazon EKS cluster by applying the same criteria used for upstream scalability tests, as defined by SIG-scalability. As we can’t test for all different use cases or configurations, these tests provides a baseline of scalability that we can use when evaluating or comparing more advanced workloads.

How Amazon EKS runs scalability tests

The Amazon EKS team uses the official Kubernetes scalability and performance testing framework ClusterLoader2. Cluster Loader uses declarative style tests to create Kubernetes objects at a specified scale and rate (e.g., “I want to run 30 pods per node across 5,000 nodes, creating resources at 50 pods/sec”). More information is available at the ClusterLoader2 Github repository.

The Amazon EKS scalability tests are based on the general purpose load test configuration that is defined in the kubernetes/perf-tests repo. To ensure Amazon EKS control planes are able to maintain the SLOs even under large scale, we configure the test to run with 5,000 nodes. The Kubernetes community has defined this as the threshold where beyond 5,000 nodes per cluster, Kubernetes may encounter performance degradation. The number of nodes is used to calculate some of the additional parameters when running a test with ClusterLoader2, such as the total namespaces. We scale out the nodes in our cluster to 5000 before we begin the load test.

The load test creates a variety of Kubernetes resources including Pods, Deployments (which create ReplicaSets and Pods), Services, and Secrets at 50 Pods per second churn to put sustained pressure on the Kubernetes control plane components. Prometheus metrics are collected during the test along with additional details to ensure that as the resources are created the SLOs are still met.

AWS service quotas and considerations

We needed to increase some of the AWS Service Quotas for our AWS account in order to scale out our cluster to 5,000 nodes. The table below includes the limits that were necessary for our testing. The quotas in the table below are the quotas we needed to raise for the scale and churn on our test cluster. There are additional AWS Service Quotas that may impact your workloads available on the EKS Best Practices guide.

You can request an increase to these limits via the AWS Service Quotas console, or the AWS Command Line Interface (AWS CLI), using the name or the quota code

Service	Quota	Quota Code	Default	Increased Value
Amazon Elastic Compute Cloud (Amazon EC2)	Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances (as a maximum vCPU count)	L-1216C47A	5	32,000
Amazon Elastic Kubernetes Service (Amazon EKS)	Nodes per managed node group	L-BD136A63	450	1000
Amazon Virtual Private Cloud (Amazon VPC)	Security groups per network interface	L-2AFB9258	5	16
Amazon VPC	IPv4 CIDR blocks per VPC	L-83CA0A9D	5	20
Amazon Elastic Block Store (Amazon EBS)	Storage for General Purpose SSD (gp3) volumes, in TiB	L-7A658B76	50	1100
Amazon EBS	Storage for General Purpose SSD (gp2) volumes, in TiB	L-D18FCD1D	50	1100

We also increase the rate limits for our AWS account to accommodate the rate of requests against Amazon Elastic Compute Cloud (Amazon EC2) for the Actions listed in the following table. Details on how rate throttling is calculated for Amazon EC2, monitoring for rate throttling in your account, and requesting an increase are available in the EC2 Documentation.

Mutating Actions	Read-only Actions
AssignPrivateIpAddresses	DescribeDhcpOptions
AttachNetworkInterface	DescribeInstances
CreateNetworkInterface	DescribeNetworkInterfaces
DeleteNetworkInterface	DescribeSecurityGroups
DeleteTags	DescribeTags
DetachNetworkInterface	DescribeVpcs
ModifyNetworkInterfaceAttribute	DescribeVolumes

Amazon EKS cluster

We use an Amazon EKS cluster with Managed Node Groups that are pre-scaled up to a total of 5,000 worker nodes to execute the ClusterLoader2 tests. Amazon EKS automatically scales the Kubernetes control plane in response to a number of signals from the cluster. As part of that scaling, Amazon EKS also scales some of the parameters for Kubernetes Control Plane components such as queries per second (QPS) or inflight request limits. Amazon EKS clusters are created with the Kubernetes upstream default values for these parameters and the Amazon EKS service automatically increases them as the control plane is scaled up.

The Kubernetes components will print the configured values to their logs at startup, if you have the Amazon EKS Control Plane logs enabled for the Kubernetes components you can search for log messages starting with FLAG: to review these messages. The exact values Amazon EKS configures for any given cluster scale may change as Kubernetes changes or we find better values.

The Amazon VPC Container Networking Interface (CNI) Plugin for the tests is configured to use Prefix Delegation for IP address assignment to improve Pod density and the performance of IP address assignment. The cluster uses Managed Node Groups with a broad range of instance families, by implementing instance diversification across instance types it helps procure capacity from multiple capacity pools. The configuration we use allows instances from: c5.large, m5.large, r5.large, t3.large, t3a.large, c5a.large, m5a.large, r5a.large.

We collect the Prometheus metrics from the cluster and use Amazon Managed Prometheus and Amazon Managed Grafana to review them.

Results of our testing

During the load tests, ClusterLoader2 monitors the performance of the cluster. If the SLOs above are broken (i.e., the 99th percentile of latency [p99] for an API request to get a single Pod takes > 1 second), then the test is considered a failure. The Amazon EKS team reviews these results and investigate failed tests to understand the failure and ensure any regressions are addressed.

The total number of resources created during the load test is dictated by the ClusterLoader configuration. Our load tests expect 5000 nodes, with 30 Pods per node and 100 nodes per namespace. The test configuration then calculates the total number of pods (30 Pods per node multiplied by 5000 nodes), namespaces (5000 nodes divided by 100 nodes per namespace), and Pods per namespace.

At the peak of our tests, we see these counts for resources in the cluster while maintaining the SLOs and expected churn rate in the cluster.

Resource type	Max reached during the test
#Nodes	5000
#Namespaces	50
#Pods	170000*
#Pods per node	30*
# Deployments	16000
#Services	8000
#Endpoints	8000
#Endpoints slice count	8000
#Secrets	16000
#ConfigMaps	16000
#CRDs	4
#Jobs	150

* The load test runs 30 application pods per node, the total number of Pods includes the pods for plugins and DaemonSets.

Keep in mind that the total number of Kubernetes resources is not really the determining factor of success for these tests because the SLOs define a threshold in time to complete actions or requests. For example, the time it takes for Pods to start provides more insight into how the cluster is performing than the total number of pods.

SLOs on your cluster

We have looked how Kubernetes defines SLOs and how Amazon EKS measures cluster performance. You might be interested in learning how your Amazon EKS cluster is performing with your configuration, plugin extensions, and workload. You don’t have to run a full 5,000 node load test to get an idea of the same performance benchmarks in your existing Amazon EKS clusters. If you are collecting the Prometheus metrics from the Kubernetes resources in your Amazon EKS cluster, then you can gain deeper insights into the performance of the Kubernetes control plane components. There are more details on the metrics and Prometheus Queries you can use at the EKS Best Practices Guide in the Scalability section.

Consider that the SLOs are focused on the performance of the Kubernetes components in your clusters, but there are additional metrics you can review, which provide a different perspectives into your cluster. Kubernetes community projects like Kube-state-metrics can help you quickly analyze trends in your cluster. Often community plugins and drivers from the Kubernetes community emit Prometheus metrics, allowing you to investigate things like autoscalers or custom schedulers. The Observability Best Practices guide contains examples of other Kubernetes metrics you can use to gain further insight.

Working with the Kubernetes community

Amazon EKS is contributing to the Kubernetes community. The Amazon EKS team have worked with the Scalability SIG to implement scalability tests for the Network Programming Latency SLO. The Amazon EKS team has also worked with the Kubernetes community to implement a 5,000 node test on AWS using kOps, a community tool to provision Kubernetes clusters. This test is run periodically to ensure that code changes in Kubernetes don’t introduce negative performance impacts and the results are available at the community Performance Dashboard. When one of these scalability tests fails, the Amazon EKS team is notified to help investigate. You can see the results from these tests in the Kubernetes community performance dashboard.

The Amazon EKS team runs the same load test on 5,000 nodes internally to monitor the same performance metrics as the upstream Kubernetes community. Using the same tests, at the same scale, helps us ensure that Amazon EKS-specific components maintain the same level of performance as the upstream Kubernetes tests.

This work is just the starting point. We’re always improving the scalability of our Amazon EKS clusters based on bottlenecks and problems our customers run into with real world uses, like increasing the QPS and inflight requests options as we scale the Kubernetes control plane. With Amazon EKS, those improvements are automatically deployed to your clusters, helping you avoid scalability problems before they even start.

Conclusion

In this post, we have discussed the SLOs defined by the Kubernetes community and how Amazon EKS tests for scalability. If you are scaling a single cluster beyond 1000 nodes or 50,000 pods, then we would love to talk to you. Amazon EKS has customers running large clusters and we’re constantly working to improve the scalability of our clusters to provide the best performance possible. Reach out to your AWS Account team (Solutions Architect or Technical account manager) or the AWS Support team or on the AWS Containers Roadmap for help with scaling. To learn more about running Kubernetes workloads at scale, checkout the Scalability section of the EKS Best practices guide.

Containers

Deep dive into Amazon EKS scalability testing

Introduction

Kubernetes upstream SLOs

API request latency

Mutating

Read-only

Pod startup latency

Kubernetes SLI metrics

API server request latency

Pod startup latency

How Amazon EKS approaches Scalability

Working with the Kubernetes community

Conclusion

Resources

Follow