Explore etcd Defragmentation in Amazon EKS

Introduction

Amazon Elastic Kubernetes Service (Amazon EKS) has gained significant popularity as a managed Kubernetes service, providing a scalable and reliable platform for running containerized applications. Behind the scenes, Amazon EKS uses etcd, a distributed key-value store, to store cluster configuration, state, and metadata. In this post, we delve into the defragmentation functionality in etcd and discuss the intricacies of the process and the strategies to minimize its impact on Amazon EKS components.

Understanding etcd in Amazon EKS

Etcd serves as the primary data store in Amazon EKS, which stores cluster configuration, state, and metadata. It maintains a consistent and distributed storage system to ensure high availability and reliability of the Amazon EKS control plane.

One primary function of etcd is to store and organize data of the Kubernetes API as key-value pairs. It keeps track of the current state of objects and configurations in the cluster, making sure that the actual state aligns with the desired state specified by cluster administrators, application developers, or controllers.

Multi-Version Concurrency Control

Etcd is a persistent key-value store. It employs Multi-Version Concurrency Control (MVCC) mechanism to ensure data consistency while also allowing concurrent read and write operations. Each key-value pair is associated with an increasing version number. When a new value is written for a key, it receives a higher version number than the previous one. Versions are unique and strictly ordered. The previous version is retained, which allows for historical tracking of changes. This append-only nature of the store causes the database size to grow indefinitely.

Data fragmentation

As data is updated and deleted over time, it can lead to fragmentation within the database. Fragmentation means that there may be non-contiguous gaps or unused space in the database at the host filesystem. As fragmentation build up over time, it can result in increased storage space consumption and increased I/O load due to non-contiguous blocks, which ultimately affects the responsiveness of the Kubernetes API server. Fragmentation can also lead to the exhaustion of available storage capacity, if left unattended. To address these challenges, etcd provides a built-in defragmentation mechanism to identify and reclaim space from fragmented storage regions.

Solution overview

How defragmentation works in etcd

Compaction

Compaction in etcd focuses on identifying and removing obsolete data to avoid eventual storage space exhaustion. Each key-value pair in etcd is assigned a unique index number, known as a revision. Compaction works based on these revision numbers to identify data that’s no longer needed or has expired. A retention policy determines the revision range of data to be retained. Data outside this range is considered eligible for removal during compaction. After deletion, storage space previously occupied by that data is released and can be used for new data. An API server triggers compaction every 5 minutes.

Defragmentation

Defragmentation consists of rewriting the data into contiguous files, effectively eliminating fragmentation and enhancing data locality. Etcd analyzes the data storage to identify fragmentation levels and areas where data reorganization is beneficial. During analysis, etcd identifies contiguous storage space where related data can be moved to. Etcd then moves data to consolidate related key-value pairs. Data that is scattered is relocated to these contiguous storage locations. Unused space is released back to the filesystem.

Impact on API availability

It is important to note that defragmentation is a blocking operation, which means that while the process is in progress, it prevents any read or write operations from taking place. This can impact the API server’s communication with etcd to serve read and write requests to the clients. The time taken for defragmentation depends on amount of compacted data that needs to be copied into the new database file. On average, it can take up to 10 seconds for every gigabyte of data to be reorganized. We recommend monitoring the etcd database size and deleting unwanted objects to minimize performance impact to the API server. This topic is covered in great detail in Managing etcd database size on Amazon EKS clusters article.

On larger etcd databases, it is common to see the following error message returned from API server during defragmentation:

[leaderelection.go:367] Failed to update lock: etcdserver: request timed out

Amazon EKS development team is actively working on improvements to minimizing the impact of defragmentation on API availability. Some examples include the introduction of gRPC Remote Procedure Calls (gRPC) health checks to etcd client.

Handling API timeouts

Intermittent timeouts from API server are to be expected during defragmentation. Therefore, it is best practice to design your client applications to gracefully handle these situations. By building robust error handling mechanisms and incorporating retry strategies, client applications can mitigate the impact of intermittent timeouts and maintain reliability. When a timeout occurs, your application should include retry logic, ideally with exponential backoff and jitter to prevent overloading the API server. Additionally, consider incorporating a circuit breaker pattern to temporarily halt retries if consecutive failures occurs.

Walkthrough

Managing defragmentation in Amazon EKS

The underlying etcd cluster and its defragmentation process are handled transparently by the Amazon EKS control plane.

Amazon EKS employs automated maintenance processes to ensure the health and stability of etcd. AWS takes care of provisioning, scaling, and managing the etcd instances for you. This proactive approach guarantees that the etcd cluster nodes remain healthy, and any potential issues related to fragmentation are mitigated before they impact the overall Amazon EKS cluster’s performance.

Minimizing impact of defragmentation

The size of the database is a crucial factor that influences the defragmentation time in etcd. As the etcd database grows larger, the defragmentation process becomes more time-consuming due to the increased volume of data that needs to be reorganized and compacted. To minimize the impact of defragmentation, consider the following practices:

1. Remove unused or orphaned objects

Regularly audit your cluster to identify and remove unused or orphaned objects. These objects may include old deployments, replica sets, or services that are no longer in use. Deleting unnecessary objects reduce the storage footprint in etcd and minimized the fragmentation impact. Tools such as popeye can help identify unused resources.

2. Sparing use of ConfigMaps and Secrets

Avoid storing large amounts of data in ConfigMaps and Secrets. Use these resources sparingly, keeping them concise and organized to reduce the number of large objects stored in etcd. Alternatively, consider using AWS Secrets Manager or AWS Systems Manager Parameter Store for storing large data sets.

3. Avoid large Pod specs

Pod specifications with sizable amounts of embedded metadata (512 K+) can quickly inflate an etcd database. This is especially problematic in scenarios where a deployment enters a crash loop and subsequently consumes all available etcd storage with infinite Pod revisions.

4. Implement object lifecycle management

Define and enforce object lifecycle management policies. Set expiration dates or implement retention policies for objects that have a limited lifespan. Automate the removal of expired objects to prevent unnecessary data accumulation in etcd.

Clean up finished Jobs automatically, by specifying .spec.ttlSecondsAfterFinished field, as described here.

apiVersion: batch/v1
kind: Job
...
spec:
  ttlSecondsAfterFinished: 100
...

Limit number of Deployment revision histories. By default, it is 10. A lower value decreases the number of previous ReplicaSets retained in etcd. Note that setting the value to 0 disables roll back functionality.

apiVersion: apps/v1
kind: Deployment
...
spec:
  revisionHistoryLimit: 0
...

5. Regularly monitor etcd storage usage

Monitor etcd storage usage to gain insights into resource utilization and identify and abnormal growth patterns. This helps you proactively address storage-related issues and take corrective actions, such as optimizing object usage, if required. See best practices guide on control plane monitoring for additional details.

Amazon CloudWatch: Amazon EKS integrates with Amazon CloudWatch, which allows you to monitor various cluster metrics, including etcd disk usage. You can use CloudWatch Insights to write custom queries and extract the relevant etcd metrics.

fields @timestamp, @message, @logStream
| filter @logStream like /kube-apiserver-audit/
| filter verb in ['create','update','patch','delete']
| limit 10

kubectl: You can use kubectl command-line tool to fetch etcd metrics directly. For example, to get etcd metrics in Amazon EKS v1.26+, you can run:

$ kubectl get --raw /metrics | grep apiserver_storage_db_total_size_in_bytes

apiserver_storage_db_total_size_in_bytes{endpoint="http://10.0.160.16:2379"} 1.210830848e+09
apiserver_storage_db_total_size_in_bytes{endpoint="http://10.0.32.16:2379"} 1.207840768e+09
apiserver_storage_db_total_size_in_bytes{endpoint="http://10.0.96.16:2379"} 1.20885248e+09

Prometheus and Grafana: AWS Distro for OpenTelemetry (ADOT) has built-in support for EKS API server monitoring. To learn more about ADOT, see monitoring Amazon EKS API server

Conclusion

In this post, we showed you how defragmentation, as a key functionality within etcd, plays a crucial role in optimizing Amazon EKS performance and ensuring cluster stability. By proactively reorganizing data and optimizing storage, defragmentation improves overall cluster efficiency and enhances resource utilization. As Amazon EKS continues to empower organizations with scalable and resilient Kubernetes deployments, we understand the nuances of etcd becoming an essential tool for administrators to unlock the full potential of their Amazon EKS clusters.

For additional reading, review the following article on managing etcd database size on Amazon EKS clusters.

Containers