AWS Open Source Blog

etcd is Now a CNCF Incubating Project

etcd is a distributed key value store that provides a reliable way to manage the coordination state of distributed systems. etcd was first announced in June 2013 by CoreOS (part of Red Hat as of 2018). Since its adoption as part of Kubernetes in 2014, etcd has become a fundamental part of the Kubernetes cluster management software design, and the etcd community has grown exponentially. etcd is now being used in production environments at multiple companies, by large cloud providers such as AWS, Google Cloud Platform, Azure, and other on-premises Kubernetes implementations. CNCF currently has 32 conformant Kubernetes platforms and distributions, all of which use etcd as the datastore.

Now that etcd is officially joining CNCF as an incubating project, we would like to reflect on the major milestones achieved in the latest etcd releases, and share the future roadmap for etcd. We’d love to have your thoughts and feedback on the features you consider important via the mailing list, etcd-dev@googlegroups.com.

etcd, 2014

In June, 2014, Kubernetes was released with etcd as a backing store for all master states. Kubernetes v0.4 used the etcd v0.2 API, at the time in an alpha stage. As Kubernetes reached the v1.0 milestone in 2015, etcd stabilized its v2.0 API. The widespread adoption of Kubernetes led to a dramatic increase in the scalability requirements for etcd. To handle the large number of workloads and the growing requirements for scale, etcd released v3.0 of its API in June, 2016. Kubernetes v1.13, released in December 2018, finally dropped support for the etcd v2.0 API and adopted the etcd v3.0 API. The table below gives a visual snapshot of the release cycles of etcd and Kubernetes.

etcd v3.1, Early 2017

etcd v3.1 features provide better read performance and better availability during version upgrades – given the high use of etcd in production today, these are critical for users. It also implements the Raft read index, which bypasses Raft WAL disk writes for linearizeable reads: the follower requests the read index from the leader, then responses from the leader indicate whether a follower has advanced as much as the leader. When the follower’s logs are up to date, quorum read is served locally without going through the full Raft protocol. Thus, no disk write is required for read requests. etcd v3.1 also introduces automatic leadership transfer. When an etcd leader receives an interrupt signal, it automatically transfers its leadership to a follower. This provides higher availability when the cluster adds or loses a member.

etcd v3.2, Summer 2017

etcd v3.2 focuses on stability. Its client was shipped in Kubernetes v1.10, v1.11, and v1.12. The etcd team still actively maintains the branch by backporting all the bug fixes. This release introduces gRPC proxy to support, watch, and coalesce all watch event broadcasts into one gRPC stream. These event broadcasts can go up to one million events per second.

etcd v3.2 also introduces changes such as increasing the “snapshot-count” default value from 10,000 to 100,000. With this higher snapshot count, etcd server holds Raft entries in-memory for longer periods before compacting the old ones. The etcd v3.2 default configuration shows higher memory usage, while giving more time for slow followers to catch up. This is a tradeoff between less-frequent snapshot sends and higher memory usage. Users can set a lower --snapshot-count value to reduce memory usage, or a higher snapshot-count value to increase the availability of slow followers.

Another new feature backported to etcd v3.2.19 was the --initial-election-tick-advance flag. By default, a rejoining follower fast-forwards election ticks to speed up its initial cluster bootstrap. For example, the starting follower node only waits 200ms instead of the full election timeout of one second before starting an election. Ideally, within the 200ms, it receives a leader heartbeat and immediately joins the cluster as a follower. However, if network partitioning occurs, the heartbeat may drop, triggering leadership election. A vote request from a partitioned node is quite disruptive. If it contains a higher Raft term, the current leader is forced to step down. With initial-election-tick-advance set to false, a rejoining node has a better chance of receiving leader heartbeats before disrupting the cluster.

etcd v3.3, Early 2018

etcd v3.3 continues the theme of stability. Its client is included in Kubernetes v1.13. Previously, the etcd client carelessly retried on network disconnects without any backoff or failover logic. The client was often stuck with a partitioned node, affecting multiple production users. The v3.3 client balancer now maintains a list of unhealthy endpoints using the gRPC health checking protocol, making more efficient retries and failover in the face of transient disconnects and network partitions. This was backported to etcd v3.2 and also included in Kubernetes v1.10 API server.

etcd v3.3 also provides more predictable database size. etcd used to maintain a separate freelist DB to track pages that were no longer in use and freed after transactions, so that following transactions could reuse them. However, it turns out that persisting freelist requires a lot of disk space and introduces high latency for Kubernetes workloads. Especially when there were frequent snapshots with lots of read transactions, etcd database size quickly grew from 16 MB to 4 GB. etcd v3.3 disables freelist sync and rebuilds the freelist on restart. The overhead is so small that it is unnoticeable to most users. See “database space exceeded” issue for more information on this.

etcd v3.4 and Beyond

etcd v3.4 focuses on improving the operational experience. It adds a Raft pre-vote feature to improve the robustness of leadership election. When a node becomes isolated (e.g., as a result of a network partition), this member will start an election requesting votes with increased Raft terms. When a leader receives a vote request with a higher term, it steps down to a follower. With pre-vote, Raft runs an additional election phase to check if the candidate can get enough votes to win an election. The isolated follower’s vote request is rejected because it does not contain the latest log entries.

etcd v3.4 adds a Raft learner that joins the cluster as a non-voting member that still receives all the updates from leader. Adding a learner node does not increase the size of quorum and hence improves cluster availability during membership reconfiguration. It only serves as a standby node until it gets promoted to a voting member. Moreover, to handle unexpected upgrade failures, v3.4 introduces the etcd downgrade feature.

etcd v3 storage uses a multi-version concurrency control model to preserve key updates as event history. Kubernetes runs compaction to discard the event history that is no longer needed, and reclaims the storage space. etcd v3.4 will improve this storage compaction operation, boost backend concurrency for large read transactions, and optimize storage commit interval for Kubernetes use cases.

To further improve the etcd client load balancer, the v3.4 balancer was rewritten to leverage the newly introduced gRPC load balancing API. This enabled us to substantially simplify the etcd client load balancer codebase while retaining feature parity with the v3.3 implementation and improving overall load balancing by round-robining requests across healthy endpoints.

Additionally, etcd maintainers will continue to make improvements to Kubernetes test frameworks: kubemark integration for scalability tests, Kubernetes API server conformance tests with etcd to provide release recommends and version skew policy, specifying conformance testing requirements for each cloud provider, etc.

Synergistic work with Kubernetes have driven the evolution of etcd. Without community feedback and contribution, etcd could not have achieved its current level of maturity and reliability. We look forward to continuing the growth of etcd as an open source project, and are excited to work with Kubernetes and the wider CNCF community.

Finally, we’d like to thank all our contributors, with special thanks to Xiang Li for his leadership in etcd and Kubernetes.

Joe Betz

Joe Betz the lead Software Engineer for etcd at Google Cloud, and an etcd project maintainer, Joe is directly responsible for the health and stability of the GKE etcd fleet and leads improvements to etcd via open source contributions. He actively contributes to Kubernetes, with a focus on the etcd interface layer as well as occasional improvements to other areas of api-machinery. He has also served the community as the patch manager for the Kubernetes 1.8 branch.

Gyuho Lee

Gyuho Lee

Gyuho is working on AWS EKS and a lead etcd maintainer. He loves to talk about distributed systems, and is passionate about making complex systems easier to understand.