AWS Open Source Blog
etcd gets ready to graduate
Update: On November 24, 2020 the Cloud Native Computing Foundation announced etcd graduation.
Etcd, a distributed key-value store that helps powers projects such as Kubernetes, is set to join the ranks of the most critical and recognizable projects for open source computing. The Cloud Native Computing Foundation (CNCF), the non-profit foundation that serves as the home for many fast-growing open source projects, is voting to move etcd from incubating to graduated.
To graduate, a project must demonstrate thriving adoption, a documented neutral governance process, maintainers from multiple organizations, and a strong commitment to community sustainability and inclusion. Since becoming an incubating project in December 2018, etcd has demonstrated significant growth with 180+ contributors from multiple organizations, including AWS; more than 2,000 commits for improvements and bug fixes; 42 releases with continued support for older etcd versions; and wide adoption as a default storage backend for Kubernetes. Being a key part of Kubernetes means that etcd enables application delivery, data processing, and machine learning for thousands of companies and organizations in every industry, around the world.
Etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data for distributed systems. It provides a single logical view across a cluster of computing nodes and is specialized for small chunks of data with an emphasis on consistency and fault tolerance. A typical etcd cluster is distributed over a cluster of 3 or 5 computing nodes (virtual machines) for high availability. Etcd uses the Raft consensus algorithm to manage data replication, which ensures strong consistency and fault tolerance, even in the case of complete node failures.
At AWS, we operate etcd as part of Amazon Elastic Kubernetes Service (Amazon EKS), a fully managed Kubernetes service. Amazon EKS runs a dedicated etcd cluster as part of every Kubernetes cluster. This cluster is fully managed by the Amazon EKS service, meaning that all operations, scaling, patching, bug fixes, and upgrades for etcd (in addition to other cluster components) are handled by EKS. Running at AWS scale, fault tolerance and scalability are critical to provide a highly available, production-ready Kubernetes service. Although Raft provides fault tolerance and strong consistency in theory, in practice it takes a lot more to operationalize etcd to meet the needs of Amazon EKS, a scale in which theoretical edge cases are regular occurrences and solving complex distributed systems problems is an operational imperative.
To meet AWS standards for scale and operations, we built etcd nanny, a supervisor for etcd, which constantly checks the etcd cluster’s control plane nodes to ensure the cluster is healthy. Etcd nanny monitors the clusters, manages periodic backups, failure recovery, ensures high availability across multiple AWS Availability Zones (AZs), handles scaling, and performs active health management (among other things). If you’re interested in learning more about how we operate etcd at AWS, take a look at the KubeCon talk Living with the pathology of the cloud: How AWS runs lots of clusters.
While a lot goes into running etcd at Amazon EKS scale, it would not be possible without the sustainable, diverse, collaborative, and open etcd community. The etcd community has allowed the project to flourish and grow since its inception and has made it possible to use etcd as the default key-value store supporting Amazon EKS. As of November 2020, etcd has 800+ contributors from 500+ organizations, with 11 maintainers (from 7 organizations, including AWS), and has 33k+ GitHub stars.
On the Amazon EKS team, participating in open source communities is part of our culture. Bob Wise, General Manager of Kubernetes at AWS, had this to say about etcd’s graduation:
“Open source software powers our lives in so many ways. From Linux to Kubernetes, open communities of builders from all sizes of organizations and walks of life spend considerable time creating and maintaining projects that underpin much of the internet, telecommunications, finance, transportation, gaming, retail, and healthcare systems we use every day.
etcd is one of these critical projects and we’re proud to have etcd as a core part of Amazon EKS and to be involved in helping the project grow and thrive. We are fervent supporters of etcd’s graduation and look forward to collaborating with etcd and other CNCF projects to build secure, reliable, powerful, and scalable open source software.”
The impact of this open collaboration is evident in all the work that goes into making etcd successful. Etcd maintainers hold monthly meetings that anyone is welcome to join. We always see new faces signing up for new work and contributing to the project. We do monthly releases and hold working sessions twice a year to meet at CloudNativeCon/KubeCon. Whether or not you’re ready to dive into contributing, you can follow the project on Twitter @etcdio to stay up to date on the latest developments.
Before AWS, I was primarily focused on developing etcd without actually using it in production. At AWS, engineers own the product, design, and operations end to end. We push etcd to its limits. We have found issues that otherwise would be considered theoretical edge cases and we are contributing those back to the community. I believe this helps to drive the quality and adoption of etcd even further.
What’s next?
Thanks to the vibrant community, etcd continues to improve. The project recently completed Jepsen analysis that validates etcd’s basic consistency premises, not to mention its rigorous testing practices. Etcd adheres to high standards for reliability and security. We regularly run functional testing to ensure correctness in the presence of failures and recently completed an independent and third-party security audit with no critical security issues found. Etcd has now fully adopted Go to better support a growing number of client library users. Additionally, etcd has improved metrics collection and monitoring in addition to adding critical performance improvements for compaction API in large-scale clusters.
For the upcoming 3.5 release, etcd will support downgrades for safe rollbacks, stable gRPC gateway feature for v3 API HTTP endpoints, simplified Go client balancer implementation to adapt to the latest gRPC interface, structured logging support, and stable stand-by node feature, among other features.
With the continued support of the CNCF, the community, and AWS, etcd will continue the push to be a highly reliable distributed system, with bar-raising testing and security practices. Etcd maintainers like myself will not stop trying our best to build a strong and welcoming open source software development community.