AWS Open Source Blog
Chaos Engineering Meetups
We all want to build reliable systems, and some applications are more critical than others. However, how do you know you have actually built a reliable system? You have to test it and exercise its ability to be resilient to various types of failure. Unfortunately, the error handling code and failure handling procedures are usually the least well-tested aspects of an application, because people are scared of causing an outage. If we ask the question: “Do you have a backup datacenter?” the answer is often: “Yes, of course we do.” The response to the follow-up question: “How often do you fail over the entire datacenter all at once?” is often an embarrassed silence. I like to call the practice of going through the motions of having a failover process, without actually testing it, “Availability Theater” – it makes people feel better without actually solving the problem of keeping things running when something goes wrong.
There is an emerging practice around chaos engineering, driven by teams at several companies, including Amazon and Netflix. The Netflix team have published an excellent book on the subject.
How does this relate to open source? Several years ago, Netflix released open source projects including the Chaos Monkey, which killed individual AWS instances to ensure that the application code was stateless, and to demonstrate that the auto-scaler replaced them automatically. Netflix also described Chaos Gorilla, which shuts down one randomly-picked availability zone in an AWS region, to ensure that everything still works when only two out of three zones are available. Chaos Kong is a really big gorilla that evacuates a region to test multi-region resilience. Recently, Netflix described how they sped up their Chaos Kong exercises to finish in less than ten minutes, and they run them every few weeks. Paradoxically, the more frequently tests are run, the less likely they are to cause a problem, as developers can’t get away with taking shortcuts, and bad designs and misconfigurations are found early, before they propagate.
Netflix has described how they use a system called ChAP, the Chaos Automation Platform, to run large numbers of automated experiments, and Gremlin Inc. has released a sophisticated commercial platform for managing chaos testing. More recently, London-based ChaosIQ have created the open source Chaostoolkit, which they are using as the basis of a commercial offering, but which provides a good focal point for open collaboration.
Chaostoolkit is a well-organized and clearly documented project hosted on GitHub. It provides drivers for automating operations against various cloud APIs, including AWS EC2 and Kubernetes. The ability to control infrastructure directly via an API provides new opportunities for chaos testing that would be impractical in a ticket-driven datacenter environment. Both Gremlin and Chaostoolkit can run experiments to exercise the failure modes of applications running on Kubernetes, and we are looking at ways to test the Kubernetes control plane itself.
It’s still early days for chaos engineering, and the chaos community is doing a lot of outreach and socializing of ideas. Russ Miles of ChaosIQ is presenting at GOTO Chicago on 25-26th April, along with talks by Kolton Andrus of Gremlin Inc. and myself. Russ is taking his talk on the road in the weeks leading up to GOTO Chicago; he will drive there as a “Geek on a Harley,” presenting at meetups along the way. The tour starts in San Francisco on March 27th. The following evening, at Intuit’s offices in Mountain View on March 28th, I will be joining Russ as his support act for one night only, and hope to provide some musical enhancements to the evening’s entertainment.
During the tour, Russ will be writing a book about conversations along the way around production failures and chaos engineering. The book will be free, but donations will be happily accepted, with all proceeds going to Girls Who Code. The first contribution will be published early next week.