AWS Open Source Blog

Chaos Engineering Meetups

Geek on a Harley

We all want to build reliable systems, and some applications are more critical than others. However, how do you know you have actually built a reliable system? You have to test it and exercise its ability to be resilient to various types of failure. Unfortunately, the error handling code and failure handling procedures are usually the least well-tested aspects of an application, because people are scared of causing an outage. If we ask the question: “Do you have a backup datacenter?” the answer is often: “Yes, of course we do.” The response to the follow-up question: “How often do you fail over the entire datacenter all at once?” is often an embarrassed silence. I like to call the practice of going through the motions of having a failover process, without actually testing it, “Availability Theater” – it makes people feel better without actually solving the problem of keeping things running when something goes wrong.

There is an emerging practice around chaos engineering, driven by teams at several companies, including Amazon and Netflix. The Netflix team have published an excellent book on the subject.

Chaos Engineering book O'Reilly

How does this relate to open source? Several years ago, Netflix released open source projects including the Chaos Monkey, which killed individual AWS instances to ensure that the application code was stateless, and to demonstrate that the auto-scaler replaced them automatically. Netflix also described Chaos Gorilla, which shuts down one randomly-picked availability zone in an AWS region, to ensure that everything still works when only two out of three zones are available. Chaos Kong is a really big gorilla that evacuates a region to test multi-region resilience. Recently, Netflix described how they sped up their Chaos Kong exercises to finish in less than ten minutes, and they run them every few weeks. Paradoxically, the more frequently tests are run, the less likely they are to cause a problem, as developers can’t get away with taking shortcuts, and bad designs and misconfigurations are found early, before they propagate.

Netflix has described how they use a system called ChAP, the Chaos Automation Platform, to run large numbers of automated experiments, and Gremlin Inc. has released a sophisticated commercial platform for managing chaos testing. More recently, London-based ChaosIQ have created the open source Chaostoolkit, which they are using as the basis of a commercial offering, but which provides a good focal point for open collaboration.

Chaostoolkit is a well-organized and clearly documented project hosted on GitHub. It provides drivers for automating operations against various cloud APIs, including AWS EC2 and Kubernetes. The ability to control infrastructure directly via an API provides new opportunities for chaos testing that would be impractical in a ticket-driven datacenter environment. Both Gremlin and Chaostoolkit can run experiments to exercise the failure modes of applications running on Kubernetes, and we are looking at ways to test the Kubernetes control plane itself.

It’s still early days for chaos engineering, and the chaos community is doing a lot of outreach and socializing of ideas. Russ Miles of ChaosIQ is presenting at GOTO Chicago on 25-26th April, along with talks by Kolton Andrus of Gremlin Inc. and myself. Russ is taking his talk on the road in the weeks leading up to GOTO Chicago; he will drive there as a “Geek on a Harley,” presenting at meetups along the way. The tour starts in San Francisco on March 27th. The following evening, at Intuit’s offices in Mountain View on March 28th, I will be joining Russ as his support act for one night only, and hope to provide some musical enhancements to the evening’s entertainment.

During the tour, Russ will be writing a book about conversations along the way around production failures and chaos engineering. The book will be free, but donations will be happily accepted, with all proceeds going to Girls Who Code. The first contribution will be published early next week.

Adrian Cockcroft

Adrian Cockcroft

Vice President Cloud Architecture Strategy, Amazon Web Services Adrian Cockcroft has had a long career working at the leading edge of technology, and is fascinated by what happens next. In his role at AWS, Cockcroft is focused on the needs of cloud native and “all-in” customers, and leads the AWS open source community development team. Prior to AWS, Cockcroft started out as a developer in the UK, joined Sun Microsystems and then moved to the United States in 1993, ending up as a Distinguished Engineer. Cockcroft left Sun in 2004, was a founding member of eBay research labs, and started at Netflix in 2007. He initially directed a team working on personalization algorithms and then became cloud architect, helping teams scale and migrate to AWS. As Netflix shared its architecture publicly, Cockcroft became a regular speaker at conferences and executive summits, and he created and led the Netflix open source program. In 2014, he joined VC firm Battery Ventures, promoting new ideas around DevOps, microservices, cloud and containers, and moved into his current role at AWS in October 2016. During 2017 he recruited a team of experienced open source technologists and gave keynote presentations at AWS Summits and many other events around the world. Cockcroft holds a degree in Applied Physics from The City University, London and is a published author of four books, notably Sun Performance and Tuning (Prentice Hall, 1998).