AWS Startups Blog
Gremlin: Chaos Engineering and Providing Failure as a Service
Gremlin CEO & Co-Founder Kolton Andrus wants you to think of his company as a flu shot. “We basically inject a little of harm [into your system] in order to find weak spots and build an immunity,” he says. “We proactively break things… to help make them stronger.” While Andrus notes that the idea of preparing for disaster isn’t new—”we were doing hardware failure testing in the 60s and 70s and people were writing papers about this in the 80s, 90s, and early 2000s”—what they have noticed is that migrating to the Cloud has introduced new challenges. “As people adopt these microservice architectures and distributed systems, now we’re ever more reliant on other people’s software. One aspect is prepare for the things that can go wrong to us,” he says. “A host could disappear. A network device could fail. A disc could fill up… When things break that tends to be a common impetus to get better, but we’d like it to be a bit more proactive, something you prepare for.”
“Now a lot of our customers, they’re doing monitoring. They’re doing alerting. They’re seeing how their systems behave,” says Andrus. “But this is a way to verify that you’ve set it up correctly. As silly as it sounds, I’ve been part of many outages where somebody wasn’t monitoring something correctly. Somebody didn’t get paged. And something took three or four times longer than it needed to to fix.”
Watch our interview with Andrus to learn more chaos engineering and about how to prepare for and ensure your systems behave well and that your customers don’t feel the pain.