Chaos merchant’s failure-as-a-service tests system resilience
Chaos-engineering company Gremlin has launched Scenarios – “templates of real-world outages” that make it easier to wreck your applications.
Gremlin announced the product at the Chaos Conf 2019 taking place in San Francisco. Scenarios include traffic spikes for testing what happens under severe load; unreliable networks for when your microservice API calls start taking ages to respond; and region evacuation, for when a cloud region becomes unavailable.
The idea of chaos engineering is to cause deliberate failure in order to investigate whether your application or system is resilient. Chaos engineering tools can consume 100 per cent of CPU, shut down a percentage of your hosts, make DNS calls unresponsive, or introduce severe latency into networks, so you can discover whether planned resiliency, like failover systems, actually work as designed – in the same way as you validate a backup by doing a test restore.
We spoke to Gremlin’s Senior Site Reliability Engineer (SRE), Tammy Butow, at the Qcon conference in London. “The history starts with Netflix when they were moving to AWS,” she told us. “They thought, how do we make sure that this does work? They started by creating Chaos Monkey, which they later open-sourced. That was about, if we shut down a server, is everything OK? That helped them provide feedback to AWS.”
Chaos Monkey is free but can be complex to deploy.
“We’re trying to prevent downtime and we’re trying to prevent data loss,” Butow added. “Back when I worked at National Australia Bank we did disaster recovery tests. You have to do those to get your banking licence. But if you’re in a tech startup, there’s nobody that holds you accountable, to prove that your system is resilient and that you’re looking after your customer’s data.”
The failures injected by Gremlin are not simulated, except in the sense that they can be paused or removed. “If you do it the wrong way it can be dangerous,” said Butow.
The key is to start small. The “blast radius” of a test determines how wide its impact is. “I like to do a CPU attack first. It’s the Hello World of chaos engineering,” Butow said.
You can begin by taking down just one or two servers, then expand to taking down whole services or an entire region. A service like Gremlin provides an API and a control plane, so you can automate and schedule tests.
Just like in the security world, many failures come about due to people using services in unexpected ways. A common example is APIs. “When people build APIs they don’t think anyone’s going to abuse the API,” said Butow. “As an SRE I’m always looking for how can things break.”
That you cannot call a system resilient until you have seen it survive massive failures is common sense, but as with backups, many organisations still end up learning the hard way. ®