When it Comes to Chaos, Gorillas Before Monkeys

Here’s a glitch in my thinking that I realized on a recent job: I am too terrified of monkeys, and not sufficiently afraid of gorillas. As a result, I’ve been missing opportunities for early, smart investments to make my systems more resilient in the Amazon cloud.

By “monkey” and “gorilla” I mean “Chaos Monkey” and “Chaos Gorilla,” veterans of Netflix’s Simian Army. You can browse the entire list ¹, but for easy reference:

Chaos Monkey is the personification (simianification?) of EC2 instance failure.
Chaos Gorilla represents major degradation of an EC2 availability zone, henceforth “AZ” for short (or, as we sometimes referred to them at my last job, “failability zones”).

I believe that startups should (mostly) worry less about EC2 instances failing, and more about entire AZs degrading. This leads to a different kind of initial tech/devops investment—one that I believe represents a better return for most early-stage companies and products.

How I (Finally) Learned to Dread Chaos Gorilla Appropriately

At the job in question, the team and I were working on an application that had, at some unremarked moment, crossed the fuzzy line between advanced prototype and early production. First customers were using it—some were even starting to depend on it in their daily lives—and suddenly downtime had gone from something we thought about as “wouldn’t it be nice if someone noticed or cared” to “wow that might really tick some people off.”

Unfortunately, unlike every other development shop ever, we might have cut a corner or two getting our prototype out. To wit, we had a single point of failure in our system. Or maybe two. All right, I admit it: we had four SPOF time bombs ticking away, which—quite coincidentally—was also the total number of EC2 instances in our deployment. I felt bad about that, and so did the rest of the team. We all knew that SPOFs were pure evil, right up there with axe murderers and grams of trans fat on the list of “Things There’s No Good Number Of.” So we planned to spend a good chunk of a couple sprints terminating those SPOFs with extreme prejudice ².

And then, the night before the first such sprint began, we dodged a bullet: one of the East Coast AZs freaked the hell out, bringing half the sites on the Internet down with it. Luckily, it wasn’t the AZ we were in ³. Phew, right?

But on the bus the next morning I got to thinking: why did I feel ashamed of our SPOF instances, but lucky for dodging the latest AZ meltdown? Why did I now suspect that if an instance failure had caused us even minutes of downtime, I would have blamed myself, whereas if an AZ meltdown had knocked our site out of commission for hours (along with Reddit and maybe Netflix) I would have—if I was being honest with myself—kind of blamed Amazon? This felt like just the kind of misaligned thinking that might be hiding an economic opportunity.

Arriving at the office, I cornered Dan in the kitchen and we chatted it through. From a cold, hard, economic point of view, would we get a better return first protecting against instance failure, or improving our resilience to AZ meltdown? Instances failed, sure—and if we were at Netflix’s scale, they’d be failing all the time. But at our scale—four machines—they didn’t seem to fail very often, and when they did, it would be maybe an hour of scrambling to fix. On the other hand I could name five occasions in the previous two years, just by my personal count, when an AZ had melted down—and in each case we had spent more like half a day (at least) dealing with the crisis and fallout. If you go looking for AZ meltdowns, they’re not very hard to find.

In other words, at our scale of four instances:

instance failure = cheap and seldom
AZ meltdowns = expensive and frequent ⁴

(Protecting against AZ meltdown also has the nice benefit of optimizing for mean time to recovery instead of mean time between failures, which is usually the right way to go).

So we changed course: over the next few sprints we ran AZ meltdown simulations, made improvements to our backup and deploy scripts that would allow us to recover from disaster more quickly ⁵, and generally made ourselves more resilient to the economically disastrous wrath of Chaos Gorilla before we spent real dev calories preventing the relative pranks of Chaos Monkey.

Why Had I Been Thinking About this So Wrong?

I suspect for a few reasons:

The Different Tradeoffs of Physical Hardware

I get a lot of my habits and patterns of thought from having worked in the olden days on sites deployed on physical, colocated servers, where the cost/benefit profiles are different than those of the cloud. Unlike tooling some scripts to bring up new instances on-demand in a separate AZ, maintaining a failover-ready second server farm in a separate colo facility represents a substantial investment for a young company. Furthermore, compared to the bleeding-edge insanity of an AZ operated at Amazon’s scale, most physical installations are built on boring, tried, and relatively simple technology, and they don’t catastrophically fail as often.

Don’t get me wrong—there’s no excuse for not preparing against colo failure in the physical hardware universe, any more than there’s an excuse for ignoring Chaos Gorilla in AWS, but the combination of lower incidence of failure and higher cost to protect against it means that the economic “break even” line can get drawn later in an application’s life-cycle.

AZs Still “Feel Like Hardware”

It’s natural to think of your piddling EC2 instance as something ephemeral, but (at least for me) it’s not natural to think of a whole AZ as something volatile or dangerous. It’s basically just a big, solid data center, more stable than the boxes it houses, right? Well…no.

An AZ is a virtualized data center—an exercise in massively distributed and heterogeneous systems engineering, housing thousands of tenants who are each competing for resources and aren’t even supposed to know the others exist. No one else in the world (as far as I know) is operating a comparable system at the scale of AWS. And like all such complex, distributed systems, AZs are subject to weird hidden dependencies and nasty cascading failure modes.⁶

In other words: I think we intuitively feel that AZs fail in the modes of colocated hardware—relatively simply and independently—when in fact they tend to fail in the mind-bogglingly interconnected modes of distributed software.

Chaos Monkey Came, Saw, and Conquered Our Imagination

I remember the first time I read about Chaos Monkey. The name was hilarious, and the thinking behind it just felt so instantly, recognizably right. “You want to make darn sure you only deploy production systems resilient to instance failure? OK, then randomly terminate instances in production.” I internalized that point of view pretty quickly, and it stuck.

Chaos Gorilla, on the other hand? I barely heard about him when he came on the scene. And I didn’t know Chaos Kong (who simulates an entire AWS coast failing) even existed until recently.

It’s Easy to Feel Good in Good Company

When one of your instances dies, you’re the only startup on your floor whose site is down. That feeling sucks. When an AZ goes down, Giants of the Internet stumble—even the mighty Netflix can have a problem or two. Meanwhile down the hall you hear the sobs and screams of your counterparts at other startups, which you find oddly comforting, and by that night you’re drinking beer and swapping the day’s war stories with them.

All this company can make you feel that, somehow, downtime caused by an AZ outage isn’t really your fault. But our customers don’t think that way, and neither should we.

In Conclusion

First, let’s note what I’m not concluding.

I’m not concluding that it’s OK to have single points of failure in your system, or that instances never die on Amazon, or that you shouldn’t engineer for high availability.

I am concluding that, because of the high historical frequency of AZ degradation and the relatively small number of instances most early startups deploy, those startups are less likely to be affected on any given day by instance failure than AZ degradation. Furthermore, the costs incurred in downtime and recovery efforts are usually significantly higher for AZ degradation than instance failure. Therefore, for many startups, it will make sense to invest in war-gaming AZ degradation and tooling for quick recovery before engineering around instance failure.

Or, to put it more succinctly: gorillas before monkeys.

Including my personal favorite, the weirdly named “Conformity Monkey,” whom I kind of imagine standing around awkwardly with a crew cut and a button-down while “Hippie Monkey” and “Beat Monkey” hurl epithets and folk songs at him. ↩
For those wondering why we couldn’t just throw up an ELB or two and call it a day: a major element of the application was a custom IMAP server—yes, that’s right, IMAP, as in the charmingly chatty, delightfully stateful mail protocol. You haven’t really lived until you’ve tried to load balance a stateful protocol, I tell you. HTTP is for wussies. ↩
For those interested: we were in a single AZ because we were required, for business reasons, to deploy into a VPC. ↩
We went into some more depth when estimating the relative costs and benefits of protecting against AZ meltdown or instance failure, generating some upper / lower bound estimates which I plan to publish in a later companion piece. ↩
Yes, we had robust backups and automated deploys, even for our prototype. Maybe we were a little SPOF-y, but we weren’t insane. ↩
Take EBS for example, which has been responsible in one form or another for many of the AZ meltdowns to date. Even if you don’t use EBS directly, an EBS meltdown can affect your deployment because—surprise!—ELBs use EBS behind the scenes—as does RDS, and a number of other AWS services. Even if you avoid all EBS-dependent services, a sudden rush of EBS failover and replication can choke an entire AZ, rendering your instances inaccessible and useless. ↩