Hut 8 Labs

How to Square the Circle, Achieve Perpetual Motion, and Tune Your Alert Emails Just Right

2015-03-02T00:00:00-05:00

“You can tune a piano,” goes the old joke, “but you can’t tuna fish.” I’ve come to believe that you can’t really tune the automated alerts you get from your monitoring systems either—at least in the sense that we usually mean when we complain that “our email alerts are out of control and need to be tuned properly.” Instead we should be spending our time and attention tuning a larger, more complex system—of which alerts are just one part.

When you tune a piano string, you make small adjustments to return it toward the ideal pitch from which it has drifted (say, 440 Hz for an A). It’s overwhelmingly impractical to tune a real piano string to A perfectly, because of the messy nature of the physical world—but the ideal you’re trying for is straightforward and unambiguous.¹ Unfortunately, there’s no such ideal for alerts.

“Of course there is,” says a guy in back, “I’ve read about it in tons of blog posts. The ideal for alerts is this: every time I get an alert, it should indicate an actual problem that requires my attention.”

OK, guy at the back, that’s pretty easy to do—let’s just turn off the alerting system altogether, and I promise that every alert you receive will be indicative of a problem requiring your attention.

“Don’t play dumb,” he retorts. “We still want to get alerts whenever something is wrong—as soon as we know it’s wrong, in fact—we don’t want to be sitting around thinking things are all hunky dory when production is on fire. So that’s the ideal towards which we tune alerts—we want alerts immediately when there’s an actual problem, and only then.”

Fair enough, but that’s a very different kind of ideal than a string vibrating at precisely 440 Hz. In fact it’s not even clear that this can be called an “ideal” at all, because under this definition an alert can drift from ideal in at least two, often contradictory directions. That is, it can fire when there isn’t an actual problem (a false positive) or not fire when there is (a false negative)—and when you make one of these problems better for a given alert, you tend to make the other worse.

For example, if you’ve ever monitored CPU usage of a production system, you’ll be familiar with the false positive alerts you get when the system becomes briefly and legitimately busy doing a burst of real work. So you tune the alert to back off a bit—perhaps you will tolerate up to 80% utilization instead of 70% before alerting—only to find that some truly nasty condition occasionally pegs the CPU right around 72% forever, causing all sorts of other problems in the meantime. All right, you think, I’ll set the CPU threshold back to 70% but not alert until we’ve exceeded that for 30 minutes, which is longer than any legitimate work spike—but now you’ve guaranteed that you won’t find out about actual problems for at least 30 minutes. So you set a new rule, which etc. etc. etc.

As tricky as this balancing act is (and, if you’ve ever struggled with this in the real world, you know that it is plenty tricky), there’s a subtle (and even more dastardly) problem buried in this discussion so far: the tossing around we’ve been doing of that term “an actual problem.”

What makes one condition of your system “an actual problem” and another “not an actual problem?” There’s no unambiguous, measurable criterion you can reference to answer that question. If you polled different people in your business, you’d almost certainly get vastly different answers—one of the folks in marketing might not care, for example, if page load times were above 1 second for their latest micro-site, but might care very much if it’s down, while another might think anything over 500 millis should be synonymous with downtime—for the very same site. So what we mean by saying an alert caught “an actual problem” ends up being something like: “when the alert fired, it alerted me to a situation that, with my own infinitely flexible and idiosyncratic human knowledge and judgment, I was glad to know about at just that time and no later.”²

This is a pretty bad situation we find ourselves in—not because there’s some unattainable perfection our alerts can’t ever realistically achieve, but because we don’t even have a good enough definition of perfection to let us say whether a change to one of our alerts is actually making it better or worse. That’s a horrific state of affairs, because it means that as we attempt to steadily improve our operations in nice small steps, we’re just going to end up endlessly jerking away from whichever flavor of catastrophe last burned us: our alert volume will build up until it’s just background noise, which will lead to an alert on an “actual problem” being ignored, which will lead to a fantastic blamefest that results in our cutting a bunch of alerts, which will lead to an “actual problem” that generates no alert, which will lead to another blamefest resulting in a buildup of alerts … and round and round we go, with no end in sight.

OK, then … is all hope lost?

So how do we get out of this spiral of blame and abject existential horror? Here’s a hint: how would we design our alerts if we had access to an infinite supply of brilliant, unsleeping, free interns (who were also intimately familiar with our systems and business) to respond to them? Well, assuming we’re completely heartless³, we’d make those alerts sensitive as all hell, because if it’s free (and we’re heartless), why not throw human intelligence and attention at every little blip and bump we monitor to see if there’s an “actual problem?”

What if, on the other hand, the condition on which we were alerting was extremely benign—for example, the website on which we host our high-school poetry going down? We’d make that alert extremely insensitive, because honestly: who cares if it goes down for a day or two, or even a week?⁴

Economics to the rescue

In other words, we can recast the idea of an alert as something that doesn’t even have a Platonic ideal in and of itself—but which is one piece of an economic equation, with an associated cost and benefit profile.

I mean those words literally, by the way: each alert has some cost and benefit in some actual number of probabilistic dollars and cents, where the cost is dominated by the investment of human attention and intelligence that it occasions, and the benefit is equal to the cost of an “actual problem” (times the probability that the alert has identified such a problem).

Now we can start comparing the relative badness of a given false positive and its corresponding false negative, and to see the outlines of a system that can actually be tuned towards an unambiguous ideal—the absolute minimum overall cost.⁵

A gentle objection from the guy at the back

“All right,” says the guy at the back, “I’ve sat quietly for a bit, but now I think I’ve got you. Maybe you’re right that the Platonic ideal was a little too simplistic, but what you’re proposing is way too complex—there’s just no way in hell you can actually figure out the cost of human attention and intelligence and the probability of a false negative etc. etc. etc. and get all that down to dollars and cents. In practice it’s just impossible.”

Well, guy at the back, we agree about one thing: we’re never going to calculate those costs down to the cent, or even the dollar. But—and here’s the great part—in practice we don’t have to—we just need back of the envelope, order of magnitude estimates that allow us to compare a couple choices (e.g. making an alert more or less sensitive) and say, relatively, which course is probably better.

“But,” says the guy, “you’re still asking me to put a cost on things like minutes of downtime. The leaders of my business are never going to do that—downtime is one of those things that just can’t ever happen.”

Oh, guy at the back, let me buy you a beer or ten.

The spirit behind that “can’t ever happen” is a gigantic problem—it’s equivalent to saying “our perpetual motion machine just can’t ever run down, because we love our customers and failure is not an option.” This is, at its heart, a moral argument, where blame and punishment are what’s under consideration—and in that mindset, people often get genuinely angry if you even suggest that there’s an economic tradeoff to consider.⁶

If the above sounds uncomfortably close to your own situation, you have much more profound problems then a simple pass at your nagios configuration could ever solve. If you operate the software of your business, then you must be able to reason about the economics of what you do—be it preventing downtime, investing in backups, or even speeding up deploys—in at least comparative orders of magnitude. If the leaders in your business refuse to partner with you on that, then you don’t have a ton of great options.⁷ In this situation, as my friend and colleague Dan Milstein says—and I don’t repeat this lightly—”maybe the world is telling you to brush up your LinkedIn profile.”

But in my experience—and I hope that experience is far from unique—the leaders of a business usually welcome the chance to have an economic discussion around such operational risks and investments, as it represents a chance to better understand (and inform) some aspects of the business’s economic equation that are often opaque to them, and remove some of the fear and anxiety that this opacity creates.

In Conclusion …

In practice I think you’ll find that, far from our original false Platonic ideal of “every alert indicates an ‘actual problem,’” you’ll end up happily and profitably tolerating some number of false positives, since for most businesses they tend to be considerably cheaper than a single false negative.⁸

But thinking about alerts in their larger economic context also lets us improve our overall economics by means of other investments too. Besides just changing the sensitivity of our alerts, we can also improve the overall economic equation by:

driving down the cost of receiving an alert—for example by making alerts easier and quicker to digest and disregard if appropriate, so they consume less human attention and intelligence
driving down the cost of the failures we want to alert on, by providing backup or alternative systems that make failures less expensive (for example, providing materials that allow cashiers to take those old paper impressions of credit cards if the electronic system goes down)

By recognizing our alerts as part of a broader economic system, in other words, we’re setting ourselves up for a world with an actual forward direction and a lot more options for how to travel in that direction—and, correspondingly, a lot less shaking our fist at all those email alerts in impotent rage.

I was originally too abmitious here, stating that there is a perfect tuning for a piano. It turns out that, as several early readers have pointed out to me, there’s no way to tune an entire piano perfectly. So, we’ll stick with a string being tuned to a single note, unrelated to all others. ↩
Yes, in some respect, creating a “perfect” alerting system would mean designing an AI that contained, besides its electronic sensors, an exact, evolving, realtime copy of all your wisdom, experience, judgment, domain knowledge, preferences, etc.—and alerted you when it calculated, with 100% certainty, that you would want to handle a situation—because, paradoxically, while possessing all your wisdom, experience, judgment, domain knowledge, preferences, etc.—as well as on-board networking and a faster CPU than yours—the AI was somehow unable to address the situation itself. ↩
Yes, there’s a real moral question here about making the lives of these poor interns unbearably miserable, but … interns. ↩
In my case, I’d probably want to be alerted if it somehow ever accidentally came up, so that I could immediately shut it down again. ↩
Or, if you’re an optimist, the maximum overall benefit—but if you’re an optimist, what are you doing working in operations anyway? ↩
For more on the moral vs. economic mindset, see Hut 8’s own Dan Milstein talk about post-mortems, axe murderers and the stupidity of our future selves, available as video or slides. ↩
In particular, it’s tempting but inadvisable to take it upon yourself to translate this moral viewpoint into an economic one. You’d essentially just be assigning an infinite cost to downtime, which leaves you just as lost as the false Platonic ideal of “all alerts must indicate an actual problem” did when it implicitly assigned an infinite cost to wasted human attention. ↩
There’s a corollary to this, too: often when you join an organization you will initially experience their alert volume as horrifically out of control, and grow to understand it as you become better associated with both the workings and the economics of the systems you’re monitoring. ↩

Speeding Up Your Engineering Org, Part I: Beyond the Cost Center Mentality

2014-04-17T00:00:00-04:00

It is a truth universally acknowledged, that engineering orgs—like greyhounds, sports cars, and wide receivers—slow down as they age.

Odds are good that you have experienced this phenomenon personally at some point in your engineering career. The slowdown was gradual, frustrating, and oddly stubborn. It survived: numerous rounds of hiring; a spate of offsites where inspiring speakers harangued everyone to “cut through the crap” and just “get shit done”; a blood-spattered re-org or two; and even a few ground-up rewrites that utterly failed to deliver on their promised boost in velocity.

If you’re now involved with engineering leadership in some capacity, you may well have accepted the slowdown as a sad universal truth. Accordingly, you may have shifted your efforts from the impossible task of making the org go faster to the thankless but crucial job of jealously guarding how engineers spend their time—because as it takes longer and longer to get even simple features out the door, those engineering hours become increasingly precious.

If all this sounds familiar, I have good news and bad news for you.

The good news: it isn’t actually a law of nature that engineering orgs have to slow down as they mature and grow. With active, contravening investment, it’s possible to maintain and even gain speed.

“But,” you protest, “I’ve made investments, remember? I’ve hired! I’ve brought in speakers! I’ve re-orged and re-factored and tried out every flavor of agile there is, and still we go slower and slower!”

Yes, which brings us to the bad news: that slowdown is a far bigger deal than you might have realized, and way more harmful to the bottom line of your business than you might imagine. Oh, and that jealous guarding of engineer hours for features? It’s only making things worse.

In this article I’m going to consider the speed of an engineering org as an economic question—not a moral question, or a question of technology choices, or a question of people “hustling” and “powering through” the obstacles they find in their path. I believe that a good percentage of engineering and business leaders economically model their engineering org—consciously or unconsciously—as a “cost center,” where every engineer hour not spent on features must translate to (at least) one engineer hour saved, and I believe that this economic model makes it extremely difficult to identify and justify the investments that could actually speed that org up. I’ll propose an alternate economic model of an engineering org—one in which speed to delivery, rather than number of engineer hours paid, is the dominant economic factor—and in which considerable, sustained investment in that speed can reap massive economic returns.

But let’s get a little more concrete with this—let’s look at an example of the kinds of decisions that face engineering orgs and their leaders every day, and just how easy it is to slip into the “cost center” mentality when attempting to juggle them.

A Tale of Two Engineers

Say you’re an engineering manager at Company X, and one morning you arrive at work to find two of your best engineers waiting outside your office. You haven’t even opened your door before they start in on you.

“Look,” says Cindy, the first engineer, “I know that the CEO is breathing down our neck to finish the new Facebook for Cats integration, but we’ve got to clear some time to work on automating database migrations. I’m the only one who knows enough to apply them to the prod DB, and I’m getting tired of spending half an hour every morning rolling out everyone else’s changes. So can we push a feature or two back and squeeze that in?”

“Forget the migrations,” says Scott, the second engineer, “we need to talk about the Frobulator Service. Two years ago we agreed to hack it up quickly in PHP, but product promised us—PROMISED—that we would have time to go back and clean it up. Yesterday I happened to be back in that code while I was updating the copyright years in our headers, and it’s even worse than I remembered. We need to rewrite it in Scala so it’s more modern, performant, and easier to maintain. Can you tell product we’re calling in that promise, please, and I’ll get started?”

First off: everything your engineers have said is true. Cindy really is spending a half hour every morning dealing with database migrations; the source for the Frobulator Service really does look like a plate of partially digested capellini; product really did promise time to clean that mess up; and of course there really is a long and growing backlog of features for the upcoming Facebook for Cats integration, each of them (according to the CEO and product) absolutely essential and destined to become a customer favorite.

Furthermore, you’ve been around long enough to know that there won’t be any “calm periods” when there’s time for your engineers to scratch these other itches—after the Facebook for Cats integration goes out, you’ll be right on to integrating with Twitter for Dogs, or LinkedIn for Ferrets. So on this fine morning someone has to make a real and uncomfortable decision: either tell Cindy and Scott to stop complaining and get back to feature work, or let product and the CEO know that you’re going to spend some engineering hours on something other than features. And today that someone is you.

Pop quiz, hot shot: what do you do?

WHAT DO YOU DO?

A Simple, Responsible, and Totally Wrong Approach

If you’re a mature, business-focused engineering leader, you might grab some coffee, sit Cindy and Scott down, and tell them something like this:

“Cindy, I’m sorry to hear that you’re getting bored doing so much production DB work, but realistically it would take you at least 40 hours of work to write, test, and deploy a migration utility, right? So if you’re spending a half hour a day on migrations, it would be 80 working days before we saw a return on our investment—that’s like 4 months, and that’s just too long for me to sanction—precisely because you’re such a valuable member of the team, and I can’t spare so much of your time right now away from our feature backlog. We can touch base if the migration workload increases too much, OK? Until then, I have to ask you to put your head down and be a team player.

“Scott, you’re absolutely right, product did promise that we could spend time cleaning up the Frobulator Service, and I’m sure they were acting in good faith, but none of us could have possibly known at the time how our product was going to take off—we’ve got customers practically beating down our door for new features, and they’re not going to see any difference whether the Frobulator Service is written in crappy PHP or transcendent Scala.

“Both of you are great engineers with bright futures, and if those futures include engineering management, then part of your job will be to understand that engineering’s job is to produce effects that are visible to customers. So if we burn hours on projects that aren’t customer visible—projects that are by engineers, for engineers—we need to be able to show directly how those hours will pay for themselves in saved engineering hours in pretty short order.”

This approach feels rational, responsible, and easy to apply, right? There’s only one small problem: by slipping into the “cost center” mentality, where engineering hours must only be spent on features or a greater savings in engineering hours, you’ve actually just slowed your engineering org down further, and cost your company real (though largely invisible) money in the process. How did this happen without our even noticing, while we thought we were being so responsible?

“Engineer Hours” vs. Latency—Where the “Cost Center” Gets it Wrong

The cost center model of engineering, to which our hypothetical engineering leader has just retreated, is basically this: an engineering org is a furnace which burns money, in the form of compensated engineer hours, and produces features. Therefore if org A can produce the same feature at half the cost of org B, then org A is twice as good as org B! And if spending 1 engineer hour on some task today will save you 100 engineer hours in the next few weeks, then you have just improved your org’s economics by 99 of those expensive engineer hours!

The fundamental and deadly flaw in this model is that it does not account economically for the speed of work through the engineering org—or what I’ll refer to from here on out as “latency”—the wall-clock hours, not paid engineer hours, that it takes the engineering org to turn some concept into reality. In other words, we can’t simply think of an engineering org as “an engine that produces thing X at cost Y.” We have to model it as “an engine that produces thing X at cost Y with latency Z,” and recognize that “latency Z” itself can and should be translated into some cost / value structure.

This is not to say that engineering leaders who employ this cost center model don’t care or think about latency. To the contrary, they often talk about it quite a bit, exhorting their teams to feel a “sense of urgency” and to exhibit a “just git ‘er done” attitude—but they treat latency as a moral or personal question—a matter of character or work ethic—rather than something that is, at its heart, organizational and economic.

It’s human nature to experience paid engineer hours as expensive and latency as annoying, because the costs of latency tend to be invisible—they usually take the form of lost opportunities or earnings, many of which, once you miss them, you never even know existed—rather than real, painful checks that you have to cut each month for payroll.

Consider an analogue: the rent your business pays on an office building. If you found a building that was only half the rent, you might well be tempted to move and count that as a huge savings—but that’s rarely the whole economic story. Is the new building farther away from where the bulk of your employees live? Does it lack the public transit options of the more expensive building? How’s the light? What’s the layout like? All of these factors can affect the amount of time your employees spend in the office, the amount and quality of work they get done there, and even the kind of people who want to work at your company in the first place—and if the cheaper building leads to a drop in productivity, or to worse hires, then that “savings” on rent might turn out to be very expensive indeed to your business’s bottom line, even though—and here’s the horrific part—that connection will probably never show up on your company’s balance sheet. It’s not hard to imagine the employee who found the cheaper building being rewarded with a fat bonus in the same cycle that a bunch of other employees are dinged for a stagnant product, increasing bug count, and flagging sales—even if all those problems were caused, to some extent, by the change in location.

One method to expose some of these invisible economic effects is to take them to an absurd extreme. For example, if your business is currently paying a half million in rent a year for a Boston office, with a workforce who lives in nearby suburbs, it’s clearly not a smart economic decision to move to a snow-cave in Juneau, Alaska—even if it’s wired for Ethernet and your annual rent would drop to $1. We’ve managed to magnify the invisible costs to a size where they can’t be easily ignored.

So let’s employ the same technique—reduction to some absurd extreme—in a thought experiment designed to demonstrate how the latency of your engineering org is almost certainly its dominant economic factor—much, much larger than the piddling six-figure salaries you’re paying the engineers it comprises.

The Thought Experiment

Role change: you’re no longer an engineering leader overseeing Facebook for Cats integration. Now you’re the CEO of a company that makes its money through big, enterprise contracts. A potential customer you’ve been after for a while is entertaining bids on a project, and will consider proposals—which are expected to include a working proof of concept—in one month.

You aren’t the only company trying to land this contract—there are lots of smart competitors. And, by the way, you’re not allowed to deliver early, even if you finish the proof of concept early—all proposals will be considered on the same day, one month from now.

As CEO you have two engineering teams available to you.

The first team is a group of good, steady developers, who correctly estimate that the proof of concept will take exactly one month for them to build (of course they can’t possibly know this, but that’s a story for another article and here we’ll just pretend they can, because we’re in a thought experiment and we can do whatever we want). Over this month of development, this team will cost the business $100,000 in salary and other compensation.

The second team, on the other hand, is a group of freelancers who are amazingly, inhumanly fast: they can produce the same proof of concept, at the same level of quality, in just one second. Before you get too excited thinking about all the money you’re going to save with this team, however, you should know this: for that one second of work, these freelancers will be invoicing you dearly—to the tune of $100,000.

Recapping your options, you have:

the normal team, which will take a month to produce the proof of concept for a total cost of $100,000
the insanely fast team, which will take a second to produce the proof of concept for a total cost of $100,000

The costs of the proof of concept are equivalent with either team, as is the quality of the product—only the latency differs. Obviously if you could deliver the proposal as soon as the proof of concept was done, you’d choose the insanely fast team every time. But that would be too easy, so in our thought experiment—where you’re not allowed to deliver the proposal early—does the latency even matter?

There’s only one scenario to consider with the normal team—they have to start working today, and they’ll finish just in time for the presentation. Start them even a day late, and they won’t finish.

With the insanely fast team, on the other hand, you have on the order of 2,592,000 scenarios to consider, as they could start and finish at any second in the entire month. But are any of these scenarios valuable?

Let’s take a look at a couple of these possibilities.

The Need for Speed

One obvious approach with the insanely fast team would be to produce the proof of concept immediately, in the very first second. Does that buy you anything? You can’t deliver the proof of concept early, but now that it exists, there are a couple things you could do with it.

For example, you could show it around and get a reaction—internally, if your business has some good proxies for your customer’s needs, or to one of the customers “on the ground” (not the Big Important People you’ll be pitching at the end of the month, just regular workers). Then you can take their feedback and do any of the following:

Iterate: Have the insanely fast team produce a second, improved version of the proof of concept—you’ll have to pay them another $100,000, but you’ll have good information about whether that’s worth it or not. You can repeat this process as many times as you like or can afford, and go into the demo having iterated through N versions to your competition’s one.
Abandon: If the feedback you get is “this is crap, and the only ways to make it good enough are too difficult or expensive to consider,” then you can abandon the contract and move on to try to sell something different to a different customer—or something different to the same customer! Meanwhile, your competition is sweating away trying to produce their own proofs of concept—squandering precious time and attention on a contest you already know isn’t worth winning.
Sell to Someone Else: By the rules you can’t deliver your proof of concept early to the one potential customer, but nothing says you can’t go out and try to sell it to a different one, or a different six. By the time proposal day arrives, you’re already a month ahead of your competition in other markets, and you might even have a nice story to tell about how your customer’s competition has already bought your version—and they’d better too, if they don’t want to fall behind.

So yeah, you could definitely say there’s some value to being able to finish the proof of concept in a second. That insanely fast team is starting to look pretty good right about now.

But wait…there’s more!

The Genius of Procrastination

What if you went to the other extreme, and waited as long as you could to produce the proof of concept, until the last possible second—literally while you’re walking down the hallway to make your presentation? Does that give you any interesting advantages?

One possibility that leaps to mind: given that your development is so expensive, you could do some cheaper exploration before you committed to a proof of concept. For example, you could send some PMs to shadow the customers, research companies that had tried similar approaches, etc.

By the time you commit to spending $100,000 on the proof of concept, you can have much better information about what it should do and what it shouldn’t. Maybe it turns out to be so difficult that you decide not to even build. Or maybe, with the insanely fast team at your back, an offhand remark as the customer is walking you to the presentation room prompts a quick phone call and a development cycle, allowing you to produce a last-second revision that totally changes the game.

In essence, by waiting until the last second to produce your proof of concept, you have the chance to be roughly 29 days, 23 hours, 59 minutes and 59 seconds better informed than your competition (the actual amount of time will depend on the particular month, whether it’s a leap year, etc., which is left as an exercise for the reader).

Mix and Match

But the real power of the insanely fast team comes when you mix and match all the techniques above.

Step 1: Do cheap research until you have an idea of what to build.

Step 2: Build it instantly and loop back to Step 1, until you decide another iteration isn’t worth $100,000 (either because the proof of concept is now good enough, or because you’ve decided to scrap the project).

Step 3: Profit!

Finish Early, Start Late

What the insanely fast team gives you, in other words, is the ability to finish early or start late. In an environment where uncertainty rules and information is value—like software development—that allows for tremendously valuable information gain, because what you finish early tends to generate information, and what you start late tends to benefit from newly available information¹. The poor old regular engineering team, on the other hand, has to start early and finish late just in order to get the work done by the deadline. Their labor can neither generate extra information or benefit from it as it becomes available.

So Which Team Do You Want, Mr. CEO?

By now it should be clear: although the two teams cost the same, and produce the same quality output, you would be crazy not to choose the insanely fast team and their drastically reduced latency². In fact, you’d be crazy not to pay a steep premium, well beyond the normal team’s salaries, to use the insanely fast team, or even to keep them inactive but on retainer.

This is so important it’s worth calling out: if you’re any kind of rational, you would pay a tremendous amount of extra money to use the insanely fast team, which means that a reduction in latency equals money. Real, actual money—and usually a lot of it. In our thought experiment, for example, a smart CEO would gladly pay $1,000,000 to use the insanely fast team instead of the regular team if it meant a massively increased chance at a $15,000,000 project. A smart CEO would see that not as “spending” money—but as investing it—putting money out into the world in the reasonable expectation of having that money return, now increased by some multiple.

Once you start thinking of engineering dollars as investment rather than cost, the fallacies of the “cost center” model become glaringly obvious. The equation behind your org isn’t “engineer hours paid for features or saved engineering hours”—it’s “money invested in the expectation of more money.” Often the money invested is in the form of paid engineer hours, but sometimes it’s new machines, or better chairs, or office space for a remote contingent, and so on. And sometimes the “more money” you expect in return comes from features for which customers will pay, but often (as in our thought experiment) it comes in the form of valuable information, or—if you’re doing it right—a reduction in (or prevention of) latency for future work, which, as we’ve just shown with our thought experiment, is worth actual money.

Sitting Down Again with Cindy and Scott

Let’s rewind back to that coffee with Cindy and Scott, where you as engineering leader were explaining to them all about how engineer hours could only be spent on features or efforts that would cut future engineer hours. With the clearer economic picture in mind, this argument no longer seems so simple and rational.

Cindy wanted time to work on DB deploy scripts, since she was the only one who could reliably get changes out to the production DB and was spending a chunk of her mornings doing so. At the time, what we heard behind her lament was “I’m getting bored doing the job you’re paying me to do and I need to be gently cat-herded to keep doing it”—but what we should have heard was “DANGER, WILL ROBINSON—a queue is forming in your engineering org.”

Cindy has become a bottleneck for changes making their way to production, and a queue of people trying to make those changes is forming behind her. Queues are one of the clearest signals of developing latency. What happens if Cindy is out for a few days on (gasp) vacation? No changes will go out. What happens if she becomes overloaded with other matters, and—without telling you—starts applying DB migrations only once a week, to “batch things up” and “be more efficient” with her time? Your latency has just skyrocketed invisibly—and the fact that this is possible should terrify you as an engineering leader. Cindy’s complaint is a warning of latency to come, and you need to nip that in the bud with extreme prejudice. You should probably allow Cindy to do her migration project—and you should definitely explain to her why you’re allowing it.

As for Scott, who wanted to rewrite the Frobulator Service from horrific PHP to stunning Scala because product had promised the time to clean it up: the “promise” from product is clearly economically irrelevant, and big rewrites tend to be a terrible investment, so you probably shouldn’t say yes to Scott’s exact request—but you still have some digging to do here to figure out whether this (almost certainly misguided) desire to rewrite is just a blue-sky engineering itch, or a signal that the Frobulator Service is creating latency.

First of all, Scott was only in that code to “update copyright years”—he wasn’t making functional changes, and apparently hadn’t made any in at least a year. Is this a clue that the Frobulator Service doesn’t see that much coding activity? Worth digging into, because if engineers aren’t touching the Frobulator Service because it’s frobulating³ just fine and there aren’t really any changes to make, that’s great—the code might read like Cthulhu’s diary, but it’s not affecting your latency and can be left as is for the moment. If, on the other hand, there are tons of changes that should go into the Frobulator Service, but which are finding their way into compensatory hacks throughout the rest of the codebase instead—because engineers are terrified to touch the Frobulator Service code—then you’ve got a brewing latency problem that you need to expose and deal with, because those hacks are probably already slowing you down, and the situation is only going to get worse. Almost certainly you still don’t want to commission a full-on rewrite, but a steady, incremental investment in testing, monitoring, and refactoring the Frobulator Service might be indicated.

Takeaways from Cindy and Scott

One of the deadliest things about latency is that often the slowdown of even a single piece of your org can introduce it, while making things faster generally requires steady work on a lot of fronts. That’s an imbalance that’s not in your favor. Add to this the certainty that latency is developing in your organization at every moment—that is the nature of organizations—and that it is often invisible to you (or any single individual)—and that, as we saw in our thought experiment, latency is tremendously expensive—and the response that’s indicated from you, the engineering leader, is a calm but constant terror.

Your job is to translate that terror into a form of shared vigilance: listen carefully to your engineers, dig into the problems they bring you, and ensure that every one of them understands the cost of latency and is on the lookout for it, making micro speed-ups everywhere they see the opportunity and surfacing brewing slowdowns.

In other words, make latency something your whole team seeks, hates, and destroys.

How to Invest in Latency Reduction

“All right,” you say, “I’m convinced—latency is a bigger deal than I thought before, and something I can improve—in theory. But how do I do it in practice? I’ve made all those investments that didn’t help at all—how do I know that if I invest in something, it will actually improve my latency?”

Some of this also comes down to how much you invest, but we’ll leave that until Part II, and here just discuss what you can look to invest in.

Here are a few places you can start.

Activities Engineers Bitch About

Engineers tend to experience latency centers as painful or “busywork.” For example, do your engineers play “Rock Paper Scissors” to determine who has to spin up a new server? Does the loser go off cursing his luck and the world? Do your engineers go to absurd lengths to pack new services onto old machines, even when a new server would be the natural solution to the problem? Then take a look at what it requires to spin up a new server, and whether you can make an investment to make it less painful—you’ll likely effect a drop in latency.

Things Only Cindy Can Do

We saw an example of this with Cindy, who was the only engineer who knew enough about the prod DB to get migrations out. If only person X can do thing Y in your organization, you’ve created a bottleneck, and bottlenecks lead to latency. Cross-train or create tools to terminate these bottlenecks with extreme prejudice⁴.

Look for Queues

Queues are a manifestation of latency, and once you can see them, you can attack them. Find them where they’re visible—ticketing systems and so on—and try to make them visible where they’re not, using techniques like a Kanban board.

Automated Tests

Good automated tests reduce latency, because they help you make changes more quickly and confidently⁵.

Monitoring

Good monitors reduce latency, because they allow you to release more frequently, confident in the knowledge that, if something goes wrong, you’ll find out immediately⁶.

Post Mortems

A good post mortem is a great opportunity to let reality point you towards improvements that not only make your systems safer, but reduce your latency as well. Do them!

Decentralization with Safety Nets / Impact Reduction Schemes

Organizations often insist that high-impact changes to products or systems pass through multiple steps of centralized review for correctness, which can become a source of dramatic latency—sometimes on the order of weeks or months. Usually these controls exist for a reason, because the mistakes they attempt to prevent are expensive.

You can attack such a situation in two ways: either by making it harder to break things in the first place (often more difficult and expensive), or by changing the game so that breaking things isn’t as big a deal (often cheaper and easier). For example, if engineers can deploy potentially high-impact changes at will to a small percentage of traffic, or to a known beta-tolerant population, or to internal users, then the downside of breaking changes is capped, and is often eminently worth the decreased latency you enjoy.

And Many, Many More

We’ve only scratched the surface here: tools for operators, intelligent development tools, even crazy things like DSLs for demo or test data creation can all reduce your latency. Once you start looking specifically for projects that reduce latency, you will see opportunities everywhere.

How Not to Invest in Latency Reduction: REWRITE ALL THE THINGS

The “rewrite reflex” exhibited by Scott is, unfortunately, a real and dangerous tendency that almost all engineers have to some extent (I myself struggle with it daily): the fanatical belief that, if a system were rewritten to framework X or language Y, development would proceed much more quickly. Generally this doesn’t pan out, both because of the astounding (and routinely underestimated) cost of the rewrite, but also because the causes of latency introduced in real-world engineering are rarely addressed more directly by languages and frameworks than by operational and organizational changes⁷. The latency caused by having to write three ugly lines in one language rather than one pretty line in another tends to pale in comparison with delays in deploys, finding and fixing bugs that tests could have caught, etc. (note: I’m not arguing that there is no difference in language productivity, and no point to choosing a language for a new venture carefully, just that for a working system the gain is usually dwarfed by the rewrite cost and other, lower hanging fruit).

Incrementalism FTW

Maybe it’s a “one ring to rule them all” deployment system, or a templating system to speed up writing your views, or a monitoring framework to end all monitoring frameworks—whatever it is, if you think it will reduce latency, and it’s a big project, you should probably try breaking it into smaller increments, each of which reduces some latency, and release those independently, as each is ready.

Most engineers will hate to hear this. They’ve already “seen” the full system in their head, and now want to bang it out in a couple caffeine-fueled weeks. Typically if you object and request smaller increments, they will point out that, broken up into discrete releases, the job will require more hours overall, and therefore represent an inefficiency. They’re generally right, of course, that you will spend more engineer hours by delivering in increments—they’re just wrong about the economic consequences.

You should insist on smaller, incremental latency improvements, not just because of all the normal, eminently true reasons that big increments are bad (everything that makes waterfall a bad idea applies here too), but because latency reduction improves the same channels by which you deliver future latency reduction. That is, since latency reduction efforts generally come in the form of new software or processes, and what they’re reducing is the latency of delivering new software or processes, finished latency reduction efforts tend to speed up future latency reduction efforts.

Latency reduction is therefore a form of compound interest, which Einstein himself called “the most powerful force in the universe⁸.” Latency reduction works just like your retirement account—steady, incremental investments generate more value than infrequent, bigger investments, because you earn interest on your interest—so you want the money in the account as soon as it becomes available. When you break a big, massively valuable latency reducing project into numerous smaller (but still latency reducing) projects, some of which can be delivered earlier, the one-time multiple you pay on extra engineering hours is nearly always a rounding error compared to the benefit of compounded latency reduction you enjoy forever.

So Much for the Easy Part

All right, we’ve skirted the hard part long enough. At this point we understand some of the costs of latency. We’ve sounded out whether projects like those that Cindy and Scott want to undertake will actually reduce latency, talked about some other projects that are good candidates for reducing latency, and understand how to generate the maximum overall value by attacking them in valuable increments. But there’s still the small matter of that endless stream of features—how do we compare the relative value of a feature and a project to reduce latency for the delivery of future features, and prioritize appropriately? How do we know how much time to spend on latency reduction vs. features? And—more difficult still—how do we convince the CEO and other Important People in the business, who are the ones asking for those features and signing our checks, that they should allow us to carve out that time to work on latency reduction?

Tune in as we tackle that in the upcoming Part II: Selling the Big Boss.

For more on this, see Reinertsen’s Principles of Product Development Flow—yup, it wouldn’t be a Hut 8 Labs Blog without a mention of that classic—but seriously, all joking aside, just go read it now. ↩
Oh, and just in case you’re thinking “sure, if I could reduce my latency to a second, that would be one thing, but that’s crazy and extreme and impossible”: an engineering organization that managed to go from shipping simple improvements and bugfixes only with quarterly releases to being able to ship them in an hour (a realistic improvement that many organizations have already accomplished) would be seeing about a 2000X reduction in latency for those improvements and bugfixes, and that’s not even the upper bound—better testing, monitoring, and other investment can also drastically speed up what an engineer can reliably get done in that hour. ↩
What? It’s a perfectly cromulent word. ↩
Note: sometimes Cindy defines her value as “being the only person who can do X.” Helping her redefine her value more broadly is a key part of the leadership function, but a topic for a different article. ↩
Bad tests can actually increase latency, because they over-specify implementation without adding any safety—but that’s a topic for a different article. ↩
Bad monitors can actually increase latency, because they overwhelm and desensitize the people looking at them with irrelevant information or over-zealous alerting. But that’s—you guessed it—a topic for a different article. ↩
And because frameworks are a form of moderate evil that is on occasion the lesser of two evils—but we’ll leave that for another time. ↩
Or fine, maybe he didn’t. But look, whether Einstein said it or not, compound interest is pretty damn powerful, OK? ↩

No Deadlines For You! Software Dev Without Estimates, Specs or Other Lies

2013-09-23T10:24:00-04:00

In Coding, Fast and Slow, I talked about one of the deepest challenges involved in writing software: the near-total inability of developers to predict how long a project will take.

Fortunately, as that post mentioned, I believe there is a way to work, where the software you write ends up being valuable, and the business people you work with end up being happy. And, critically, this way of working does not involve committing to estimates of how long work will take (which is good, because, personally, I suck beyond all belief at such estimates… even for work which I initially believe will take no longer than a single day).

In a lot of ways, this is The Most Important Thing I’ve learned in my (let’s just say many) years of being paid to write software for people.

The core idea is: put uncertainty and risk at the center of a conversation between the developers and the rest of the business (instead of everyone pretending such nasty things don’t exist). Doing so allows the entire business to tackle those genuine challenges together.

To show what such a conversation might look like, I’m going to develop this approach in detail, in the context of a story.

Welcome To <Company X>, Here’s Your Spec

Let’s say you’ve started at a new job, leading a small team of engineers. On your first day, an Important Person comes by your desk. After some welcome-to-the-business chit chat, he/she hands you a spec. You look it over—it describes a new report to add to the company’s product. Of course, like all specs, it’s pretty vague, and, worse, it uses some jargon you’ve heard around the office, but haven’t quite figured out yet.

You look up from the spec to discover that the Important Person is staring at you expectantly: “So, <Your Name>, do you think you and your team can get that done in 3 months?”

What do you do?

Here are some possible approaches (all of which I’ve tried… and none of which has ever worked out well):

Immediately try to flesh out the spec in more detail

“How are we summing up this number? Is this piece of data required? What does <jargon word> mean, here, exactly?”

Stall, and take the spec to your new team

“Hmm. Hmm. Hmmmmmmmm. Do you think, um, Bob (that’s his name, right?) has the best handle on these kinds of things?”

Give the spec a quick skim, and then listen to the seductive voice of System I

“Sure, yeah, 3 months sounds reasonable” (OMG, I wish this wasn’t something I’ve done SO MANY TIMES).

Push back aggressively

“I read this incredibly convincing blog post ¹ about how it’s impossible to commit to deadlines for software projects, sorry, I just can’t do that.”

Here’s the thing about all of the above: they’re basically guaranteed to fail. By which I mean, specifically: no one is going to be any kind of happy about the software that gets written from the above starting points.

OKAY, ENOUGH STALLING, SO WHAT DO I DO, DAN?

I’m going to suggest something that may sound a bit odd: while this Important Person is standing at your desk, use this opportunity to, politely but ruthlessly, interrogate them about the business you have just joined.

What is the business model? What are the biggest challenges facing the business as a whole? What risks does leadership worry most about? What are they hoping happens, if everything goes just right? Who is the current customer for the product? What motivates that customer to buy? Are they happy after they buy? If not, why not? What other customers would the Important Person like to go after, if he/she could?

One way to understand this is: there is some central problem or challenge which the business is facing. Your first job is to figure out what that problem is, and, just as importantly, what words the Important Person uses when they think about that problem.

A very important thing: it usually takes a considerable bit of effort to get beyond the proposed solution (e.g. the report), to the actual underlying problem. Laura Klein summarizes this marvelously as “[People] will tell you that they want a toaster in their car, when what they really mean is that they don’t have time to make breakfast in the morning.” She’s talking about user research, but I find the same perspective is incredibly useful when talking to, e.g. CEO’s.

Returning to our example, let’s say that, as you talk to the Important Person, you come to understand that your new business, which sells software via a monthly subscription plan, has a serious problem — too many customers are canceling every month. What’s more, you’ve joined a startup, and, although it has a solid chunk of cash in the bank, the leaders very much want to ramp up how much they spend on sales and marketing. Of course, doing that will burn through their cash, and thus require raising more capital sooner than later. And getting VC’s to invest more money with that high cancel rate is going to be very difficult, if not impossible.

You’ve been hired, at some level, to help solve that problem. Even if the people who have hired you don’t think about it that way.

Now that you understand that central problem, take one more step: figure out exactly how this proposed development effort is supposed to solve that problem.

How and why does the business believe that this report is going to lower the cancel rate? What makes the Important Person think it’s going to work? Are there any ways they’re worried that it might not work? Are there any key questions they’d like answered sooner than later?

Oh, How People Love To Hear Their Own Words

A key tip for these conversations: at each step, it’s really helpful to echo back what the person just said to you. E.g. “Okay, let me make sure I understand — you’re saying this new feature you want is critical because it’s going to help us upsell existing customers, but we’re not so much expecting it to help us get new customers? Do I have that right?”

At each of those little checkpoints, if you’re right, the Important Person will feel this rare, pleasant sense that someone in development actually seems to understand how the goddamn business works. If you’re wrong, you’ve just narrowly avoided basing your dev efforts on an imperfect understanding of the business (which is a path straight to misery).

Note that template: a) “I’m going to echo that back, make sure I understand”, b) echo it back, c) “Do I have that right?”. I say exactly those words, basically every time I talk to someone about a new project — so much so that my partner Edmund calls it “pulling a Milstein”. You don’t have to be clever with that template, is what I’m saying — put all your cleverness to work really listening and trying to understand the problems facing the business.

This whole process takes practice, but is INSANELY VALUABLE. You can (and should!) start by asking everyone you work with about how they understand the overall business you’re currently in, and what challenges it’s facing. Do the same with random people you meet. Be curious, don’t stop being curious, and don’t be in any way afraid to say “I don’t understand that, can you explain it to me?”

Now, The Knockout Punch

Once you both understand some central problem facing the overall business, and how your proposed bit of development effort fits into a possible solution, you wrap all that up and deliver it back, repeating as many of the words they used as possible, e.g.:

“Okay, if I understand it properly, we’re adding this report, because we think we can use it as a key feature in a new, higher pricing tier. This more expensive tier is not really for acquiring new customers, it’s more for upselling existing ones, so we can extract more revenue from our most engaged customers. If we can do that, it’ll have a potentially big impact on our revenue churn ², which is the most important number in our business right now. And, we really need to see that move in the right direction, in the next 6-9 months, so we’ve got a good story to tell investors when we go out to raise our next round of financing.

Do I have that mostly right?”

With even a modest bit of luck, at this point, the person who handed you the spec will have a cautiously hopeful expression on their face, and they’ll nod as they say, “Yeah, that’s… um… that’s pretty much exactly right.” ³

You then say, “Great, let me look into the tech we need for that report, and I’ll get back to you with more info.”

Note: you haven’t promised any date by which the report will be finished. Instead, you’ve demonstrated that you are going to work with this Important Person to solve the actual problems the business is facing. And those problems involve very real, very hard, external deadlines (e.g. running out of money by a certain date).

One way to see it: you’ve taken a key first step in earning their trust.

Now, notice, too: instead of you having made some promises to deliver on a spec, which promises are now hanging over you and making you nervous, you’ve directly engaged in a real problem for the business. And you have plenty of room to be creative about how you solve that problem. Yes, it’s a hard problem, but that’s why you got into this business in the first place — for the joy of solving hard problems that actually matter to someone.

Man, If Only We Knew What To Do

The next day, you meet with the team, and discover that the new report is mostly straightforward, except for one thing: it requires a periodic import of data from a new social network with a complex API. The team has just started working with that API, and they tell you that they just don’t have enough information to make the call on the 3 month deadline, either way — it’s certainly possible they could hit it, but there’s every chance things could blow up.

What do you do?

You could tell the Important Person that you don’t know. That is, at least, honest. But it doesn’t really help them (aka help the business solve its problem — move revenue churn in the right direction, before the next round of funding).

What would help you solve the business’s problem?

One key is that the business as a whole is trying to make a decision — about how to spend your time.

If you knew for certain that you could get the report built in 3 months (and that existing customers would happily pay more for it), the right decision for the business would be: build it.

Conversely, if you knew for certain that you couldn’t hit the deadline (or that existing customers wouldn’t pay more), the right decision would be: stop immediately, start some other plan to reduce revenue churn.

Given that you don’t know which of those two alternatives you’re living in, what you (and the business) need is: more information.

If you could obtain that information, you could make the right decision, which would make your business a great deal more money than the wrong one.

In the presence of uncertainty, acquiring information is often the best way to generate value. And, yes, this is the point in this blog post where I tell you to go read Donald Reinertsen’s Principles of Product Development Flow.

So, what you do is, you pick what you work on next, to gather as much information as possible, about the things you are most uncertain about. If you’re clever (and you are! That’s why you got into development in the first place), you can find a way to gather information as part of the process of actually building the thing. Meaning, you usually don’t need to conduct some separate, research-y phase—instead, you can gather the information you need by doing your work in a careful sequence.⁴

And, crucially, you have to be completely up front about all this with your counterpart on the business side.

The Meeting Where You Earn Your Salary

In our story: you schedule a meeting with the Important Person, and, in advance of that meeting, you bury yourself in the technical details of what the team has found about that new API so far. You also do some chatting with the sales and marketing folks — you want to understand the target customer, and which of their problems the business is hoping to solve with this new report.

Then, at that meeting with the Important Person, you say something like:

“Right now, we’re feeling optimistic that we’ll have that report ready in some form within 3 months — but our biggest risk is working with that new social network’s API. From the initial investigation we’ve done, it looks like, at the very least, we’d definitely be able to show them <minimal data foo>, which, from what I understand of our engaged customers, might be enough to trigger upsells, but sales and marketing aren’t certain.

We’d like to propose the following: we take two of our best devs, and they spend 2 weeks trying to build a full integration with the social network, purely on its own, so we’ll have a better understanding of just how much data we can pull in. While they’re doing that, we’d also like to have our front-end devs building mockups of a report with just the minimal data, so that you’ll have something to do some user research with, and possibly even use for sales demos if things go well.

Does that plan sound like a good way to go?”

This little speech is, basically, the most important thing you’re going to do at your job all month. So I want to unpack it in some detail.

First off, note that, because you’re thinking in terms of risks and information, you propose sequencing the work to get as much information, as quickly as possible (e.g. information both about how much data you can get from the social network, and also about whether or not customers will be satisfied by the minimal data set). When you’re facing a chain of risks, you’re going to generate the most valuable information by attacking the biggest risks first.

Second, it should be clear that you can only pull this off if you deeply understand the overall business problem — that’s what lets you propose the minimal data thing. Generally, those opportunities emerge bottom up, as, e.g. a dev figures out what data is / is not easy to obtain — but the value is not always clear to those devs (the very best way to run this game plan is to make it so that all the devs really deeply understand the overall business problem).

Third, it’s important that you’re offering the Important Person an actual, meaningful choice. You’ve clearly stated your current knowledge of what is possible (e.g. the technical risks and opportunities), plus your current understanding of what is valuable (to the business). You’ve framed that in a way which lets the Important Person now make a choice about what to do next (which will often result in you learning that your understanding of what is valuable to the rest of the business is no longer accurate — that’s a very, very good thing).

Fourth, note that, when you work this way, there are good risks (we call them “opportunities”) as well as bad ones. You discovered something unexpected — you could quickly and cheaply build a simpler report that might work. One of the most fun things about this approach is finding those wins — it’s tremendously exciting.

Finally, notice how you’re explicitly operating with a full knowledge of the hard, external deadline facing the business. You’re not talking about deadlines for implementing a spec, but you are talking about deadlines for the overall business… which are the only ones that actually matter.

What Happens Next

Now, the Important Person might say any of the following:

a) “Great, go for it”, (you say “Thanks, sir/madam, we’ll see you in two weeks with more information + options”)

b) “That minimal data would be a fantastic report — I’m certain we can get upsells with that” (you say “Awesome, we won’t put the devs on an exploratory full backend integration, we’ll sprint ahead on getting the minimal data ready asap, we should have an early, crude prototype to look at it within a week or two.”)

c) “That minimal data is absolutely not enough” (in which case you say, “Okay, would you like to see other options for restricted data?”, or, “Hmm, I’d love to better understand what questions we’re trying to answer with this report, since I don’t feel like I quite get it yet”, or even, “Well, in that case, maybe we should explore some other options for reducing revenue churn in parallel, because there’s a real chance we won’t be able to make this report work in time.”)

Note that that last situation, is not, in any way, a failure. You’ve learned something very important — the business folks believe that the current plan centrally depends on something which has a great deal of risk associated with it. Armed with this information, you can both try to drive down that risk, as aggressively as possible, and also start working with them to prepare other plans, so you’re ready if things blow up.

Overall, what this approach means is that you will be constantly adjusting your understanding of what is the most valuable way to spend your time, and constantly keeping the business folks in the loop + offering them meaningful choices. This is not, in any way, “we don’t need no stinking estimates, we’re code cowboys, just trust in the full force of our awesomeness.” It’s turning the entire process of software dev into an ongoing conversation with the rest of the business, where information is quickly getting into the hands of people who can make decisions about it. And, where “information” means both things that you know/have learned, and also an understanding of what you don’t yet know — i.e. important risks.

As I said in my previous post, writing software means learning something in such precise detail that you can tell a computer how to do it. More broadly, if creating new software is important to a business, then the business as a whole must engage in a learning process — not just the developers.

Hmmm, This Doesn’t Really Feel Like a “Process”

Inevitably, my solutions to this feels somewhat personal — but that is not an accident. Fundamentally, we’re talking about two groups of people having to build up trust in each other. Trust about things that they will not, in general, be able to verify.

Specifically, developers have to trust that what they are being told about the rest of the business is true — that customers want what they’re building, that the long hours are actually needed (and aren’t just some middle manager showing that he knows how to crack the whip — I wish I didn’t see that happen, far, far too often).

And the rest of the business has to trust that the developers, when they go off into their weird, opaque world, are honestly reporting back on what is possible, how much effort is involved, what they’ve achieved, etc.

Any means of building up that trust will always have a personal flavor — it exists between human beings who have learned something of each other. It’s not a thing you can mandate or fix with an imposed process.

Absolutely anyone who has done any real work on either side of that divide can immediately call up instances of that trust being betrayed — of discovering that all your work for the last half year was meaningless (and that someone knew that and didn’t tell you); or that the repeated promises that some system was ready to launch collapsed in a fiery wreck as soon as the first user tried to login.

Sometimes, The World Is Telling You To Polish Up Your LinkedIn Profile

A severe warning: this whole plan can fail, badly, if the Important Person is, well, not very important. Specifically, say you have a strongly hierarchical structure, where some middle manager is the only person you’re allowed to talk to. It can be the case that such a person perceives their job as taking proposed solutions from upper management and getting a bunch of developers to implement them. Such a person can be very threatened by the idea that you want to get beyond the proposed solution, to the underlying problem. They can hear that as “I’m going to have to go back to my boss and tell them that a bunch of developers think their idea isn’t very good.”

When a boss says “Jump!”, this kind of person prides themselves on saying “How high?!” Since I’m instead proposing “Why are we even jumping, here?”, you can see how there can be problem.

Furthermore, such a person will often put a really strong value on preventing the flow of information up (from developers to people who can actually make decisions). They may think of that as “Not troubling the boss with the details”. But, as I’ve described above, such a block on the flow of information is absolutely deadly to software development.

So, what’s your best option if you find yourself in such an unfortunate situation?

As I see it, there are two paths ahead.

Option 1: Try to get the middle manager to see this new way of working as something that will make them look good.

I rarely see this work, but it can be worth a shot. My partner Edmund reports some success trying this by way of: a) find an ‘internal’ thing, where the middle manager is, like, a user of the thing , b) propose to them that you work on that internal thing this new way, and then, c) if that produces a thing they find really useful, help them see that their boss can feel the way they feel now.

But, as above, that’s something of a long shot. Which leads us to…

Option 2: Quit.

I don’t say this casually. If you’re stuck in the situation I’m describing, it’s overwhelmingly likely that your project is going to end in some form of unpleasant failure. And, what’s more, it’s extremely rare that you can get higher-level leadership to see any problems with such a middle manager — in general, such a person has that job precisely because they fit into higher-level leadership’s mental model of a manager. In which case, the entire org is going to be set up in a way which makes it hard or even impossible to write useful/valuable software.

If you’ve been stuck in such a situation for a while, I’ll just say — you may have forgotten how great it feels to solve meaningful problems for people. Go find a place where you can do that.

Your Mission, Should You Choose To Accept It

In summary, I’m saying: 1) become a student of the overall business you are in, 2) sequence your work to extract as much information from reality as early as possible , and 3) make risks and opportunities the centerpiece of an ongoing conversation with the rest of the business.

There are no certainties in this world, but that approach will let you tackle the uncertainties together.

And that, I can tell you from fortunate experience, is a profoundly satisfying way to work.

But, Wait I Want To Learn More

I’ve ~~stolen~~synthesized just about all of the above ideas from a bunch of very smart people. You should totally go read their books and blog posts. ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰

And, this December, I’ll be speaking at the Lean Startup Conference, in San Francisco, on “Risk, Information, Time & Money”. Major bonus: almost everyone I list in the above footnotes will be speaking there, too. Last year’s conf was great — I gave a talk on How to Run a 5 Whys (With Humans, Not Robots), learned a ton from other speakers.

And I hear the author is very handsome, too. ↩
“Revenue churn”: it turns out that, sometimes, the best way to reduce the cancel rate (aka “churn”), in a subscription business is not to stop every last unhappy customer from canceling, but rather to increase the amount of money you’re getting from the people who use your service the most — in other words, solve for the churn rate in terms of dollars/month, instead of customers/month ↩
If you can hone this to the point that your summary of the business is so good that it actually helps the Important Person clarify their own thinking… you will win, at whatever game it is you wish to play in life. ↩
At Hut 8 Labs, we are, well, utterly obsessed with the sequence in which we do work. It’s a rare couple of hours that doesn’t see a discussion about what’s most valuable to do next, based on what we just learned. ↩
Donald Reinertsen, Principles of Product Development Flow. If you a) love math and b) have spent ten years trying to figure out why your software projects keep getting cancelled, drop absolutely everything you’re doing and read Reinertsen right now. Otherwise, first read The Goal, by Eliyahu M. Goldratt, and then read Reinertsen. ↩
Eric Ries, The Lean Startup. He works out a very powerful set of ideas for generating value in conditions of extreme uncertainty. As you can tell from the name of his book, his focus is on startups, but I find his ideas broadly useful for software development in general. ↩
Kent Beck, Software Design Glossary, and, Extreme Programming Explained. Few people have written as thoughtfully and intelligently about software development as Mr. Kent Beck. His work at the intersection of complexity, human nature, and economic value has had a huge influence on me. ↩
Douglas W. Hubbard, How to Measure Anything. Some really fascinating ideas on how to turn a vague statement like “We could make a better decision if we had more information” into something with concrete dollars attached to it. If you love math… you’ll wish he had written a shorter book with a lot more math in it, but such is life. ↩
Laura Klein, Users Know, and UX for Lean Startups. Truly great stuff on how to talk to human beings. ↩
“So, wait, your blog post got so long that you included an appendix, disguised as a series of footnotes?” In my defense, I can only quote a beloved one-time coworker: “No, so is your face”. ↩

Introducing Diffscuss: Plain Text Code Reviews, Right in Your Editor

2013-09-05T00:00:00-04:00

Earlier this year we at Hut 8 Labs were working onsite with a client who didn’t have their own code review system. Since a life without code reviews just isn’t worth living for us, we found ourselves emailing diffs back and forth to each other, with messages like “about halfway through the diff you do X, maybe you should do Y?” Eventually we even started inserting comments right in the attached diffs themselves—comments like “EWJ RENAME THIS VARIABLE OR DIE IN A FIRE!!!”—which worked surprisingly well, except that:

it was easy to miss comments and replies in large diffs, even when the comments were all caps and followed by multiple exclamation points
it was a pain to co-ordinate reviews and replies from even two other people
it was a pain to track down the actual source lines a comment referred to, which meant an unpleasantly high activation energy for applying small fixes and suggestions

So we created diffscuss—a code review format based on unified diffs, with editor support for threaded inline comments, basic review management and git integration, and (best of all) support for jumping right from a comment to the local source it addresses, without ever leaving the comfort of Emacs (or, because Hut 8’s own Matt Papi is a Vimmortal, Vim).

We’ve been using diffscuss for about 6 months now, and we’ve been happy enough with it that we figure it’s time to share it with the world.

Check it out at Github or read on for an example of diffscuss in action.

For example, here you are using diffscuss in Emacs, reading a comment that Some Guy left in your code. (Click if you want a larger image.)

You decide you agree with him and want to make the change, so you hit “C-c s” and Emacs pops up the local source file for you, with the cursor already positioned on the relevant line. (Again, click for a larger image.)

You make the change, save the buffer, and switch right back to the review buffer. (Click for…you know the drill.)

Now “C-c C-c” opens up a new comment, and you reply.

Easy peasy lemon squeezy.

Try it out!

Coding, Fast and Slow: Developers and the Psychology of Overconfidence

2013-04-22T02:24:00-04:00

I’m going to talk today about what goes on in inside developers’ heads when they make estimates, why that’s so hard to fix, and how I personally figured out how to live and write software (for very happy business owners) even though my estimates are just as brutally unreliable as ever.

But first, a story.

It was the <insert time period that will not make me seem absurdly old>, and I was a young developer ¹. In college, I had aced coding exercises, as a junior dev I had cranked out code to solve whatever problems someone specified for me, quicker than anyone expected. I could learn a new language and get productive in it over a weekend (or, so I believed).

And thus, in the natural course of things, I got to run my own project. The account manager explained, in rough form, what the client was looking for, we talked it out, and I said, “That should be about 3 weeks of work.” “Sounds good,” he said. And so I got to coding.

How long do you imagine this project took? Four weeks? Maybe five?

Um, actually: three months.

I have vivid memories of that time — my self-image had been wrapped up in being “a good programmer”, and here I was just hideously failing. I lost sleep. I had these little panic attack episodes. And it just Would Not End. I remember talking to that account manager, a pit in my stomach, explaining over and over that I still didn’t have something to show.

During one of those black periods, I resolved to Never Be That Wrong Again.

Unfortunately, over the course of my career, I’ve learned something pretty hard: I’m always that wrong.

Actually, I’ve learned something even better: we’re all that wrong.

Recently, I read Daniel Kahneman’s Thinking, Fast and Slow, a sprawling survey of what psychology has learned about human cognition, about its marvelous strengths and its (surprisingly predictable) failings.

My favorite section was on Overconfidence. There were, let us say, some connections to the ways developers make estimates.

Why You Suck at Making Estimates, Part I: Writing Software = Learning Something You Don’t Know When You Start

First off, there are, I believe, really two reasons why we’re so bad at making estimates. The first is the sort of irreducible one: writing software involves figuring out something in such incredibly precise detail that you can tell a computer how to do it. And the problem is that, hidden in the parts you don’t fully understand when you start, there are often these problems that will explode and just utterly screw you.

And this is genuinely irreducible. If you do “fully understand” something, you’ve got a library or existing piece of software that does that thing, and you’re not writing anything. Otherwise, there is uncertainty, and it will often blow up. And those blow ups can take anywhere from one day to one year to beyond the heat death of the universe to resolve.

E.g. connections to some key 3rd party service turn out to not be reliable… so you have to write an entire retry/failure tracking layer; or the db doesn’t understand some critical character set encoding… so you have to rebuild all your schemas from scratch; or, the real classic: when you show it to some customers, they don’t want exactly what they asked for, they want something just a tiny bit different… that is much harder to do.

When you first hit this pain, you think “We should just be more careful at the specification stage”. But this turns out to fail, badly. Why? The core reason is that, as you can see from the examples above, if you were to write a specification in such detail that it would capture those issues, you’d be writing the software. And there is really just no way around this. (if, as you read this, you’re trying to bargain this one away, I have to tell you — there is really really really no way around this. Full specifications are a terrible economic idea. Some ways below I’m going to lay out better economic choices)

But here’s where it gets interesting. Every programmer who’s been working in the real world for more than a few months has run into the problems I’m describing above.

And yet… we keep on making these just spectacularly bad estimates.

And, worse yet, we believe our own estimates. I still believe my own, in the moment I make them.

So, wait, am I suggesting that all developers somehow fall prey to the same, predictable errors in thinking?

Yep, that’s exactly what I’m suggesting.

Why You Suck at Making Estimates, Part II: Overconfidence

Kahneman talks at some length about the problem of “experts” making predictions. In a shockingly wide variety of situations, those predictions turn out to be utterly useless. Specifically, in many, many situations, the following three things hold true:

1- “Expert” predictions about some future event are so completely unreliable as to be basically meaningless

2- Nonetheless, the experts in question are extremely confident about the accuracy of their predictions

3- And, best of all: absolutely nothing seems to be able to diminish the confidence that experts feel

The last one is truly remarkable: even if experts try to honestly face evidence of their own past failures, even if they deeply understand this flaw in human cognition… they will still feel a deep sense of confidence in the accuracy of their predictions.

As Kahneman explains it, after telling an amazing story about his own failing on this front:

“The confidence you will experience in your future judgments will not be diminished by what you just read, even if you believe every word.”

Interestingly, there are situations where expert prediction is quite good — I’m going to explore that below, and how to use it to hack your own dev process. But before I do that, I want to walk through some details of how the flawed overconfidence works, on the ground, so you can maybe recognize it in yourself.

What It Feels Like To Be Wrong: Systems I & II, and The 3 Weeks and 3 Months Problem

In Thinking Fast and Slow, Kahneman explains a great deal of psychology as the interplay between two “systems” which govern our thoughts: System I and System II. My far-too-brief summary would be “System II does careful, rational, analytical thinking, and System I does quick, heuristic, pattern matching thinking”.

And, crucially, it’s as if evolution designed the whole thing with a key goal of keeping System II from having to do too much. Which makes plenty of sense from an evolutionary perspective — System II is slow as molasses, and incredibly costly, it should only be deployed in very, very rare situations. But you see the problem, no doubt: without thinking, how does your mind know when to invoke System II? From this perspective, many of the various “cognitive biases” of psychology make sense as elegant engineering solutions to a brutal real-world problem: how to apportion attention in real time.

To see how the interplay between Systems I & II can lead to truly awful, and yet honestly-believed estimates, I’m going turn the mic briefly over to my friend (and Hut 8 Labs co-conspirator) Edmund Jorgensen. He described it to me in an email as follows:

“When I ask myself “how long will this project take” System I has no idea, but wants to have an idea, and translates the question. Into what? I suspect it’s into something like “how confident am I that I can do this thing,” and that gets translated into a time estimate, with some multiplier that’s fairly individual (e.g. when Bob has level of confidence X, he always says 3 weeks; when Suzy has level of confidence X, she always says 5 weeks).”

Raise your hand if you’ve gradually realized you have two “big” time estimates? E.g. for me it’s “3 weeks” and “3 months”. The former means “that seems complex, but I basically think I see how to do it”, and the latter means “Wow, that’s hard, I’m not sure what’s involved, but I bet I can figure it out.”

Aka, I think Edmund is totally right.

(For those playing along at home: my “3 week” projects seem to take 5-15 weeks, my “3 month” projects usually take 1-3 years, on the rare event that someone is willing to keep paying me).

Alright, So Let’s Stop Being So Overconfident!

You might be thinking at this point: “Okay, I see where Dan is going: we have to approach these estimation challenges in some manner that engages System II instead of System I. That way, our careful, analytical minds will produce much better estimates.”

Congratulations, you’ve just invented Waterfall.

That’s basically the promise of the “full specification before we start coding” approach: don’t allow the team to make intuitive estimates, force everyone to carefully engage their analytical minds and come up with a detailed spec with estimates broken down into smaller pieces.

But that totally fails. Like, always.

The real trouble here is the interplay between the two sources of estimation error: the human bias towards overconfidence, and the inherent uncertainty involved in any real software project. That uncertainty is severe enough that even the careful, rational System II is unable to come up with accurate predictions.

Fortunately, there is a way to both play to the strengths of your own cognition and also handle the intense variability of the real world.

First, how to play to your mind’s strengths.

When Experts Are Right, and How To Use That To Your Advantage

Kahneman and other researchers have been able to identify situations where expert judgment doesn’t completely suck. As he says:

“To know whether you can trust a particular intuitive judgment, there are two questions you should ask: Is the environment in which the judgment is made sufficiently regular to enable predictions from the available evidence? The answer is yes for diagnosticians, no for stock pickers. Do the professionals have an adequate opportunity to learn the cues and the regularities?”

An “adequate opportunity” means a lot of practice making predictions, and a tight feedback loop to learn their accuracy.

Now, 6-18 month software projects just miserably fail on all these criteria. As I’ve discussed above, the environment is just savagely not “regular”. Plus, experts don’t get the combo of making lots of predictions and getting rapid feedback. If something is going to take a year or more, the feedback loop is too long to train your intuition (plus you need a lot of instances).

However, there is a form of estimation in software dev that does fit that bill — 0-12 hour tasks, if they are then immediately executed. At that scale, things work differently:

Although there is still a lot of variability (more on that below), there is some real hope of “regularity in your environment”. Two four-hour tasks tend to have a lot more in common than two six-month projects.
You can expect to make hundreds of such estimates, in the course of a couple of years.
You get very quick feedback about your accuracy

The highest-velocity team I’ve ever been on did week sprints, and broke everything down to, basically, 0, 2, 4, or 8 hours (and there was always some suspicion about the 8 hour ones — like, we’d try pretty hard to break those down to smaller chunks). We estimated those very quickly and somewhat casually — we didn’t even use a Planning Poker style formalism.

At that point, you’re using the strengths of System I — it has a chance to get trained, it sees plenty of examples, and there are meaningful patterns to be gleaned. And, thanks to the short sprint length, you get very rapid feedback on the quality of your estimates.

Wait, Wait, Wait, Let’s Just Make a Thousand 4 Hour Estimates!

How can I both claim that you can make these micro-scale estimates, but somehow can’t roll them up into 6-18 months estimates? Won’t the errors average out?

Basically, although I think the estimates at that scale are often right, when they’re wrong, there’s simply no limit to how wrong they can be. In math-y terms, I suspect the actual times follow a power law distribution. And, power law distributions are notable for having no stable mean, and infinite variance. Which, frankly, is exactly how those big waterfall project estimates feel to me.

You might be thinking: how on earth could something you expected to take 4 hours take a month or two?

This happens all the time: you go to take some final step in something and discover some hideous blocker which completely changes the scope. E.g. at a recent startup, in trying to eliminate single points of failure from our system, we went to put a load balancer in front of an IMAP server we had written. So that, when one server machine died, the load balancer would just smoothly fail over to another box, and customers would see no impact.

And that seemed like a 4-hour-ish task.

But when we went to actually do it, we realized/remembered that the IMAP server, unlike all the HTTP servers we were so used to, maintained connection state. So if we wanted to be able to transparently fail over to another server, we’d have to somehow maintain that state on two servers, or write some kind of state-aware proxying load balancer in front of the IMAP server.

Which felt like about a 3-month project to us.²

And there is the other reason that short sprints are an absolutely key piece of all this: they place a hard limit on the cost of a horrifically bad estimate.

Are We All Just Screwed?

So what do we do? Just accept that all our projects are doomed to failure? That we’ll have poisoned relationships with the rest of the business, because we’ll always be failing to meet our promises?

The key is that you first accept that making accurate long-term estimates is fundamentally impossible. Once you’ve done that, you can tackle a challenge which, though extremely difficult, can be met: how you can your dev team generate a ton of value, even though you can not make meaningful long-term estimates?

What we’ve arrived at is basically a first-principles explanation of why the various Agile approaches have taken over the world. I work that in more detail in my next post: “No Deadlines For You! Software Dev Without Estimates, Specs or Other Lies”.

(Join in the conversation on Hacker News and Slashdot.)

(the band <insert dated music reference> was on the radio, and everyone was talking about <some long-gone tv show>). ↩
If you’re thinking “Wait, 3 months, like one of your 3 month estimates?”, I have no idea what you’re talking about. ↩

When it Comes to Chaos, Gorillas Before Monkeys

2013-04-02T00:19:00-04:00

Here’s a glitch in my thinking that I realized on a recent job: I am too terrified of monkeys, and not sufficiently afraid of gorillas. As a result, I’ve been missing opportunities for early, smart investments to make my systems more resilient in the Amazon cloud.

By “monkey” and “gorilla” I mean “Chaos Monkey” and “Chaos Gorilla,” veterans of Netflix’s Simian Army. You can browse the entire list ¹, but for easy reference:

Chaos Monkey is the personification (simianification?) of EC2 instance failure.
Chaos Gorilla represents major degradation of an EC2 availability zone, henceforth “AZ” for short (or, as we sometimes referred to them at my last job, “failability zones”).

I believe that startups should (mostly) worry less about EC2 instances failing, and more about entire AZs degrading. This leads to a different kind of initial tech/devops investment—one that I believe represents a better return for most early-stage companies and products.

How I (Finally) Learned to Dread Chaos Gorilla Appropriately

At the job in question, the team and I were working on an application that had, at some unremarked moment, crossed the fuzzy line between advanced prototype and early production. First customers were using it—some were even starting to depend on it in their daily lives—and suddenly downtime had gone from something we thought about as “wouldn’t it be nice if someone noticed or cared” to “wow that might really tick some people off.”

Unfortunately, unlike every other development shop ever, we might have cut a corner or two getting our prototype out. To wit, we had a single point of failure in our system. Or maybe two. All right, I admit it: we had four SPOF time bombs ticking away, which—quite coincidentally—was also the total number of EC2 instances in our deployment. I felt bad about that, and so did the rest of the team. We all knew that SPOFs were pure evil, right up there with axe murderers and grams of trans fat on the list of “Things There’s No Good Number Of.” So we planned to spend a good chunk of a couple sprints terminating those SPOFs with extreme prejudice ².

And then, the night before the first such sprint began, we dodged a bullet: one of the East Coast AZs freaked the hell out, bringing half the sites on the Internet down with it. Luckily, it wasn’t the AZ we were in ³. Phew, right?

But on the bus the next morning I got to thinking: why did I feel ashamed of our SPOF instances, but lucky for dodging the latest AZ meltdown? Why did I now suspect that if an instance failure had caused us even minutes of downtime, I would have blamed myself, whereas if an AZ meltdown had knocked our site out of commission for hours (along with Reddit and maybe Netflix) I would have—if I was being honest with myself—kind of blamed Amazon? This felt like just the kind of misaligned thinking that might be hiding an economic opportunity.

Arriving at the office, I cornered Dan in the kitchen and we chatted it through. From a cold, hard, economic point of view, would we get a better return first protecting against instance failure, or improving our resilience to AZ meltdown? Instances failed, sure—and if we were at Netflix’s scale, they’d be failing all the time. But at our scale—four machines—they didn’t seem to fail very often, and when they did, it would be maybe an hour of scrambling to fix. On the other hand I could name five occasions in the previous two years, just by my personal count, when an AZ had melted down—and in each case we had spent more like half a day (at least) dealing with the crisis and fallout. If you go looking for AZ meltdowns, they’re not very hard to find.

In other words, at our scale of four instances:

instance failure = cheap and seldom
AZ meltdowns = expensive and frequent ⁴

(Protecting against AZ meltdown also has the nice benefit of optimizing for mean time to recovery instead of mean time between failures, which is usually the right way to go).

So we changed course: over the next few sprints we ran AZ meltdown simulations, made improvements to our backup and deploy scripts that would allow us to recover from disaster more quickly ⁵, and generally made ourselves more resilient to the economically disastrous wrath of Chaos Gorilla before we spent real dev calories preventing the relative pranks of Chaos Monkey.

Why Had I Been Thinking About this So Wrong?

I suspect for a few reasons:

The Different Tradeoffs of Physical Hardware

I get a lot of my habits and patterns of thought from having worked in the olden days on sites deployed on physical, colocated servers, where the cost/benefit profiles are different than those of the cloud. Unlike tooling some scripts to bring up new instances on-demand in a separate AZ, maintaining a failover-ready second server farm in a separate colo facility represents a substantial investment for a young company. Furthermore, compared to the bleeding-edge insanity of an AZ operated at Amazon’s scale, most physical installations are built on boring, tried, and relatively simple technology, and they don’t catastrophically fail as often.

Don’t get me wrong—there’s no excuse for not preparing against colo failure in the physical hardware universe, any more than there’s an excuse for ignoring Chaos Gorilla in AWS, but the combination of lower incidence of failure and higher cost to protect against it means that the economic “break even” line can get drawn later in an application’s life-cycle.

AZs Still “Feel Like Hardware”

It’s natural to think of your piddling EC2 instance as something ephemeral, but (at least for me) it’s not natural to think of a whole AZ as something volatile or dangerous. It’s basically just a big, solid data center, more stable than the boxes it houses, right? Well…no.

An AZ is a virtualized data center—an exercise in massively distributed and heterogeneous systems engineering, housing thousands of tenants who are each competing for resources and aren’t even supposed to know the others exist. No one else in the world (as far as I know) is operating a comparable system at the scale of AWS. And like all such complex, distributed systems, AZs are subject to weird hidden dependencies and nasty cascading failure modes.⁶

In other words: I think we intuitively feel that AZs fail in the modes of colocated hardware—relatively simply and independently—when in fact they tend to fail in the mind-bogglingly interconnected modes of distributed software.

Chaos Monkey Came, Saw, and Conquered Our Imagination

I remember the first time I read about Chaos Monkey. The name was hilarious, and the thinking behind it just felt so instantly, recognizably right. “You want to make darn sure you only deploy production systems resilient to instance failure? OK, then randomly terminate instances in production.” I internalized that point of view pretty quickly, and it stuck.

Chaos Gorilla, on the other hand? I barely heard about him when he came on the scene. And I didn’t know Chaos Kong (who simulates an entire AWS coast failing) even existed until recently.

It’s Easy to Feel Good in Good Company

When one of your instances dies, you’re the only startup on your floor whose site is down. That feeling sucks. When an AZ goes down, Giants of the Internet stumble—even the mighty Netflix can have a problem or two. Meanwhile down the hall you hear the sobs and screams of your counterparts at other startups, which you find oddly comforting, and by that night you’re drinking beer and swapping the day’s war stories with them.

All this company can make you feel that, somehow, downtime caused by an AZ outage isn’t really your fault. But our customers don’t think that way, and neither should we.

In Conclusion

First, let’s note what I’m not concluding.

I’m not concluding that it’s OK to have single points of failure in your system, or that instances never die on Amazon, or that you shouldn’t engineer for high availability.

I am concluding that, because of the high historical frequency of AZ degradation and the relatively small number of instances most early startups deploy, those startups are less likely to be affected on any given day by instance failure than AZ degradation. Furthermore, the costs incurred in downtime and recovery efforts are usually significantly higher for AZ degradation than instance failure. Therefore, for many startups, it will make sense to invest in war-gaming AZ degradation and tooling for quick recovery before engineering around instance failure.

Or, to put it more succinctly: gorillas before monkeys.

Including my personal favorite, the weirdly named “Conformity Monkey,” whom I kind of imagine standing around awkwardly with a crew cut and a button-down while “Hippie Monkey” and “Beat Monkey” hurl epithets and folk songs at him. ↩
For those wondering why we couldn’t just throw up an ELB or two and call it a day: a major element of the application was a custom IMAP server—yes, that’s right, IMAP, as in the charmingly chatty, delightfully stateful mail protocol. You haven’t really lived until you’ve tried to load balance a stateful protocol, I tell you. HTTP is for wussies. ↩
For those interested: we were in a single AZ because we were required, for business reasons, to deploy into a VPC. ↩
We went into some more depth when estimating the relative costs and benefits of protecting against AZ meltdown or instance failure, generating some upper / lower bound estimates which I plan to publish in a later companion piece. ↩
Yes, we had robust backups and automated deploys, even for our prototype. Maybe we were a little SPOF-y, but we weren’t insane. ↩
Take EBS for example, which has been responsible in one form or another for many of the AZ meltdowns to date. Even if you don’t use EBS directly, an EBS meltdown can affect your deployment because—surprise!—ELBs use EBS behind the scenes—as does RDS, and a number of other AWS services. Even if you avoid all EBS-dependent services, a sudden rush of EBS failover and replication can choke an entire AZ, rendering your instances inaccessible and useless. ↩

Dan Talks About Post-Mortems

2013-03-29T07:20:00-04:00

Hello, Dan here. So: the Hut 8 Labs team is very excited to be firing up our blog. But, before we get down to new business, I wanted to post some links to other places I’ve written or talked. Of late, a bunch of that writing and talking has been about how to run effective post-mortems.

I’ve actually come to believe that, for many startups, spending a chunk of time improving how they approach post-mortems (and learning from failure more generally) has a just incredible economic return. I suspect it’s one of the most profitable things they can do with their (incredibly scarce) time.

Why? Because the sort of default way groups of human beings respond to failures is with shame… and an attendant desire to quickly move on and pretend it never happened. Thus, that’s how most startups respond to multi-hour outages, or embarrassing bugs showing up in front of important early customers, or the like.

And if that’s what your team does after experiencing some nasty failure, you’re basically guaranteed to be missing simple, cheap, incredibly valuable improvements. It can be helpful to flip this around, and imagine those improvements not as “avoiding bad things”, but rather “making you piles of money” (aka, having strongly positive economic returns). Imagine there’s a big class of customers waiting to buy your product, but you’ve got a team-wide mental block which prevents everyone from seeing them. Improving how you run post-mortems is like discovering those customers are lined up outside your door, waiting to get in.

(I am not, of course, suggesting that making failures or outages go away is somehow simple and cheap — what I’m suggesting is that there are incremental improvements with outsized value, and post-mortems can help you find them. If you’re thinking “But early customers don’t care that much about outages”, you’re totally right — the big economic win comes not from avoiding showing bugs to customers, but from decreasing the frequency of firedrills for your team, which have an outsized opportunity cost.)

Well-run post-mortems can also serve as a very important release valve — again, because of the default response of shame. Unless there’s a structure to deal with failures, people tend to slip into very damaging patterns — searching for someone to blame, inserting slow-moving layers of review, etc.

Most recently, I gave a talk touching on a bunch of this at the Lean Startup Conference, the slides are up here:

How To Run a 5 Whys (With Humans, Not Robots)

You can also watch a 12-minute video of the talk (which has the added benefit of documenting for future-me that, in late 2012, I briefly experimented with a mustache).

Also, a ways earlier, I wrote up a blog post on my experiences running post-mortems at HubSpot:

What I Learned From 250 Whys

Hope you enjoy, do check back for more. As a teaser: I’ve been engaged in a very interesting post-mortem-themed email exchange with one John Allspaw (who will tell pretty much anyone who asks that he has some very serious concerns about the 5 Whys approach). I’ve promised to write up my take on that discussion, tentatively titled 5 Whys Baaaad, 5 Whys Gooood, aka “All The Things That Are Wrong With 5 Whys And Why I Think They’re Awesome Anyways”.

Assuming I actually get that written, it should be fun.