Hut 8 Labshttp://blog.hut8labs.com/2015-03-02T00:00:00-05:00How to Square the Circle, Achieve Perpetual Motion, and Tune Your Alert Emails Just Right2015-03-02T00:00:00-05:00Edmund Jorgensentag:blog.hut8labs.com,2015-03-02:tuning-your-alert-emails.html<p><span class="dquo">“</span>You can tune a piano,” goes the old joke, “but you can’t tuna fish.” I’ve
come to believe that you can’t really tune the automated alerts you get from
your monitoring systems either—at least in the sense that we usually mean when
we complain that “our email alerts are out of control and need to be tuned
properly.” Instead we should be spending our time and attention tuning a
larger, more complex system—of which alerts are just one part.</p>
<a name="continued" id="continued"></a>
<p>When you tune a piano string, you make small adjustments to return it toward
the ideal pitch from which it has drifted (say, 440 Hz for an A). It’s
overwhelmingly <em>impractical</em> to tune a real piano string to A perfectly,
because of the messy nature of the physical world—but the ideal you’re trying
for is straightforward and unambiguous.<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup> Unfortunately, there’s no such
ideal for alerts.</p>
<p><span class="dquo">“</span>Of course there is,” says a guy in back, “I’ve read about it in tons of blog
posts. The ideal for alerts is this: <em>every time</em> I get an alert, it should
indicate an <em>actual problem</em> that requires my attention.”</p>
<p><span class="caps">OK</span>, guy at the back, that’s pretty easy to do—let’s just turn off the alerting
system altogether, and I promise that every alert you receive will be
indicative of a problem requiring your attention.</p>
<p><span class="dquo">“</span>Don’t play dumb,” he retorts. “We still want to get alerts whenever something
is wrong—as soon as we know it’s wrong, in fact—we don’t want to be sitting
around thinking things are all hunky dory when production is on fire. So
that’s the ideal towards which we tune alerts—we want alerts <em>immediately</em>
when there’s an actual problem, and only then.”</p>
<p>Fair enough, but that’s a very different kind of ideal than a string vibrating
at precisely 440 Hz. In fact it’s not even clear that this can be called an
“ideal” at all, because under this definition an alert can drift from ideal in
at least two, <em>often contradictory</em> directions. That is, it can fire when
there isn’t an actual problem (a false positive) or not fire when there is (a
false negative)—and when you make one of these problems better for a given
alert, you tend to make the other worse.</p>
<p>For example, if you’ve ever monitored <span class="caps">CPU</span> usage of a production system, you’ll
be familiar with the false positive alerts you get when the system becomes
briefly and legitimately busy doing a burst of real work. So you tune the
alert to back off a bit—perhaps you will tolerate up to 80% utilization
instead of 70% before alerting—only to find that some truly nasty condition
occasionally pegs the <span class="caps">CPU</span> right around 72% forever, causing all sorts of other
problems in the meantime. All right, you think, I’ll set the <span class="caps">CPU</span> threshold
back to 70% but not alert until we’ve exceeded that for 30 minutes, which is
longer than any legitimate work spike—but now you’ve guaranteed that you won’t
find out about actual problems for at least 30 minutes. So you set a new rule,
which etc. etc. etc.</p>
<p>As tricky as this balancing act is (and, if you’ve ever struggled with this in
the real world, you know that it is plenty tricky), there’s a subtle (and even
more dastardly) problem buried in this discussion so far: the tossing around
we’ve been doing of that term “an actual problem.”</p>
<p>What makes one condition of your system “an actual problem” and another “not an
actual problem?” There’s no unambiguous, measurable criterion you can
reference to answer that question. If you polled different people in your
business, you’d almost certainly get vastly different answers—one of the folks
in marketing might not care, for example, if page load times were above 1
second for their latest micro-site, but might care very much if it’s down,
while another might think anything over 500 millis should be synonymous with
downtime—<em>for the very same site</em>. So what we mean by saying an alert caught
“an actual problem” ends up being something like: “when the alert fired, it
alerted me to a situation that, with my own infinitely flexible and
idiosyncratic human knowledge and judgment, I was glad to know about at just
that time and no later.”<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup></p>
<p>This is a pretty bad situation we find ourselves in—not because there’s some
unattainable perfection our alerts can’t ever realistically achieve, but
because we don’t even have a good enough definition of perfection to let us say
whether a change to one of our alerts is actually making it <em>better or worse</em>.
That’s a horrific state of affairs, because it means that as we attempt to
steadily improve our operations in nice small steps, we’re just going to end up
endlessly jerking away from whichever flavor of catastrophe last burned us: our
alert volume will build up until it’s just background noise, which will lead to
an alert on an “actual problem” being ignored, which will lead to a fantastic
blamefest that results in our cutting a bunch of alerts, which will lead to an
“actual problem” that generates no alert, which will lead to another blamefest
resulting in a buildup of alerts … and round and round we go, with no end in sight.</p>
<h3><span class="caps">OK</span>, then … is all hope lost?</h3>
<p>So how do we get out of this spiral of blame and abject existential horror?
Here’s a hint: how would we design our alerts if we had access to an infinite
supply of brilliant, unsleeping, <em>free</em> interns (who were also intimately
familiar with our systems and business) to respond to them? Well, assuming
we’re completely heartless<sup id="fnref:3"><a class="footnote-ref" href="#fn:3" rel="footnote">3</a></sup>, we’d make those alerts sensitive as all hell,
because if it’s free (and we’re heartless), why not throw human intelligence
and attention at every little blip and bump we monitor to see if there’s an
“actual problem?”</p>
<p>What if, on the other hand, the condition on which we were alerting was
extremely benign—for example, the website on which we host our high-school
poetry going down? We’d make that alert extremely insensitive, because
honestly: who cares if it goes down for a day or two, or even a week?<sup id="fnref:4"><a class="footnote-ref" href="#fn:4" rel="footnote">4</a></sup></p>
<h3>Economics to the rescue</h3>
<p>In other words, we can recast the idea of an alert as something that doesn’t
even <em>have</em> a Platonic ideal in and of itself—but which is one piece of an
economic equation, with an associated <em>cost</em> and <em>benefit</em> profile.</p>
<p>I mean those words literally, by the way: each alert has some cost and benefit
in some actual number of probabilistic dollars and cents, where the cost is
dominated by the investment of human attention and intelligence that it
occasions, and the benefit is equal to the cost of an “actual problem” (times
the probability that the alert has identified such a problem).</p>
<p>Now we can start comparing the relative badness of a given false positive and
its corresponding false negative, and to see the outlines of a system that can
actually be tuned towards an unambiguous ideal—the absolute <em>minimum overall
cost</em>.<sup id="fnref:5"><a class="footnote-ref" href="#fn:5" rel="footnote">5</a></sup></p>
<h3>A gentle objection from the guy at the back</h3>
<p><span class="dquo">“</span>All right,” says the guy at the back, “I’ve sat quietly for a bit, but now I
think I’ve got you. Maybe you’re right that the Platonic ideal was a little
too simplistic, but what you’re proposing is way too <em>complex</em>—there’s just no
way in hell you can actually figure out the cost of human attention and
intelligence and the probability of a false negative etc. etc. etc. and get all
that down to dollars and cents. In practice it’s just impossible.”</p>
<p>Well, guy at the back, we agree about one thing: we’re never going to calculate
those costs down to the cent, or even the dollar. But—and here’s the great
part—in practice <em>we don’t have to</em>—we just need back of the envelope, order
of magnitude estimates that allow us to compare a couple choices (e.g. making
an alert more or less sensitive) and say, <em>relatively</em>, which course is
probably better.</p>
<p><span class="dquo">“</span>But,” says the guy, “you’re still asking me to put a cost on things like
minutes of downtime. The leaders of my business are never going to do
that—downtime is one of those things that just <em>can’t ever happen</em>.”</p>
<p>Oh, guy at the back, let me buy you a beer or ten.</p>
<p>The spirit behind that “can’t ever happen” is a gigantic problem—it’s
equivalent to saying “our perpetual motion machine just can’t ever run down,
because we love our customers and failure is not an option.” This is, at its
heart, a <em>moral</em> argument, where blame and punishment are what’s under
consideration—and in that mindset, people often get genuinely angry if you
even <em>suggest</em> that there’s an economic tradeoff to consider.<sup id="fnref:6"><a class="footnote-ref" href="#fn:6" rel="footnote">6</a></sup></p>
<p>If the above sounds uncomfortably close to your own situation, you have much
more profound problems then a simple pass at your nagios configuration could
ever solve. If you operate the software of your business, then you <em>must</em> be
able to reason about the economics of what you do—be it preventing downtime,
investing in backups, or even speeding up deploys—in at least comparative
orders of magnitude. If the leaders in your business refuse to partner with
you on that, then you don’t have a ton of great options.<sup id="fnref:7"><a class="footnote-ref" href="#fn:7" rel="footnote">7</a></sup> In this situation,
as my friend and colleague Dan Milstein says—and I don’t repeat this
lightly—”maybe the world is telling you to brush up your LinkedIn profile.”</p>
<p>But in my experience—and I hope that experience is far from unique—the
leaders of a business usually welcome the chance to have an economic discussion
around such operational risks and investments, as it represents a chance to
better understand (and inform) some aspects of the business’s economic equation
that are often opaque to them, and remove some of the fear and anxiety that
this opacity creates.</p>
<h3>In Conclusion …</h3>
<p>In practice I think you’ll find that, far from our original false Platonic
ideal of “every alert indicates an ‘actual problem,’” you’ll end up happily and
profitably tolerating some number of false positives, since for most businesses
they tend to be considerably cheaper than a single false negative.<sup id="fnref:8"><a class="footnote-ref" href="#fn:8" rel="footnote">8</a></sup></p>
<p>But thinking about alerts in their larger economic context also lets us improve
our overall economics by means of other investments too. Besides just changing
the sensitivity of our alerts, we can also improve the overall economic
equation by:</p>
<ul>
<li>
<p>driving down the cost of receiving an alert—for example by making alerts
easier and quicker to digest and disregard if appropriate, so they consume
less human attention and intelligence</p>
</li>
<li>
<p>driving down the cost of the failures we want to alert on, by providing
backup or alternative systems that make failures less expensive (for example,
providing materials that allow cashiers to take those old paper impressions
of credit cards if the electronic system goes down)</p>
</li>
</ul>
<p>By recognizing our alerts as part of a broader economic system, in other words,
we’re setting ourselves up for a world with an actual forward direction and a
lot more options for how to travel in that direction—and, correspondingly, a
lot less shaking our fist at all those email alerts in impotent rage.</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>I was originally too abmitious here, stating that there is a perfect
tuning for a piano. It turns out that, as several early readers have
<a href="http://blogs.scientificamerican.com/roots-of-unity/2014/11/30/the-saddest-thing-i-know-about-the-integers/">pointed out to me</a>,
there’s no way to tune an entire <em>piano</em> perfectly. So, we’ll stick with a
string being tuned to a single note, unrelated to all others. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Yes, in some respect, creating a “perfect” alerting system would mean
designing an <span class="caps">AI</span> that contained, besides its electronic sensors, an exact,
evolving, realtime copy of all your wisdom, experience, judgment, domain
knowledge, preferences, etc.—and alerted you when it calculated, with 100%
certainty, that you would want to handle a situation—because, paradoxically,
while possessing all your wisdom, experience, judgment, domain knowledge,
preferences, etc.—as well as on-board networking and a faster <span class="caps">CPU</span> than
yours—the <span class="caps">AI</span> was somehow unable to address the situation itself. <a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Yes, there’s a real moral question here about making the lives of these
poor interns unbearably miserable, but … <em>interns</em>. <a class="footnote-backref" href="#fnref:3" rev="footnote" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>In my case, I’d probably want to be alerted if it somehow ever
accidentally came <em>up</em>, so that I could immediately shut it down again. <a class="footnote-backref" href="#fnref:4" rev="footnote" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>Or, if you’re an optimist, the maximum overall benefit—but if you’re an
optimist, what are you doing working in operations anyway? <a class="footnote-backref" href="#fnref:5" rev="footnote" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>For more on the moral vs. economic mindset, see Hut 8’s own Dan Milstein
talk about post-mortems, axe murderers and the stupidity of our future selves,
available as <a href="https://www.youtube.com/watch?v=78qzrXIPn5Q">video</a> or
<a href="http://www.slideshare.net/danmil30/how-to-run-a-5-whys-with-humans-not-robots">slides</a>. <a class="footnote-backref" href="#fnref:6" rev="footnote" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:7">
<p>In particular, it’s tempting but inadvisable to take it upon yourself to
translate this moral viewpoint into an economic one. You’d essentially just be
assigning an infinite cost to downtime, which leaves you just as lost as the
false Platonic ideal of “all alerts must indicate an actual problem” did when
it implicitly assigned an infinite cost to wasted human attention. <a class="footnote-backref" href="#fnref:7" rev="footnote" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:8">
<p>There’s a corollary to this, too: often when you join an organization you
will initially experience their alert volume as horrifically out of control, and
grow to understand it as you become better associated with both the workings
and the economics of the systems you’re monitoring. <a class="footnote-backref" href="#fnref:8" rev="footnote" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
</ol>
</div>Speeding Up Your Engineering Org, Part I: Beyond the Cost Center Mentality2014-04-17T00:00:00-04:00Edmund Jorgensentag:blog.hut8labs.com,2014-04-17:speeding-up-your-eng-org-part-i.html<p>It is a truth universally acknowledged, that engineering orgs—like
greyhounds, sports cars, and wide receivers—slow down as they age.</p>
<p>Odds are good that you have experienced this phenomenon personally at
some point in your engineering career. The slowdown was gradual,
frustrating, and oddly stubborn. It survived: numerous rounds of
hiring; a spate of offsites where inspiring speakers harangued
everyone to “cut through the crap” and just “get shit done”; a
blood-spattered re-org or two; and even a few ground-up rewrites that
utterly failed to deliver on their promised boost in velocity.</p>
<p>If you’re now involved with engineering leadership in some capacity,
you may well have accepted the slowdown as a sad universal truth.
Accordingly, you may have shifted your efforts from the impossible
task of making the org go faster to the thankless but crucial job of
jealously guarding how engineers spend their time—because as it takes
longer and longer to get even simple features out the door, those
engineering hours become increasingly precious.</p>
<p>If all this sounds familiar, I have good news and bad news for you.</p>
<a name="continued" id="continued"></a>
<p>The good news: it isn’t actually a law of nature that engineering orgs
have to slow down as they mature and grow. With active, contravening
investment, it’s possible to maintain and even gain speed.</p>
<p><span class="dquo">“</span>But,” you protest, “I’ve <em>made</em> investments, remember? I’ve hired!
I’ve brought in speakers! I’ve re-orged and re-factored and tried out
every flavor of agile there is, and still we go slower and slower!”</p>
<p>Yes, which brings us to the bad news: that slowdown is a far bigger
deal than you might have realized, and way more harmful to the bottom
line of your business than you might imagine. Oh, and that jealous
guarding of engineer hours for features? It’s only making things worse.</p>
<p>In this article I’m going to consider the speed of an engineering org
as an economic question—not a moral question, or a question of
technology choices, or a question of people “hustling” and “powering
through” the obstacles they find in their path. I believe that a good
percentage of engineering and business leaders economically model
their engineering org—consciously or unconsciously—as a “cost
center,” where every engineer hour not spent on features must
translate to (at least) one engineer hour saved, and I believe that
this economic model makes it extremely difficult to identify and
justify the investments that could actually speed that org up. I’ll
propose an alternate economic model of an engineering org—one in
which speed to delivery, rather than number of engineer hours paid, is
the dominant economic factor—and in which considerable, sustained
investment in that speed can reap massive economic returns.</p>
<p>But let’s get a little more concrete with this—let’s look at an
example of the kinds of decisions that face engineering orgs and their
leaders every day, and just how easy it is to slip into the “cost
center” mentality when attempting to juggle them.</p>
<h2>A Tale of Two Engineers</h2>
<p>Say you’re an engineering manager at Company X, and one morning you
arrive at work to find two of your best engineers waiting outside your
office. You haven’t even opened your door before they start in on you.</p>
<p><span class="dquo">“</span>Look,” says Cindy, the first engineer, “I know that the <span class="caps">CEO</span> is
breathing down our neck to finish the new Facebook for Cats
integration, but we’ve got to clear some time to work on automating
database migrations. I’m the only one who knows enough to apply them
to the prod <span class="caps">DB</span>, and I’m getting tired of spending half an hour every
morning rolling out everyone else’s changes. So can we push a feature
or two back and squeeze that in?”</p>
<p><span class="dquo">“</span>Forget the migrations,” says Scott, the second engineer, “we need to
talk about the Frobulator Service. Two years ago we agreed to hack it
up quickly in <span class="caps">PHP</span>, but product promised us—<span class="caps">PROMISED</span>—that we would
have time to go back and clean it up. Yesterday I happened to be back
in that code while I was updating the copyright years in our headers,
and it’s even worse than I remembered. We need to rewrite it in Scala
so it’s more modern, performant, and easier to maintain. Can you tell
product we’re calling in that promise, please, and I’ll get started?”</p>
<p>First off: everything your engineers have said is true. Cindy really
is spending a half hour every morning dealing with database
migrations; the source for the Frobulator Service really does look
like a plate of partially digested capellini; product really did
promise time to clean that mess up; and of course there really is a
long and growing backlog of features for the upcoming Facebook for
Cats integration, each of them (according to the <span class="caps">CEO</span> and product)
absolutely essential and destined to become a customer favorite.</p>
<p>Furthermore, you’ve been around long enough to know that there won’t
be any “calm periods” when there’s time for your engineers to scratch
these other itches—after the Facebook for Cats integration goes out,
you’ll be right on to integrating with Twitter for Dogs, or LinkedIn
for Ferrets. So on this fine morning someone has to make a real and
uncomfortable decision: either tell Cindy and Scott to stop
complaining and get back to feature work, or let product and the <span class="caps">CEO</span>
know that you’re going to spend some engineering hours on something
other than features. And today that someone is you.</p>
<p>Pop quiz, hot shot: what do you do?</p>
<p><span class="caps">WHAT</span> <span class="caps">DO</span> <span class="caps">YOU</span> <span class="caps">DO</span>?</p>
<h2>A Simple, Responsible, and Totally Wrong Approach</h2>
<p>If you’re a mature, business-focused engineering leader, you might
grab some coffee, sit Cindy and Scott down, and tell them something
like this:</p>
<p><span class="dquo">“</span>Cindy, I’m sorry to hear that you’re getting bored doing so much
production <span class="caps">DB</span> work, but realistically it would take you at least 40
hours of work to write, test, and deploy a migration utility, right?
So if you’re spending a half hour a day on migrations, it would be 80
working days before we saw a return on our investment—that’s like 4
months, and that’s just too long for me to sanction—precisely because
you’re such a valuable member of the team, and I can’t spare so much
of your time right now away from our feature backlog. We can touch
base if the migration workload increases too much, <span class="caps">OK</span>? Until then, I
have to ask you to put your head down and be a team player.</p>
<p><span class="dquo">“</span>Scott, you’re absolutely right, product did promise that we could
spend time cleaning up the Frobulator Service, and I’m sure they were
acting in good faith, but none of us could have possibly known at the
time how our product was going to take off—we’ve got customers
practically beating down our door for new features, and they’re not
going to see any difference whether the Frobulator Service is written
in crappy <span class="caps">PHP</span> or transcendent Scala.</p>
<p><span class="dquo">“</span>Both of you are great engineers with bright futures, and if those
futures include engineering management, then part of your job will be
to understand that engineering’s job is to produce effects that are
visible to customers. So if we burn hours on projects that aren’t
customer visible—projects that are by engineers, for engineers—we
need to be able to show directly how those hours will pay for
themselves in <em>saved</em> engineering hours in pretty short order.”</p>
<p>This approach feels rational, responsible, and easy to apply, right?
There’s only one small problem: by slipping into the “cost center”
mentality, where engineering hours must only be spent on features or a
greater savings in engineering hours, you’ve actually just slowed your
engineering org down further, and cost your company real (though
largely invisible) money in the process. How did this happen without
our even noticing, while we thought we were being so responsible?</p>
<h2><span class="dquo">“</span>Engineer Hours” vs. Latency—Where the “Cost Center” Gets it Wrong</h2>
<p>The cost center model of engineering, to which our hypothetical
engineering leader has just retreated, is basically this: an
engineering org is a furnace which burns money, in the form of
compensated engineer hours, and produces features. Therefore if org A
can produce the same feature at half the cost of org B, then org A is
twice as good as org B! And if spending 1 engineer hour on some task
today will save you 100 engineer hours in the next few weeks, then you
have just improved your org’s economics by 99 of those expensive
engineer hours!</p>
<p>The fundamental and deadly flaw in this model is that it does not
account economically for the speed of work through the engineering
org—or what I’ll refer to from here on out as “latency”—the
wall-clock hours, not paid engineer hours, that it takes the
engineering org to turn some concept into reality. In other words, we
can’t simply think of an engineering org as “an engine that produces
thing X at cost Y.” We have to model it as “an engine that produces
thing X at cost Y <em>with latency Z</em>,” and recognize that “latency Z”
itself can and should be translated into some cost / value structure.</p>
<p>This is not to say that engineering leaders who employ this cost
center model don’t care or think about latency. To the contrary, they
often talk about it quite a bit, exhorting their teams to feel a
“sense of urgency” and to exhibit a “just git ‘er done” attitude—but
they treat latency as a moral or personal question—a matter of
character or work ethic—rather than something that is, at its heart,
organizational and economic.</p>
<p>It’s human nature to experience paid engineer hours as <em>expensive</em> and
latency as <em>annoying</em>, because the costs of latency tend to be
invisible—they usually take the form of lost opportunities or
earnings, many of which, once you miss them, you never even know
existed—rather than real, painful checks that you have to cut each
month for payroll.</p>
<p>Consider an analogue: the rent your business pays on an office
building. If you found a building that was only half the rent, you
might well be tempted to move and count that as a huge savings—but
that’s rarely the whole economic story. Is the new building farther
away from where the bulk of your employees live? Does it lack the
public transit options of the more expensive building? How’s the
light? What’s the layout like? All of these factors can affect the
amount of time your employees spend in the office, the amount and
quality of work they get done there, and even the kind of people who
want to work at your company in the first place—and if the cheaper
building leads to a drop in productivity, or to worse hires, then that
“savings” on rent might turn out to be very expensive indeed to your
business’s bottom line, even though—and here’s the horrific
part—that connection will probably never show up on your company’s
balance sheet. It’s not hard to imagine the employee who found the
cheaper building being rewarded with a fat bonus in the same cycle
that a bunch of other employees are dinged for a stagnant product,
increasing bug count, and flagging sales—even if all those problems
were caused, to some extent, by the change in location.</p>
<p>One method to expose some of these invisible economic effects is to
take them to an absurd extreme. For example, if your business is
currently paying a half million in rent a year for a Boston office,
with a workforce who lives in nearby suburbs, it’s clearly not a smart
economic decision to move to a snow-cave in Juneau, Alaska—even if
it’s wired for Ethernet and your annual rent would drop to $1. We’ve
managed to magnify the invisible costs to a size where they can’t be
easily ignored.</p>
<p>So let’s employ the same technique—reduction to some absurd
extreme—in a thought experiment designed to demonstrate how the
latency of your engineering org is almost certainly its dominant
economic factor—much, much larger than the piddling six-figure
salaries you’re paying the engineers it comprises.</p>
<h2>The Thought Experiment</h2>
<p>Role change: you’re no longer an engineering leader overseeing
Facebook for Cats integration. Now you’re the <span class="caps">CEO</span> of a company that
makes its money through big, enterprise contracts. A potential
customer you’ve been after for a while is entertaining bids on a
project, and will consider proposals—which are expected to include a
working proof of concept—in one month.</p>
<p>You aren’t the only company trying to land this contract—there are
lots of smart competitors. And, by the way, you’re not allowed to
deliver early, even if you finish the proof of concept early—all
proposals will be considered on the same day, one month from now.</p>
<p>As <span class="caps">CEO</span> you have two engineering teams available to you.</p>
<p>The first team is a group of good, steady developers, who correctly
estimate that the proof of concept will take exactly one month for
them to build (of course they can’t possibly know this, but that’s a
story for <a href="http://blog.hut8labs.com/coding-fast-and-slow.html">another
article</a> and here
we’ll just pretend they can, because we’re in a thought experiment and
we can do whatever we want). Over this month of development, this
team will cost the business $100,000 in salary and other compensation.</p>
<p>The second team, on the other hand, is a group of freelancers who are
amazingly, inhumanly fast: they can produce the same proof of concept,
at the same level of quality, in just <em>one second</em>. Before you get
too excited thinking about all the money you’re going to save with
this team, however, you should know this: for that one second of work,
these freelancers will be invoicing you dearly—to the tune of $100,000.</p>
<p>Recapping your options, you have:</p>
<ul>
<li>
<p>the normal team, which will take a month to produce the proof of
concept for a total cost of $100,000</p>
</li>
<li>
<p>the insanely fast team, which will take a second to produce the
proof of concept for a total cost of $100,000</p>
</li>
</ul>
<p>The costs of the proof of concept are equivalent with either team, as
is the quality of the product—only the latency differs. Obviously if
you could deliver the proposal as soon as the proof of concept was
done, you’d choose the insanely fast team every time. But that would
be too easy, so in our thought experiment—where you’re not allowed to
deliver the proposal early—does the latency even matter?</p>
<p>There’s only one scenario to consider with the normal team—they have
to start working today, and they’ll finish just in time for the
presentation. Start them even a day late, and they won’t finish.</p>
<p>With the insanely fast team, on the other hand, you have on the order
of 2,592,000 scenarios to consider, as they could start and finish at
any second in the entire month. But are any of these scenarios valuable?</p>
<p>Let’s take a look at a couple of these possibilities.</p>
<h3>The Need for Speed</h3>
<p>One obvious approach with the insanely fast team would be to produce
the proof of concept immediately, in the very first second. Does that
buy you anything? You can’t deliver the proof of concept early, but
now that it exists, there are a couple things you could do with it.</p>
<p>For example, you could show it around and get a reaction—internally,
if your business has some good proxies for your customer’s needs, or
to one of the customers “on the ground” (not the Big Important People
you’ll be pitching at the end of the month, just regular workers).
Then you can take their feedback and do any of the following:</p>
<ul>
<li>
<p>Iterate: Have the insanely fast team produce a second, improved
version of the proof of concept—you’ll have to pay them another
$100,000, but you’ll have good information about whether that’s
worth it or not. You can repeat this process as many times as you
like or can afford, and go into the demo having iterated through N
versions to your competition’s one.</p>
</li>
<li>
<p>Abandon: If the feedback you get is “this is crap, and the only ways
to make it good enough are too difficult or expensive to consider,”
then you can abandon the contract and move on to try to sell
something different to a different customer—or something different
to the same customer! Meanwhile, your competition is sweating away
trying to produce their own proofs of concept—squandering precious
time and attention on a contest you already know isn’t worth winning.</p>
</li>
<li>
<p>Sell to Someone Else: By the rules you can’t deliver your proof of
concept early to the one potential customer, but nothing says you
can’t go out and try to sell it to a different one, or a different
six. By the time proposal day arrives, you’re already a month ahead
of your competition in other markets, and you might even have a nice
story to tell about how your customer’s competition has already
bought your version—and they’d better too, if they don’t want to
fall behind.</p>
</li>
</ul>
<p>So yeah, you could definitely say there’s some value to being able to
finish the proof of concept in a second. That insanely fast team is
starting to look pretty good right about now.</p>
<p>But wait…there’s more!</p>
<h3>The Genius of Procrastination</h3>
<p>What if you went to the other extreme, and waited as long as you could
to produce the proof of concept, until the last possible
second—literally while you’re walking down the hallway to make your
presentation? Does that give you any interesting advantages?</p>
<p>One possibility that leaps to mind: given that your development is so
expensive, you could do some cheaper exploration before you committed
to a proof of concept. For example, you could send some PMs to shadow
the customers, research companies that had tried similar approaches, etc.</p>
<p>By the time you commit to spending $100,000 on the proof of concept,
you can have much better information about what it should do and what
it shouldn’t. Maybe it turns out to be so difficult that you decide
not to even build. Or maybe, with the insanely fast team at your
back, an offhand remark as the customer is walking you to the
presentation room prompts a quick phone call and a development cycle,
allowing you to produce a last-second revision that totally changes
the game.</p>
<p>In essence, by waiting until the last second to produce your proof of
concept, you have the chance to be roughly 29 days, 23 hours, 59
minutes and 59 seconds better informed than your competition (the
actual amount of time will depend on the particular month, whether
it’s a leap year, etc., which is left as an exercise for the reader).</p>
<h3>Mix and Match</h3>
<p>But the real power of the insanely fast team comes when you mix and
match all the techniques above.</p>
<p>Step 1: Do cheap research until you have an idea of what to build.</p>
<p>Step 2: Build it instantly and loop back to Step 1, until you decide
another iteration isn’t worth $100,000 (either because the proof of
concept is now good enough, or because you’ve decided to scrap the project).</p>
<p>Step 3: Profit!</p>
<h3>Finish Early, Start Late</h3>
<p>What the insanely fast team gives you, in other words, is the ability
to finish early or start late. In an environment where uncertainty
rules and information is value—like software development—that allows
for tremendously valuable information gain, because what you finish
early tends to generate information, and what you start late tends to
benefit from newly available information<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup>. The poor old regular
engineering team, on the other hand, has to start <em>early</em> and finish
<em>late</em> just in order to get the work done by the deadline. Their
labor can neither generate extra information or benefit from it as it
becomes available.</p>
<h3>So Which Team Do You Want, Mr. <span class="caps">CEO</span>?</h3>
<p>By now it should be clear: although the two teams <em>cost</em> the same, and
produce the same quality output, you would be crazy not to choose the
insanely fast team and their drastically reduced latency<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup>. In
fact, you’d be crazy not to pay a <em>steep premium</em>, well beyond the
normal team’s salaries, to use the insanely fast team, or even to keep
them inactive but on retainer.</p>
<p>This is so important it’s worth calling out: if you’re any kind of
rational, you would pay a tremendous amount of extra money to use the
insanely fast team, which means that <em>a reduction in latency equals
money</em>. Real, actual money—and usually a lot of it. In our thought
experiment, for example, a smart <span class="caps">CEO</span> would gladly pay $1,000,000 to
use the insanely fast team instead of the regular team if it meant a
massively increased chance at a $15,000,000 project. A smart <span class="caps">CEO</span>
would see that not as “spending” money—but as <em>investing</em> it—putting
money out into the world in the reasonable expectation of having that
money return, now increased by some multiple.</p>
<p>Once you start thinking of engineering dollars as investment rather
than cost, the fallacies of the “cost center” model become glaringly
obvious. The equation behind your org isn’t “engineer hours paid for
features or saved engineering hours”—it’s “money invested in the
expectation of more money.” Often the money invested is in the form
of paid engineer hours, but sometimes it’s new machines, or better
chairs, or office space for a remote contingent, and so on. And
sometimes the “more money” you expect in return comes from features
for which customers will pay, but often (as in our thought experiment)
it comes in the form of valuable information, or—if you’re doing it
right—a reduction in (or prevention of) latency for future work,
which, as we’ve just shown with our thought experiment, is <em>worth
actual money</em>.</p>
<h2>Sitting Down Again with Cindy and Scott</h2>
<p>Let’s rewind back to that coffee with Cindy and Scott, where you as
engineering leader were explaining to them all about how engineer
hours could only be spent on features or efforts that would cut future
engineer hours. With the clearer economic picture in mind, this
argument no longer seems so simple and rational.</p>
<p>Cindy wanted time to work on <span class="caps">DB</span> deploy scripts, since she was the only
one who could reliably get changes out to the production <span class="caps">DB</span> and was
spending a chunk of her mornings doing so. At the time, what we heard
behind her lament was “I’m getting bored doing the job you’re paying
me to do and I need to be gently cat-herded to keep doing it”—but
what we should have heard was “<span class="caps">DANGER</span>, <span class="caps">WILL</span> <span class="caps">ROBINSON</span>—a queue is
forming in your engineering org.”</p>
<p>Cindy has become a bottleneck for changes making their way to
production, and a queue of people trying to make those changes is
forming behind her. Queues are one of the clearest signals of
developing latency. What happens if Cindy is out for a few days on
(gasp) vacation? No changes will go out. What happens if she becomes
overloaded with other matters, and—without telling you—starts
applying <span class="caps">DB</span> migrations only once a week, to “batch things up” and “be
more efficient” with her time? Your latency has just skyrocketed
invisibly—and the fact that this is possible should terrify you as an
engineering leader. Cindy’s complaint is a warning of latency to
come, and you need to nip that in the bud with extreme prejudice. You
should probably allow Cindy to do her migration project—and you
should <em>definitely</em> explain to her <em>why</em> you’re allowing it.</p>
<p>As for Scott, who wanted to rewrite the Frobulator Service from
horrific <span class="caps">PHP</span> to stunning Scala because product had promised the time
to clean it up: the “promise” from product is clearly economically
irrelevant, and big rewrites tend to be a <a href="http://onstartups.com/tabid/3339/bid/97052/How-To-Survive-a-Ground-Up-Rewrite-Without-Losing-Your-Sanity.aspx">terrible
investment</a>,
so you probably shouldn’t say yes to Scott’s exact request—but you
still have some digging to do here to figure out whether this (almost
certainly misguided) desire to rewrite is just a blue-sky engineering
itch, or a signal that the Frobulator Service is creating latency.</p>
<p>First of all, Scott was only in that code to “update copyright
years”—he wasn’t making functional changes, and apparently hadn’t
made any in at least a year. Is this a clue that the Frobulator
Service doesn’t see that much coding activity? Worth digging into,
because if engineers aren’t touching the Frobulator Service because
it’s frobulating<sup id="fnref:3"><a class="footnote-ref" href="#fn:3" rel="footnote">3</a></sup> just fine and there aren’t really any changes to
make, that’s great—the code might read like Cthulhu’s diary, but it’s
not affecting your latency and can be left as is for the moment. If,
on the other hand, there are tons of changes that <em>should</em> go into the
Frobulator Service, but which are finding their way into compensatory
hacks throughout the rest of the codebase instead—because engineers
are terrified to touch the Frobulator Service code—then you’ve got a
brewing latency problem that you need to expose and deal with, because
those hacks are probably already slowing you down, and the situation
is only going to get worse. Almost certainly you still don’t want to
commission a full-on rewrite, but a steady, incremental investment in
testing, monitoring, and refactoring the Frobulator Service might be indicated.</p>
<h3>Takeaways from Cindy and Scott</h3>
<p>One of the deadliest things about latency is that often the slowdown
of even a single piece of your org can introduce it, while making
things faster generally requires steady work on a lot of fronts.
That’s an imbalance that’s not in your favor. Add to this the
certainty that latency is developing in your organization at every
moment—that is the nature of organizations—and that it is often
invisible to you (or any single individual)—and that, as we saw in
our thought experiment, latency is tremendously expensive—and the
response that’s indicated from you, the engineering leader, is a calm
but constant terror.</p>
<p>Your job is to translate that terror into a form of shared vigilance:
listen carefully to your engineers, dig into the problems they bring
you, and ensure that every one of them understands the cost of latency
and is on the lookout for it, making micro speed-ups everywhere they
see the opportunity and surfacing brewing slowdowns.</p>
<p>In other words, make latency something your whole team seeks, hates,
and destroys.</p>
<h2>How to Invest in Latency Reduction</h2>
<p><span class="dquo">“</span>All right,” you say, “I’m convinced—latency is a bigger deal than I
thought before, <em>and</em> something I can improve—in theory. But how do
I do it in practice? I’ve made all those investments that didn’t help
at all—how do I know that if I invest in something, it will actually
improve my latency?”</p>
<p>Some of this also comes down to <em>how much</em> you invest, but we’ll leave
that until Part <span class="caps">II</span>, and here just discuss <em>what</em> you can look to
invest in.</p>
<p>Here are a few places you can start.</p>
<h3>Activities Engineers Bitch About</h3>
<p>Engineers tend to experience latency centers as painful or “busywork.”
For example, do your engineers play “Rock Paper Scissors” to determine
who has to spin up a new server? Does the loser go off cursing his
luck and the world? Do your engineers go to absurd lengths to pack
new services onto old machines, even when a new server would be the
natural solution to the problem? Then take a look at what it requires
to spin up a new server, and whether you can make an investment to
make it less painful—you’ll likely effect a drop in latency.</p>
<h3>Things Only Cindy Can Do</h3>
<p>We saw an example of this with Cindy, who was the only engineer who
knew enough about the prod <span class="caps">DB</span> to get migrations out. If only person X
can do thing Y in your organization, you’ve created a bottleneck, and
bottlenecks lead to latency. Cross-train or create tools to terminate
these bottlenecks with extreme prejudice<sup id="fnref:4"><a class="footnote-ref" href="#fn:4" rel="footnote">4</a></sup>.</p>
<h3>Look for Queues</h3>
<p>Queues are a manifestation of latency, and once you can see them, you
can attack them. Find them where they’re visible—ticketing systems
and so on—and try to make them visible where they’re not, using
techniques like a Kanban board.</p>
<h3>Automated Tests</h3>
<p>Good automated tests reduce latency, because they help you make
changes more quickly and confidently<sup id="fnref:5"><a class="footnote-ref" href="#fn:5" rel="footnote">5</a></sup>.</p>
<h3>Monitoring</h3>
<p>Good monitors reduce latency, because they allow you to release more
frequently, confident in the knowledge that, if something goes wrong,
you’ll find out immediately<sup id="fnref:6"><a class="footnote-ref" href="#fn:6" rel="footnote">6</a></sup>.</p>
<h3>Post Mortems</h3>
<p>A <a href="http://www.slideshare.net/danmil30/how-to-run-a-postmortem-with-humans-not-robots-velocity-2013">good post
mortem</a>
is a great opportunity to let reality point you towards improvements
that not only make your systems safer, but reduce your latency as
well. Do them!</p>
<h3>Decentralization with Safety Nets / Impact Reduction Schemes</h3>
<p>Organizations often insist that high-impact changes to products or
systems pass through multiple steps of centralized review for
correctness, which can become a source of dramatic latency—sometimes
on the order of weeks or months. Usually these controls exist for a
reason, because the mistakes they attempt to prevent are expensive.</p>
<p>You can attack such a situation in two ways: either by making it
harder to break things in the first place (often more difficult and
expensive), or by changing the game so that breaking things isn’t as
big a deal (often cheaper and easier). For example, if engineers can
deploy potentially high-impact changes at will to a small percentage
of traffic, or to a known beta-tolerant population, or to internal
users, then the downside of breaking changes is capped, and is often
eminently worth the decreased latency you enjoy.</p>
<h3>And Many, Many More</h3>
<p>We’ve only scratched the surface here: tools for operators,
intelligent development tools, even crazy things like DSLs for demo or
test data creation can all reduce your latency. Once you start
looking specifically for projects that reduce latency, you will see
opportunities everywhere.</p>
<h2>How Not to Invest in Latency Reduction: <span class="caps">REWRITE</span> <span class="caps">ALL</span> <span class="caps">THE</span> <span class="caps">THINGS</span></h2>
<p>The “rewrite reflex” exhibited by Scott is, unfortunately, a real and
dangerous tendency that almost all engineers have to some extent (I
myself struggle with it daily): the fanatical belief that, if a system
were rewritten to framework X or language Y, development would proceed
much more quickly. Generally this doesn’t pan out, both because of
the astounding (and routinely underestimated) cost of the rewrite, but
also because the causes of latency introduced in real-world
engineering are rarely addressed more directly by languages and
frameworks than by operational and organizational changes<sup id="fnref:7"><a class="footnote-ref" href="#fn:7" rel="footnote">7</a></sup>. The
latency caused by having to write three ugly lines in one language
rather than one pretty line in another tends to pale in comparison
with delays in deploys, finding and fixing bugs that tests could have
caught, etc. (note: I’m not arguing that there is no difference in
language productivity, and no point to choosing a language for a new
venture carefully, just that for a working system the gain is usually
dwarfed by the rewrite cost and other, lower hanging fruit).</p>
<h2>Incrementalism <span class="caps">FTW</span></h2>
<p>Maybe it’s a “one ring to rule them all” deployment system, or a
templating system to speed up writing your views, or a monitoring
framework to end all monitoring frameworks—whatever it is, if you
think it will reduce latency, and it’s a big project, you should
probably try breaking it into smaller increments, each of which
reduces <em>some</em> latency, and release those independently, as each is ready.</p>
<p>Most engineers will hate to hear this. They’ve already “seen” the
full system in their head, and now want to bang it out in a couple
caffeine-fueled weeks. Typically if you object and request smaller
increments, they will point out that, broken up into discrete
releases, the job will require more hours overall, and therefore
represent an inefficiency. They’re generally right, of course, that
you will spend more engineer hours by delivering in
increments—they’re just wrong about the economic consequences.</p>
<p>You should insist on smaller, incremental latency improvements, not
just because of all the normal, eminently true reasons that big
increments are bad (everything that makes waterfall a bad idea applies
here too), but because <em>latency reduction improves the same channels
by which you deliver future latency reduction</em>. That is, since
latency reduction efforts generally come in the form of new software
or processes, and what they’re reducing is the latency of delivering
new software or processes, finished latency reduction efforts tend to
speed up future latency reduction efforts.</p>
<p>Latency reduction is therefore a form of <em>compound interest</em>, which
Einstein himself called “the most powerful force in the universe<sup id="fnref:8"><a class="footnote-ref" href="#fn:8" rel="footnote">8</a></sup>.”
Latency reduction works just like your retirement account—steady,
incremental investments generate more value than infrequent, bigger
investments, because you earn interest on your interest—so you want
the money in the account as soon as it becomes available. When you
break a big, massively valuable latency reducing project into numerous
smaller (but still latency reducing) projects, some of which can be
delivered earlier, the one-time multiple you pay on extra engineering
hours is nearly always a rounding error compared to the benefit of
compounded latency reduction you enjoy <em>forever</em>.</p>
<h2>So Much for the Easy Part</h2>
<p>All right, we’ve skirted the hard part long enough. At this point we
understand some of the costs of latency. We’ve sounded out whether
projects like those that Cindy and Scott want to undertake will
actually reduce latency, talked about some other projects that are
good candidates for reducing latency, and understand how to generate
the maximum overall value by attacking them in valuable increments.
But there’s still the small matter of that endless stream of
features—how do we compare the relative value of a feature and a
project to reduce latency for the delivery of future features, and
prioritize appropriately? How do we know how much time to spend on
latency reduction vs. features? And—more difficult still—how do we
convince the <span class="caps">CEO</span> and other Important People in the business, who are
the ones asking for those features and signing our checks, that they
should allow us to carve out that time to work on latency reduction?</p>
<p>Tune in as we tackle that in the upcoming Part <span class="caps">II</span>: Selling the Big Boss.</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>For more on this, see Reinertsen’s <a href="http://www.amazon.com/gp/product/B007TKU0O0/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=B007TKU0O0&linkCode=am2&tag=hu8labl08-20"><em>Principles of Product
Development
Flow</em></a>—yup,
it wouldn’t be a Hut 8 Labs Blog without a mention of that
classic—but seriously, all joking aside, just go read it now. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Oh, and just in case you’re thinking “sure, if I could reduce my
latency to a second, that would be one thing, but that’s crazy and
extreme and impossible”: an engineering organization that managed to
go from shipping simple improvements and bugfixes only with quarterly
releases to being able to ship them in an hour (a realistic
improvement that many organizations have already accomplished) would
be seeing about a <em>2000X</em> reduction in latency for those improvements
and bugfixes, and that’s not even the upper bound—better testing,
monitoring, and other investment can also drastically speed up what an
engineer can reliably get done in that hour. <a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>What? It’s a perfectly cromulent word. <a class="footnote-backref" href="#fnref:3" rev="footnote" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>Note: sometimes Cindy defines her value as “being the only
person who can do X.” Helping her redefine her value more broadly is
a key part of the leadership function, but a topic for a different
article. <a class="footnote-backref" href="#fnref:4" rev="footnote" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>Bad tests can actually increase latency, because they
over-specify implementation without adding any safety—but that’s a
topic for a different article. <a class="footnote-backref" href="#fnref:5" rev="footnote" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>Bad monitors can actually increase latency, because they
overwhelm and desensitize the people looking at them with irrelevant
information or over-zealous alerting. But that’s—you guessed it—a
topic for a different article. <a class="footnote-backref" href="#fnref:6" rev="footnote" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:7">
<p>And because frameworks are a form of moderate evil that is on
occasion the lesser of two evils—but we’ll leave that for another
time. <a class="footnote-backref" href="#fnref:7" rev="footnote" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:8">
<p>Or fine, <a href="http://www.snopes.com/quotes/einstein/interest.asp">maybe he
didn’t</a>. But
look, whether Einstein said it or not, compound interest is pretty
damn powerful, <span class="caps">OK</span>? <a class="footnote-backref" href="#fnref:8" rev="footnote" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
</ol>
</div>No Deadlines For You! Software Dev Without Estimates, Specs or Other Lies2013-09-23T10:24:00-04:00Dan Milsteintag:blog.hut8labs.com,2013-09-23:no-deadlines-for-you.html<p>In <a href="http://blog.hut8labs.com/coding-fast-and-slow.html">Coding, Fast and Slow</a>, I talked about one of the deepest challenges involved in writing software: the near-total inability of developers to predict how long a project will take.</p>
<p>Fortunately, as that post mentioned, I believe there is a way to work, where the software you write ends up being valuable, and the business people you work with end up being happy. And, critically, this way of working does <em>not</em> involve committing to estimates of how long work will take (which is good, because, personally, I suck beyond all belief at such estimates… even for work which I initially believe will take no longer than a single day).</p>
<p>In a lot of ways, this is The Most Important Thing I’ve learned in my (let’s just say many) years of being paid to write software for people.</p>
<p>The core idea is: put uncertainty and risk at the center of a conversation between the developers and the rest of the business (instead of everyone pretending such nasty things don’t exist). Doing so allows the entire business to tackle those genuine challenges <em>together</em>.</p>
<p>To show what such a conversation might look like, I’m going to develop this approach in detail, in the context of a story.</p>
<a name="continued" id="continued"></a>
<h3>Welcome To <Company X>, Here’s Your Spec</h3>
<p>Let’s say you’ve started at a new job, leading a small team of engineers. On your first day, an Important Person comes by your desk. After some welcome-to-the-business chit chat, he/she hands you a spec. You look it over—it describes a new report to add to the company’s product. Of course, like all specs, it’s pretty vague, and, worse, it uses some jargon you’ve heard around the office, but haven’t quite figured out yet.</p>
<p>You look up from the spec to discover that the Important Person is staring at you expectantly: “So, <Your Name>, do you think you and your team can get that done in 3 months?”</p>
<p>What do you do?</p>
<p>Here are some possible approaches (all of which I’ve tried… and none of which has ever worked out well):</p>
<ul>
<li>Immediately try to flesh out the spec in more detail</li>
</ul>
<p><span class="dquo">“</span>How are we summing up this number? Is this piece of data required? What does <jargon word> mean, here, exactly?”</p>
<ul>
<li>Stall, and take the spec to your new team</li>
</ul>
<p><span class="dquo">“</span>Hmm. Hmm. Hmmmmmmmm. Do you think, um, Bob (that’s his name, right?) has the best handle on these kinds of things?”</p>
<ul>
<li>Give the spec a quick skim, and then listen to the seductive voice of <a href="http://blog.hut8labs.com/coding-fast-and-slow.html">System I</a></li>
</ul>
<p><span class="dquo">“</span>Sure, yeah, 3 months sounds reasonable” (<span class="caps">OMG</span>, I wish this wasn’t something I’ve done <span class="caps">SO</span> <span class="caps">MANY</span> <span class="caps">TIMES</span>).</p>
<ul>
<li>Push back aggressively</li>
</ul>
<p><span class="dquo">“</span>I read this incredibly convincing blog post <sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup> about how it’s impossible to commit to deadlines for software projects, sorry, I just can’t do that.”</p>
<p>Here’s the thing about all of the above: they’re basically guaranteed to fail. By which I mean, specifically: no one is going to be any kind of happy about the software that gets written from the above starting points.</p>
<h3><span class="caps">OKAY</span>, <span class="caps">ENOUGH</span> <span class="caps">STALLING</span>, <span class="caps">SO</span> <span class="caps">WHAT</span> <span class="caps">DO</span> I <span class="caps">DO</span>, <span class="caps">DAN</span>?</h3>
<p>I’m going to suggest something that may sound a bit odd: while this Important Person is standing at your desk, use this opportunity to, politely but ruthlessly, interrogate them about the <em>business you have just joined</em>.</p>
<p>What is the business model? What are the biggest challenges facing the business as a whole? What risks does leadership worry most about? What are they hoping happens, if everything goes just right? Who is the current customer for the product? What motivates that customer to buy? Are they happy after they buy? If not, why not? What other customers would the Important Person like to go after, if he/she could?</p>
<p>One way to understand this is: there is some central problem or challenge which the business is facing. Your first job is to figure out what that problem is, and, just as importantly, what words the Important Person uses when they think about that problem.</p>
<p>A very important thing: it usually takes a considerable bit of effort to get beyond the proposed <em>solution</em> (e.g. the report), to the actual underlying <em>problem</em>. <a href="http://usersknow.blogspot.com/2009/11/6-reasons-users-hate-your-new-feature.html">Laura Klein</a> summarizes this marvelously as “[People] will tell you that they want a toaster in their car, when what they really mean is that they don’t have time to make breakfast in the morning.” She’s talking about user research, but I find the same perspective is incredibly useful when talking to, e.g. <span class="caps">CEO</span>’s.</p>
<p>Returning to our example, let’s say that, as you talk to the Important Person, you come to understand that your new business, which sells software via a monthly subscription plan, has a serious problem — too many customers are canceling every month. What’s more, you’ve joined a startup, and, although it has a solid chunk of cash in the bank, the leaders very much want to ramp up how much they spend on sales and marketing. Of course, doing that will burn through their cash, and thus require raising more capital sooner than later. And getting <span class="caps">VC</span>’s to invest more money with that high cancel rate is going to be very difficult, if not impossible.</p>
<p>You’ve been hired, at some level, to help solve that problem. <em>Even if the people who have hired you don’t think about it that way.</em></p>
<p>Now that you understand that central problem, take one more step: figure out <em>exactly</em> how this proposed development effort is supposed to solve that problem.</p>
<p>How and why does the business believe that this report is going to lower the cancel rate? What makes the Important Person think it’s going to work? Are there any ways they’re worried that it might <em>not</em> work? Are there any key questions they’d like answered sooner than later?</p>
<h3>Oh, How People Love To Hear Their Own Words</h3>
<p>A key tip for these conversations: at each step, it’s really helpful to echo back what the person just said to you. E.g. “Okay, let me make sure I understand — you’re saying this new feature you want is critical because it’s going to help us upsell existing customers, but we’re not so much expecting it to help us get new customers? Do I have that right?”</p>
<p>At each of those little checkpoints, if you’re right, the Important Person will feel this rare, pleasant sense that someone in development actually seems to understand how the goddamn business works. If you’re wrong, you’ve just narrowly avoided basing your dev efforts on an imperfect understanding of the business (which is a path straight to misery).</p>
<p>Note that template: a) “I’m going to echo that back, make sure I understand”, b) echo it back, c) “Do I have that right?”. I say <em>exactly those words</em>, basically every time I talk to someone about a new project — so much so that my partner Edmund calls it “pulling a Milstein”. You don’t have to be clever with that template, is what I’m saying — put all your cleverness to work really listening and trying to understand the problems facing the business.</p>
<p>This whole process takes practice, but is <span class="caps">INSANELY</span> <span class="caps">VALUABLE</span>. You can (and should!) start by asking everyone you work with about how they understand the overall business you’re currently in, and what challenges it’s facing. Do the same with random people you meet. Be curious, don’t stop being curious, and don’t be in any way afraid to say “I don’t understand that, can you explain it to me?”</p>
<h3>Now, The Knockout Punch</h3>
<p>Once you both understand some central problem facing the overall business, <em>and</em> how your proposed bit of development effort fits into a possible solution, you wrap all that up and deliver it back, repeating as many of the words they used as possible, e.g.:</p>
<p><span class="dquo">“</span>Okay, if I understand it properly, we’re adding this report, because we think we can use it as a key feature in a new, higher pricing tier. This more expensive tier is not really for acquiring new customers, it’s more for upselling existing ones, so we can extract more revenue from our most engaged customers. If we can do that, it’ll have a potentially big impact on our revenue churn <sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup>, which is the most important number in our business right now. And, we really need to see that move in the right direction, in the next 6-9 months, so we’ve got a good story to tell investors when we go out to raise our next round of financing.</p>
<p>Do I have that mostly right?”</p>
<p>With even a modest bit of luck, at this point, the person who handed you the spec will have a cautiously hopeful expression on their face, and they’ll nod as they say, “Yeah, that’s… um… that’s pretty much exactly right.” <sup id="fnref:3"><a class="footnote-ref" href="#fn:3" rel="footnote">3</a></sup></p>
<p>You then say, “Great, let me look into the tech we need for that report, and I’ll get back to you with more info.”</p>
<p>Note: you haven’t promised any date by which the report will be finished. Instead, you’ve demonstrated that you are going to work with this Important Person to solve the actual problems the business is facing. And those problems involve very real, very hard, <em>external</em> deadlines (e.g. running out of money by a certain date).</p>
<p>One way to see it: you’ve taken a key first step in earning their trust.</p>
<p>Now, notice, too: instead of you having made some promises to deliver on a spec, which promises are now hanging over you and making you nervous, you’ve directly engaged in a real problem for the business. And you have plenty of room to be creative about how you solve that problem. Yes, it’s a hard problem, but that’s why you got into this business in the first place — for the joy of solving hard problems that actually matter to someone.</p>
<h3>Man, If Only We Knew What To Do</h3>
<p>The next day, you meet with the team, and discover that the new report is mostly straightforward, except for one thing: it requires a periodic import of data from a new social network with a complex <span class="caps">API</span>. The team has just started working with that <span class="caps">API</span>, and they tell you that they just don’t have enough information to make the call on the 3 month deadline, <em>either way</em> — it’s certainly possible they could hit it, but there’s every chance things could blow up.</p>
<p>What do you do?</p>
<p>You could tell the Important Person that you don’t know. That is, at least, honest. But it doesn’t really help them (aka help the business solve its problem — move revenue churn in the right direction, before the next round of funding).</p>
<p>What <em>would</em> help you solve the business’s problem?</p>
<p>One key is that the business as a whole is trying to make a <em>decision</em> — about how to spend your time.</p>
<p>If you knew for certain that you could get the report built in 3 months (and that existing customers would happily pay more for it), the right decision for the business would be: build it.</p>
<p>Conversely, if you knew for certain that you <em>couldn’t</em> hit the deadline (or that existing customers <em>wouldn’t</em> pay more), the right decision would be: stop immediately, start some other plan to reduce revenue churn.</p>
<p>Given that you don’t know which of those two alternatives you’re living in, what you (and the business) need is: <em>more information</em>.</p>
<p>If you could obtain that information, you could make the right decision, which would make your business a great deal more money than the wrong one.</p>
<p>In the presence of uncertainty, acquiring information is often the best way to generate value. And, yes, this is the point in this blog post where I tell you to go read Donald Reinertsen’s <a href="http://www.amazon.com/gp/product/1935401009/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1935401009&linkCode=am2&tag=hu8labl08-20">Principles of Product Development Flow</a>.</p>
<p>So, what you do is, you pick what you work on next, to gather as much information as possible, about the things you are most uncertain about. If you’re clever (and you are! That’s why you got into development in the first place), you can find a way to gather information as part of the process of <em>actually building the thing</em>. Meaning, you usually don’t need to conduct some separate, research-y phase—instead, you can gather the information you need by doing your work in a careful sequence.<sup id="fnref:4"><a class="footnote-ref" href="#fn:4" rel="footnote">4</a></sup></p>
<p>And, crucially, you have to be completely up front about all this with your counterpart on the business side.</p>
<h3>The Meeting Where You Earn Your Salary</h3>
<p>In our story: you schedule a meeting with the Important Person, and, in advance of that meeting, you bury yourself in the technical details of what the team has found about that new <span class="caps">API</span> so far. You also do some chatting with the sales and marketing folks — you want to understand the target customer, and which of their problems the business is hoping to solve with this new report.</p>
<p>Then, at that meeting with the Important Person, you say something like:</p>
<p><span class="dquo">“</span>Right now, we’re feeling optimistic that we’ll have that report ready in some form within 3 months — but our biggest risk is working with that new social network’s <span class="caps">API</span>. From the initial investigation we’ve done, it looks like, at the very least, we’d definitely be able to show them <minimal data foo>, which, from what I understand of our engaged customers, might be enough to trigger upsells, but sales and marketing aren’t certain.</p>
<p>We’d like to propose the following: we take two of our best devs, and they spend 2 weeks trying to build a full integration with the social network, purely on its own, so we’ll have a better understanding of just how much data we can pull in. While they’re doing that, we’d also like to have our front-end devs building mockups of a report with <em>just the minimal data</em>, so that you’ll have something to do some user research with, and possibly even use for sales demos if things go well.</p>
<p>Does that plan sound like a good way to go?”</p>
<p>This little speech is, basically, the most important thing you’re going to do at your job all month. So I want to unpack it in some detail.</p>
<p>First off, note that, because you’re thinking in terms of risks and information, you propose sequencing the work to get as much information, as quickly as possible (e.g. information <em>both</em> about how much data you can get from the social network, and <em>also</em> about whether or not customers will be satisfied by the minimal data set). When you’re facing a chain of risks, you’re going to generate the most valuable information by attacking the biggest risks first.</p>
<p>Second, it should be clear that you can only pull this off if you deeply understand the overall business problem — that’s what lets you propose the minimal data thing. Generally, those opportunities emerge bottom up, as, e.g. a dev figures out what data is / is not easy to obtain — but the value is not always clear to those devs (the very best way to run this game plan is to make it so that all the devs really deeply understand the overall business problem).</p>
<p>Third, it’s important that you’re offering the Important Person an actual, meaningful choice. You’ve clearly stated your current knowledge of what is <em>possible</em> (e.g. the technical risks and opportunities), plus your current understanding of what is <em>valuable</em> (to the business). You’ve framed that in a way which lets the Important Person now make a choice about what to do next (which will often result in you learning that your understanding of what is valuable to the rest of the business is no longer accurate — that’s a very, very good thing).</p>
<p>Fourth, note that, when you work this way, there are <em>good</em> risks (we call them “opportunities”) as well as bad ones. You discovered something unexpected — you could quickly and cheaply build a simpler report that might work. One of the most fun things about this approach is finding those wins — it’s tremendously exciting.</p>
<p>Finally, notice how you’re explicitly operating with a full knowledge of the hard, external deadline facing the business. You’re <em>not</em> talking about deadlines for implementing a spec, but you <em>are</em> talking about deadlines for the overall business… which are the only ones that actually matter.</p>
<h3>What Happens Next</h3>
<p>Now, the Important Person might say any of the following:</p>
<p>a) “Great, go for it”, (you say “Thanks, sir/madam, we’ll see you in two weeks with more information + options”)</p>
<p>b) “That minimal data would be a fantastic report — I’m certain we can get upsells with that” (you say “Awesome, we won’t put the devs on an exploratory full backend integration, we’ll sprint ahead on getting the minimal data ready asap, we should have an early, crude prototype to look at it within a week or two.”)</p>
<p>c) “That minimal data is absolutely not enough” (in which case you say, “Okay, would you like to see other options for restricted data?”, or, “Hmm, I’d love to better understand what questions we’re trying to answer with this report, since I don’t feel like I quite get it yet”, or even, “Well, in that case, maybe we should explore some other options for reducing revenue churn in parallel, because there’s a real chance we won’t be able to make this report work in time.”)</p>
<p>Note that that last situation, is not, in any way, a failure. You’ve learned something very important — the business folks believe that the current plan centrally depends on something which has a great deal of risk associated with it. Armed with this information, you can <em>both</em> try to drive down that risk, as aggressively as possible, and <em>also</em> start working with them to prepare other plans, so you’re ready if things blow up.</p>
<p>Overall, what this approach means is that you will be constantly adjusting your understanding of what is the most valuable way to spend your time, and constantly keeping the business folks in the loop + offering them meaningful choices. This is not, in any way, “we don’t need no stinking estimates, we’re code cowboys, just trust in the full force of our awesomeness.” It’s turning the entire process of software dev into an ongoing conversation with the rest of the business, where information is quickly getting into the hands of people who can make decisions about it. And, where “information” means both things that you know/have learned, and also an understanding of what you <em>don’t</em> yet know — i.e. important risks.</p>
<p>As I said in my previous post, writing software means learning something in such precise detail that you can tell a computer how to do it. More broadly, if creating new software is important to a business, then the business as a whole must engage in a learning process — not just the developers.</p>
<h3>Hmmm, This Doesn’t Really Feel Like a “Process”</h3>
<p>Inevitably, my solutions to this feels somewhat personal — but that is not an accident. Fundamentally, we’re talking about two groups of people having to build up trust in each other. Trust about things that they will not, in general, be able to verify.</p>
<p>Specifically, developers have to trust that what they are being told about the rest of the business is true — that customers want what they’re building, that the long hours are actually needed (and aren’t just some middle manager showing that he knows how to crack the whip — I wish I didn’t see that happen, far, far too often).</p>
<p>And the rest of the business has to trust that the developers, when they go off into their weird, opaque world, are honestly reporting back on what is possible, how much effort is involved, what they’ve achieved, etc.</p>
<p>Any means of building up that trust will always have a personal flavor — it exists between human beings who have learned something of each other. It’s not a thing you can mandate or fix with an imposed process.</p>
<p>Absolutely anyone who has done any real work on either side of that divide can immediately call up instances of that trust being betrayed — of discovering that all your work for the last half year was meaningless (and that someone knew that and didn’t tell you); or that the repeated promises that some system was ready to launch collapsed in a fiery wreck as soon as the first user tried to login.</p>
<h3>Sometimes, The World Is Telling You To Polish Up Your LinkedIn Profile</h3>
<p>A severe warning: this whole plan can fail, badly, if the Important Person is, well, not very important. Specifically, say you have a strongly hierarchical structure, where some middle manager is the only person you’re allowed to talk to. It can be the case that such a person <em>perceives their job</em> as taking proposed solutions from upper management and getting a bunch of developers to implement them. Such a person can be very threatened by the idea that you want to get beyond the proposed solution, to the underlying problem. They can hear that as “I’m going to have to go back to my boss and tell them that a bunch of developers think their idea isn’t very good.”</p>
<p>When a boss says “Jump!”, this kind of person prides themselves on saying “How high?!” Since I’m instead proposing “Why are we even jumping, here?”, you can see how there can be problem.</p>
<p>Furthermore, such a person will often put a really strong value on <em>preventing the flow of information up</em> (from developers to people who can actually make decisions). They may think of that as “Not troubling the boss with the details”. But, as I’ve described above, such a block on the flow of information is absolutely deadly to software development.</p>
<p>So, what’s your best option if you find yourself in such an unfortunate situation?</p>
<p>As I see it, there are two paths ahead.</p>
<p>Option 1: Try to get the middle manager to see this new way of working as something that will make them look good.</p>
<p>I rarely see this work, but it can be worth a shot. My partner Edmund reports some success trying this by way of: a) find an ‘internal’ thing, where the middle manager is, like, a user of the thing , b) propose to them that you work on that internal thing this new way, and then, c) if that produces a thing they find really useful, help them see that their boss can feel the way they feel now.</p>
<p>But, as above, that’s something of a long shot. Which leads us to…</p>
<p>Option 2: Quit.</p>
<p>I don’t say this casually. If you’re stuck in the situation I’m describing, it’s overwhelmingly likely that your project is going to end in some form of unpleasant failure. And, what’s more, it’s extremely rare that you can get higher-level leadership to see any problems with such a middle manager — in general, such a person has that job precisely because they fit into higher-level leadership’s mental model of a manager. In which case, the entire org is going to be set up in a way which makes it hard or even impossible to write useful/valuable software.</p>
<p>If you’ve been stuck in such a situation for a while, I’ll just say — you may have forgotten how great it feels to solve meaningful problems for people. Go find a place where you can do that.</p>
<h3>Your Mission, Should You Choose To Accept It</h3>
<p>In summary, I’m saying: 1) become a student of the overall business you are in, 2) sequence your work to extract as much information from reality as early as possible , and 3) make risks and opportunities the centerpiece of an ongoing conversation with the rest of the business.</p>
<p>There are no certainties in this world, but that approach will let you tackle the uncertainties together.</p>
<p>And that, I can tell you from fortunate experience, is a profoundly satisfying way to work.</p>
<hr />
<h4>But, Wait I Want To Learn More</h4>
<p>I’ve <del>stolen</del><em>synthesized</em> just about all of the above ideas from a bunch of very smart people. You should totally go read their books and blog posts. <sup id="fnref:5"><a class="footnote-ref" href="#fn:5" rel="footnote">5</a></sup> <sup id="fnref:6"><a class="footnote-ref" href="#fn:6" rel="footnote">6</a></sup> <sup id="fnref:7"><a class="footnote-ref" href="#fn:7" rel="footnote">7</a></sup> <sup id="fnref:8"><a class="footnote-ref" href="#fn:8" rel="footnote">8</a></sup> <sup id="fnref:9"><a class="footnote-ref" href="#fn:9" rel="footnote">9</a></sup> <sup id="fnref:10"><a class="footnote-ref" href="#fn:10" rel="footnote">10</a></sup></p>
<p>And, this December, I’ll be speaking at the <a href="http://leanstartup.co/">Lean Startup Conference</a>, in San Francisco, on “Risk, Information, Time <span class="amp">&</span> Money”. Major bonus: almost everyone I list in the above footnotes will be speaking there, too. Last year’s conf was great — I gave a talk on <a href="http://www.slideshare.net/danmil30/how-to-run-a-5-whys-with-humans-not-robots">How to Run a 5 Whys (With Humans, Not Robots)</a>, learned a ton from other speakers.</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>And I hear the author is very handsome, too. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p><span class="dquo">“</span>Revenue churn”: it turns out that, sometimes, the best way to reduce the cancel rate (aka “churn”), in a subscription business is <em>not</em> to stop every last unhappy customer from canceling, but rather to increase the amount of money you’re getting from the people who use your service the most — in other words, solve for the churn rate in terms of dollars/month, instead of customers/month <a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>If you can hone this to the point that your summary of the business is so good that it actually helps the Important Person clarify their own thinking… you will win, at whatever game it is you wish to play in life. <a class="footnote-backref" href="#fnref:3" rev="footnote" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>At Hut 8 Labs, we are, well, utterly obsessed with the sequence in which we do work. It’s a rare couple of hours that doesn’t see a discussion about what’s most valuable to do next, based on what we just learned. <a class="footnote-backref" href="#fnref:4" rev="footnote" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>Donald Reinertsen, <a href="http://www.amazon.com/gp/product/1935401009/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1935401009&linkCode=am2&tag=hu8labl08-20">Principles of Product Development Flow</a>. If you a) love math and b) have spent ten years trying to figure out why your software projects keep getting cancelled, drop absolutely everything you’re doing and read Reinertsen <em>right now</em>. Otherwise, first read <a href="http://www.amazon.com/gp/product/0884271951/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0884271951&linkCode=am2&tag=hu8labl08-20">The Goal</a>, by Eliyahu M. Goldratt, and <em>then</em> read Reinertsen. <a class="footnote-backref" href="#fnref:5" rev="footnote" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>Eric Ries, <a href="http://www.amazon.com/gp/product/0307887898/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0307887898&linkCode=am2&tag=hu8labl08-20">The Lean Startup</a>. He works out a very powerful set of ideas for generating value in conditions of extreme uncertainty. As you can tell from the name of his book, his focus is on startups, but I find his ideas broadly useful for software development in general. <a class="footnote-backref" href="#fnref:6" rev="footnote" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:7">
<p>Kent Beck, <a href="https://www.facebook.com/notes/facebook-engineering/software-design-glossary/10150309412413920">Software Design Glossary</a>, and, <a href="http://www.amazon.com/gp/product/0321278658/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0321278658&linkCode=am2&tag=hu8labl08-20">Extreme Programming Explained</a>. Few people have written as thoughtfully and intelligently about software development as Mr. Kent Beck. His work at the intersection of complexity, human nature, and economic value has had a huge influence on me. <a class="footnote-backref" href="#fnref:7" rev="footnote" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:8">
<p>Douglas W. Hubbard, <a href="http://www.amazon.com/gp/product/1452654204/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1452654204&linkCode=am2&tag=hu8labl08-20">How to Measure Anything</a>. Some really fascinating ideas on how to turn a vague statement like “We could make a better decision if we had more information” into something with concrete dollars attached to it. If you love math… you’ll wish he had written a shorter book with a lot more math in it, but such is life. <a class="footnote-backref" href="#fnref:8" rev="footnote" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
<li id="fn:9">
<p>Laura Klein, <a href="http://usersknow.blogspot.com/">Users Know</a>, and <a href="http://www.amazon.com/gp/product/1449334911/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1449334911&linkCode=am2&tag=hu8labl08-20"><span class="caps">UX</span> for Lean Startups</a>. Truly great stuff on how to talk to human beings. <a class="footnote-backref" href="#fnref:9" rev="footnote" title="Jump back to footnote 9 in the text">↩</a></p>
</li>
<li id="fn:10">
<p><span class="dquo">“</span>So, wait, your blog post got so long that you included an <em>appendix</em>, disguised as a series of footnotes?” In my defense, I can only quote a beloved one-time coworker: “No, so is your face”. <a class="footnote-backref" href="#fnref:10" rev="footnote" title="Jump back to footnote 10 in the text">↩</a></p>
</li>
</ol>
</div>Introducing Diffscuss: Plain Text Code Reviews, Right in Your Editor2013-09-05T00:00:00-04:00Edmund Jorgensentag:blog.hut8labs.com,2013-09-05:introducing-diffscuss.html<p>Earlier this year we at Hut 8 Labs were working onsite with a client
who didn’t have their own code review system. Since a life without
code reviews just isn’t worth living for us, we found ourselves
emailing diffs back and forth to each other, with messages like “about
halfway through the diff you do X, maybe you should do Y?” Eventually
we even started inserting comments right in the attached diffs
themselves—comments like “<span class="caps">EWJ</span> <span class="caps">RENAME</span> <span class="caps">THIS</span> <span class="caps">VARIABLE</span> <span class="caps">OR</span> <span class="caps">DIE</span> <span class="caps">IN</span> A
<span class="caps">FIRE</span>!!!”—which worked surprisingly well, except that:</p>
<ul>
<li>
<p>it was easy to miss comments and replies in large diffs, even when
the comments were all caps and followed by multiple exclamation points</p>
</li>
<li>
<p>it was a pain to co-ordinate reviews and replies from even two other people</p>
</li>
<li>
<p>it was a pain to track down the actual source lines a comment
referred to, which meant an unpleasantly high activation energy for
applying small fixes and suggestions</p>
</li>
</ul>
<p>So we created diffscuss—a code review format based on unified diffs,
with editor support for threaded inline comments, basic review
management and git integration, and (best of all) support for jumping
right from a comment to the local source it addresses, without ever
leaving the comfort of Emacs (or, because Hut 8’s own Matt Papi is a
Vimmortal, Vim).</p>
<p><a href="/images/diffscuss-jump-to-source.png">
<img src="/images/diffscuss-jump-to-source.png" width="100%"
alt="Jump to Source" />
</a></p>
<p>We’ve been using diffscuss for about 6 months now, and we’ve been
happy enough with it that we figure it’s time to share it with the world.</p>
<p>Check it out at <a href="https://github.com/hut8labs/diffscuss">Github</a> or
read on for an example of diffscuss in action.</p>
<a name="continued" id="continued"></a>
<p>For example, here you are using diffscuss in Emacs, reading a comment
that Some Guy left in your code. (Click if you want a larger image.)</p>
<p><a href="/images/diffscuss-reading-review.png">
<img src="/images/diffscuss-reading-review.png" width="100%"
alt="Reading Review" />
</a></p>
<p>You decide you agree with him and want to make the change, so you hit
“<code>C-c s</code>” and Emacs pops up the local source file for you, with
the cursor already positioned on the relevant line. (Again, click for
a larger image.)</p>
<p><a href="/images/diffscuss-jump-to-source.png">
<img src="/images/diffscuss-jump-to-source.png" width="100%"
alt="Jump to Source" />
</a></p>
<p>You make the change, save the buffer, and switch right back to the
review buffer. (Click for…you know the drill.)</p>
<p><a href="/images/diffscuss-after-change.png">
<img src="/images/diffscuss-after-change.png" width="100%"
alt="After Change in Source" />
</a></p>
<p>Now “<code>C-c C-c</code>” opens up a new comment, and you reply.</p>
<p><a href="/images/diffscuss-reply.png">
<img src="/images/diffscuss-reply.png" width="100%"
alt="After Change in Source" />
</a></p>
<p>Easy peasy lemon squeezy.</p>
<p><a href="https://github.com/hut8labs/diffscuss">Try it out!</a></p>Coding, Fast and Slow: Developers and the Psychology of Overconfidence2013-04-22T02:24:00-04:00Dan Milsteintag:blog.hut8labs.com,2013-04-22:coding-fast-and-slow.html<p>I’m going to talk today about what goes on in inside developers’ heads when
they make estimates, why that’s so hard to fix, and how I personally figured
out how to live and write software (for very happy business owners) even though
my estimates are just as brutally unreliable as ever.</p>
<p>But first, a story.</p>
<p>It was the <insert time period that will not make me seem absurdly old>,
and I was a young developer <sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup>. In college, I had aced coding exercises, as
a junior dev I had cranked out code to solve whatever problems someone
specified for me, quicker than anyone expected. I could learn a new language
and get productive in it over a weekend (or, so I believed).</p>
<p>And thus, in the natural course of things, I got to run my own project. The
account manager explained, in rough form, what the client was looking for, we
talked it out, and I said, “That should be about 3 weeks of work.” “Sounds
good,” he said. And so I got to coding.</p>
<a name="continued" id="continued"></a>
<p>How long do you imagine this project took? Four weeks? Maybe five?</p>
<p>Um, actually: three <em>months</em>.</p>
<p>I have vivid memories of that time — my self-image had been wrapped up in
being “a good programmer”, and here I was just hideously failing. I lost
sleep. I had these little panic attack episodes. And it just Would Not End.
I remember talking to that account manager, a pit in my stomach, explaining
over and over that I still didn’t have something to show.</p>
<p>During one of those black periods, I resolved to Never Be That Wrong Again.</p>
<p>Unfortunately, over the course of my career, I’ve learned something pretty hard:
I’m <em>always</em> that wrong.</p>
<p>Actually, I’ve learned something even better: we’re all that wrong.</p>
<p>Recently, I read Daniel Kahneman’s <a href="http://www.amazon.com/gp/product/0374275637/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0374275637&linkCode=am2&tag=hu8labl08-20">Thinking, Fast and Slow</a>, a sprawling survey
of what psychology has learned about human cognition, about its marvelous
strengths and its (surprisingly predictable) failings.</p>
<p>My favorite section was on Overconfidence. There were, let us say, some
connections to the ways developers make estimates.</p>
<h3>Why You Suck at Making Estimates, Part I: Writing Software = Learning Something You Don’t Know When You Start</h3>
<p>First off, there are, I believe, really two reasons why we’re so bad at making
estimates. The first is the sort of irreducible one: writing software involves
figuring out something in such incredibly precise detail that you can <em>tell a
computer how to do it</em>. And the problem is that, hidden in the parts you don’t
fully understand when you start, there are often these problems that will
explode and just utterly screw you.</p>
<p>And this is genuinely irreducible. If you do “fully understand” something,
you’ve got a library or existing piece of software that <em>does that thing</em>, and
you’re not writing anything. Otherwise, there is uncertainty, and it will often
blow up. And those blow ups can take anywhere from one day to one year to beyond
the heat death of the universe to resolve.</p>
<p>E.g. connections to some key 3rd party service turn out to not be reliable… so
you have to write an entire retry/failure tracking layer; or the db doesn’t
understand some critical character set encoding… so you have to rebuild all
your schemas from scratch; or, the real classic: when you show it to some
customers, they don’t want exactly what they asked for, they want something just
a tiny bit different… that is much harder to do.</p>
<p>When you first hit this pain, you think “We should just be more careful at the
specification stage”. But this turns out to fail, badly. Why? The core
reason is that, as you can see from the examples above, if you were to write a
specification in such detail that it would capture those issues, you’d be
<em>writing the software</em>. And there is really just no way around this. (if, as
you read this, you’re trying to bargain this one away, I have to tell you —
there is really really really no way around this. Full specifications are a
terrible economic idea. Some ways below I’m going to lay out better economic choices)</p>
<p>But here’s where it gets interesting. Every programmer who’s been working in
the real world for more than a few months has run into the problems I’m
describing above.</p>
<p>And yet… we keep on making these just spectacularly bad estimates.</p>
<p>And, worse yet, we <em>believe</em> our own estimates. I still believe my own, in the
moment I make them.</p>
<p>So, wait, am I suggesting that all developers somehow fall prey to the same,
predictable errors in thinking?</p>
<p>Yep, that’s exactly what I’m suggesting.</p>
<h3>Why You Suck at Making Estimates, Part <span class="caps">II</span>: Overconfidence</h3>
<p>Kahneman talks at some length about the problem of “experts” making predictions.
In a shockingly wide variety of situations, those predictions turn out to be
utterly useless. Specifically, in many, many situations, the following three
things hold true:</p>
<p>1- “Expert” predictions about some future event are so completely unreliable as
to be basically meaningless</p>
<p>2- Nonetheless, the experts in question are extremely confident about the
accuracy of their predictions</p>
<p>3- And, best of all: absolutely <em>nothing</em> seems to be able to diminish the
confidence that experts feel</p>
<p>The last one is truly remarkable: even if experts try to honestly face evidence
of their own past failures, even if they deeply understand this flaw in human
cognition… they will still feel a deep sense of confidence in the accuracy of
their predictions.</p>
<p>As Kahneman explains it, after telling <a href="http://www.nytimes.com/2011/10/23/magazine/dont-blink-the-hazards-of-confidence.html?pagewanted=all">an amazing story</a>
about his own failing on this front:</p>
<p><span class="dquo">“</span>The confidence you will experience in your future judgments will not be
diminished by what you just read, even if you believe every word.”</p>
<p>Interestingly, there <em>are</em> situations where expert prediction is quite good —
I’m going to explore that below, and how to use it to hack your own dev process.
But before I do that, I want to walk through some details of how the flawed
overconfidence works, on the ground, so you can maybe recognize it in yourself.</p>
<h3>What It Feels Like To Be Wrong: Systems I <span class="amp">&</span> <span class="caps">II</span>, and The 3 Weeks and 3 Months Problem</h3>
<p>In Thinking Fast and Slow, Kahneman explains a great deal of psychology as the
interplay between two “systems” which govern our thoughts: System I and System
<span class="caps">II</span>. My far-too-brief summary would be “System <span class="caps">II</span> does careful, rational,
analytical thinking, and System I does quick, heuristic, pattern matching thinking”.</p>
<p>And, crucially, it’s as if evolution designed the whole thing with a key goal of
<em>keeping System <span class="caps">II</span> from having to do too much</em>. Which makes plenty of sense
from an evolutionary perspective — System <span class="caps">II</span> is slow as molasses, and
incredibly costly, it should only be deployed in very, very rare situations.
But you see the problem, no doubt: without <em>thinking</em>, how does your mind know
<em>when to invoke System <span class="caps">II</span></em>? From this perspective, many of the various
“cognitive biases” of psychology make sense as elegant engineering solutions to
a brutal real-world problem: how to apportion attention in real time.</p>
<p>To see how the interplay between Systems I <span class="amp">&</span> <span class="caps">II</span> can lead to truly awful, and
yet honestly-believed estimates, I’m going turn the mic briefly over to my
friend (and Hut 8 Labs co-conspirator) <a href="http://blog.hut8labs.com/author/edmund-jorgensen.html">Edmund
Jorgensen</a>. He
described it to me in an email as follows:</p>
<p><span class="dquo">“</span>When I ask myself “how long will this project take” System I has no idea, but
wants to have an idea, and translates the question. Into what? I suspect it’s
into something like “how confident am I that I <em>can do</em> this thing,” and that
gets translated into a time estimate, with some multiplier that’s fairly
individual (e.g. when Bob has level of confidence X, he always says 3 weeks;
when Suzy has level of confidence X, she always says 5 weeks).”</p>
<p>Raise your hand if you’ve gradually realized you have two “big” time estimates?
E.g. for me it’s “3 weeks” and “3 months”. The former means “that seems
complex, but I basically think I see how to do it”, and the latter means “Wow,
that’s hard, I’m not sure what’s involved, but I bet I can figure it out.”</p>
<p>Aka, I think Edmund is totally right.</p>
<p>(For those playing along at home: my “3 week” projects seem to take 5-15 weeks,
my “3 month” projects usually take 1-3 years, on the rare event that someone is
willing to keep paying me).</p>
<h3>Alright, So Let’s Stop Being So Overconfident!</h3>
<p>You might be thinking at this point: “Okay, I see where Dan is going: we have to
approach these estimation challenges in some manner that engages System <span class="caps">II</span>
instead of System I. That way, our careful, analytical minds will produce much
better estimates.”</p>
<p>Congratulations, you’ve just invented Waterfall.</p>
<p>That’s basically the promise of the “full specification before we start coding”
approach: don’t allow the team to make intuitive estimates, force everyone to
carefully engage their analytical minds and come up with a detailed spec with
estimates broken down into smaller pieces.</p>
<p>But that totally fails. Like, always.</p>
<p>The real trouble here is the interplay between the two sources of estimation
error: the human bias towards overconfidence, <em>and</em> the inherent uncertainty
involved in any real software project. That uncertainty is severe enough that
even the careful, rational System <span class="caps">II</span> is unable to come up with accurate predictions.</p>
<p>Fortunately, there is a way to both play to the strengths of your own cognition
and also handle the intense variability of the real world.</p>
<p>First, how to play to your mind’s strengths.</p>
<h3>When Experts Are Right, and How To Use That To Your Advantage</h3>
<p>Kahneman and other researchers <em>have</em> been able to identify situations where
expert judgment doesn’t completely suck. As he says:</p>
<p><span class="dquo">“</span>To know whether you can trust a particular intuitive judgment, there are two
questions you should ask: Is the environment in which the judgment is made
sufficiently regular to enable predictions from the available evidence? The
answer is yes for diagnosticians, no for stock pickers. Do the professionals
have an adequate opportunity to learn the cues and the regularities?”</p>
<p>An “adequate opportunity” means a <em>lot</em> of practice making predictions, and a
tight feedback loop to learn their accuracy.</p>
<p>Now, 6-18 month software projects just miserably fail on all these criteria. As
I’ve discussed above, the environment is just savagely not “regular”. Plus,
experts don’t get the combo of making lots of predictions <em>and getting rapid
feedback</em>. If something is going to take a year or more, the feedback loop is
too long to train your intuition (plus you need a <em>lot</em> of instances).</p>
<p>However, there is a form of estimation in software dev that <em>does</em> fit that bill
— 0-12 hour tasks, if they are then immediately executed. At that scale,
things work differently:</p>
<ul>
<li>
<p>Although there is still a lot of variability (more on that below), there is
some real hope of “regularity in your environment”. Two four-hour tasks tend
to have a lot more in common than two six-month projects.</p>
</li>
<li>
<p>You can expect to make hundreds of such estimates, in the course of a couple
of years.</p>
</li>
<li>
<p>You get very quick feedback about your accuracy</p>
</li>
</ul>
<p>The highest-velocity team I’ve ever been on did week sprints, and broke
everything down to, basically, 0, 2, 4, or 8 hours (and there was always some
suspicion about the 8 hour ones — like, we’d try pretty hard to break those
down to smaller chunks). We estimated those very quickly and somewhat casually
— we didn’t even use a <a href="http://en.wikipedia.org/wiki/Planning_poker">Planning
Poker</a> style formalism.</p>
<p>At that point, you’re using the strengths of System I — it has a chance to get
trained, it sees plenty of examples, and there are meaningful patterns to be
gleaned. And, thanks to the short sprint length, you get very rapid feedback on
the quality of your estimates.</p>
<h3>Wait, Wait, Wait, Let’s Just Make a Thousand 4 Hour Estimates!</h3>
<p>How can I both claim that you <em>can</em> make these micro-scale estimates, but somehow
can’t roll them up into 6-18 months estimates? Won’t the errors average out?</p>
<p>Basically, although I think the estimates at that scale are often right, when
they’re wrong, there’s simply no limit to how wrong they can be. In math-y
terms, I suspect the actual times follow a power law distribution. And, power
law distributions are notable for having no stable mean, and infinite variance.
Which, frankly, is exactly how those big waterfall project estimates feel to me.</p>
<p>You might be thinking: how on earth could something you expected to take 4 hours
take a month or two?</p>
<p>This happens <em>all the time</em>: you go to take some final step in something and
discover some hideous blocker which completely changes the scope. E.g. at a
recent startup, in trying to eliminate single points of failure from our
system, we went to put a load balancer in front of an <span class="caps">IMAP</span> server we had
written. So that, when one server machine died, the load balancer would just
smoothly fail over to another box, and customers would see no impact.</p>
<p>And that seemed like a 4-hour-ish task.</p>
<p>But when we went to actually do it, we realized/remembered that the <span class="caps">IMAP</span> server,
unlike all the <span class="caps">HTTP</span> servers we were so used to, <em>maintained connection state</em>.
So if we wanted to be able to transparently fail over to another server, we’d
have to somehow maintain that state on two servers, or write some kind of
state-aware proxying load balancer in front of the <span class="caps">IMAP</span> server.</p>
<p>Which felt like about a 3-month project to us.<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup></p>
<p>And there is the other reason that short sprints are an absolutely key piece of
all this: they place a hard limit on the cost of a horrifically bad estimate.</p>
<h3>Are We All Just Screwed?</h3>
<p>So what do we do? Just accept that all our projects are doomed to failure?
That we’ll have poisoned relationships with the rest of the business, because
we’ll always be failing to meet our promises?</p>
<p>The key is that you first accept that making accurate long-term estimates is
fundamentally impossible. Once you’ve done that, you can tackle a challenge
which, though extremely difficult, can be met: how you can your dev team
generate a ton of value, <em>even though</em> you can not make meaningful long-term estimates?</p>
<p>What we’ve arrived at is basically a first-principles explanation of why the
various Agile approaches have taken over the world. I work that in more detail
in my next post: <a href="http://blog.hut8labs.com/no-deadlines-for-you.html">“No Deadlines For You! Software Dev Without Estimates, Specs or Other Lies”.</a></p>
<p>(Join in the conversation on <a href="https://news.ycombinator.com/item?id=5596578">Hacker
News</a> and
<a href="http://developers.slashdot.org/story/13/04/23/2021201/overconfidence-why-you-suck-at-making-development-time-estimates">Slashdot</a>.)</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>(the band <insert dated music reference> was on the radio, and everyone was
talking about <some long-gone tv show>). <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>If you’re thinking “Wait, 3 months, like one of your 3 month estimates?”, I
have no idea what you’re talking about. <a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>When it Comes to Chaos, Gorillas Before Monkeys2013-04-02T00:19:00-04:00Edmund Jorgensentag:blog.hut8labs.com,2013-04-02:gorillas-before-monkeys.html<p>Here’s a glitch in my thinking that I realized on a recent job: I am
too terrified of monkeys, and not sufficiently afraid of gorillas. As
a result, I’ve been missing opportunities for early, smart investments
to make my systems more resilient in the Amazon cloud.</p>
<p>By “monkey” and “gorilla” I mean “Chaos Monkey” and “Chaos Gorilla,”
veterans of Netflix’s Simian Army. You can browse the <a href="http://techblog.netflix.com/2011/07/netflix-simian-army.html">entire
list</a>
<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup>, but for easy reference:</p>
<ul>
<li>Chaos Monkey is the personification (simianification?) of <span class="caps">EC2</span>
instance failure.</li>
<li>Chaos Gorilla represents major degradation of an <span class="caps">EC2</span> availability
zone, henceforth “<span class="caps">AZ</span>” for short (or, as we sometimes referred to
them at my last job, “failability zones”).</li>
</ul>
<p>I believe that startups should (mostly) worry less about <span class="caps">EC2</span> instances
failing, and more about entire AZs degrading. This leads to a
different kind of initial tech/devops investment—one that I believe
represents a better return for most early-stage companies and products.</p>
<a name="continued" id="continued"></a>
<h3>How I (Finally) Learned to Dread Chaos Gorilla Appropriately</h3>
<p>At the job in question, the team and I were working on an application
that had, at some unremarked moment, crossed the fuzzy line between
advanced prototype and early production. First customers were using
it—some were even starting to depend on it in their daily lives—and
suddenly downtime had gone from something we thought about as
“wouldn’t it be nice if someone noticed or cared” to “wow that might
really tick some people off.”</p>
<p>Unfortunately, unlike <em>every other development shop ever</em>, we might
have cut a corner or two getting our prototype out. To wit, we had a
single point of failure in our system. Or maybe two. All right, I
admit it: we had four <span class="caps">SPOF</span> time bombs ticking away, which—quite
coincidentally—was also the total number of <span class="caps">EC2</span> instances in our
deployment. I felt bad about that, and so did the rest of the team.
We all knew that SPOFs were pure evil, right up there with axe
murderers and grams of trans fat on the list of “Things There’s No
Good Number Of.” So we planned to spend a good chunk of a couple
sprints terminating those SPOFs with extreme prejudice <sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup>.</p>
<p>And then, the night before the first such sprint began, we dodged a
bullet: one of the East Coast AZs freaked the hell out, bringing half
the sites on the Internet down with it. Luckily, it wasn’t the <span class="caps">AZ</span> we
were in <sup id="fnref:3"><a class="footnote-ref" href="#fn:3" rel="footnote">3</a></sup>. Phew, right?</p>
<p>But on the bus the next morning I got to thinking: why did I feel
<em>ashamed</em> of our <span class="caps">SPOF</span> instances, but <em>lucky</em> for dodging the latest <span class="caps">AZ</span>
meltdown? Why did I now suspect that if an instance failure had
caused us even minutes of downtime, I would have blamed myself,
whereas if an <span class="caps">AZ</span> meltdown had knocked our site out of commission for
hours (along with Reddit and maybe Netflix) I would have—if I was
being honest with myself—kind of blamed Amazon? This felt like just
the kind of misaligned thinking that might be hiding an economic opportunity.</p>
<p>Arriving at the office, I cornered Dan in the kitchen and we chatted
it through. From a cold, hard, economic point of view, would we get a
better return first protecting against instance failure, or improving
our resilience to <span class="caps">AZ</span> meltdown? Instances failed, sure—and if we were
at Netflix’s scale, they’d be failing all the time. But at our
scale—four machines—they didn’t seem to fail very often, and when
they did, it would be maybe an hour of scrambling to fix. On the
other hand I could name five occasions in the previous two years, just
by my personal count, when an <span class="caps">AZ</span> had melted down—and in each case we
had spent more like half a day (at least) dealing with the crisis and
fallout. If you go <a href="http://aws.amazon.com/message/680587/">looking</a>
for <a href="http://aws.amazon.com/message/680342/"><span class="caps">AZ</span></a>
<a href="http://aws.amazon.com/message/65648/">meltdowns</a>, they’re
<a href="http://aws.amazon.com/message/67457/">not</a> very
<a href="http://aws.amazon.com/message/2329B7/">hard</a> to
<a href="http://www.webpronews.com/amazon-web-services-outage-brings-down-websites-2012-06">find</a>.</p>
<p>In other words, at our scale of four instances:</p>
<ul>
<li>instance failure = cheap and seldom</li>
<li><span class="caps">AZ</span> meltdowns = expensive and frequent <sup id="fnref:4"><a class="footnote-ref" href="#fn:4" rel="footnote">4</a></sup></li>
</ul>
<p>(Protecting against <span class="caps">AZ</span> meltdown also has the nice benefit of
optimizing for mean time to recovery instead of mean time between
failures, which is <a href="http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/">usually the right way to
go</a>).</p>
<p>So we changed course: over the next few sprints we ran <span class="caps">AZ</span> meltdown
simulations, made improvements to our backup and deploy scripts that
would allow us to recover from disaster more quickly <sup id="fnref:5"><a class="footnote-ref" href="#fn:5" rel="footnote">5</a></sup>, and
generally made ourselves more resilient to the economically disastrous
wrath of Chaos Gorilla before we spent real dev calories preventing
the relative pranks of Chaos Monkey.</p>
<h3>Why Had I Been Thinking About this So Wrong?</h3>
<p>I suspect for a few reasons:</p>
<h4>The Different Tradeoffs of Physical Hardware</h4>
<p>I get a lot of my habits and patterns of thought from having worked in
the olden days on sites deployed on physical, colocated servers, where
the cost/benefit profiles are different than those of the cloud.
Unlike tooling some scripts to bring up new instances on-demand in a
separate <span class="caps">AZ</span>, maintaining a failover-ready second server farm in a
separate colo facility represents a substantial investment for a young
company. Furthermore, compared to the bleeding-edge insanity of an <span class="caps">AZ</span>
operated at Amazon’s scale, most physical installations are built on
boring, tried, and relatively simple technology, and they don’t
catastrophically fail as often.</p>
<p>Don’t get me wrong—there’s no excuse for not preparing against colo
failure in the physical hardware universe, any more than there’s an
excuse for ignoring Chaos Gorilla in <span class="caps">AWS</span>, but the combination of lower
incidence of failure and higher cost to protect against it means that
the economic “break even” line can get drawn later in an application’s life-cycle.</p>
<h4>AZs Still “Feel Like Hardware”</h4>
<p>It’s natural to think of your piddling <span class="caps">EC2</span> instance as something
ephemeral, but (at least for me) it’s not natural to think of a whole
<span class="caps">AZ</span> as something volatile or dangerous. It’s basically just a big,
solid data center, more stable than the boxes it houses, right? Well…no.</p>
<p>An <span class="caps">AZ</span> is a virtualized data center—an exercise in massively
distributed and heterogeneous systems engineering, housing thousands
of tenants who are each competing for resources and aren’t even
supposed to know the others exist. No one else in the world (as far
as I know) is operating a comparable system at the scale of <span class="caps">AWS</span>. And
like all such complex, distributed systems, AZs are subject to weird
hidden dependencies and nasty cascading failure modes.<sup id="fnref:6"><a class="footnote-ref" href="#fn:6" rel="footnote">6</a></sup></p>
<p>In other words: I think we intuitively feel that AZs fail in the modes
of colocated hardware—relatively simply and independently—when in
fact they tend to fail in the mind-bogglingly interconnected modes of
distributed software.</p>
<h4>Chaos Monkey Came, Saw, and Conquered Our Imagination</h4>
<p>I remember the first time I read about Chaos Monkey. The name was
hilarious, and the thinking behind it just felt so instantly,
recognizably <em>right</em>. “You want to make darn sure you only deploy
production systems resilient to instance failure? <span class="caps">OK</span>, then randomly
terminate instances in production.” I internalized that point of view
pretty quickly, and it stuck.</p>
<p>Chaos Gorilla, on the other hand? I barely heard about him when he
came on the scene. And I didn’t know Chaos Kong (who simulates an
entire <span class="caps">AWS</span> coast failing) even existed until recently.</p>
<h4>It’s Easy to Feel Good in Good Company</h4>
<p>When one of your instances dies, you’re the only startup on your floor
whose site is down. That feeling sucks. When an <span class="caps">AZ</span> goes down, Giants
of the Internet stumble—even the mighty Netflix can have a problem or
two. Meanwhile down the hall you hear the sobs and screams of your
counterparts at other startups, which you find oddly comforting, and
by that night you’re drinking beer and swapping the day’s war stories
with them.</p>
<p>All this company can make you feel that, somehow, downtime caused by
an <span class="caps">AZ</span> outage <em>isn’t really your fault</em>. But our customers don’t think
that way, and neither should we.</p>
<h3>In Conclusion</h3>
<p>First, let’s note what I’m not concluding.</p>
<p>I’m not concluding that it’s <span class="caps">OK</span> to have single points of failure in
your system, or that instances never die on Amazon, or that you
shouldn’t engineer for high availability.</p>
<p>I am concluding that, because of the high historical frequency of <span class="caps">AZ</span>
degradation and the relatively small number of instances most early
startups deploy, those startups are less likely to be affected on any
given day by instance failure than <span class="caps">AZ</span> degradation. Furthermore, the
costs incurred in downtime and recovery efforts are usually
significantly higher for <span class="caps">AZ</span> degradation than instance failure.
Therefore, for many startups, it will make sense to invest in
war-gaming <span class="caps">AZ</span> degradation and tooling for quick recovery before
engineering around instance failure.</p>
<p>Or, to put it more succinctly: gorillas before monkeys.</p>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>Including my personal favorite, the weirdly named “Conformity
Monkey,” whom I kind of imagine standing around awkwardly with a crew
cut and a button-down while “Hippie Monkey” and “Beat Monkey” hurl
epithets and folk songs at him. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>For those wondering why we couldn’t just throw up an <span class="caps">ELB</span> or two
and call it a day: a major element of the application was a custom
<span class="caps">IMAP</span> server—yes, that’s right, <span class="caps">IMAP</span>, as in the charmingly chatty,
delightfully stateful mail protocol. You haven’t really lived until
you’ve tried to load balance a stateful protocol, I tell you. <span class="caps">HTTP</span> is
for wussies. <a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>For those interested: we were in a single <span class="caps">AZ</span> because we were
required, for business reasons, to deploy into a <span class="caps">VPC</span>. <a class="footnote-backref" href="#fnref:3" rev="footnote" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>We went into some more depth when estimating the relative costs
and benefits of protecting against <span class="caps">AZ</span> meltdown or instance failure,
generating some upper / lower bound estimates which I plan to publish
in a later companion piece. <a class="footnote-backref" href="#fnref:4" rev="footnote" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>Yes, we had robust backups and automated deploys, even for our
prototype. Maybe we were a little <span class="caps">SPOF</span>-y, but we weren’t insane. <a class="footnote-backref" href="#fnref:5" rev="footnote" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>Take <span class="caps">EBS</span> for example, which has been responsible in one form or
another for many of the <span class="caps">AZ</span> meltdowns to date. Even if you don’t
use <span class="caps">EBS</span> directly, an <span class="caps">EBS</span> meltdown can affect your deployment
because—surprise!—ELBs use <span class="caps">EBS</span> behind the scenes—as does <span class="caps">RDS</span>,
and a number of other <span class="caps">AWS</span> services. Even if you avoid all
<span class="caps">EBS</span>-dependent services, a sudden rush of <span class="caps">EBS</span> failover and
replication can choke an entire <span class="caps">AZ</span>, rendering your instances
inaccessible and useless. <a class="footnote-backref" href="#fnref:6" rev="footnote" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
</ol>
</div>Dan Talks About Post-Mortems2013-03-29T07:20:00-04:00Dan Milsteintag:blog.hut8labs.com,2013-03-29:dan-talks-about-post-mortems.html<p>Hello, Dan here. So: the Hut 8 Labs team is very excited to be firing up our
blog. But, before we get down to new business, I wanted to post some links to
other places I’ve written or talked. Of late, a bunch of that writing and
talking has been about <strong>how to run effective post-mortems</strong>.</p>
<p>I’ve actually come to believe that, for many startups, spending a chunk of time
improving how they approach post-mortems (and learning from failure more
generally) has a just incredible economic return. I suspect it’s one of the
most profitable things they can do with their (incredibly scarce) time.</p>
<a name="continued" id="continued"></a>
<p>Why? Because the sort of default way groups of human beings respond to failures
is with <em>shame</em>… and an attendant desire to quickly move on and pretend it
never happened. Thus, that’s how most startups respond to multi-hour outages,
or embarrassing bugs showing up in front of important early customers, or the like.</p>
<p>And if that’s what your team does after experiencing some nasty failure, you’re
basically guaranteed to be missing simple, cheap, incredibly valuable
improvements. It can be helpful to flip this around, and imagine those
improvements not as “avoiding bad things”, but rather “making you piles of
money” (aka, having strongly positive economic returns). Imagine there’s a big
class of customers waiting to buy your product, but you’ve got a team-wide
mental block which prevents everyone from seeing them. Improving how you run
post-mortems is like discovering those customers are lined up outside your door,
waiting to get in.</p>
<p>(I am not, of course, suggesting that making failures or outages go away is
somehow simple and cheap — what I’m suggesting is that there are incremental
improvements with outsized value, and post-mortems can help you find them. If
you’re thinking “But early customers don’t care that much about outages”,
you’re totally right — the big economic win comes not from avoiding showing
bugs to customers, but from decreasing the frequency of firedrills for your
team, which have an outsized opportunity cost.)</p>
<p>Well-run post-mortems can also serve as a very important release valve —
again, because of the default response of shame. Unless there’s a structure to
deal with failures, people tend to slip into very damaging patterns —
searching for someone to blame, inserting slow-moving layers of review, etc.</p>
<p>Most recently, I gave a talk touching on a bunch of this at the Lean Startup
Conference, the slides are up here:</p>
<p><a href="http://www.slideshare.net/danmil30/how-to-run-a-5-whys-with-humans-not-robots">How To Run a 5 Whys (With Humans, Not Robots)</a></p>
<p>You can also watch a <a href="http://www.ustream.tv/recorded/27482093/highlight/310486">12-minute
video</a> of the talk
(which has the added benefit of documenting for future-me that, in late 2012, I
briefly experimented with a mustache).</p>
<p>Also, a ways earlier, I wrote up a blog post on my experiences running
post-mortems at HubSpot:</p>
<p><a href="http://dev.hubspot.com/blog/bid/64771/Post-Mortems-at-HubSpot-What-I-Learned-From-250-Whys">What I Learned From 250 Whys</a></p>
<p>Hope you enjoy, do check back for more. As a teaser: I’ve been engaged in a
very interesting post-mortem-themed email exchange with one <a href="http://www.kitchensoap.com/">John
Allspaw</a> (who will tell pretty much anyone who
asks that he has some very serious concerns about the 5 Whys approach). I’ve
promised to write up my take on that discussion, tentatively titled <strong>5 Whys
Baaaad, 5 Whys Gooood</strong>, aka “All The Things That Are Wrong With 5 Whys And Why
I Think They’re Awesome Anyways”.</p>
<p>Assuming I actually get that written, it should be fun.</p>