How to Square the Circle, Achieve Perpetual Motion, and Tune Your Alert Emails Just Right
“You can tune a piano,” goes the old joke, “but you can’t tuna fish.” I’ve come to believe that you can’t really tune the automated alerts you get from your monitoring systems either—at least in the sense that we usually mean when we complain that “our email alerts are out of control and need to be tuned properly.” Instead we should be spending our time and attention tuning a larger, more complex system—of which alerts are just one part.
When you tune a piano string, you make small adjustments to return it toward the ideal pitch from which it has drifted (say, 440 Hz for an A). It’s overwhelmingly impractical to tune a real piano string to A perfectly, because of the messy nature of the physical world—but the ideal you’re trying for is straightforward and unambiguous.1 Unfortunately, there’s no such ideal for alerts.
“Of course there is,” says a guy in back, “I’ve read about it in tons of blog posts. The ideal for alerts is this: every time I get an alert, it should indicate an actual problem that requires my attention.”
OK, guy at the back, that’s pretty easy to do—let’s just turn off the alerting system altogether, and I promise that every alert you receive will be indicative of a problem requiring your attention.
“Don’t play dumb,” he retorts. “We still want to get alerts whenever something is wrong—as soon as we know it’s wrong, in fact—we don’t want to be sitting around thinking things are all hunky dory when production is on fire. So that’s the ideal towards which we tune alerts—we want alerts immediately when there’s an actual problem, and only then.”
Fair enough, but that’s a very different kind of ideal than a string vibrating at precisely 440 Hz. In fact it’s not even clear that this can be called an “ideal” at all, because under this definition an alert can drift from ideal in at least two, often contradictory directions. That is, it can fire when there isn’t an actual problem (a false positive) or not fire when there is (a false negative)—and when you make one of these problems better for a given alert, you tend to make the other worse.
For example, if you’ve ever monitored CPU usage of a production system, you’ll be familiar with the false positive alerts you get when the system becomes briefly and legitimately busy doing a burst of real work. So you tune the alert to back off a bit—perhaps you will tolerate up to 80% utilization instead of 70% before alerting—only to find that some truly nasty condition occasionally pegs the CPU right around 72% forever, causing all sorts of other problems in the meantime. All right, you think, I’ll set the CPU threshold back to 70% but not alert until we’ve exceeded that for 30 minutes, which is longer than any legitimate work spike—but now you’ve guaranteed that you won’t find out about actual problems for at least 30 minutes. So you set a new rule, which etc. etc. etc.
As tricky as this balancing act is (and, if you’ve ever struggled with this in the real world, you know that it is plenty tricky), there’s a subtle (and even more dastardly) problem buried in this discussion so far: the tossing around we’ve been doing of that term “an actual problem.”
What makes one condition of your system “an actual problem” and another “not an actual problem?” There’s no unambiguous, measurable criterion you can reference to answer that question. If you polled different people in your business, you’d almost certainly get vastly different answers—one of the folks in marketing might not care, for example, if page load times were above 1 second for their latest micro-site, but might care very much if it’s down, while another might think anything over 500 millis should be synonymous with downtime—for the very same site. So what we mean by saying an alert caught “an actual problem” ends up being something like: “when the alert fired, it alerted me to a situation that, with my own infinitely flexible and idiosyncratic human knowledge and judgment, I was glad to know about at just that time and no later.”2
This is a pretty bad situation we find ourselves in—not because there’s some unattainable perfection our alerts can’t ever realistically achieve, but because we don’t even have a good enough definition of perfection to let us say whether a change to one of our alerts is actually making it better or worse. That’s a horrific state of affairs, because it means that as we attempt to steadily improve our operations in nice small steps, we’re just going to end up endlessly jerking away from whichever flavor of catastrophe last burned us: our alert volume will build up until it’s just background noise, which will lead to an alert on an “actual problem” being ignored, which will lead to a fantastic blamefest that results in our cutting a bunch of alerts, which will lead to an “actual problem” that generates no alert, which will lead to another blamefest resulting in a buildup of alerts … and round and round we go, with no end in sight.
OK, then … is all hope lost?
So how do we get out of this spiral of blame and abject existential horror? Here’s a hint: how would we design our alerts if we had access to an infinite supply of brilliant, unsleeping, free interns (who were also intimately familiar with our systems and business) to respond to them? Well, assuming we’re completely heartless3, we’d make those alerts sensitive as all hell, because if it’s free (and we’re heartless), why not throw human intelligence and attention at every little blip and bump we monitor to see if there’s an “actual problem?”
What if, on the other hand, the condition on which we were alerting was extremely benign—for example, the website on which we host our high-school poetry going down? We’d make that alert extremely insensitive, because honestly: who cares if it goes down for a day or two, or even a week?4
Economics to the rescue
In other words, we can recast the idea of an alert as something that doesn’t even have a Platonic ideal in and of itself—but which is one piece of an economic equation, with an associated cost and benefit profile.
I mean those words literally, by the way: each alert has some cost and benefit in some actual number of probabilistic dollars and cents, where the cost is dominated by the investment of human attention and intelligence that it occasions, and the benefit is equal to the cost of an “actual problem” (times the probability that the alert has identified such a problem).
Now we can start comparing the relative badness of a given false positive and its corresponding false negative, and to see the outlines of a system that can actually be tuned towards an unambiguous ideal—the absolute minimum overall cost.5
A gentle objection from the guy at the back
“All right,” says the guy at the back, “I’ve sat quietly for a bit, but now I think I’ve got you. Maybe you’re right that the Platonic ideal was a little too simplistic, but what you’re proposing is way too complex—there’s just no way in hell you can actually figure out the cost of human attention and intelligence and the probability of a false negative etc. etc. etc. and get all that down to dollars and cents. In practice it’s just impossible.”
Well, guy at the back, we agree about one thing: we’re never going to calculate those costs down to the cent, or even the dollar. But—and here’s the great part—in practice we don’t have to—we just need back of the envelope, order of magnitude estimates that allow us to compare a couple choices (e.g. making an alert more or less sensitive) and say, relatively, which course is probably better.
“But,” says the guy, “you’re still asking me to put a cost on things like minutes of downtime. The leaders of my business are never going to do that—downtime is one of those things that just can’t ever happen.”
Oh, guy at the back, let me buy you a beer or ten.
The spirit behind that “can’t ever happen” is a gigantic problem—it’s equivalent to saying “our perpetual motion machine just can’t ever run down, because we love our customers and failure is not an option.” This is, at its heart, a moral argument, where blame and punishment are what’s under consideration—and in that mindset, people often get genuinely angry if you even suggest that there’s an economic tradeoff to consider.6
If the above sounds uncomfortably close to your own situation, you have much more profound problems then a simple pass at your nagios configuration could ever solve. If you operate the software of your business, then you must be able to reason about the economics of what you do—be it preventing downtime, investing in backups, or even speeding up deploys—in at least comparative orders of magnitude. If the leaders in your business refuse to partner with you on that, then you don’t have a ton of great options.7 In this situation, as my friend and colleague Dan Milstein says—and I don’t repeat this lightly—”maybe the world is telling you to brush up your LinkedIn profile.”
But in my experience—and I hope that experience is far from unique—the leaders of a business usually welcome the chance to have an economic discussion around such operational risks and investments, as it represents a chance to better understand (and inform) some aspects of the business’s economic equation that are often opaque to them, and remove some of the fear and anxiety that this opacity creates.
In Conclusion …
In practice I think you’ll find that, far from our original false Platonic ideal of “every alert indicates an ‘actual problem,’” you’ll end up happily and profitably tolerating some number of false positives, since for most businesses they tend to be considerably cheaper than a single false negative.8
But thinking about alerts in their larger economic context also lets us improve our overall economics by means of other investments too. Besides just changing the sensitivity of our alerts, we can also improve the overall economic equation by:
driving down the cost of receiving an alert—for example by making alerts easier and quicker to digest and disregard if appropriate, so they consume less human attention and intelligence
driving down the cost of the failures we want to alert on, by providing backup or alternative systems that make failures less expensive (for example, providing materials that allow cashiers to take those old paper impressions of credit cards if the electronic system goes down)
By recognizing our alerts as part of a broader economic system, in other words, we’re setting ourselves up for a world with an actual forward direction and a lot more options for how to travel in that direction—and, correspondingly, a lot less shaking our fist at all those email alerts in impotent rage.
I was originally too abmitious here, stating that there is a perfect tuning for a piano. It turns out that, as several early readers have pointed out to me, there’s no way to tune an entire piano perfectly. So, we’ll stick with a string being tuned to a single note, unrelated to all others. ↩
Yes, in some respect, creating a “perfect” alerting system would mean designing an AI that contained, besides its electronic sensors, an exact, evolving, realtime copy of all your wisdom, experience, judgment, domain knowledge, preferences, etc.—and alerted you when it calculated, with 100% certainty, that you would want to handle a situation—because, paradoxically, while possessing all your wisdom, experience, judgment, domain knowledge, preferences, etc.—as well as on-board networking and a faster CPU than yours—the AI was somehow unable to address the situation itself. ↩
Yes, there’s a real moral question here about making the lives of these poor interns unbearably miserable, but … interns. ↩
In my case, I’d probably want to be alerted if it somehow ever accidentally came up, so that I could immediately shut it down again. ↩
Or, if you’re an optimist, the maximum overall benefit—but if you’re an optimist, what are you doing working in operations anyway? ↩
In particular, it’s tempting but inadvisable to take it upon yourself to translate this moral viewpoint into an economic one. You’d essentially just be assigning an infinite cost to downtime, which leaves you just as lost as the false Platonic ideal of “all alerts must indicate an actual problem” did when it implicitly assigned an infinite cost to wasted human attention. ↩
There’s a corollary to this, too: often when you join an organization you will initially experience their alert volume as horrifically out of control, and grow to understand it as you become better associated with both the workings and the economics of the systems you’re monitoring. ↩