Your Nines Are A Lie

Is your service three nines, four nines, or even five? No matter what your answer is, it’s almost surely inaccurate.

I recently went through an exercise at work to calculate the expected availability of one of our foundational systems. It reminded me how little these calculations have to do with actual availability as experienced by consumers. Expected availability numbers are generally based on hardware failure rates. You take how often it fails with how long it takes to repair, and that gives you the component availability. An individual server may have an expected availability number of 99% which means in an average year you’d expect it to be down for repairs for about three and a half days. An easy way to raise the availability of a system is to have redundant components – if you have two of those servers, your system availability goes up to 99.99%. Why? Because the chances of both servers failing at the same time are really small. With three servers you get up to 99.9999%. As you make this system more complex with more layers and more dependencies, the math gets a little more complicated but the idea stays the same, and so you can calculate an expected availability of your entire system based on the availability of each of its components. If you’re running a production system at scale a typical design (redundant data centers, redundant circuits, redundant systems) could easily reach 99.999% (five nines) on paper. That’s about 5 minutes of downtime per year. For calibration, it would take 12 years of uninterrupted service to be able to take a 1 hour outage and still be at five nines. But every big outfit, including Google, AWS, and Facebook has experienced outages longer than that, even though they have big budgets and super smart people designing their systems. Why?

It turns out that most big outages are not caused by component failures. The most common cause of a major outage is someone making a change. All three of the outages I linked to above were caused by a human making a change. Reliability calculations based on component failures tell you absolutely zero about how likely your system is to fail when you make changes – that depends on the quality of your tooling, the frequency of your changes, the design of your system, and the capabilities and training of your team. The second most common cause of outages is overloads – where your system (or some critical subsystem) can’t keep up with what’s being sent at it. Two of the three examples involved overload conditions.

I’ve seen a lot of outages in my career and a vanishingly small percentage were caused by hardware failures – pretty much any decent system these days has been designed to handle individual component failures. The trick is figuring out how to make your system resilient in the face of change and making sure you have the tooling you need to be able to react to and quickly fix any problems that do come up (including being able to quickly add new capacity if needed). If you’re trying to build a reliable service you should pay just as much attention to those as you do to the reliability of your system components!

About Paul Guth

Old Timey Web Ops guy. I think about cars and clouds, and how they could be faster, cheaper, and more resilient. View all posts by Paul Guth

2 responses to “Your Nines Are A Lie

  • Soumen Sarkar

    Well said Paul. What formalization could be brought to reliability and availablity calculations with change injection in mind? Components failure was easier to formalize but less effective given change factor you mentioned.

    • Paul Guth

      Great question. Obviously a big part of why we do expected availability calculations this way is because we have data on hardware failure, provided by the manufacturers. What we need to do what you’re saying is similar data about the availability characteristics of the other factors: tooling, system design, change policies and processes, etc. That data doesn’t really exist, not publicly and certainly not comprehensively. The closest thing is probably the work Gene Kim et al have been doing with DevOps Research and Assessment. (I should probably ask them this same question!)

      The simplest place to start would probably be with two numbers: a change success rate, and an average MTTR for incidents caused by change. So if we could say “this kind of change policy and system context typically leads to change success rates of X% and MTTR of Y minutes” then we’d be getting a lot closer.

      Thanks for the question!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: