Tag Archives: reliability

Your Nines Are A Lie

Is your service three nines, four nines, or even five? No matter what your answer is, it’s almost surely inaccurate.

I recently went through an exercise at work to calculate the expected availability of one of our foundational systems. It reminded me how little these calculations have to do with actual availability as experienced by consumers. Expected availability numbers are generally based on hardware failure rates. You take how often it fails with how long it takes to repair, and that gives you the component availability. An individual server may have an expected availability number of 99% which means in an average year you’d expect it to be down for repairs for about three and a half days. An easy way to raise the availability of a system is to have redundant components – if you have two of those servers, your system availability goes up to 99.99%. Why? Because the chances of both servers failing at the same time are really small. With three servers you get up to 99.9999%. As you make this system more complex with more layers and more dependencies, the math gets a little more complicated but the idea stays the same, and so you can calculate an expected availability of your entire system based on the availability of each of its components. If you’re running a production system at scale a typical design (redundant data centers, redundant circuits, redundant systems) could easily reach 99.999% (five nines) on paper. That’s about 5 minutes of downtime per year. For calibration, it would take 12 years of uninterrupted service to be able to take a 1 hour outage and still be at five nines. But every big outfit, including Google, AWS, and Facebook has experienced outages longer than that, even though they have big budgets and super smart people designing their systems. Why?

It turns out that most big outages are not caused by component failures. The most common cause of a major outage is someone making a change. All three of the outages I linked to above were caused by a human making a change. Reliability calculations based on component failures tell you absolutely zero about how likely your system is to fail when you make changes – that depends on the quality of your tooling, the frequency of your changes, the design of your system, and the capabilities and training of your team. The second most common cause of outages is overloads – where your system (or some critical subsystem) can’t keep up with what’s being sent at it. Two of the three examples involved overload conditions.

I’ve seen a lot of outages in my career and a vanishingly small percentage were caused by hardware failures – pretty much any decent system these days has been designed to handle individual component failures. The trick is figuring out how to make your system resilient in the face of change and making sure you have the tooling you need to be able to react to and quickly fix any problems that do come up (including being able to quickly add new capacity if needed). If you’re trying to build a reliable service you should pay just as much attention to those as you do to the reliability of your system components!


Reliability vs Availability and the Magic of MTTR

When talking about online systems, you’ll often encounter the terms “availablity” and “reliability.” As in “The CloudNetanitor 10000 provides industry-leading reliability” or “We built the Securovator LX system using exclusively high-availability components.” These two concepts are closely linked but different: each of them has its own specific definition.

Availability

Availability is how often something works and works correctly when you try to use it. It’s generally expressed as a percentage, like “SuperLogBuzzer achieved 99.94% availability in the month of June.” The simplest way to calculate the availability is to take the number of successful uses of the service, and divide it by the total attempts. Mathematically:

(successes) / (successes + failures)

Reliability

Reliability is how long something works without breaking. Unlike availability it is generally expressed in time. So if you manufacture engines and your engines on average last for two years of continuous operation, the reliability of your engines is two years. Generally reliability is expressed as an average since every engine is going to be different – some may last six months, some may last six years. There’s even a specific term (and acronym) for this average: MTTF, standing for “Mean Time To Failure.” You see this a lot for things like disk drives, which may say “measured MTTF of 6,000 hours.” Some systems will use MTBF or Mean Time Between Failures – that’s almost but not exactly the same thing – close enough for now.

So which is better?

Availability is what really matters when talking about services you use or services you provide to others. If you can’t reach http://www.google.com, do you really care how long it’s been since the last time it broke? Of course not – you just care that it’s not working right now when you want to use it.

The underlying story here is that availability captures both reliability (MTTF) and another critical concept: mean time to repair (MTTR). Whether or not you can use Google when you want to depends on two things: how long Google goes in between breaking, and also how long it stays broken for when it does break. Both pieces affect your experience with the service, and availability captures both.

What does this mean for me?

Focus on MTTR. If your system is reasonably well-designed and you didn’t cut a bunch of corners, your MTTF is almost exclusively determined by the architecture and the intrinsic characteristics of the components (hardware and software). This means it’s difficult to make significant changes in MTTF without either rearchitecting, spending lots of money on more redundancy (which can cause its own reliability problems by adding complexity), or changing to an underlying platform with different reliability characteristics. By contrast, there are often large gains to be realized in MTTR without nearly as much investment by relatively simple changes in tools, techniques, and organization.

Tracking MTTR

How do you focus on MTTR? First, track it. If you don’t already have somewhere you record every incident – get one. It doesn’t really matter what you use as long as it’s easy to update and allows structured data to enable analysis once you’ve got some data. Once you have that, make sure you’re recording data that allows you to determine MTTF and MTTR for your services.

MTTF is a straightforward number – it’s the time between when your last service interruption ended and when your next service interruption starts. MTTR on the other hand can be further subdivided, and your recording system should allow you to track the following components of it:

  1. Time to detect
  2. Time to triage
  3. Time to diagnose
  4. Time to fix

Make sure your tracking enables you to measure each of these components so you can see where your biggest opportunities for improvement are, and so you can measure the effects of what you’re doing to make things better.

Monitoring at the service level

To reduce detection times, make sure you have monitoring at the service level. There’s a whole different post here, but the essence is that you want to monitor what your customers are paying for – they don’t care if your database is up or down and neither should you unless it’s affecting the service. I’ve seen many cases where every piece of infrastructure is extensively monitored and yet the entire service can go down without any alerts – because it’s the interactions between those services that were affected – and those weren’t monitored.

Have Data. Use Data.

Triage and diagnosis times can be greatly reduced by having the right information. Hopefully your applications and systems are already recording lots of useful data in a place where you can easily see and analyze it. The most important information you need for diagnosing tricky problems is a clear understanding of the dependencies in your system. You should have a data model of your infrastructure somewhere – is it easily visualized? Better still, is there an API to it so your tools can not only show it to you but use it themselves when making decisions?

Build your tools to help you do the right thing

How long it takes to fix a problem once you know the cause (or before you know the cause, as long as you know where the problem is) is largely dependent on the design of the system and the tools to control it. When you’re building your tools make sure you think through the typical use cases – make it easy to do the right thing and try to prevent people from making the wrong choices – even if that does limit the flexibility of the tool.

Leverage automation

Automation can help you at each stage: detection, triage, diagnosis, repair. You want people to spend their time making decisions and once those decisions are made, the computers should do most of the work.

Make things better, one outage at a time

If you have a good system, focusing on MTTR will give you the biggest bang for your buck when it comes to increasing your availability. If you really want to improve MTTR, make sure you learn everything you can from the outages you do encounter. There is no better way to understand the weaknesses of your system than to examine what caused and contributed to real failures in production. Don’t waste a single opportunity!