Monthly Archives: August 2011

Reliability vs Availability and the Magic of MTTR

When talking about online systems, you’ll often encounter the terms “availablity” and “reliability.” As in “The CloudNetanitor 10000 provides industry-leading reliability” or “We built the Securovator LX system using exclusively high-availability components.” These two concepts are closely linked but different: each of them has its own specific definition.


Availability is how often something works and works correctly when you try to use it. It’s generally expressed as a percentage, like “SuperLogBuzzer achieved 99.94% availability in the month of June.” The simplest way to calculate the availability is to take the number of successful uses of the service, and divide it by the total attempts. Mathematically:

(successes) / (successes + failures)


Reliability is how long something works without breaking. Unlike availability it is generally expressed in time. So if you manufacture engines and your engines on average last for two years of continuous operation, the reliability of your engines is two years. Generally reliability is expressed as an average since every engine is going to be different – some may last six months, some may last six years. There’s even a specific term (and acronym) for this average: MTTF, standing for “Mean Time To Failure.” You see this a lot for things like disk drives, which may say “measured MTTF of 6,000 hours.” Some systems will use MTBF or Mean Time Between Failures – that’s almost but not exactly the same thing – close enough for now.

So which is better?

Availability is what really matters when talking about services you use or services you provide to others. If you can’t reach, do you really care how long it’s been since the last time it broke? Of course not – you just care that it’s not working right now when you want to use it.

The underlying story here is that availability captures both reliability (MTTF) and another critical concept: mean time to repair (MTTR). Whether or not you can use Google when you want to depends on two things: how long Google goes in between breaking, and also how long it stays broken for when it does break. Both pieces affect your experience with the service, and availability captures both.

What does this mean for me?

Focus on MTTR. If your system is reasonably well-designed and you didn’t cut a bunch of corners, your MTTF is almost exclusively determined by the architecture and the intrinsic characteristics of the components (hardware and software). This means it’s difficult to make significant changes in MTTF without either rearchitecting, spending lots of money on more redundancy (which can cause its own reliability problems by adding complexity), or changing to an underlying platform with different reliability characteristics. By contrast, there are often large gains to be realized in MTTR without nearly as much investment by relatively simple changes in tools, techniques, and organization.

Tracking MTTR

How do you focus on MTTR? First, track it. If you don’t already have somewhere you record every incident – get one. It doesn’t really matter what you use as long as it’s easy to update and allows structured data to enable analysis once you’ve got some data. Once you have that, make sure you’re recording data that allows you to determine MTTF and MTTR for your services.

MTTF is a straightforward number – it’s the time between when your last service interruption ended and when your next service interruption starts. MTTR on the other hand can be further subdivided, and your recording system should allow you to track the following components of it:

  1. Time to detect
  2. Time to triage
  3. Time to diagnose
  4. Time to fix

Make sure your tracking enables you to measure each of these components so you can see where your biggest opportunities for improvement are, and so you can measure the effects of what you’re doing to make things better.

Monitoring at the service level

To reduce detection times, make sure you have monitoring at the service level. There’s a whole different post here, but the essence is that you want to monitor what your customers are paying for – they don’t care if your database is up or down and neither should you unless it’s affecting the service. I’ve seen many cases where every piece of infrastructure is extensively monitored and yet the entire service can go down without any alerts – because it’s the interactions between those services that were affected – and those weren’t monitored.

Have Data. Use Data.

Triage and diagnosis times can be greatly reduced by having the right information. Hopefully your applications and systems are already recording lots of useful data in a place where you can easily see and analyze it. The most important information you need for diagnosing tricky problems is a clear understanding of the dependencies in your system. You should have a data model of your infrastructure somewhere – is it easily visualized? Better still, is there an API to it so your tools can not only show it to you but use it themselves when making decisions?

Build your tools to help you do the right thing

How long it takes to fix a problem once you know the cause (or before you know the cause, as long as you know where the problem is) is largely dependent on the design of the system and the tools to control it. When you’re building your tools make sure you think through the typical use cases – make it easy to do the right thing and try to prevent people from making the wrong choices – even if that does limit the flexibility of the tool.

Leverage automation

Automation can help you at each stage: detection, triage, diagnosis, repair. You want people to spend their time making decisions and once those decisions are made, the computers should do most of the work.

Make things better, one outage at a time

If you have a good system, focusing on MTTR will give you the biggest bang for your buck when it comes to increasing your availability. If you really want to improve MTTR, make sure you learn everything you can from the outages you do encounter. There is no better way to understand the weaknesses of your system than to examine what caused and contributed to real failures in production. Don’t waste a single opportunity!