Tag Archives: web operations

Your Nines Are A Lie

Is your service three nines, four nines, or even five? No matter what your answer is, it’s almost surely inaccurate.

I recently went through an exercise at work to calculate the expected availability of one of our foundational systems. It reminded me how little these calculations have to do with actual availability as experienced by consumers. Expected availability numbers are generally based on hardware failure rates. You take how often it fails with how long it takes to repair, and that gives you the component availability. An individual server may have an expected availability number of 99% which means in an average year you’d expect it to be down for repairs for about three and a half days. An easy way to raise the availability of a system is to have redundant components – if you have two of those servers, your system availability goes up to 99.99%. Why? Because the chances of both servers failing at the same time are really small. With three servers you get up to 99.9999%. As you make this system more complex with more layers and more dependencies, the math gets a little more complicated but the idea stays the same, and so you can calculate an expected availability of your entire system based on the availability of each of its components. If you’re running a production system at scale a typical design (redundant data centers, redundant circuits, redundant systems) could easily reach 99.999% (five nines) on paper. That’s about 5 minutes of downtime per year. For calibration, it would take 12 years of uninterrupted service to be able to take a 1 hour outage and still be at five nines. But every big outfit, including Google, AWS, and Facebook has experienced outages longer than that, even though they have big budgets and super smart people designing their systems. Why?

It turns out that most big outages are not caused by component failures. The most common cause of a major outage is someone making a change. All three of the outages I linked to above were caused by a human making a change. Reliability calculations based on component failures tell you absolutely zero about how likely your system is to fail when you make changes – that depends on the quality of your tooling, the frequency of your changes, the design of your system, and the capabilities and training of your team. The second most common cause of outages is overloads – where your system (or some critical subsystem) can’t keep up with what’s being sent at it. Two of the three examples involved overload conditions.

I’ve seen a lot of outages in my career and a vanishingly small percentage were caused by hardware failures – pretty much any decent system these days has been designed to handle individual component failures. The trick is figuring out how to make your system resilient in the face of change and making sure you have the tooling you need to be able to react to and quickly fix any problems that do come up (including being able to quickly add new capacity if needed). If you’re trying to build a reliable service you should pay just as much attention to those as you do to the reliability of your system components!

Cars, Technology, and Web Operations

I like cars. And I love finding analogies for my day job in the car world. Cars have been around for a long time and they’ve gone through a bunch of technological advances. I happen to have two cars from two different eras that demonstrate many of the changes in automotive technology over the last 40 years. I believe that these changes can teach us something about the evolution of Web Ops technologies as well.

One of my cars is a recent Mustang, a 2006.  It’s a modern car, full of technology and creature comforts. I also have a 1966 Mustang, which is devoid of both. I just recently started driving the 1966 again (it was non-functional for many years), and I have been really struck by how different the two driving experiences are.

What first got me thinking is when I noticed the vast difference related to the manual transmissions, and specifically the clutches. These two cars are very similar at first glance: both rear-wheel drive, both powerful V8 engines, both manual transmissions. But they don’t drive the same at all.  In the 66, all of the user controls (steering wheel, clutch pedal, gas pedal, etc) are mechanically linked to the things they’re controlling – if you press the gas pedal to a certain point, that corresponds exactly to how far the throttle plates in the carburetor will be open. In the 06, everything is computer controlled.  The position of the gas pedal is sent to the computer which uses that as one of the inputs to decide how much fuel to inject into the mixture going into the cylinders.  Meaning that in the 06, the computer can adjust things if it wants to – and it often wants to.  You can see this in how forgiving the 06 is when starting from a stop.  As everyone who has driven a manual car knows, from a stop you give the car some throttle to bring up engine speed while simultaneously releasing the clutch pedal, which brings the clutch plate into contact with the flywheel at the back of the engine, starting the clutch plate spinning, spinning the transmission gears, then the driveshaft, then the differential gears, and finally the rear wheels, making the car go.  In the 66 it’s very easy to get a result that’s out of expected norms – namely either stalling the car if you don’t give enough throttle or spinning the rear wheels if you give too much.  It is much, much harder to stall the 06, because the computer knows that’s probably not what you’re trying to do – so it’ll put more fuel into the mixture than the position of the throttle would normally indicate to keep the engine from stalling (up to a limit).  This makes the general driving experience in the 06 more pleasant, more predictable, and safer.

Great, right?  That’s progress!  Cars should be pleasant, predictable, and safe, shouldn’t they?  Yes they should – usually.  But not always.  What if you’re on a racetrack, trying to get the maximum possible performance out of your car?  In those cases, the same technology that makes your car more pleasant in average driving conditions will hold you back.  You can see this much more concretely in traction control systems.  Briefly, many modern cars (especially with lots of power) have computerized traction control systems that can sense when a wheel starts to spin (meaning you’re losing traction) and they either use the braking system to slow the rotation of that wheel or lower the power output from the engine.  In average driving conditions, this is great – it makes the car safer and more predictable.  However, if you’ve ever watched Top Gear or any show where they have professional race drivers trying to get the most performance out of a road car, the very first thing they will do is to disable the traction control systems – because at the edges of the performance envelope those systems will hold you back and prevent you from getting the full potential out of your car.  You need to be able to drive the car so that one or more of the tires are on the ragged edge of traction to go FAST.  When you’re on the edge of out of control is when you’re racing – and (so far) traction control systems are not nearly as good at balancing on that edge as a human driver.

What does this mean for Web Ops?

  • When you make things easier, more predictable, and safer, you by necessity impose constraints on what can be done with the system. For most people most of the time, these are very good constraints. However, if you are trying to push the edges of the envelope when it comes to performance or efficiency (like a race driver does), then those constraints will get in your way and you’ll want to avoid or remove them. For example, using standardized machine configurations that are automatically applied can greatly increase the robustness of the system, both by removing manual mistakes and also by simplifying diagnostic and repair procedures. However, those standardized configurations are probably not going to be as performant as a custom-built, custom-tuned configuration would be for any particular application. If you’re in the 99% that don’t need maximum performance for your particular business, standardized configurations are going to be great for you.  But if you’re that 1% you may need the flexibility to customize.
  • Another effect of using technology to make things easier, more predictable, and safer is that it usually imposes new costs on the system in terms of resources. Take virtualization as an example. Virtualization provides powerful new capabilities for IT – you can easily reconfigure your infrastructure, snapshot your systems, etc. However, there are costs to virtualization – the virtualization layer consumes system resources that can’t be used directly for your workloads. For most people most of the time, virtualization is overall a large net benefit.  However, if you’re running a huge infrastructure where losing 0.5% of your resources represents a significant financial burden, then virtualization probably isn’t for you.
  • The costs imposed by new technology to make things easier, more predictable, and safer aren’t always measured in resources. They can come in the form of risks as well. Computer-controlled throttle and braking systems could in theory suffer from software bugs that could create dangerous conditions in ways that mechanical systems are not vulnerable to, like having the throttle opened even when the driver is intending to slow the vehicle down. New capabilities and new software introduce new failure modes.  IT is littered with similar examples – in my own career I was struck by how often the introduction of high-availability systems (like database clustering for instance) actually led to failures of the highly-available system that wouldn’t have occurred in non-redundant systems – such as when a heartbeat between a master/slave system fails and the slave puts itself in-service, leading to two masters being active, leading to IP address conflicts and general badness.

Don’t get me wrong – I love technology, in cars and in IT. I hope that everyone leverages technological advances to the fullest extent possible. However, I recognize that as technology gives us new capabilities, it also imposes new constraints and costs, and as Web Ops professionals we need to understand those costs and constraints and make sure that we make informed and intelligent decisions about how we apply new technologies to our infrastructures.

Happy motoring.

Anatomy of a Change Request – The Basics

In most IT environments you’ll find some kind of Change Request (CR) form.  Some of them are simple forms for simple workflows and some of them…well, aren’t.  What does a typical CR look like?  If you’re creating a Change Management (CM) process for your organization (and you should have one!), what should your CR look like?

In this post I’ll talk about the very basic information that should be in every CR. In a subsequent post I’ll go through some of the optional information that more heavyweight CM processes may use.

A minimal CR

Any CR should have at least the following information:

  • Title
  • Requestor
  • Executor
  • Execution Time
  • Purpose
  • Procedure (including execution, validation, and rollback)
  • Results

Let’s go through these one by one:

This is a short (less than one line) summary of the CR, used mainly for displaying CRs in lists.
Who asked for this change? This is important to have in case there are any questions about what should be done or decisions that need to be made about different options that can be chosen. If you don’t know who requested it, you can’t get answers to those questions.
Who is actually doing the change? This is important to know for later troubleshooting purposes – if something goes wrong you’ll want to consult the person who made the change as they will have the best knowledge of what happened and if anything strange occurred.
Execution time
For troubleshooting it is critical to know exactly when changes took place, so you can correlate with service impacts or other important events. (Your CM process may record execution time as part of the change workflow itself, in which case it’s not critical to have it actually in the CR – but it needs to be somewhere).
Why is this change being made? What is the business value of doing this? This is the field I see missing most often. Everyone involved in the CM process should understand the reason why changes are being made – and those reasons should be tied to the needs of the business. This understanding allows everyone to make informed decisions at every stage about priorities, strategies, tactics, etc. Without this understanding, the people making the changes are disconnected from the business and become disengaged and jaded, eventually leading to poor decisions.
Procedure (execution, validation, rollback)
What are you going to do? What order are you going to do it in? How are you going to make sure it worked, and didn’t break anything else? What are you going to do if something goes wrong? There are many different viewpoints on what level of detail and rigor this procedure needs to have – there is no one right answer but I always think of every CR as a candidate for future automation, and the more detailed, specific, and complete the procedural section of the CR is, the easier it will be to automate in the future.
What happened when the change was executed? Typically this part of the CR will contain pasted output from execution or validation commands, or screenshots showing the effective change, etc. If there are any problems later this prevents wasted time while people ask “did you do _____” or “what does ______ command show?” A tiny amount of work to cut’n’paste some info here can save a huge amount of heartache later.

This may seem like a lot of information for a simple CR, but in practice it doesn’t take very long to fill these out for simple changes. And for complicated changes, you shouldn’t be worried about the extra overhead of typing – if you’re not thinking through and planning your complicated changes, you’re taking big risks with your business.

Where does a CR form live?

When your CM process gets started, CR forms will often be simple documents – they could be in GDocs (this is how we do it at my company today), they could be in a wiki, or they could live directly in the ticketing system that manages your CM workflow (if you have one). What’s important is that the CRs be easy to fill out and easy to find later.

How do I start using a CR form?

Once you’ve created your CR form, the next step is simple. Just start using it for your changes! Ideally the person in charge of your infrastructure already understands the value of CM, and will be eager to have everyone start using the CR. If that’s not the case, use the CR form yourself, and ask others to use it. Even if no one else does, at some point there will be an incident that will make the value of using CRs obvious to everyone – and when that happens you’ll be ready.

Why Change Management?

Recently I had the opportunity to create a template for infrastructure change requests at work. Based on the reaction from some of my co-workers, I thought it might be valuable to explain what change requests are for. In a subsequent post I’ll go through what a basic change request looks like.

Change Requests are part of the Change Management (CM) process. Now don’t get freaked out, that doesn’t mean we need forms filled out in triplicate sent through multiple people for review and approval. Processes can have as much or as little heft as required to meet the needs of your organization. But if your infrastructure’s availability is important to you, you should have a CM process. We are a small startup, so our CM process is very lightweight. Here are the main tenets:

  1. Think about a change before you start executing it
  2. If something is high-risk, test it before you do it for real
  3. Know how you’re going to handle it if something goes horribly wrong
  4. Record that you made the change so people can find it later if they need to (for example, when troubleshooting a problem)

Point 1 (think before you execute) is really philosophical. After many years of doing production web operations, I’m convinced based on the empirical evidence that you’re far more likely to screw something up if you just start cowboying your way through a change rather than planning it ahead of time. You see this point of view in other contexts as well (“plan your flight, fly your plan”). Many times when planning a change, I have thought of something new as I’m doing the planning that I would otherwise have encountered during execution – something that in the heat of the moment would have caused me great panic. Better to hit that and work through it when you’re not stressed out in the middle of a big production change. For me one of the most important parts of having a written Change Request is that it enforces thinking through a change before you execute it.

Point 2 (test high-risk changes) may sound obvious but there are certainly nuances. How do you determine what’s high-risk and where do you draw the line? How much time do you spend doing testing vs simply rolling back a change if it does cause problems? I’ve found that it’s best to leave these decisions in the hands of the people executing the changes – but your CM process needs to remind them to ask these questions, think about the answers, and use their best judgment.

Point 3 (how to handle problems) is not theoretical. If your job is web operations, you will be involved with a change that goes horribly wrong. It just happens. When it happens, if you have not thought about it ahead of time you will be up a smelly brown creek without a paddle. This is when panic sets in, and in the heat of those moments some spectacularly bad decisions can be made which could make the situation even worse. Spending some time prior to execution thinking through potential failure scenarios allows you to execute your rollback plan calmly and effectively. Which way do you prefer?

Point 4 (change recording) is absolutely critical unless you a) never forget anything and b) are the only person involved in the support of your infrastructure. In my experience, the majority of thorny production problems are caused by changes, usually when they introduce latent faults that don’t manifest as incidents for a while. When diagnosing such a problem, it is critical that you know what changed when, and that is precisely the purpose of change recording. There are a million ways to do this, from sending emails to a “changelog” alias or putting change summaries in IRC to having a CMDB with change records in it. Less important than the specific mechanism(*) is that you have a mechanism, that people use it religiously, that it’s easy to search for changes at particular times and to particular systems, and that everyone knows where to find it and how to use it. What seems like busywork when you’re performing a change (“Why do I have to write this down? It’s already done!”) will pay giant dividends when it prevents someone from spending tons of time reverse engineering what happened while the service is down.

(*) – Note: one thing you really should leverage is version control for your CM and recording processes – it’s invaluable for being able to track a sequence of changes and to easily pull back a previously working configuration.

Monitoring – Getting the Most Bang For Your Buck With WCCAM

When you run a service that others depend on (i.e. you have customers) you have a responsibility. Your service should work when people want to use it. If it doesn’t you’re letting them down – and likely costing yourself money. But you have limited resources to invest in keeping the service up – how do you spend them most wisely? What’s the best bang for your buck when it comes to monitoring? (Hint: it’s probably not what your monitoring system is desgined for!) Let’s look at your options:

Infrastructure monitoring

Your typical monitoring system solution will tell you lots about your infrastructure, meaning the servers and network devices that your services run on. You’ll have ping tests to make sure servers are alive and disk space checks to make sure that they can write new information that they need to. The monitoring system will also record lots of system-level metrics for you to look at: how busy your servers are, how heavily utilized your network links are, etc. For a service with lots of customers, you’ll have a lot of infrastructure. It may be dozens, hundreds, or thousands of devices. A decent monitoring system will tell you right away when any of those devices fail or are having serious problems.

But that’s not good enough! In fact, it’s often useless;. Unfortunately this is the kind of thing most monitoring systems are really good at. But how valuable is it to check 3 times every minute that your disks are still 77% full? So you can ping a server – do your customers care?

Infrastructure monitoring tells you if a server or a router go down.  Do your angry customers typically complain that “your router is down!” or “your database server is down!” when they call you about problems? If the answer is no – read on.

Application Monitoring

On top of your infrastructure you have applications – the software that provides the services your customers consume. Good application monitoring will involve looking at individual processes on your servers, and looking at the operational interfaces those processes provide to you: primarily logfiles and statistics. If your application monitoring is decent you’ll know right away when any of your software gets into a bad state.

That’s not good enough either!  The service you provide is not the software.

Do your angry customers typically complain that “the indexing queue is really backed up” or that “the shopping cart middleware has stopped accepting requests” when they call you about problems? If the answer is no – read on.

What Customers Care About

Your customers use what you’ve created for a reason. It provides a benefit (or benefits) to them. That is what you want to be monitoring. What would your customers say if you asked them “Why do you use our service? What does it do for you?” Take that answer and figure out how to monitor it. Maybe the answer is “I use your service to make payments to people I buy things from.” OK, then your monitoring system needs to be able to measure making payments. (NOTE: not the servers that are involved in making payments – not the software that is involved in making payments, but making payments is what you need to measure and monitor). If the answer is “I use your service to read about what my friends are doing” then your monitoring system needs to be able to measure people reading about what their friends are doing. After lots of searching in vain for a decent name, I call this “What Customers Care About Monitoring” or WCCAM (rhymes with Wiggam, like the police chief in The Simpsons). This is what you really care about – that the value you provide to your customers is working.

These are probably also exactly the things your customers do complain to you about. “I can’t make a payment!” “I can’t read the status updates from Soandso!” If you listen to customer support calls, these are the kinds of things you’ll hear. In fact, in lieu of asking customers directly what they use your service for, the next best thing is to ask your customer support folks what people complain about – that’s an excellent pointer to what your critical services are from a customer perspective.

Measure your services – measure your value!

Once you’ve identified the services your customers care about (like making payments, or reading updates from their friends), figure out what characteristics of those services are critical. Possibilities include:

  • performance – response time, load time – how quickly can they get to it?
  • functional correctness – is it doing what it’s supposed to?
  • availability – can they reach it when they want to?

Then figure out how to measure and monitor those characteristics.  I know, I know – that’s not easy.  That’s why you get paid to do it!

What does this do for me?

Effective WCCAM pays off in much higher availability for your services, which means happier and more satisfied customers.  It can reduce both MTTR and MTTF by providing much faster detection of customer impacting problems. If you have ever had an outage or disruption that was reported to you by your customers rather than your monitoring system then you already know the value of WCCAM. There are many situations where all of your individual devices and applications are up and running, but the overall service is not working because something is wrong with the connections between those services or with some external dependency those services have. WCCAM tells you about these problems – infrastructure and application monitoring do not. This earlier detection can dramatically reduce your MTTR.

Effective WCCAM also can lead to faster triage and diagnosis, again reducing MTTR. Since what you are measuring is what customers care about, it’s much easier to distinguish an important problem from a trivial one – letting you prioritize what you’re going to do intelligently.

WCCAM points the way

Let’s review:

  • What you should monitor is what your customers care about.
  • So monitor the services you provide to your customers – not (just) your infrastructure.
  • WCCAM will let you find and address problems more quickly, leading to happier customers, a happier business, and a happier universe.
WCCAM will also provide great data to inform the decisions you make down the road about which services need investment in stability fixes, what your customers’ experiences with your service are, and how you stack up against your competition.  WCCAM is not the only monitoring you need – you still need to have infrastructure and application monitoring.  But when you have limited resources and have to prioritize and make choices – make sure you’re putting WCCAM at the top of the list, because it gives you the best bang for your buck.

Reliability vs Availability and the Magic of MTTR

When talking about online systems, you’ll often encounter the terms “availablity” and “reliability.” As in “The CloudNetanitor 10000 provides industry-leading reliability” or “We built the Securovator LX system using exclusively high-availability components.” These two concepts are closely linked but different: each of them has its own specific definition.


Availability is how often something works and works correctly when you try to use it. It’s generally expressed as a percentage, like “SuperLogBuzzer achieved 99.94% availability in the month of June.” The simplest way to calculate the availability is to take the number of successful uses of the service, and divide it by the total attempts. Mathematically:

(successes) / (successes + failures)


Reliability is how long something works without breaking. Unlike availability it is generally expressed in time. So if you manufacture engines and your engines on average last for two years of continuous operation, the reliability of your engines is two years. Generally reliability is expressed as an average since every engine is going to be different – some may last six months, some may last six years. There’s even a specific term (and acronym) for this average: MTTF, standing for “Mean Time To Failure.” You see this a lot for things like disk drives, which may say “measured MTTF of 6,000 hours.” Some systems will use MTBF or Mean Time Between Failures – that’s almost but not exactly the same thing – close enough for now.

So which is better?

Availability is what really matters when talking about services you use or services you provide to others. If you can’t reach http://www.google.com, do you really care how long it’s been since the last time it broke? Of course not – you just care that it’s not working right now when you want to use it.

The underlying story here is that availability captures both reliability (MTTF) and another critical concept: mean time to repair (MTTR). Whether or not you can use Google when you want to depends on two things: how long Google goes in between breaking, and also how long it stays broken for when it does break. Both pieces affect your experience with the service, and availability captures both.

What does this mean for me?

Focus on MTTR. If your system is reasonably well-designed and you didn’t cut a bunch of corners, your MTTF is almost exclusively determined by the architecture and the intrinsic characteristics of the components (hardware and software). This means it’s difficult to make significant changes in MTTF without either rearchitecting, spending lots of money on more redundancy (which can cause its own reliability problems by adding complexity), or changing to an underlying platform with different reliability characteristics. By contrast, there are often large gains to be realized in MTTR without nearly as much investment by relatively simple changes in tools, techniques, and organization.

Tracking MTTR

How do you focus on MTTR? First, track it. If you don’t already have somewhere you record every incident – get one. It doesn’t really matter what you use as long as it’s easy to update and allows structured data to enable analysis once you’ve got some data. Once you have that, make sure you’re recording data that allows you to determine MTTF and MTTR for your services.

MTTF is a straightforward number – it’s the time between when your last service interruption ended and when your next service interruption starts. MTTR on the other hand can be further subdivided, and your recording system should allow you to track the following components of it:

  1. Time to detect
  2. Time to triage
  3. Time to diagnose
  4. Time to fix

Make sure your tracking enables you to measure each of these components so you can see where your biggest opportunities for improvement are, and so you can measure the effects of what you’re doing to make things better.

Monitoring at the service level

To reduce detection times, make sure you have monitoring at the service level. There’s a whole different post here, but the essence is that you want to monitor what your customers are paying for – they don’t care if your database is up or down and neither should you unless it’s affecting the service. I’ve seen many cases where every piece of infrastructure is extensively monitored and yet the entire service can go down without any alerts – because it’s the interactions between those services that were affected – and those weren’t monitored.

Have Data. Use Data.

Triage and diagnosis times can be greatly reduced by having the right information. Hopefully your applications and systems are already recording lots of useful data in a place where you can easily see and analyze it. The most important information you need for diagnosing tricky problems is a clear understanding of the dependencies in your system. You should have a data model of your infrastructure somewhere – is it easily visualized? Better still, is there an API to it so your tools can not only show it to you but use it themselves when making decisions?

Build your tools to help you do the right thing

How long it takes to fix a problem once you know the cause (or before you know the cause, as long as you know where the problem is) is largely dependent on the design of the system and the tools to control it. When you’re building your tools make sure you think through the typical use cases – make it easy to do the right thing and try to prevent people from making the wrong choices – even if that does limit the flexibility of the tool.

Leverage automation

Automation can help you at each stage: detection, triage, diagnosis, repair. You want people to spend their time making decisions and once those decisions are made, the computers should do most of the work.

Make things better, one outage at a time

If you have a good system, focusing on MTTR will give you the biggest bang for your buck when it comes to increasing your availability. If you really want to improve MTTR, make sure you learn everything you can from the outages you do encounter. There is no better way to understand the weaknesses of your system than to examine what caused and contributed to real failures in production. Don’t waste a single opportunity!