Category Archives: Web Operations

Your Nines Are A Lie

Is your service three nines, four nines, or even five? No matter what your answer is, it’s almost surely inaccurate.

I recently went through an exercise at work to calculate the expected availability of one of our foundational systems. It reminded me how little these calculations have to do with actual availability as experienced by consumers. Expected availability numbers are generally based on hardware failure rates. You take how often it fails with how long it takes to repair, and that gives you the component availability. An individual server may have an expected availability number of 99% which means in an average year you’d expect it to be down for repairs for about three and a half days. An easy way to raise the availability of a system is to have redundant components – if you have two of those servers, your system availability goes up to 99.99%. Why? Because the chances of both servers failing at the same time are really small. With three servers you get up to 99.9999%. As you make this system more complex with more layers and more dependencies, the math gets a little more complicated but the idea stays the same, and so you can calculate an expected availability of your entire system based on the availability of each of its components. If you’re running a production system at scale a typical design (redundant data centers, redundant circuits, redundant systems) could easily reach 99.999% (five nines) on paper. That’s about 5 minutes of downtime per year. For calibration, it would take 12 years of uninterrupted service to be able to take a 1 hour outage and still be at five nines. But every big outfit, including Google, AWS, and Facebook has experienced outages longer than that, even though they have big budgets and super smart people designing their systems. Why?

It turns out that most big outages are not caused by component failures. The most common cause of a major outage is someone making a change. All three of the outages I linked to above were caused by a human making a change. Reliability calculations based on component failures tell you absolutely zero about how likely your system is to fail when you make changes – that depends on the quality of your tooling, the frequency of your changes, the design of your system, and the capabilities and training of your team. The second most common cause of outages is overloads – where your system (or some critical subsystem) can’t keep up with what’s being sent at it. Two of the three examples involved overload conditions.

I’ve seen a lot of outages in my career and a vanishingly small percentage were caused by hardware failures – pretty much any decent system these days has been designed to handle individual component failures. The trick is figuring out how to make your system resilient in the face of change and making sure you have the tooling you need to be able to react to and quickly fix any problems that do come up (including being able to quickly add new capacity if needed). If you’re trying to build a reliable service you should pay just as much attention to those as you do to the reliability of your system components!

Are you SRE or are you DevOps?

People have asked me, “Are we doing DevOps, or are we doing SRE?” I’ve also heard (and this is worse): “We’re an SRE team – we don’t do DevOps.” These distinctions don’t make sense, because SRE and DevOps aren’t actually different things. SRE is DevOps. To be more precise, SRE is a specific implementation of DevOps. DevOps is a broad (and vague) term. It’s more of a philosophy than a methodology – it’s a perspective on the world and a set of patterns to apply. SRE shares the DevOps philosophy and many of the same patterns.

The term “SRE” generally refers to Google SRE, which is a particular implementation of DevOps inside of a ton of Google-specific context. (SRE – both the term and the practice – originated at Google and has only recently been used by other organizations). There are several things DevOps and SRE have in common:

  • Focus on solving problems with software
  • Ownership and empowerment of the team responsible for a service
  • Learning relentlessly from successes and (especially) failures
  • Driven by data and metrics

Google SRE adds a lot of specifics – some of the most interesting are aspects of economics and incentives, such as:

  • Common (human) resource pool between software developers and SREs – scarcity of whom leads to explicit decisions to optimize between features and reliability
  • Use of an “error budget” to throttle the rate of change for a product – including the unintuitive guidance that if you are exceeding your SLO for availability, you should launch more features
  • A cap of 50% of SRE time on operational tasks (known as “toil”) – to ensure the system can scale faster than the team required to support it
  • At least 5% of operational work done by software developers – to maintain visibility of the operational load the software creates

Google SRE operates on top of Google’s internal infrastructure and products. This is an extremely important part of the Google SRE context – they have had brilliant people working for fifteen years on the foundational systems, processes, and tools used to manage Google’s services.And within Google, every SRE team can benefit from not just common tooling and infrastructure, but also repeatable, translatable process. No other existing SRE team (outside of Google) works with the same level of foundational support.

As SRE expands outside the walls of Google, I like to think it will come to mean “applying the principles of DevOps at scale.” “Service Reliability Engineering” (an evolution of Google’s “Site Reliability Engineering”) is a much better term than “DevOps” to apply to teams focused on the reliability and performance of large-scale distributed systems, because it reflects the work and the expertise involved. “DevOps,” unfortunately, tends to just create confusion when applied to an organization or a strategy.

What, then, does it mean to do SRE? What does applying DevOps at scale look like? To start with:

  • Automate your infrastructure – build and management
  • Monitor what matters – set explicit SLOs for your services and gather the data both to see if you’re hitting the objective, as well as to evaluate the effects of changes to your infrastructure or code
  • Make your code builds and deploys both automated and repeatable, leveraging CI/CD
  • Learn from your failures with an effective retrospective process for incidents and launches
  • Empower your people – software developers and SREs – and hold them accountable for the overall success of the product – which includes its reliability

DevOps and SRE aren’t at odds. You can learn a lot from both!

The Freakonomics of Oncall Pay

Wearing the pager. It’s a fact of life for many of us ops folks. I’ve taken part in many a discussion from a management perspective about how oncall duty should be compensated. When the people doing the talking are pointy-haired-manager types who haven’t done oncall themselves, their starting position is often similar to the “per incident support” policies you get from a vendor – if you’re oncall and you get paged, you get paid either a set amount for that incident or an amount proportional to how long you’re working on it. So you might get say $50 per incident, or $25 per hour you’re engaged, for example. And if your oncall duty period goes quietly (no pages), you won’t get anything above your normal salary. Let’s call this the “per incident” model. The other option is a flat wage bump for the period you’re oncall – so for example every week you’re oncall, you get an extra $150 in your check. Let’s call this the “per week” model.


image provided by flickr user hades2k under a Creative Commons license.

I’ve always been a strong advocate for the “per week” model instead of the “per incident” model, because being oncall is an intrusion by your work into your personal life – it reduces your freedom during your time off work, and it does that whether or not you get paged. You can’t fly to Vegas for the weekend, you can’t get sloshed after work, and in some cases you can’t even take a shower or drop your kids at school without arranging for someone to cover you in case there’s an incident while you’re away from the keyboard. Simply being oncall affects your life, and I’ve always felt that people should be compensated for that – not just for the time where they are actively working an issue.

Then I saw _Freakonomics_ and realized there’s an even more powerful argument for the “per week” model: incentives. In the “per incident” model, your compensation goes up when the system has more problems that require oncall support. In theory, people might try to cause incidents so that they will get paged and therefore get more money. Personally I doubt that happens very often, if at all. However, there’s a more subtle influence on root cause analysis and problem management that I think does have real effects. When you’re paid in the “per week” model, you’re strongly incentivized to address the root causes of problems and improve your systems so that during your oncall week it’s more likely that you’ll be able to sleep through every night and live your life normally for that week. So when you do encounter an issue in the “per week” model, you’re going to really want to figure out what caused it, and to make meaningful changes to prevent it from happening the next time around. Fixing the problem completely does you nothing but good – you’re still going to get paid the same oncall pay for your next stint, and you’re going to have a much more pleasant experience when you’re on call. But in the “per incident” model, putting all that effort into root cause and remediation is actually going to cost you money. The next time you’re oncall, each incident you prevented from happening will mean you don’t get the money for working that incident. So consciously or not, it’s likely you’re not going to work quite as hard to make your systems better as you would under the “per week” model. I believe that this can have real effects and result in your systems being less reliable and less stable than they could be.

How does your company do oncall? Is it per week, per incident, or something else? And how do you think that is affecting incentives and outcomes? I’d love to hear your experiences!

PS: I did a little research into military hazard pay to see if there were instructive parallels there. I found Hostile Fire Pay which seems to follow the “per week” model. Anyone have information/thoughts on how incentives have affected this program?

What Public Cloud SLAs ARE good for

When HP announced their cloud was going GA, lots of the coverage said that the cloud came with a 99.95% SLA. While that is technically accurate, it is also very misleading. People may assume (and even some coverage indicates this) that this means if you launch a virtual server inside HP’s cloud, they are guaranteeing that instance will have 99.95% uptime. This is emphatically not true. What they are really guaranteeing is that that instance will be up or you will be able to launch a replacement instance. That’s a really important or. Lydia Leong over at Gartner has a much more detailed analysis of this that you probably want to read. For my part, I’d like to illustrate some theoretical situations that would not be considered violations of your SLA to try to make this a little more real and concrete for those of you with infrastructure running in the public cloud. (AWS’ SLA is similar to HP’s for all these cases).

The Examples

Here we go. Each of these situations is not considered to be a breach of HP’s SLA:

  • The only VM you are running in the cloud blows up and is unrecoverable. This does not count against the SLA.
  • The only VM you are running in the cloud loses all network connectivity. This does not count against the SLA.
  • Every VM you have in every availability zone in the region blows up at exactly the same time. They are all unrecoverable. This does not count against the SLA (because you can launch replacement VMs for all of them).

Amazon is no better. The examples above all would be the same for AWS – not a breach of their SLA. Here’s one that would be covered by HP’s SLA, but not AWS’:

  • Let’s say you have 200 VMs in each of 3 different AZes in a region. In each of those AZes, 199 of those instances die and you cannot launch any replacement instances. So you are left with 3 of your 600 instances running (99.5% of your capacity is lost). This does not count against the SLA.

What to do?

I’m really not ragging on public cloud here. I don’t think SLAs have much value as contracts. What they are good for is helping you do risk management. A provider is going to sum up lots of information about how they’ve designed and architected their services in their SLA. Think about new cars: if one comes with a 12 month limited warranty and another comes with a 10 year bumper to bumper warranty, doesn’t that make you think they are designed and built differently? So both HP’s and AWS’ SLAs tell you a lot about how their systems are architected. Mostly they tell you that individual instances are ephemeral and you better account for that when running your applications on top of them.

This is also why any SLA that says “100% uptime” is completely useless, because that number is obviously divorced from any match to the reality of the design and architecture of the system underneath the SLA. No system can achieve 100% uptime, and by pretending theirs does, service providers actually make it more difficult for their customers to do intelligent risk management.

My advice: make sure you understand your providers’ SLAs. Not because you’re going to use those SLAs to wrangle compensation out of them for outages (maybe I’ll post later about how such compensation is also a joke). You want to understand your provider’s SLAs so that you can do intelligent risk management for yourself. Knowledge is power. Get it and use it!

Incident Management and Problem Management – Web Ops style

I’ve noticed that a lot of people find this blog by searching for “incident management versus problem management.” This puzzled me, because I figured there must be way better places on the internets to find that answer. So I looked at the google results for that search and read the top five or so. And now I get it. Because those all suck. They have too much ITIL. And they’re too focused on service desk setups versus web ops setups. So let me try to put this in the simplest terms I can:

  • Incident Management is about fixing something that’s broken
  • Problem Management is about fixing something that’s broken

Confusing, right? Well, the two really are different. Let’s start with some analogies.

The Analogies

Let’s say your house is on fire. “Incident Management” is the firefighters putting out the fire. “Problem Management” is the investigation that happens afterwards to figure out why the fire started and what to do about it.

Let’s say you’re on a plane, and all the engines fail. “Incident Management” is the flight crew gliding the plane to an emergency landing. “Problem Management” is the investigation that happens afterwards to determine why those engines all failed and what should be changed in the future so that doesn’t happen again.

In a Web Ops world, “Incident Management” is what you do when all the servers in one of your datacenters have gone tango uniform – perhaps you take all the incoming traffic and redirect it over to a different datacenter. “Problem Management” would be figuring out what happened to all those servers – why did they fail together and what do you have to do to make sure that doesn’t happen again.

The Definitions

“Incident Management,” then, is when you are immediately addressing an urgent situation. Timeframes are generally short and pressure is high, and you’ll do whatever it takes to make things better – even things that in normal situations would be too costly, too risky, or against policy. For example, if your engines are out you might land your plane on a freeway – which is something you’d never do under normal circumstances. And importantly, you don’t do investigations into what caused things while you’re doing incident management. You really don’t care at that point why all your engines failed – you just want to get the plane down safely.

“Problem Management,” in contrast, is a more measured activity where you figure out what was really behind the incident and then what to do about it. The situation isn’t as urgent and the techniques are different. Root cause analysis is a key tool for Problem Management. And I don’t mean the mythical quest for a single root cause, I mean real root cause analysis. Which is a topic for another post.

The two are connected, of course. The inputs to Problem Management are The Facts: what happened when, who did what, what results did those actions have, etc. Those Facts came out of the Incident Management process, and ideally are recorded automatically without the people doing Incident Management having to do anything special – think of the “black box” that commercial aircraft have. In a Web Ops world, automatically recording crisis phone bridges and chatrooms can serve a similar purpose.

Who Cares?

It’s important to recognize that these are two different activities, and treat them as such. Most everyone does Incident Management, out of necessity (I mean, it’s not like you’re going to decide NOT to emergency land your plane when all your engines fail). What’s sometimes lacking, however, is the followup through Problem Management. It’s tempting after an incident to say “We know what happened, and what to do about it.” It’s tempting because it sounds right. And because it’s a lot easier than doing real root cause analysis, and a helluva lot easier than actually prioritizing and executing all the actions that are really required to address the underlying problems that led to the incident. But in the long run you’re way better off investing that time and effort.

Here’s how to tell you need to step up your Problem Management efforts. Do you ever say or hear things like this?

  • “Oh yeah, I remember when this happened last time.”
  • “Call Bobby Ray, he fixed this before.”
  • “Why does this keep happening?”
  • “We’ve gotten really good at dealing with this problem.”

If so – try some Problem Management.

What to do?

  • Make sure you’re recording useful information from your incidents – get The Facts
  • Have an official post-mortem investigation process for your incidents (and use it).
  • Periodically review the actions you’ve identified to fix things and see how you’re doing at getting them accomplished.

You don’t need a separate Problem Management team. (In fact, I’d recommend against it). You don’t need some expensive Problem Management software. All you need is an incident, some data, and a desire to make things better. I know you’ve got that!

Is Ubuntu a good server OS? – firewall edition

Earlier, I posted about upstart and how it illustrated how Ubuntu is not a good operating system for running servers. Today I wanted to provide another example: UFW, the “Uncomplicated FireWall” that is installed by default on Ubuntu servers.

Linux firewalling and UFW

Firewalls in Linux are implemented with iptables. In a nutshell, iptables evaluates packets against lists of rules to determine whether they should be accepted or rejected. iptables is typically used on servers through scripts that set up all these rules the way you want them – these days this process is generally managed through whatever configuration management system you’re using (e.g. chef or puppet).  UFW is a tool distributed with Ubuntu that provides a different way of setting up these iptables rules, using command line tools.

How UFW works

UFW keeps its own set of iptables rules in /etc/ufw. When ufw is turned on, it flushes out all the existing iptables rules and replaces them with the rules from /etc/ufw. You can then use the command line tools to add ports to allow or deny, which will update both the running iptables rules as well as the copies stored in /etc/ufw.

Note that all UFW works with is what’s in /etc/ufw – it doesn’t know or care what the running iptables rules are.

A Totally Theoretical Example

Let’s pretend you have a big distributed system where the nodes are all running Ubuntu. It’s a big system, and it’s part of an even larger IT environment, so there’s a large IT support organization that’s staffed with people who were hired because they have fair bit of experience with Linux systems – but some of those folks have more at-home, desktop Linux experience vs datacenter, server Linux experience. Also these folks don’t know the ins and outs of the design of this particular distributed system because they are responsible for all of the varied IT environments at this large organization. The hero in our story is one of these selfless and courageous sysadmins. A report comes in from the users of this distributed system that they’re having trouble reaching some of the resources it provides. Initial troubleshooting leads our hero to posit that the problem is a misconfigured firewall. His experience is with Ubuntu and he’s always used ufw, so his first step is to disable the firewall and see if that helps:

ufw disable

There’s no change, so he figures that this is unrelated, so he turns the firewall back on:

ufw enable

Now he moves on to further troubleshooting.

Here’s the problem: The ‘ufw disable’ command above didn’t actually do anything, as ufw wasn’t enabled to begin with. No harm, no foul. However, running ‘ufw enable’ turns on ufw, and configures it in default mode – which denies all incoming connections. As the server in question provides many key network services for this distributed system, suddenly all of the other nodes in this system cannot reach these central services, and the entire distributed system starts to fall over!

So what went wrong here?

UFW is not a front-end for iptables

It may seem like UFW is a front-end for iptables – because you run ufw commands and it makes changes to iptables rules. But what it’s really doing is throwing away whatever is currently in iptables, and replacing it with what’s been configured in ufw. A real front-end would allow you to inspect and modify iptables stuff that already exists.

How this plays into this situation is that lots of carefully-crafted iptables rules for access, masquerading, etc got silently nuked when the ‘ufw enable’ command was run. It does not take into account the starting state of the system before it does stuff. That’s bad on a server.

UFW’s defaults are appropriate for desktops, not servers

I love ‘deny all’ as a default for firewalls. I could even concede that ‘deny all’ is the best default for servers, although I think that could be debated. However, let’s look at some of the specifics that ufw’s defaults do allow:

  • ICMP (this is OK)
  • multicast DNS (hmmm….)
  • UPnP (!!)

OK, this should go without saying, but does it seem to you like your server operating system should enable UPnP discovery by default? Makes sense for a desktop (maybe) – but not for a server. No way, no how.

UFW is an Ubuntu-specific thing (and makes iptables complicated)

If you have a ‘firewall administrator’ on staff, and you ask them to look at a server, how likely is it that they’re going to know anything about UFW? IMO, very unlikely. They’re probably familiar with iptables, and the first thing they’ll do is look at iptables directly, and they’re going to see dozens of rules that point at other rules and they’re going to be very confused. At best, they’ll refuse to touch it and give you a big lecture about how if packets get to your server’s TCP/IP stack before hitting your firewall, you’re doing it wrong. At worst, they’ll start making changes which due to the complexity of these rules will likely result in more problems, not less.

Make your Ubuntu servers better

If you are going to run servers on Ubuntu, here’s one simple suggestion for how to make them better:

apt-get remove ufw

After this, at least your beleaguered sysadmins will have one less possible way to screw up your systems while doing their jobs.

Divergent, Convergent, and Congruent Infrastructures

There’s a very powerful concept that informs most of my thinking about system configuration management: the concept of divergent vs convergent vs congruent infrastructures. I first encountered this concept in Steve Traugott’s seminal paper Why Order Matters: Turing Equivalence in Systems Administration. I highly recommend reading the entire paper if you haven’t before – or even reviewing it if it’s been a while. Steve and Mark Burgess (of cfengine fame) are the two people who IMO brought the first real rigor to analysis of systems configuration management – their stuff is required reading for anyone in this space. Here’s the idea:

Divergent Infrastructure
When multiple systems are managed by hand by individual (well-meaning) sysadmins in a “classical” IT shop, they invariably become out of sync over time. This is known as “configuration drift” and is impossible to prevent if any changes to the systems are allowed at all. Many IT installations go through a divergent phase after their installation until it is recognized as a problem, which then leads to an attempt to create:
Convergent Infrastructure
After configuration drift is recognized as a problem, people attempt to bring the systems back to a common standard. passwd files may be unified, syslog configs may be made identical, etc. During this period, the configurations “converge” towards a common state – however it is very difficult (maybe even impossible) to reach a point where once-divergent systems are identical – what you hope for is an asymptotic approach to identity. Contrast this with
Congruent Infrastructure
A different way of managing configurations is to enforce that all systems are configured identically, so that configuration state remains the same between systems over time. This can be done in multiple ways and in my mind is one of the core goals of any system configuration management system.

This is illustrated in the following graph1.
Graph showing system divergence, convergence, and congruence

In the beginning, the state of the systems is getting more and more different – they are diverging. Then someone starts applying standards, templates, etc and the configurations begin to converge. Finally a congruent mechanism is implemented and the systems are enforced to identical state during the congruent phase.

Much more on this to come – this is just setting the stage….

1.That graph was generated with the neat-o stuff over at Google Charts. Mostly anyway, I had to load it into Skitch to put the text annotations and arrows on – Google Charts allows annotations of datapoints but I couldn’t find a good way to get it to identify a particular range of points and annotate those as a whole.