What Public Cloud SLAs ARE good for

When HP announced their cloud was going GA, lots of the coverage said that the cloud came with a 99.95% SLA. While that is technically accurate, it is also very misleading. People may assume (and even some coverage indicates this) that this means if you launch a virtual server inside HP’s cloud, they are guaranteeing that instance will have 99.95% uptime. This is emphatically not true. What they are really guaranteeing is that that instance will be up or you will be able to launch a replacement instance. That’s a really important or. Lydia Leong over at Gartner has a much more detailed analysis of this that you probably want to read. For my part, I’d like to illustrate some theoretical situations that would not be considered violations of your SLA to try to make this a little more real and concrete for those of you with infrastructure running in the public cloud. (AWS’ SLA is similar to HP’s for all these cases).

The Examples

Here we go. Each of these situations is not considered to be a breach of HP’s SLA:

  • The only VM you are running in the cloud blows up and is unrecoverable. This does not count against the SLA.
  • The only VM you are running in the cloud loses all network connectivity. This does not count against the SLA.
  • Every VM you have in every availability zone in the region blows up at exactly the same time. They are all unrecoverable. This does not count against the SLA (because you can launch replacement VMs for all of them).

Amazon is no better. The examples above all would be the same for AWS – not a breach of their SLA. Here’s one that would be covered by HP’s SLA, but not AWS’:

  • Let’s say you have 200 VMs in each of 3 different AZes in a region. In each of those AZes, 199 of those instances die and you cannot launch any replacement instances. So you are left with 3 of your 600 instances running (99.5% of your capacity is lost). This does not count against the SLA.

What to do?

I’m really not ragging on public cloud here. I don’t think SLAs have much value as contracts. What they are good for is helping you do risk management. A provider is going to sum up lots of information about how they’ve designed and architected their services in their SLA. Think about new cars: if one comes with a 12 month limited warranty and another comes with a 10 year bumper to bumper warranty, doesn’t that make you think they are designed and built differently? So both HP’s and AWS’ SLAs tell you a lot about how their systems are architected. Mostly they tell you that individual instances are ephemeral and you better account for that when running your applications on top of them.

This is also why any SLA that says “100% uptime” is completely useless, because that number is obviously divorced from any match to the reality of the design and architecture of the system underneath the SLA. No system can achieve 100% uptime, and by pretending theirs does, service providers actually make it more difficult for their customers to do intelligent risk management.

My advice: make sure you understand your providers’ SLAs. Not because you’re going to use those SLAs to wrangle compensation out of them for outages (maybe I’ll post later about how such compensation is also a joke). You want to understand your provider’s SLAs so that you can do intelligent risk management for yourself. Knowledge is power. Get it and use it!


Incident Management and Problem Management – Web Ops style

I’ve noticed that a lot of people find this blog by searching for “incident management versus problem management.” This puzzled me, because I figured there must be way better places on the internets to find that answer. So I looked at the google results for that search and read the top five or so. And now I get it. Because those all suck. They have too much ITIL. And they’re too focused on service desk setups versus web ops setups. So let me try to put this in the simplest terms I can:

  • Incident Management is about fixing something that’s broken
  • Problem Management is about fixing something that’s broken

Confusing, right? Well, the two really are different. Let’s start with some analogies.

The Analogies

Let’s say your house is on fire. “Incident Management” is the firefighters putting out the fire. “Problem Management” is the investigation that happens afterwards to figure out why the fire started and what to do about it.

Let’s say you’re on a plane, and all the engines fail. “Incident Management” is the flight crew gliding the plane to an emergency landing. “Problem Management” is the investigation that happens afterwards to determine why those engines all failed and what should be changed in the future so that doesn’t happen again.

In a Web Ops world, “Incident Management” is what you do when all the servers in one of your datacenters have gone tango uniform – perhaps you take all the incoming traffic and redirect it over to a different datacenter. “Problem Management” would be figuring out what happened to all those servers – why did they fail together and what do you have to do to make sure that doesn’t happen again.

The Definitions

“Incident Management,” then, is when you are immediately addressing an urgent situation. Timeframes are generally short and pressure is high, and you’ll do whatever it takes to make things better – even things that in normal situations would be too costly, too risky, or against policy. For example, if your engines are out you might land your plane on a freeway – which is something you’d never do under normal circumstances. And importantly, you don’t do investigations into what caused things while you’re doing incident management. You really don’t care at that point why all your engines failed – you just want to get the plane down safely.

“Problem Management,” in contrast, is a more measured activity where you figure out what was really behind the incident and then what to do about it. The situation isn’t as urgent and the techniques are different. Root cause analysis is a key tool for Problem Management. And I don’t mean the mythical quest for a single root cause, I mean real root cause analysis. Which is a topic for another post.

The two are connected, of course. The inputs to Problem Management are The Facts: what happened when, who did what, what results did those actions have, etc. Those Facts came out of the Incident Management process, and ideally are recorded automatically without the people doing Incident Management having to do anything special – think of the “black box” that commercial aircraft have. In a Web Ops world, automatically recording crisis phone bridges and chatrooms can serve a similar purpose.

Who Cares?

It’s important to recognize that these are two different activities, and treat them as such. Most everyone does Incident Management, out of necessity (I mean, it’s not like you’re going to decide NOT to emergency land your plane when all your engines fail). What’s sometimes lacking, however, is the followup through Problem Management. It’s tempting after an incident to say “We know what happened, and what to do about it.” It’s tempting because it sounds right. And because it’s a lot easier than doing real root cause analysis, and a helluva lot easier than actually prioritizing and executing all the actions that are really required to address the underlying problems that led to the incident. But in the long run you’re way better off investing that time and effort.

Here’s how to tell you need to step up your Problem Management efforts. Do you ever say or hear things like this?

  • “Oh yeah, I remember when this happened last time.”
  • “Call Bobby Ray, he fixed this before.”
  • “Why does this keep happening?”
  • “We’ve gotten really good at dealing with this problem.”

If so – try some Problem Management.

What to do?

  • Make sure you’re recording useful information from your incidents – get The Facts
  • Have an official post-mortem investigation process for your incidents (and use it).
  • Periodically review the actions you’ve identified to fix things and see how you’re doing at getting them accomplished.

You don’t need a separate Problem Management team. (In fact, I’d recommend against it). You don’t need some expensive Problem Management software. All you need is an incident, some data, and a desire to make things better. I know you’ve got that!


Is Ubuntu a good server OS? – firewall edition

Earlier, I posted about upstart and how it illustrated how Ubuntu is not a good operating system for running servers. Today I wanted to provide another example: UFW, the “Uncomplicated FireWall” that is installed by default on Ubuntu servers.

Linux firewalling and UFW

Firewalls in Linux are implemented with iptables. In a nutshell, iptables evaluates packets against lists of rules to determine whether they should be accepted or rejected. iptables is typically used on servers through scripts that set up all these rules the way you want them – these days this process is generally managed through whatever configuration management system you’re using (e.g. chef or puppet).  UFW is a tool distributed with Ubuntu that provides a different way of setting up these iptables rules, using command line tools.

How UFW works

UFW keeps its own set of iptables rules in /etc/ufw. When ufw is turned on, it flushes out all the existing iptables rules and replaces them with the rules from /etc/ufw. You can then use the command line tools to add ports to allow or deny, which will update both the running iptables rules as well as the copies stored in /etc/ufw.

Note that all UFW works with is what’s in /etc/ufw – it doesn’t know or care what the running iptables rules are.

A Totally Theoretical Example

Let’s pretend you have a big distributed system where the nodes are all running Ubuntu. It’s a big system, and it’s part of an even larger IT environment, so there’s a large IT support organization that’s staffed with people who were hired because they have fair bit of experience with Linux systems – but some of those folks have more at-home, desktop Linux experience vs datacenter, server Linux experience. Also these folks don’t know the ins and outs of the design of this particular distributed system because they are responsible for all of the varied IT environments at this large organization. The hero in our story is one of these selfless and courageous sysadmins. A report comes in from the users of this distributed system that they’re having trouble reaching some of the resources it provides. Initial troubleshooting leads our hero to posit that the problem is a misconfigured firewall. His experience is with Ubuntu and he’s always used ufw, so his first step is to disable the firewall and see if that helps:

ufw disable

There’s no change, so he figures that this is unrelated, so he turns the firewall back on:

ufw enable

Now he moves on to further troubleshooting.

Here’s the problem: The ‘ufw disable’ command above didn’t actually do anything, as ufw wasn’t enabled to begin with. No harm, no foul. However, running ‘ufw enable’ turns on ufw, and configures it in default mode – which denies all incoming connections. As the server in question provides many key network services for this distributed system, suddenly all of the other nodes in this system cannot reach these central services, and the entire distributed system starts to fall over!

So what went wrong here?

UFW is not a front-end for iptables

It may seem like UFW is a front-end for iptables – because you run ufw commands and it makes changes to iptables rules. But what it’s really doing is throwing away whatever is currently in iptables, and replacing it with what’s been configured in ufw. A real front-end would allow you to inspect and modify iptables stuff that already exists.

How this plays into this situation is that lots of carefully-crafted iptables rules for access, masquerading, etc got silently nuked when the ‘ufw enable’ command was run. It does not take into account the starting state of the system before it does stuff. That’s bad on a server.

UFW’s defaults are appropriate for desktops, not servers

I love ‘deny all’ as a default for firewalls. I could even concede that ‘deny all’ is the best default for servers, although I think that could be debated. However, let’s look at some of the specifics that ufw’s defaults do allow:

  • ICMP (this is OK)
  • multicast DNS (hmmm….)
  • UPnP (!!)

OK, this should go without saying, but does it seem to you like your server operating system should enable UPnP discovery by default? Makes sense for a desktop (maybe) – but not for a server. No way, no how.

UFW is an Ubuntu-specific thing (and makes iptables complicated)

If you have a ‘firewall administrator’ on staff, and you ask them to look at a server, how likely is it that they’re going to know anything about UFW? IMO, very unlikely. They’re probably familiar with iptables, and the first thing they’ll do is look at iptables directly, and they’re going to see dozens of rules that point at other rules and they’re going to be very confused. At best, they’ll refuse to touch it and give you a big lecture about how if packets get to your server’s TCP/IP stack before hitting your firewall, you’re doing it wrong. At worst, they’ll start making changes which due to the complexity of these rules will likely result in more problems, not less.

Make your Ubuntu servers better

If you are going to run servers on Ubuntu, here’s one simple suggestion for how to make them better:

apt-get remove ufw

After this, at least your beleaguered sysadmins will have one less possible way to screw up your systems while doing their jobs.


Is Ubuntu a good server OS?

My “Openstack IRL” presentation informs the audience that we at Cloudscaling use Ubuntu for the systems we deploy. When I present this live and we get to that slide, I usually say something like this:

We use Ubuntu for our systems. This is somewhat ironic because at least once a month in our office we have a big discussion about how terrible Ubuntu is as a server operating system…

Funny. But is it true? Is Ubuntu a terrible operating system for servers? Let’s look at one data point: Upstart.

Upstart’s raisons d’etre

My distillation of the reasons upstart was created to replace the traditional rcX.d init structure from SysV and BSD is:

  1. the traditional system is serial rather than parallel, meaning reboots take longer than they have to – and people reboot their systems a lot more these days
  2. the traditional system doesn’t deal well with new hardware being plugged in and out on a regular basis – and people are constantly plugging stuff in and out of their systems these days

Do those sound like conditions that affect your servers? Me neither. They are desktop-centric concerns. And there’s nothing wrong with that – unless you’re trying to run a server.

Why does it matter?

From the perspective of a crotchety old-time unix sysadmin (a hypothetical one of course!), upstart is a PITA. Let me try to illustrate why:

Checking what order stuff starts in

In the traditional world, here’s what you have to do to find out what order things start in:

ls /etc/rc2.d

That’s it. The output of that command provides you with everything you need to see at a glance that (for instance) apache is going to start after syslog.

Here’s how you do it in the upstart world:

well I wish I cold give you a simple way to do that, but you can’t. You have to open up the conf file for the service you’re interested in in /etc/init and look at what events it depends on for starting. If one of those events is the startup of another service, then you know your service will start after it. However, if there is no dependency listed on another service, then you don’t know what order they will startup in. Yours might start after the other one, or the other one might start before yours does, or they may both be starting up at the same time. You don’t know, and it isn’t guaranteed that they will start in the same order every time the system boots. This makes crotchety old unix sysadmins nervous, and leads to the second point….

Defining what order stuff starts in

In the traditional world, this is done with 2 digit numbers. You have a 2 digit number (part of the name of the file in /etc/rcX.d) and the scripts are run in the order of the numbers in their filename. So if you want one script to start later than another, just change its number to be larger than that other one. Easy to understand, and all you have to know to do it is how to use mv. And there are no hard dependencies here – if you build one server that doesn’t contain a particular service, that init file won’t be installed, and none of the other init files will be affected and startup will go as you expect.

In the upstart world, you do this by specifying startup dependencies between the jobs that start services. Each job emits an event when it completes that you can use in the conf files for other services. So say you have two services, s1 and s2, and you want to be sure s2 starts after s1. You do this by putting a line like this into /etc/init/s2.conf:

start on started s1

So aside from the crochety old sysadmin spending 45 minutes perusing ‘man upstart’ to figure this out in the first place, the problem you run into here is with distributed systems that can be deployed in varied configurations. For example, sometimes s1 and s2 are on the same node, and sometimes they are on different nodes. If you put the above line into /etc/init/s2.conf by default, guess what happens if you deploy in a config where s1 isn’t on the same node? s2 will never ever ever start.

Summary

My take on this is that upstart is a great thing for desktop systems. For server systems, it’s adding a bunch of complexity and brittleness without providing any actual benefits. And it’s one check mark in the “ubuntu isn’t a good OS for servers” category.


Openstack LA meetup

Last week I took a little trip down to LA to attend the #OSLAX meetup. I’m a little ashamed to admit that I was pleasantly surprised to find a big group of knowledgeable, friendly, and fun Openstack enthusiasts. (Us NorCal types are kinda snobby that way). I had the opportunity to present my talk on “Openstack IRL,” which seemed to be very well received. The video of my talk is up on vimeo if you’re interested in hearing the actual presentation, or you can see the slides on Slideshare. The message was the same as when I presented to #lspe earlier this year – you can build a production cloud using Openstack (we’ve done it!). This time around I was able to provide a little more detail about some of the specifics of what Cloudscaling is doing in our Open Cloud System, especially around resiliency for internal Openstack components, networking, and AWS compatibility.

At about 9:44 in the presentation, I’m talking about how excited network engineers get when they can build a CLOS network that provides loads of bisectional bandwidth. I have been assured multiple times that what everyone is actually hearing is “bisexual bandwidth” – I’m not sure exactly what that is but it’s probably more interesting than a network diagram. Check it out for yourself and see what you think I’m saying….

OpenStack LA Meetup Jul 26 – Paul Guth from DreamHost on Vimeo.


Divergent, Convergent, and Congruent Infrastructures

There’s a very powerful concept that informs most of my thinking about system configuration management: the concept of divergent vs convergent vs congruent infrastructures. I first encountered this concept in Steve Traugott’s seminal paper Why Order Matters: Turing Equivalence in Systems Administration. I highly recommend reading the entire paper if you haven’t before – or even reviewing it if it’s been a while. Steve and Mark Burgess (of cfengine fame) are the two people who IMO brought the first real rigor to analysis of systems configuration management – their stuff is required reading for anyone in this space. Here’s the idea:

Divergent Infrastructure
When multiple systems are managed by hand by individual (well-meaning) sysadmins in a “classical” IT shop, they invariably become out of sync over time. This is known as “configuration drift” and is impossible to prevent if any changes to the systems are allowed at all. Many IT installations go through a divergent phase after their installation until it is recognized as a problem, which then leads to an attempt to create:
Convergent Infrastructure
After configuration drift is recognized as a problem, people attempt to bring the systems back to a common standard. passwd files may be unified, syslog configs may be made identical, etc. During this period, the configurations “converge” towards a common state – however it is very difficult (maybe even impossible) to reach a point where once-divergent systems are identical – what you hope for is an asymptotic approach to identity. Contrast this with
Congruent Infrastructure
A different way of managing configurations is to enforce that all systems are configured identically, so that configuration state remains the same between systems over time. This can be done in multiple ways and in my mind is one of the core goals of any system configuration management system.

This is illustrated in the following graph1.
Graph showing system divergence, convergence, and congruence

In the beginning, the state of the systems is getting more and more different – they are diverging. Then someone starts applying standards, templates, etc and the configurations begin to converge. Finally a congruent mechanism is implemented and the systems are enforced to identical state during the congruent phase.

Much more on this to come – this is just setting the stage….

1.That graph was generated with the neat-o stuff over at Google Charts. Mostly anyway, I had to load it into Skitch to put the text annotations and arrows on – Google Charts allows annotations of datapoints but I couldn’t find a good way to get it to identify a particular range of points and annotate those as a whole.


Incident Management vs Problem Management

Matt Groeninger over at  disruptive–it.com wrote a very thought-provoking blog post about the Skype outage from a while back and how it demonstrates that sometimes you need to understand root cause in order to restore service in incidents in complex systems. I suggest you read his post, and I completely agree with his point (as I said in a comment there). However, it also got me thinking about ITIL, Incident Management, Problem Management, etc. Matt says:

the Skype outage demonstrates that some attempts to restore services will require communication/analysis from your problem management group.

There is a tendency among ITIL practitioners to take the functional descriptions in ITIL (Problem Management, Change Management, Incident Management, etc) and formalize them into organizational structure, and I see a little of that here. Root cause analysis is not some forbidden ritual that is the purview only of one particular team or role. So while I agree with his larger point that to resolve incidents you need to understand causes, I don’t agree that this means you need to involve a problem management group in the incident. In fact….

Incident Management often requires analysis of cause

You may have an Incident Management team, but even if you don’t, certainly someone is providing that function and they are responsible for restoring service when there is an incident. But in order to restore service, you often need to at least theorize about what is wrong before you can fix it. This is a standard part of any troubleshooting methodology. Take a simplified example of a single overloaded server. To fix this, you need to speculate on what could be causing it and do some investigation based on that speculation. For instance, maybe there’s a hung process chewing up all the resources – to test that you do a ps or top and look for something using all the CPU. If that’s not it, maybe the disks are overloaded so you do an iostat and look at wait times. Maybe that’s not it so you think it could be an external resource that’s slow, so you look at connection pools. The point is that in any complex system the “restoration” role is going to involve the rapid creation, testing, and discarding of theories about cause. These theories may not necessarily be about root cause, but are certainly going to be about proximate cause and potentially a handful of contributing causes. In fact I would say that nearly all interesting problems cannot be solved without gaining at least some idea of their causes – and any that can should have the solution automated anyway so they no longer require bothering someone in the future.

Only SOME causes should be considered during restoration

What you should not do in any service restoration effort is do any investigation that won’t directly help you solve the problem. As an example, during an incident you often find that some resource is not in the state it should be in – it is very tempting in those moments to ask “Why isn’t this in the state it should be in?” and start looking into what changes happened when, who did them, etc. This temptation must be resisted to keep the focus on restoring the service, as the fix is likely to be simply putting the resource back in the state it should be in, without needing to know who did what when to get it in the bad state. The proper time to look into those particular causes is during Problem Management after the incident is past.

Successful Incident Management requires either smart tools or smart people (ideally you have both)

If you have a complex system that requires high availability, you cannot get by with ticket monkeys and runbooks. (Unfortunately, many people have tried…) You need a combination of smart, experienced, knowledgeable people and useful tools that can analyze and describe the systems involved. If you have great tools you can get by with people who have less knowledge and less experience. If you have great people, you can get by with fewer and/or crappier tools. Ideally however, you have both.

If you want great tools, the best way I know of to get them is to have your smartest, most experienced, most knowledgeable people be intimately involved in their creation. The tools will empower your Incident Management team, and your Problem Management team should have as one of its primary charters the continual evaluation and improvement of those tools.