Monthly Archives: November 2012

Incident Management and Problem Management – Web Ops style

I’ve noticed that a lot of people find this blog by searching for “incident management versus problem management.” This puzzled me, because I figured there must be way better places on the internets to find that answer. So I looked at the google results for that search and read the top five or so. And now I get it. Because those all suck. They have too much ITIL. And they’re too focused on service desk setups versus web ops setups. So let me try to put this in the simplest terms I can:

  • Incident Management is about fixing something that’s broken
  • Problem Management is about fixing something that’s broken

Confusing, right? Well, the two really are different. Let’s start with some analogies.

The Analogies

Let’s say your house is on fire. “Incident Management” is the firefighters putting out the fire. “Problem Management” is the investigation that happens afterwards to figure out why the fire started and what to do about it.

Let’s say you’re on a plane, and all the engines fail. “Incident Management” is the flight crew gliding the plane to an emergency landing. “Problem Management” is the investigation that happens afterwards to determine why those engines all failed and what should be changed in the future so that doesn’t happen again.

In a Web Ops world, “Incident Management” is what you do when all the servers in one of your datacenters have gone tango uniform – perhaps you take all the incoming traffic and redirect it over to a different datacenter. “Problem Management” would be figuring out what happened to all those servers – why did they fail together and what do you have to do to make sure that doesn’t happen again.

The Definitions

“Incident Management,” then, is when you are immediately addressing an urgent situation. Timeframes are generally short and pressure is high, and you’ll do whatever it takes to make things better – even things that in normal situations would be too costly, too risky, or against policy. For example, if your engines are out you might land your plane on a freeway – which is something you’d never do under normal circumstances. And importantly, you don’t do investigations into what caused things while you’re doing incident management. You really don’t care at that point why all your engines failed – you just want to get the plane down safely.

“Problem Management,” in contrast, is a more measured activity where you figure out what was really behind the incident and then what to do about it. The situation isn’t as urgent and the techniques are different. Root cause analysis is a key tool for Problem Management. And I don’t mean the mythical quest for a single root cause, I mean real root cause analysis. Which is a topic for another post.

The two are connected, of course. The inputs to Problem Management are The Facts: what happened when, who did what, what results did those actions have, etc. Those Facts came out of the Incident Management process, and ideally are recorded automatically without the people doing Incident Management having to do anything special – think of the “black box” that commercial aircraft have. In a Web Ops world, automatically recording crisis phone bridges and chatrooms can serve a similar purpose.

Who Cares?

It’s important to recognize that these are two different activities, and treat them as such. Most everyone does Incident Management, out of necessity (I mean, it’s not like you’re going to decide NOT to emergency land your plane when all your engines fail). What’s sometimes lacking, however, is the followup through Problem Management. It’s tempting after an incident to say “We know what happened, and what to do about it.” It’s tempting because it sounds right. And because it’s a lot easier than doing real root cause analysis, and a helluva lot easier than actually prioritizing and executing all the actions that are really required to address the underlying problems that led to the incident. But in the long run you’re way better off investing that time and effort.

Here’s how to tell you need to step up your Problem Management efforts. Do you ever say or hear things like this?

  • “Oh yeah, I remember when this happened last time.”
  • “Call Bobby Ray, he fixed this before.”
  • “Why does this keep happening?”
  • “We’ve gotten really good at dealing with this problem.”

If so – try some Problem Management.

What to do?

  • Make sure you’re recording useful information from your incidents – get The Facts
  • Have an official post-mortem investigation process for your incidents (and use it).
  • Periodically review the actions you’ve identified to fix things and see how you’re doing at getting them accomplished.

You don’t need a separate Problem Management team. (In fact, I’d recommend against it). You don’t need some expensive Problem Management software. All you need is an incident, some data, and a desire to make things better. I know you’ve got that!

Is Ubuntu a good server OS? – firewall edition

Earlier, I posted about upstart and how it illustrated how Ubuntu is not a good operating system for running servers. Today I wanted to provide another example: UFW, the “Uncomplicated FireWall” that is installed by default on Ubuntu servers.

Linux firewalling and UFW

Firewalls in Linux are implemented with iptables. In a nutshell, iptables evaluates packets against lists of rules to determine whether they should be accepted or rejected. iptables is typically used on servers through scripts that set up all these rules the way you want them – these days this process is generally managed through whatever configuration management system you’re using (e.g. chef or puppet).  UFW is a tool distributed with Ubuntu that provides a different way of setting up these iptables rules, using command line tools.

How UFW works

UFW keeps its own set of iptables rules in /etc/ufw. When ufw is turned on, it flushes out all the existing iptables rules and replaces them with the rules from /etc/ufw. You can then use the command line tools to add ports to allow or deny, which will update both the running iptables rules as well as the copies stored in /etc/ufw.

Note that all UFW works with is what’s in /etc/ufw – it doesn’t know or care what the running iptables rules are.

A Totally Theoretical Example

Let’s pretend you have a big distributed system where the nodes are all running Ubuntu. It’s a big system, and it’s part of an even larger IT environment, so there’s a large IT support organization that’s staffed with people who were hired because they have fair bit of experience with Linux systems – but some of those folks have more at-home, desktop Linux experience vs datacenter, server Linux experience. Also these folks don’t know the ins and outs of the design of this particular distributed system because they are responsible for all of the varied IT environments at this large organization. The hero in our story is one of these selfless and courageous sysadmins. A report comes in from the users of this distributed system that they’re having trouble reaching some of the resources it provides. Initial troubleshooting leads our hero to posit that the problem is a misconfigured firewall. His experience is with Ubuntu and he’s always used ufw, so his first step is to disable the firewall and see if that helps:

ufw disable

There’s no change, so he figures that this is unrelated, so he turns the firewall back on:

ufw enable

Now he moves on to further troubleshooting.

Here’s the problem: The ‘ufw disable’ command above didn’t actually do anything, as ufw wasn’t enabled to begin with. No harm, no foul. However, running ‘ufw enable’ turns on ufw, and configures it in default mode – which denies all incoming connections. As the server in question provides many key network services for this distributed system, suddenly all of the other nodes in this system cannot reach these central services, and the entire distributed system starts to fall over!

So what went wrong here?

UFW is not a front-end for iptables

It may seem like UFW is a front-end for iptables – because you run ufw commands and it makes changes to iptables rules. But what it’s really doing is throwing away whatever is currently in iptables, and replacing it with what’s been configured in ufw. A real front-end would allow you to inspect and modify iptables stuff that already exists.

How this plays into this situation is that lots of carefully-crafted iptables rules for access, masquerading, etc got silently nuked when the ‘ufw enable’ command was run. It does not take into account the starting state of the system before it does stuff. That’s bad on a server.

UFW’s defaults are appropriate for desktops, not servers

I love ‘deny all’ as a default for firewalls. I could even concede that ‘deny all’ is the best default for servers, although I think that could be debated. However, let’s look at some of the specifics that ufw’s defaults do allow:

  • ICMP (this is OK)
  • multicast DNS (hmmm….)
  • UPnP (!!)

OK, this should go without saying, but does it seem to you like your server operating system should enable UPnP discovery by default? Makes sense for a desktop (maybe) – but not for a server. No way, no how.

UFW is an Ubuntu-specific thing (and makes iptables complicated)

If you have a ‘firewall administrator’ on staff, and you ask them to look at a server, how likely is it that they’re going to know anything about UFW? IMO, very unlikely. They’re probably familiar with iptables, and the first thing they’ll do is look at iptables directly, and they’re going to see dozens of rules that point at other rules and they’re going to be very confused. At best, they’ll refuse to touch it and give you a big lecture about how if packets get to your server’s TCP/IP stack before hitting your firewall, you’re doing it wrong. At worst, they’ll start making changes which due to the complexity of these rules will likely result in more problems, not less.

Make your Ubuntu servers better

If you are going to run servers on Ubuntu, here’s one simple suggestion for how to make them better:

apt-get remove ufw

After this, at least your beleaguered sysadmins will have one less possible way to screw up your systems while doing their jobs.