Incident Management and Problem Management – Web Ops style

I’ve noticed that a lot of people find this blog by searching for “incident management versus problem management.” This puzzled me, because I figured there must be way better places on the internets to find that answer. So I looked at the google results for that search and read the top five or so. And now I get it. Because those all suck. They have too much ITIL. And they’re too focused on service desk setups versus web ops setups. So let me try to put this in the simplest terms I can:

  • Incident Management is about fixing something that’s broken
  • Problem Management is about fixing something that’s broken

Confusing, right? Well, the two really are different. Let’s start with some analogies.

The Analogies

Let’s say your house is on fire. “Incident Management” is the firefighters putting out the fire. “Problem Management” is the investigation that happens afterwards to figure out why the fire started and what to do about it.

Let’s say you’re on a plane, and all the engines fail. “Incident Management” is the flight crew gliding the plane to an emergency landing. “Problem Management” is the investigation that happens afterwards to determine why those engines all failed and what should be changed in the future so that doesn’t happen again.

In a Web Ops world, “Incident Management” is what you do when all the servers in one of your datacenters have gone tango uniform – perhaps you take all the incoming traffic and redirect it over to a different datacenter. “Problem Management” would be figuring out what happened to all those servers – why did they fail together and what do you have to do to make sure that doesn’t happen again.

The Definitions

“Incident Management,” then, is when you are immediately addressing an urgent situation. Timeframes are generally short and pressure is high, and you’ll do whatever it takes to make things better – even things that in normal situations would be too costly, too risky, or against policy. For example, if your engines are out you might land your plane on a freeway – which is something you’d never do under normal circumstances. And importantly, you don’t do investigations into what caused things while you’re doing incident management. You really don’t care at that point why all your engines failed – you just want to get the plane down safely.

“Problem Management,” in contrast, is a more measured activity where you figure out what was really behind the incident and then what to do about it. The situation isn’t as urgent and the techniques are different. Root cause analysis is a key tool for Problem Management. And I don’t mean the mythical quest for a single root cause, I mean real root cause analysis. Which is a topic for another post.

The two are connected, of course. The inputs to Problem Management are The Facts: what happened when, who did what, what results did those actions have, etc. Those Facts came out of the Incident Management process, and ideally are recorded automatically without the people doing Incident Management having to do anything special – think of the “black box” that commercial aircraft have. In a Web Ops world, automatically recording crisis phone bridges and chatrooms can serve a similar purpose.

Who Cares?

It’s important to recognize that these are two different activities, and treat them as such. Most everyone does Incident Management, out of necessity (I mean, it’s not like you’re going to decide NOT to emergency land your plane when all your engines fail). What’s sometimes lacking, however, is the followup through Problem Management. It’s tempting after an incident to say “We know what happened, and what to do about it.” It’s tempting because it sounds right. And because it’s a lot easier than doing real root cause analysis, and a helluva lot easier than actually prioritizing and executing all the actions that are really required to address the underlying problems that led to the incident. But in the long run you’re way better off investing that time and effort.

Here’s how to tell you need to step up your Problem Management efforts. Do you ever say or hear things like this?

  • “Oh yeah, I remember when this happened last time.”
  • “Call Bobby Ray, he fixed this before.”
  • “Why does this keep happening?”
  • “We’ve gotten really good at dealing with this problem.”

If so – try some Problem Management.

What to do?

  • Make sure you’re recording useful information from your incidents – get The Facts
  • Have an official post-mortem investigation process for your incidents (and use it).
  • Periodically review the actions you’ve identified to fix things and see how you’re doing at getting them accomplished.

You don’t need a separate Problem Management team. (In fact, I’d recommend against it). You don’t need some expensive Problem Management software. All you need is an incident, some data, and a desire to make things better. I know you’ve got that!


About Paul Guth

Old Timey Web Ops guy. I think about cars and clouds, and how they could be faster, cheaper, and more resilient. View all posts by Paul Guth

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: