Tag Archives: problem management

Incident Management and Problem Management – Web Ops style

I’ve noticed that a lot of people find this blog by searching for “incident management versus problem management.” This puzzled me, because I figured there must be way better places on the internets to find that answer. So I looked at the google results for that search and read the top five or so. And now I get it. Because those all suck. They have too much ITIL. And they’re too focused on service desk setups versus web ops setups. So let me try to put this in the simplest terms I can:

  • Incident Management is about fixing something that’s broken
  • Problem Management is about fixing something that’s broken

Confusing, right? Well, the two really are different. Let’s start with some analogies.

The Analogies

Let’s say your house is on fire. “Incident Management” is the firefighters putting out the fire. “Problem Management” is the investigation that happens afterwards to figure out why the fire started and what to do about it.

Let’s say you’re on a plane, and all the engines fail. “Incident Management” is the flight crew gliding the plane to an emergency landing. “Problem Management” is the investigation that happens afterwards to determine why those engines all failed and what should be changed in the future so that doesn’t happen again.

In a Web Ops world, “Incident Management” is what you do when all the servers in one of your datacenters have gone tango uniform – perhaps you take all the incoming traffic and redirect it over to a different datacenter. “Problem Management” would be figuring out what happened to all those servers – why did they fail together and what do you have to do to make sure that doesn’t happen again.

The Definitions

“Incident Management,” then, is when you are immediately addressing an urgent situation. Timeframes are generally short and pressure is high, and you’ll do whatever it takes to make things better – even things that in normal situations would be too costly, too risky, or against policy. For example, if your engines are out you might land your plane on a freeway – which is something you’d never do under normal circumstances. And importantly, you don’t do investigations into what caused things while you’re doing incident management. You really don’t care at that point why all your engines failed – you just want to get the plane down safely.

“Problem Management,” in contrast, is a more measured activity where you figure out what was really behind the incident and then what to do about it. The situation isn’t as urgent and the techniques are different. Root cause analysis is a key tool for Problem Management. And I don’t mean the mythical quest for a single root cause, I mean real root cause analysis. Which is a topic for another post.

The two are connected, of course. The inputs to Problem Management are The Facts: what happened when, who did what, what results did those actions have, etc. Those Facts came out of the Incident Management process, and ideally are recorded automatically without the people doing Incident Management having to do anything special – think of the “black box” that commercial aircraft have. In a Web Ops world, automatically recording crisis phone bridges and chatrooms can serve a similar purpose.

Who Cares?

It’s important to recognize that these are two different activities, and treat them as such. Most everyone does Incident Management, out of necessity (I mean, it’s not like you’re going to decide NOT to emergency land your plane when all your engines fail). What’s sometimes lacking, however, is the followup through Problem Management. It’s tempting after an incident to say “We know what happened, and what to do about it.” It’s tempting because it sounds right. And because it’s a lot easier than doing real root cause analysis, and a helluva lot easier than actually prioritizing and executing all the actions that are really required to address the underlying problems that led to the incident. But in the long run you’re way better off investing that time and effort.

Here’s how to tell you need to step up your Problem Management efforts. Do you ever say or hear things like this?

  • “Oh yeah, I remember when this happened last time.”
  • “Call Bobby Ray, he fixed this before.”
  • “Why does this keep happening?”
  • “We’ve gotten really good at dealing with this problem.”

If so – try some Problem Management.

What to do?

  • Make sure you’re recording useful information from your incidents – get The Facts
  • Have an official post-mortem investigation process for your incidents (and use it).
  • Periodically review the actions you’ve identified to fix things and see how you’re doing at getting them accomplished.

You don’t need a separate Problem Management team. (In fact, I’d recommend against it). You don’t need some expensive Problem Management software. All you need is an incident, some data, and a desire to make things better. I know you’ve got that!


Incident Management vs Problem Management

Matt Groeninger over at  disruptive–it.com wrote a very thought-provoking blog post about the Skype outage from a while back and how it demonstrates that sometimes you need to understand root cause in order to restore service in incidents in complex systems. I suggest you read his post, and I completely agree with his point (as I said in a comment there). However, it also got me thinking about ITIL, Incident Management, Problem Management, etc. Matt says:

the Skype outage demonstrates that some attempts to restore services will require communication/analysis from your problem management group.

There is a tendency among ITIL practitioners to take the functional descriptions in ITIL (Problem Management, Change Management, Incident Management, etc) and formalize them into organizational structure, and I see a little of that here. Root cause analysis is not some forbidden ritual that is the purview only of one particular team or role. So while I agree with his larger point that to resolve incidents you need to understand causes, I don’t agree that this means you need to involve a problem management group in the incident. In fact….

Incident Management often requires analysis of cause

You may have an Incident Management team, but even if you don’t, certainly someone is providing that function and they are responsible for restoring service when there is an incident. But in order to restore service, you often need to at least theorize about what is wrong before you can fix it. This is a standard part of any troubleshooting methodology. Take a simplified example of a single overloaded server. To fix this, you need to speculate on what could be causing it and do some investigation based on that speculation. For instance, maybe there’s a hung process chewing up all the resources – to test that you do a ps or top and look for something using all the CPU. If that’s not it, maybe the disks are overloaded so you do an iostat and look at wait times. Maybe that’s not it so you think it could be an external resource that’s slow, so you look at connection pools. The point is that in any complex system the “restoration” role is going to involve the rapid creation, testing, and discarding of theories about cause. These theories may not necessarily be about root cause, but are certainly going to be about proximate cause and potentially a handful of contributing causes. In fact I would say that nearly all interesting problems cannot be solved without gaining at least some idea of their causes – and any that can should have the solution automated anyway so they no longer require bothering someone in the future.

Only SOME causes should be considered during restoration

What you should not do in any service restoration effort is do any investigation that won’t directly help you solve the problem. As an example, during an incident you often find that some resource is not in the state it should be in – it is very tempting in those moments to ask “Why isn’t this in the state it should be in?” and start looking into what changes happened when, who did them, etc. This temptation must be resisted to keep the focus on restoring the service, as the fix is likely to be simply putting the resource back in the state it should be in, without needing to know who did what when to get it in the bad state. The proper time to look into those particular causes is during Problem Management after the incident is past.

Successful Incident Management requires either smart tools or smart people (ideally you have both)

If you have a complex system that requires high availability, you cannot get by with ticket monkeys and runbooks. (Unfortunately, many people have tried…) You need a combination of smart, experienced, knowledgeable people and useful tools that can analyze and describe the systems involved. If you have great tools you can get by with people who have less knowledge and less experience. If you have great people, you can get by with fewer and/or crappier tools. Ideally however, you have both.

If you want great tools, the best way I know of to get them is to have your smartest, most experienced, most knowledgeable people be intimately involved in their creation. The tools will empower your Incident Management team, and your Problem Management team should have as one of its primary charters the continual evaluation and improvement of those tools.

Thoughts on “Blameless Postmortems and a Just Culture”

I was quoting Conan The Ops Guy in my previous post because I wanted to start writing some stuff about root cause analysis, problem management, after-action reports, etc. Then John Allspaw wrote this incredibly fantastic blog post about blameless postmortems that so eloquently and thoughtfully conveys a bunch of the things I was thinking about that now I am just going to sit here feeling inadequate and tell you to go read it.

Here are some of the things I really liked about what he had to say:

  • A culture of blame leads to people not providing information, and the information is what you need to improve things
  • A lack of information leads to larger disconnects in understanding between line and management
  • Blame implies a strategy of deterrence, versus a strategy of prevention
  • Just saying “Person X should have done Y instead of Z” does not help the next person, unless you also understand why Person X did Z in the first place and change those circumstances
  • Post-mortem analyses should be about learning, not about blame

And one note from me that I put into a comment on the original post, but the comment is still awaiting moderation so I better say it here too:

One thing I would add: it’s also critical for a successful process that the output of post-mortems be acted upon in a *timely* and *visible* manner. If someone spends their time doing a bunch of analysis and comes up with recommendations on how to avoid problems but then feels like those recommendations are ignored or not appreciated, that also is highly disincentivizing future analysis. I’m sure Etsy doesn’t have this problem, but I’ve seen it happen in other organizations, especially as they get larger.

Basically John points out that if there are negative consequences to providing information that is useful in preventing future incidents, then people won’t provide that information. The flip side however is that if there aren’t positive consequences for providing that information, people also won’t provide it. And the best positive consequence I can think of is seeing your information acted on to make your service more reliable, prevent future problems, and improve the experience of your customers, all of which ultimately makes your business stronger.

Conan the Ops Guy

Conan the Problem ManagerLately I have been thinking of incident analysis and Problem Management (in the ITIL sense), and when I do that I always hark back to the immortal words of Conan The Ops Guy, who spoke thusly on what was best in life:

To crush your incidents, to have their root causes identified, and to see the prevention of their recurrence.

That is good Conan! That is good.