Incident Management vs Problem Management

Matt Groeninger over at  disruptive–it.com wrote a very thought-provoking blog post about the Skype outage from a while back and how it demonstrates that sometimes you need to understand root cause in order to restore service in incidents in complex systems. I suggest you read his post, and I completely agree with his point (as I said in a comment there). However, it also got me thinking about ITIL, Incident Management, Problem Management, etc. Matt says:

the Skype outage demonstrates that some attempts to restore services will require communication/analysis from your problem management group.

There is a tendency among ITIL practitioners to take the functional descriptions in ITIL (Problem Management, Change Management, Incident Management, etc) and formalize them into organizational structure, and I see a little of that here. Root cause analysis is not some forbidden ritual that is the purview only of one particular team or role. So while I agree with his larger point that to resolve incidents you need to understand causes, I don’t agree that this means you need to involve a problem management group in the incident. In fact….

Incident Management often requires analysis of cause

You may have an Incident Management team, but even if you don’t, certainly someone is providing that function and they are responsible for restoring service when there is an incident. But in order to restore service, you often need to at least theorize about what is wrong before you can fix it. This is a standard part of any troubleshooting methodology. Take a simplified example of a single overloaded server. To fix this, you need to speculate on what could be causing it and do some investigation based on that speculation. For instance, maybe there’s a hung process chewing up all the resources – to test that you do a ps or top and look for something using all the CPU. If that’s not it, maybe the disks are overloaded so you do an iostat and look at wait times. Maybe that’s not it so you think it could be an external resource that’s slow, so you look at connection pools. The point is that in any complex system the “restoration” role is going to involve the rapid creation, testing, and discarding of theories about cause. These theories may not necessarily be about root cause, but are certainly going to be about proximate cause and potentially a handful of contributing causes. In fact I would say that nearly all interesting problems cannot be solved without gaining at least some idea of their causes – and any that can should have the solution automated anyway so they no longer require bothering someone in the future.

Only SOME causes should be considered during restoration

What you should not do in any service restoration effort is do any investigation that won’t directly help you solve the problem. As an example, during an incident you often find that some resource is not in the state it should be in – it is very tempting in those moments to ask “Why isn’t this in the state it should be in?” and start looking into what changes happened when, who did them, etc. This temptation must be resisted to keep the focus on restoring the service, as the fix is likely to be simply putting the resource back in the state it should be in, without needing to know who did what when to get it in the bad state. The proper time to look into those particular causes is during Problem Management after the incident is past.

Successful Incident Management requires either smart tools or smart people (ideally you have both)

If you have a complex system that requires high availability, you cannot get by with ticket monkeys and runbooks. (Unfortunately, many people have tried…) You need a combination of smart, experienced, knowledgeable people and useful tools that can analyze and describe the systems involved. If you have great tools you can get by with people who have less knowledge and less experience. If you have great people, you can get by with fewer and/or crappier tools. Ideally however, you have both.

If you want great tools, the best way I know of to get them is to have your smartest, most experienced, most knowledgeable people be intimately involved in their creation. The tools will empower your Incident Management team, and your Problem Management team should have as one of its primary charters the continual evaluation and improvement of those tools.

Advertisements

About Paul Guth

Old Timey Web Ops guy. I think about cars and clouds, and how they could be faster, cheaper, and more resilient. View all posts by Paul Guth

6 responses to “Incident Management vs Problem Management

  • Matt Groeninger (@mgroeninger)

    Hi Paul!
    I replied to your comment on my post. I love the thoughts and I think we agree! The one thing that I would point out here is that ITILv3 creates the perception of organizational structure between incident management and problem management by recommending that the Problem Manager and the Incident Manager be different people. I believe it is the only place in ITIL where they recommend splitting roles into different people.

    (They might make similar recommendations around Change Manager role
    but I don’t recall.)

    My ITIL based incident response was a straw man, ultimately (hurts to admit that, for some reason *grin*), but it was based on the most common implementation of ITIL I have seen. Any thoughts on how to communicate to ITIL practitioners to remember your last sentance?

    • Paul Guth

      I don’t think this should take much convincing. Any effective root cause analysis process should quickly identify areas where better tools/monitoring could have reduced the TTR on an incident which logically leads to improving tools/monitoring being a primary activity of the problem management function. I think I may be misunderstanding your question though….

  • Matt Groeninger (@mgroeninger)

    Heh, I actually forgot another thought I had… *sigh*
    I’ve also seen poor change management implementations push incident management away from the model you describe/advocate for. Any thoughts on how to prevent that?

    (These are organizational issues, obviously, but they are what I find fascinating about ITIL/service management.)

    • Paul Guth

      I think I need you to connect the dots for me a little more here. 🙂 Can you give me an example of such a poor change management implementation and how it had this effect?

  • James Richardson

    I agree Problem Management and Incident Management have a huge stake in the monitoring and diagnostic tools deployed to managed services. They should in my opinion be strong advocates of improving both sets of tools, however I have never worked in a Problem Management role with any budget to make ‘real changes’. I have often wondered if it would be beneficial to have a budget and my conclusion is no. As a Problem manager I am disjointed from the businesses priorities, which would make my decisions flawed. I instead feed into the process with data to allow the business to make good decisions based on accurate data.

    The Problem Manager

    • Paul Guth

      Thanks for the comment, James. What I would expect (and I think this is what you are saying) is that Problem Management can have a huge hand in generating the requirements for updates to existing toolsets or new tools, and then in ensuring that those get successfully built. I don’t think the charter of Problem Management should include actually building the tools, although I can see that may not have been clear from the post itself.

      I don’t agree with your statement about Problem Management being disjointed from the business priorities – I think in any role it’s important to understand the business priorities. In fact I believe Problem Management is in an excellent position to get the big-picture view of both the technical and business sides of systems and is a great place to leverage that big-picture view.

      Paul

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: