Monthly Archives: June 2012

Incident Management vs Problem Management

Matt Groeninger over at  disruptive– wrote a very thought-provoking blog post about the Skype outage from a while back and how it demonstrates that sometimes you need to understand root cause in order to restore service in incidents in complex systems. I suggest you read his post, and I completely agree with his point (as I said in a comment there). However, it also got me thinking about ITIL, Incident Management, Problem Management, etc. Matt says:

the Skype outage demonstrates that some attempts to restore services will require communication/analysis from your problem management group.

There is a tendency among ITIL practitioners to take the functional descriptions in ITIL (Problem Management, Change Management, Incident Management, etc) and formalize them into organizational structure, and I see a little of that here. Root cause analysis is not some forbidden ritual that is the purview only of one particular team or role. So while I agree with his larger point that to resolve incidents you need to understand causes, I don’t agree that this means you need to involve a problem management group in the incident. In fact….

Incident Management often requires analysis of cause

You may have an Incident Management team, but even if you don’t, certainly someone is providing that function and they are responsible for restoring service when there is an incident. But in order to restore service, you often need to at least theorize about what is wrong before you can fix it. This is a standard part of any troubleshooting methodology. Take a simplified example of a single overloaded server. To fix this, you need to speculate on what could be causing it and do some investigation based on that speculation. For instance, maybe there’s a hung process chewing up all the resources – to test that you do a ps or top and look for something using all the CPU. If that’s not it, maybe the disks are overloaded so you do an iostat and look at wait times. Maybe that’s not it so you think it could be an external resource that’s slow, so you look at connection pools. The point is that in any complex system the “restoration” role is going to involve the rapid creation, testing, and discarding of theories about cause. These theories may not necessarily be about root cause, but are certainly going to be about proximate cause and potentially a handful of contributing causes. In fact I would say that nearly all interesting problems cannot be solved without gaining at least some idea of their causes – and any that can should have the solution automated anyway so they no longer require bothering someone in the future.

Only SOME causes should be considered during restoration

What you should not do in any service restoration effort is do any investigation that won’t directly help you solve the problem. As an example, during an incident you often find that some resource is not in the state it should be in – it is very tempting in those moments to ask “Why isn’t this in the state it should be in?” and start looking into what changes happened when, who did them, etc. This temptation must be resisted to keep the focus on restoring the service, as the fix is likely to be simply putting the resource back in the state it should be in, without needing to know who did what when to get it in the bad state. The proper time to look into those particular causes is during Problem Management after the incident is past.

Successful Incident Management requires either smart tools or smart people (ideally you have both)

If you have a complex system that requires high availability, you cannot get by with ticket monkeys and runbooks. (Unfortunately, many people have tried…) You need a combination of smart, experienced, knowledgeable people and useful tools that can analyze and describe the systems involved. If you have great tools you can get by with people who have less knowledge and less experience. If you have great people, you can get by with fewer and/or crappier tools. Ideally however, you have both.

If you want great tools, the best way I know of to get them is to have your smartest, most experienced, most knowledgeable people be intimately involved in their creation. The tools will empower your Incident Management team, and your Problem Management team should have as one of its primary charters the continual evaluation and improvement of those tools.

Thoughts on “Blameless Postmortems and a Just Culture”

I was quoting Conan The Ops Guy in my previous post because I wanted to start writing some stuff about root cause analysis, problem management, after-action reports, etc. Then John Allspaw wrote this incredibly fantastic blog post about blameless postmortems that so eloquently and thoughtfully conveys a bunch of the things I was thinking about that now I am just going to sit here feeling inadequate and tell you to go read it.

Here are some of the things I really liked about what he had to say:

  • A culture of blame leads to people not providing information, and the information is what you need to improve things
  • A lack of information leads to larger disconnects in understanding between line and management
  • Blame implies a strategy of deterrence, versus a strategy of prevention
  • Just saying “Person X should have done Y instead of Z” does not help the next person, unless you also understand why Person X did Z in the first place and change those circumstances
  • Post-mortem analyses should be about learning, not about blame

And one note from me that I put into a comment on the original post, but the comment is still awaiting moderation so I better say it here too:

One thing I would add: it’s also critical for a successful process that the output of post-mortems be acted upon in a *timely* and *visible* manner. If someone spends their time doing a bunch of analysis and comes up with recommendations on how to avoid problems but then feels like those recommendations are ignored or not appreciated, that also is highly disincentivizing future analysis. I’m sure Etsy doesn’t have this problem, but I’ve seen it happen in other organizations, especially as they get larger.

Basically John points out that if there are negative consequences to providing information that is useful in preventing future incidents, then people won’t provide that information. The flip side however is that if there aren’t positive consequences for providing that information, people also won’t provide it. And the best positive consequence I can think of is seeing your information acted on to make your service more reliable, prevent future problems, and improve the experience of your customers, all of which ultimately makes your business stronger.