When you run a service that others depend on (i.e. you have customers) you have a responsibility. Your service should work when people want to use it. If it doesn’t you’re letting them down – and likely costing yourself money. But you have limited resources to invest in keeping the service up – how do you spend them most wisely? What’s the best bang for your buck when it comes to monitoring? (Hint: it’s probably not what your monitoring system is desgined for!) Let’s look at your options:
Your typical monitoring system solution will tell you lots about your infrastructure, meaning the servers and network devices that your services run on. You’ll have ping tests to make sure servers are alive and disk space checks to make sure that they can write new information that they need to. The monitoring system will also record lots of system-level metrics for you to look at: how busy your servers are, how heavily utilized your network links are, etc. For a service with lots of customers, you’ll have a lot of infrastructure. It may be dozens, hundreds, or thousands of devices. A decent monitoring system will tell you right away when any of those devices fail or are having serious problems.
But that’s not good enough! In fact, it’s often useless;. Unfortunately this is the kind of thing most monitoring systems are really good at. But how valuable is it to check 3 times every minute that your disks are still 77% full? So you can ping a server – do your customers care?
Infrastructure monitoring tells you if a server or a router go down. Do your angry customers typically complain that “your router is down!” or “your database server is down!” when they call you about problems? If the answer is no – read on.
On top of your infrastructure you have applications – the software that provides the services your customers consume. Good application monitoring will involve looking at individual processes on your servers, and looking at the operational interfaces those processes provide to you: primarily logfiles and statistics. If your application monitoring is decent you’ll know right away when any of your software gets into a bad state.
That’s not good enough either! The service you provide is not the software.
Do your angry customers typically complain that “the indexing queue is really backed up” or that “the shopping cart middleware has stopped accepting requests” when they call you about problems? If the answer is no – read on.
What Customers Care About
Your customers use what you’ve created for a reason. It provides a benefit (or benefits) to them. That is what you want to be monitoring. What would your customers say if you asked them “Why do you use our service? What does it do for you?” Take that answer and figure out how to monitor it. Maybe the answer is “I use your service to make payments to people I buy things from.” OK, then your monitoring system needs to be able to measure making payments. (NOTE: not the servers that are involved in making payments – not the software that is involved in making payments, but making payments is what you need to measure and monitor). If the answer is “I use your service to read about what my friends are doing” then your monitoring system needs to be able to measure people reading about what their friends are doing. After lots of searching in vain for a decent name, I call this “What Customers Care About Monitoring” or WCCAM (rhymes with Wiggam, like the police chief in The Simpsons). This is what you really care about – that the value you provide to your customers is working.
These are probably also exactly the things your customers do complain to you about. “I can’t make a payment!” “I can’t read the status updates from Soandso!” If you listen to customer support calls, these are the kinds of things you’ll hear. In fact, in lieu of asking customers directly what they use your service for, the next best thing is to ask your customer support folks what people complain about – that’s an excellent pointer to what your critical services are from a customer perspective.
Measure your services – measure your value!
Once you’ve identified the services your customers care about (like making payments, or reading updates from their friends), figure out what characteristics of those services are critical. Possibilities include:
- performance – response time, load time – how quickly can they get to it?
- functional correctness – is it doing what it’s supposed to?
- availability – can they reach it when they want to?
Then figure out how to measure and monitor those characteristics. I know, I know – that’s not easy. That’s why you get paid to do it!
What does this do for me?
Effective WCCAM pays off in much higher availability for your services, which means happier and more satisfied customers. It can reduce both MTTR and MTTF by providing much faster detection of customer impacting problems. If you have ever had an outage or disruption that was reported to you by your customers rather than your monitoring system then you already know the value of WCCAM. There are many situations where all of your individual devices and applications are up and running, but the overall service is not working because something is wrong with the connections between those services or with some external dependency those services have. WCCAM tells you about these problems – infrastructure and application monitoring do not. This earlier detection can dramatically reduce your MTTR.
Effective WCCAM also can lead to faster triage and diagnosis, again reducing MTTR. Since what you are measuring is what customers care about, it’s much easier to distinguish an important problem from a trivial one – letting you prioritize what you’re going to do intelligently.
WCCAM points the way
- What you should monitor is what your customers care about.
- So monitor the services you provide to your customers – not (just) your infrastructure.
- WCCAM will let you find and address problems more quickly, leading to happier customers, a happier business, and a happier universe.