When HP announced their cloud was going GA, lots of the coverage said that the cloud came with a 99.95% SLA. While that is technically accurate, it is also very misleading. People may assume (and even some coverage indicates this) that this means if you launch a virtual server inside HP’s cloud, they are guaranteeing that instance will have 99.95% uptime. This is emphatically not true. What they are really guaranteeing is that that instance will be up or you will be able to launch a replacement instance. That’s a really important or. Lydia Leong over at Gartner has a much more detailed analysis of this that you probably want to read. For my part, I’d like to illustrate some theoretical situations that would not be considered violations of your SLA to try to make this a little more real and concrete for those of you with infrastructure running in the public cloud. (AWS’ SLA is similar to HP’s for all these cases).
Here we go. Each of these situations is not considered to be a breach of HP’s SLA:
- The only VM you are running in the cloud blows up and is unrecoverable. This does not count against the SLA.
- The only VM you are running in the cloud loses all network connectivity. This does not count against the SLA.
- Every VM you have in every availability zone in the region blows up at exactly the same time. They are all unrecoverable. This does not count against the SLA (because you can launch replacement VMs for all of them).
Amazon is no better. The examples above all would be the same for AWS – not a breach of their SLA. Here’s one that would be covered by HP’s SLA, but not AWS’:
- Let’s say you have 200 VMs in each of 3 different AZes in a region. In each of those AZes, 199 of those instances die and you cannot launch any replacement instances. So you are left with 3 of your 600 instances running (99.5% of your capacity is lost). This does not count against the SLA.
What to do?
I’m really not ragging on public cloud here. I don’t think SLAs have much value as contracts. What they are good for is helping you do risk management. A provider is going to sum up lots of information about how they’ve designed and architected their services in their SLA. Think about new cars: if one comes with a 12 month limited warranty and another comes with a 10 year bumper to bumper warranty, doesn’t that make you think they are designed and built differently? So both HP’s and AWS’ SLAs tell you a lot about how their systems are architected. Mostly they tell you that individual instances are ephemeral and you better account for that when running your applications on top of them.
This is also why any SLA that says “100% uptime” is completely useless, because that number is obviously divorced from any match to the reality of the design and architecture of the system underneath the SLA. No system can achieve 100% uptime, and by pretending theirs does, service providers actually make it more difficult for their customers to do intelligent risk management.
My advice: make sure you understand your providers’ SLAs. Not because you’re going to use those SLAs to wrangle compensation out of them for outages (maybe I’ll post later about how such compensation is also a joke). You want to understand your provider’s SLAs so that you can do intelligent risk management for yourself. Knowledge is power. Get it and use it!