DevOps and Formula 1 – Automation

Formula 1 racing and DevOps. Two things that I love. At first glance, you might not think they have anything in common. But they do! Both of them are about maximizing the throughput of your systems through relentless focus on improving performance and reliability.

Observation Tower at Circuit of the Americas during the 2016 F1 race

How did I start thinking about this? At the Bahrain race this year, Ferrari had a horrific accident in the pit lane that seriously injured one of their mechanics. The car started driving away while the mechanic was still standing in front of the rear tire. These cars accelerate fast and in a split second the moving tire hit the mechanic’s leg and broke it in multiple places. Of particular interest to me, there’s an automation backstory to this horrible event. In the olden days, releasing the car from a pit stop was the responsibility of a human (affectionately known as the “lollipop man”). But in the last few years there’s been a switch to an automated system that checks various things before it will turn the light green and thus signaling the driver to go. In this case, it checked for all the things it’s supposed to check for and turned the light green – but the mechanic was still standing in front of the tire. (Obviously that’s not one of the things it checks for.) While the details aren’t 100% clear, it appears that whatever sequence of events happened was not accounted for in the design of the automation (i.e. an edge case). In this case, by taking the human out of the decision making loop, this particular implementation of automation led to a tragic outcome.

Obviously the link to DevOps is automation. Automation is essential to success in today’s world of computing infrastructure. But it’s not a panacea and can lead to bad outcomes, typically in unaccounted-for edge cases (just like the pit stop). A classic example is automation for simple self-healing. Say you run a bunch of containers and sometimes they get in a bad state and need to be destroyed and new ones spun up. In the olden days some human would keep an eye on the fleet and kill/restart things manually when needed. Automation lets you remove the toil from that human by setting up a simple liveness monitor for your container and when that monitor fails, you have software automatically kill/restart it. That works great – most of the time. But you might experience pathological behavior when some downstream component has a hiccough that causes your liveness probes to all return failures for a brief period, leading them all to be killed and restarted, which leads to a service outage.

Automation is a powerful tool, in both the F1 and the DevOps context. The key is to figure out what should be automated, and what shouldn’t. You want to remove as much of the toil from humans as possible, but in places where key decisions need to be made – well that’s what humans are there for. Is releasing a car from a pit stop that kind of key decision? That’s arguable, but I’d vote yes. Is restarting a container a key decision? Absolutely not. Is restarting all your containers simultaneously a key decision? Yes! Make sure your automation is clear on what is and isn’t a key decision, and treats them appropriately!

Your Nines Are A Lie

Is your service three nines, four nines, or even five? No matter what your answer is, it’s almost surely inaccurate.

I recently went through an exercise at work to calculate the expected availability of one of our foundational systems. It reminded me how little these calculations have to do with actual availability as experienced by consumers. Expected availability numbers are generally based on hardware failure rates. You take how often it fails with how long it takes to repair, and that gives you the component availability. An individual server may have an expected availability number of 99% which means in an average year you’d expect it to be down for repairs for about three and a half days. An easy way to raise the availability of a system is to have redundant components – if you have two of those servers, your system availability goes up to 99.99%. Why? Because the chances of both servers failing at the same time are really small. With three servers you get up to 99.9999%. As you make this system more complex with more layers and more dependencies, the math gets a little more complicated but the idea stays the same, and so you can calculate an expected availability of your entire system based on the availability of each of its components. If you’re running a production system at scale a typical design (redundant data centers, redundant circuits, redundant systems) could easily reach 99.999% (five nines) on paper. That’s about 5 minutes of downtime per year. For calibration, it would take 12 years of uninterrupted service to be able to take a 1 hour outage and still be at five nines. But every big outfit, including Google, AWS, and Facebook has experienced outages longer than that, even though they have big budgets and super smart people designing their systems. Why?

It turns out that most big outages are not caused by component failures. The most common cause of a major outage is someone making a change. All three of the outages I linked to above were caused by a human making a change. Reliability calculations based on component failures tell you absolutely zero about how likely your system is to fail when you make changes – that depends on the quality of your tooling, the frequency of your changes, the design of your system, and the capabilities and training of your team. The second most common cause of outages is overloads – where your system (or some critical subsystem) can’t keep up with what’s being sent at it. Two of the three examples involved overload conditions.

I’ve seen a lot of outages in my career and a vanishingly small percentage were caused by hardware failures – pretty much any decent system these days has been designed to handle individual component failures. The trick is figuring out how to make your system resilient in the face of change and making sure you have the tooling you need to be able to react to and quickly fix any problems that do come up (including being able to quickly add new capacity if needed). If you’re trying to build a reliable service you should pay just as much attention to those as you do to the reliability of your system components!

Hosting simple webapps for free with Github Pages

I wanted to put up some simple webapps. In the past, I’ve always had an Internet-connected server handy for such a purpose – either in a spare room in my house or a cheap VPS (I miss you unixshell!) But I don’t anymore. So where to put these webapps? Even a t2.nano is $40 a month! Turns out there is a great solution: GitHub Pages. What makes this great?

  • The code is already in GitHub anyway
  • It’s super simple to turn your repository into a hosted webapp – just go into the repository settings, scroll to the “GitHub Pages” section and select a branch to serve from. Now the index.html in your repository will be loaded when someone goes to a URL like

And BOOM now you have free hosting of your webapp (if all the logic is client-side).

For an example, check out my simulator for evaluating different toilet-seat strategies. (code)

Or my estate tax calculator. (code)

Both are simple webapps where all the logic is implemented in client-side javascript. And hosted for free! Thank you GitHub!

Are you SRE or are you DevOps?

People have asked me, “Are we doing DevOps, or are we doing SRE?” I’ve also heard (and this is worse): “We’re an SRE team – we don’t do DevOps.” These distinctions don’t make sense, because SRE and DevOps aren’t actually different things. SRE is DevOps. To be more precise, SRE is a specific implementation of DevOps. DevOps is a broad (and vague) term. It’s more of a philosophy than a methodology – it’s a perspective on the world and a set of patterns to apply. SRE shares the DevOps philosophy and many of the same patterns.

The term “SRE” generally refers to Google SRE, which is a particular implementation of DevOps inside of a ton of Google-specific context. (SRE – both the term and the practice – originated at Google and has only recently been used by other organizations). There are several things DevOps and SRE have in common:

  • Focus on solving problems with software
  • Ownership and empowerment of the team responsible for a service
  • Learning relentlessly from successes and (especially) failures
  • Driven by data and metrics

Google SRE adds a lot of specifics – some of the most interesting are aspects of economics and incentives, such as:

  • Common (human) resource pool between software developers and SREs – scarcity of whom leads to explicit decisions to optimize between features and reliability
  • Use of an “error budget” to throttle the rate of change for a product – including the unintuitive guidance that if you are exceeding your SLO for availability, you should launch more features
  • A cap of 50% of SRE time on operational tasks (known as “toil”) – to ensure the system can scale faster than the team required to support it
  • At least 5% of operational work done by software developers – to maintain visibility of the operational load the software creates

Google SRE operates on top of Google’s internal infrastructure and products. This is an extremely important part of the Google SRE context – they have had brilliant people working for fifteen years on the foundational systems, processes, and tools used to manage Google’s services.And within Google, every SRE team can benefit from not just common tooling and infrastructure, but also repeatable, translatable process. No other existing SRE team (outside of Google) works with the same level of foundational support.

As SRE expands outside the walls of Google, I like to think it will come to mean “applying the principles of DevOps at scale.” “Service Reliability Engineering” (an evolution of Google’s “Site Reliability Engineering”) is a much better term than “DevOps” to apply to teams focused on the reliability and performance of large-scale distributed systems, because it reflects the work and the expertise involved. “DevOps,” unfortunately, tends to just create confusion when applied to an organization or a strategy.

What, then, does it mean to do SRE? What does applying DevOps at scale look like? To start with:

  • Automate your infrastructure – build and management
  • Monitor what matters – set explicit SLOs for your services and gather the data both to see if you’re hitting the objective, as well as to evaluate the effects of changes to your infrastructure or code
  • Make your code builds and deploys both automated and repeatable, leveraging CI/CD
  • Learn from your failures with an effective retrospective process for incidents and launches
  • Empower your people – software developers and SREs – and hold them accountable for the overall success of the product – which includes its reliability

DevOps and SRE aren’t at odds. You can learn a lot from both!

Scalable Internet Architectures: A Review

I just finished reading Theo Schlossnagle’s Scalable Internet Architectures. This book is seven years old, but the concepts in it are still as current and useful as they were when the book was published. If your job is to design, build, run, or manage systems at scale, this book is worth reading. Now, scale ain’t what it used to be – this book won’t provide you step-by-step instructions for building the next Google or Facebook (mostly because it focuses on technology and tools, not on process). In fact, when this book was written Facebook was probably only 10,000 servers or so. But what the book will teach you will get you a good chunk of the way to being able to build a giant.

Here are some of the things I really liked about this book:

  • It uses an actual, real-life example throughout most of the topics. Including real empirical results on various implementations. This is indescribably awesome, and related to the next thing I love about the book:
  • Theo has been there, done that, and learned from it. He exudes competence on stage or in person, and it comes through in the book. There is a bit of a ‘tude, but it’s easy to look past it.
  • The distinction drawn between performance and scalability is one that many fail to grasp. Theo explains it in some detail, including why it matters.
  • Theo is an extraordinary troubleshooter, and he presents troubleshooting concepts in this book with such clarity of exposition that it’s easy to overlook how insightful they are.
  • For a seven-year-old book, I was very surprised to see the explanation of TCP-level HA and loadbalancing without the use of hardware appliances. Back then I certainly wasn’t hip to these mechanisms.
  • Theo (and Circonus) are well-known for their focus on business-level metrics (a focus I think is spot on). His description of this and metrics in general is outstanding.
  • The discussion of RDBMS vs NoSQL (Chapter 10 – The Right Tool for the Job) is the best I’ve read on the topic. (Even though “NoSQL” wasn’t a thing when the book was written). He analyzes his workload in terms of requirements against ACID and then shows you why those semantics aren’t relevant in this case. He then walks you through the NoSQL implementation and shows you the resultant speedup. Awesome. I’ve seen fairly significant platform decisions made with far less thought and data behind them.

What’s in this book isn’t glamorous. But it works. If you want to know how to build scalable and reliable online systems, there’s nothing better than to spend a day with Theo. If you can’t do that, then read this book. 

The Freakonomics of Oncall Pay

Wearing the pager. It’s a fact of life for many of us ops folks. I’ve taken part in many a discussion from a management perspective about how oncall duty should be compensated. When the people doing the talking are pointy-haired-manager types who haven’t done oncall themselves, their starting position is often similar to the “per incident support” policies you get from a vendor – if you’re oncall and you get paged, you get paid either a set amount for that incident or an amount proportional to how long you’re working on it. So you might get say $50 per incident, or $25 per hour you’re engaged, for example. And if your oncall duty period goes quietly (no pages), you won’t get anything above your normal salary. Let’s call this the “per incident” model. The other option is a flat wage bump for the period you’re oncall – so for example every week you’re oncall, you get an extra $150 in your check. Let’s call this the “per week” model.


image provided by flickr user hades2k under a Creative Commons license.

I’ve always been a strong advocate for the “per week” model instead of the “per incident” model, because being oncall is an intrusion by your work into your personal life – it reduces your freedom during your time off work, and it does that whether or not you get paged. You can’t fly to Vegas for the weekend, you can’t get sloshed after work, and in some cases you can’t even take a shower or drop your kids at school without arranging for someone to cover you in case there’s an incident while you’re away from the keyboard. Simply being oncall affects your life, and I’ve always felt that people should be compensated for that – not just for the time where they are actively working an issue.

Then I saw _Freakonomics_ and realized there’s an even more powerful argument for the “per week” model: incentives. In the “per incident” model, your compensation goes up when the system has more problems that require oncall support. In theory, people might try to cause incidents so that they will get paged and therefore get more money. Personally I doubt that happens very often, if at all. However, there’s a more subtle influence on root cause analysis and problem management that I think does have real effects. When you’re paid in the “per week” model, you’re strongly incentivized to address the root causes of problems and improve your systems so that during your oncall week it’s more likely that you’ll be able to sleep through every night and live your life normally for that week. So when you do encounter an issue in the “per week” model, you’re going to really want to figure out what caused it, and to make meaningful changes to prevent it from happening the next time around. Fixing the problem completely does you nothing but good – you’re still going to get paid the same oncall pay for your next stint, and you’re going to have a much more pleasant experience when you’re on call. But in the “per incident” model, putting all that effort into root cause and remediation is actually going to cost you money. The next time you’re oncall, each incident you prevented from happening will mean you don’t get the money for working that incident. So consciously or not, it’s likely you’re not going to work quite as hard to make your systems better as you would under the “per week” model. I believe that this can have real effects and result in your systems being less reliable and less stable than they could be.

How does your company do oncall? Is it per week, per incident, or something else? And how do you think that is affecting incentives and outcomes? I’d love to hear your experiences!

PS: I did a little research into military hazard pay to see if there were instructive parallels there. I found Hostile Fire Pay which seems to follow the “per week” model. Anyone have information/thoughts on how incentives have affected this program?

A Fond Farewell to Cloudscaling

Earlier this month I walked into Cloudscaling’s offices for the last time as an employee, almost two and a half years after I started. I loved my job. I think Cloudscaling’s future is super bright. While there, I learned a ton, I got to work with fantastic people, and I got to work on cool stuff that I believe will really make a difference in the future of how we do computing. Those of you who have talked to me about Cloudscaling probably already know how much I believe in the company and in its mission. Cloudscaling is democratizing agile infrastructure – taking the patterns and concepts that have fueled the hypersuccess of companies like amazon, google, and facebook and building open systems that will allow everyone in the industry to benefit from them. So why did I leave?


When I got there, Cloudscaling was a professional services company that was building large-scale clouds for their clients. During my time there, we transitioned to a product company, secured a series A investment round that we used to invest in building, selling, and supporting that product, and secured a series B investment round (announced last week). As VP of Engineering, my goal for the Series B timeframe was to build a sustainable and scalable technology team that could develop, maintain, and support the product and have that team be stable enough to continue doing so without my help. We got there – so now it’s time to pass the reins and let someone else take it from here.

As for myself, I started at Walmart Global eCommerce last week where I’m looking forward to taking the new ideas, concepts, and technologies that I’ve been working on and proving them out in the real world at one of the world’s largest ecommerce players.

To all my friends at Cloudscaling – I miss you all and wish you the best of success. Cloudscaling is the gold standard for Openstack-based products and thanks to all your hard work the future of computing will be here sooner than anyone thought. I will be forever grateful for the experience you gave me and I will be forever proud to be a Cloudscaling alumnus. Thank you for everything.

Technology Culture: The Famous (Infamous?) Netflix Culture Deck

Sheryl Sandberg called the Netflix Culture Deck possibly “the most important document ever to come out of the Valley.” For my part, I dig it, and I think it’s a powerful draw to potential employees. So when Adrian Cockroft was tweeting about #netflixculture with gems from the deck, I sent him a tongue-in-cheek DM saying “Hey! If you keep doing that you’ll make it impossible for the rest of us to hire good people!” His sensible reply was, “If it’s good, why aren’t more people copying?” This is an excellent question, and I’ve been trying to formulate a good concise answer. That’s been really hard, so I’m going to start with what like about it.

There’s a lot to digest in the deck, so here’s my summary:

  • Field the best team that you can
  • Give the people on that team the freedom (and responsibility) to get shit done

The second bullet really resonates with me. My philosophy on management can be summed up in one word: “empowerment.” The Netflix formula is essentially the same as mine: give people the context they need to understand the problems that need to be solved, give them the resources to solve those problems, and then get out of their way. This is not a new or revolutionary idea. I developed a lot of my ideas from reading the literature on self-directed or high-performing work teams that was popular in the 80s. Netflix is applying a lot of these same essential ideas.

So let’s get back to Adrian’s question. Why don’t more people do this?

Freedom and Responsibility

Here’s my summary of these ideas from their deck:

  • Focus on results
  • Fix errors quickly, rather than preventing them
  • Avoid rules and policies

Focus on Results

I think a lot of people actually do do this. This is the basis of the self-directed work teams I mentioned earlier. In the olden days this was referred to as “management by objectives.” More recently people are referring to this as a Results Only Work Environment (ROWE). I try to set my teams up this way, by using metrics like “deliver projects on time” and “have less than X minutes of downtime per quarter” rather than “generate Y lines of code per month” or “process Z tickets per week.” Of course, not everyone does this. Why not? The first and most obvious reason is that it’s harder to manage. Measuring the quantity of work done (or time spent) is easy. How many hours does someone work? How many tickets do they handle? How many times do they break the rules? It’s very easy to generate reports showing these numbers. Judging by real business-relevant results is much harder, both in definition and measurement. Another reason organizations don’t do this is that it requires competent people who already know how to get results. In this way, it aligns nicely with Netflix’ culture of hiring and keeping only the best performers, but not every organization can hire and keep the best performers (more on this later). Focusing on results also requires effective communication from the top down: about the vision, goals, and strategy – as well as transparent communication about the actual results obtained. This communication introduces overhead and has a cost in efficiency – there’s a reason the Marines don’t operate as a democracy. And finally, Focus on Results requires trust in both directions (up and down) as well.

Back to the question at hand: why aren’t more people copying the ‘Focus on Results’ aspect of Netflix culture? In this specific area, I think the answer is that many people are. I think Netflix has advantages here though, where other parts of their culture (compensation, performance management, investment in communication) reinforce and enable a Focus on Results strategy that’s more holistic and more effective than at many other companies.

Fix Errors Quickly Rather than Preventing Them

Everyone in software development knows that it’s far better to find and fix bugs earlier in the cycle than later. Finding a bug after release to production has very high costs, both in impact to the business and in resources to fix (the person fixing the bug will have to context switch back to the code they wrote in most cases a long time in the past, and then the fix will have to go through the full lifecycle of integration/test/deployment). Finding the same bug during the QA phase reduces these costs, and finding it during development largely avoids them. Ergo, the more bugs you can prevent from getting out of the development phase, the better.

Like other things that everyone knows, you periodically need to re-evaluate this assumption and make sure it’s still true and applicable. The Netflix philosophy says that this is not true, at least not universally and definitely not in their specific case. I agree. I believe that techniques like continuous deployment and automated testing have reduced both of the costs mentioned above (the context switching, and the cost of the lifecycle) to the point where this assumption doesn’t apply in many cases.

So why aren’t more people doing this? I think this realization (that the landscape around the cost of fixing bugs is changing) is starting to percolate through the industry, but like anything else that everybody knows, it will take a long time for this to happen – and probably a generational turnover that we’re only now beginning. And it will never be universally applicable because bugs in different industries have very different impacts – the cost of a bug that interrupts people’s ability to watch “She-Ra” on their Xbox is not at all comparable to the cost of a bug that causes a laser scalpel to cut the wrong artery during brain surgery.

Avoid Rules and Policies

This one seems to get the most attention. For example, there is no vacation policy at Netflix. The expense policy is 5 words, vs the 25 pages I’ve seen at other companies. The idea here seems to be twofold. The first is that rules and policies can be confining and in some cases get in the way of getting things done, so when you get rid of them you give people more freedom. The second is that it specifically frees up managers’ time. At other companies, they spend part of their days dealing with PTO requests, approving expense reports, and enforcing compliance with policies. Getting rid of those policies frees up managers to spend their time on leadership, which is a far more leveraged way to spend their time. While this has caught the most eyes and generated the most discussion, I think in terms of real impact this is the least important. Symbolically, of course, it is very important.

Why aren’t more people doing this? People are starting to adopt the no-vacation policy, which I think is the easiest to adopt and probably a good marker for companies that will be implementing more of Netflix culture over time. I imagine there’s more reluctance on the expense policy – there’s a lot of emphasis these days on financial reporting and there’s been a lot of high-profile expense abuse in the news. I also think there’s a generational thing here though, and it’s related to the “fix errors rather than preventing them” idea above. My favorite quote from _Rework_ is that policies are “organizational scar tissue,” kind of like the old adage that behind every FAA regulation is a plane crash. When something goes wrong and people say “we can’t let this happen again!” the answer is generally some new rule or policy. That’s how it’s been done for a long long time, and getting away from that will be very difficult for a lot of folks. And just like the difference between showing videos and brain surgery, there are industries where a no-vacation policy doesn’t make sense – for example in a bank or financial institution where corruption/graft are significant risks, forced vacations are used as a risk mitigation, because such schemes often can’t be sustained when the primary agent is away from their position for a week or two. Finally, for this to really work you have to have what the deck calls “responsible people.” Not every company can hire exclusively responsible people – in this case, as above, other elements of Netflix culture enable them to have the right workforce to support this element of their culture.

Field the Best Team You Can

Complementing the freedom and responsibility side of their culture, the other key element is having a workplace filled with nothing but “stunning colleagues.” They make a very useful analogy to a professional sports team, where the objective is to always have a superstar at every position. This leads to specific strategies and techniques:

  • Pay top of market
  • Don’t settle for adequate performance

Pay Top of Market

Netflix’ philosophy is to pay more than anyone else would. The yardsticks used are how much that person would be paid somewhere else, how much Netflix would pay someone else to come in and do the same job, and how much Netflix would be willing to pay to prevent someone from leaving. It’s the manager’s job at Netflix to make sure that each employee is compensated at the level that meets the above criteria, proactively. This simply short circuits a lot of problems that managers need to deal with at other companies around recruiting, retention, and incentivization. This is largely based on a belief that the best employees (in creative/inventive positions) contribute significantly more value (like 2x-10x) than the average employee – so that paying them more is in reality a bargain.

Why don’t more people do this? Well, obviously not everyone can pay top of market, because then it wouldn’t be top of market anymore. But plenty of companies have explicit polices to pay at mid-market, or at the 75th percentile, etc. I think a big part of the answer here is that there’d be a lot of disagreement that the best employees can be 10x more valuable than the average employee. This is actually something I would love to see more data on – the culture deck simply states this as a fact without supporting citations. In my experience, I think this is a reasonable statement – I’m really curious what other people have to think (please tell me in the comments!)

Don’t Settle For Adequate Performance

I believe this generates the most controversy from the culture deck. Basically at Netflix, if you do an OK job, you are given a nice severance package and shown the door so that a superstar can be brought in to do a better job. At most companies, you are expected to do an OK job, and you’re only shown the door if you’re not doing an OK job. Think of the sports team analogy – if your first baseman is batting .270 with 50 RBIs, he’s doing OK, but if you have the opportunity to bring in a first baseman who bats .320 with 95 RBIs – wouldn’t you do it?

Why don’t more people do this? You can only do this if you are convinced you can bring in a superstar to replace the person leaving, and that assumption requires some of the other Netflix culture charactersitics: namely paying at top of market and having a workplace that’s already filled with “stunning colleagues.” So this is something you can really only do if you’ve already adopted most of the rest of #netflixculture.


What really jumped out at me from writing this post is that a lot of the components of #netflixculture leverage and reinforce each other. For example, paying top of market lets you fill your company with responsible, awesome people, which lets you grow without adding tons of rules and policies. The real question is how much of what Netflix has done is unique to their position in their industry, and how much is generalizable to other companies? I’m looking forward to seeing other companies adopt this culture and seeing how it turns out for them. If your company is doing so, let me know!

Netflix teaches everyone how to host a tech meetup

Netflix has once again set the bar. Not with their technology this time – but with their organizing. I just got back from the first meetup of the NetflixOSS group – and it was spectacular. Let me walk you through it.

The intro was given by Ruslan Meshenberg, and contained a wonderful story about the email exchange that started the ball rolling on Netflix’s open-source efforts: one of their developers had something he wanted to open-source, so he asked about the policy for doing so. He was told “Our policy is we have no policies. Go for it!” To many of the developers in the audience this was a religious experience, because they had experienced very different responses at their companies to the same question. (When Ruslan earlier asked the audience how long it typically took to open-source something at other companies, the loudest response from the audience was “FOREVER!”) At this point the audience was warmed up and already on Netflix’s side, because they’d shown a) that they’re supporting open-source b) that they don’t put bullshit in your developers’ way and c) that they have smart, cool, funny people working for them. Win x3.

Next up was Cloud Architect Extraordinaire Adrian Cockroft, who gave an overview of all the (really cool) platform pieces that Netflix has open-sourced, along with an explanation of why they are doing so in the first place. Adrian’s presentations at conferences are always packed and for good reason – he’s entertaining, charming, and (like Ruslan) knows how to hit developers in their hearts. His talk had enough anecdotes about quirky naming methods, reinforcement of the “stay out of developers’ way” culture at Netflix, and enough hard-won insights from experiences in production (and amusing digs at other platform solutions) to keep the audience’s interest piqued, and at the end everyone was hungry for more details about the components he had introduced. Adrian’s reinforcement of the “the whole is greater than the sum of the parts” message also reminded everyone that the software they’re talking about all works together to accomplish far more than any individual component could ever hope to – which presented a fantastic analogy for the audience to infer about the team and process at Netflix as well.

Then came the brilliant part. There was a series of several lightning talks where one of the developers of each of those platform services talked briefly about what it did and why they needed it. This really stoked the audience’s interest and I know everyone (like I did) had one or two things they heard about that they were super eager to get more information on. This also made even more evident that there a lot of smart, experienced, and accomplished people at Netflix who are working on interesting problems and who have been given the freedom (and responsibility) to come up with the right solutions. Again, a tempting picture to paint for the developers in the audience.

The final stage was the demo room, where food and booze were provided and then those same engineers who had given the lightning talks were standing in front of stations where they would answer your questions and demo the software they had described earlier. So you could find the engineer who talked about what you had found so interesting during the lightning talks, and you could drill in with them. I spent a few minutes talking with Ben Christensen about Hystrix, a library that provides mechanisms for isolating failures in distributed systems so problems in one area don’t cascade throughout the system, as well as providing a dashboard view into the metrics generated by said library. This was fascinating to me for many reasons as I’ve long been interested in how we monitor and manage the connections between systems (sooooo much harder than dealing with the systems themselves) and I thought Hystrix did a very good job of presenting the information in a dashboard that was information-dense and yet at the same time very helpful at highlighting the things that were important. The point being that I found someone I could have a great conversation with about something I found very interesting – and I’m willing to bet that so did the majority of other attendees.

Overall, I had a great time and think it was a fantastic experience. It’s definitely one of the best tech meetups I’ve ever been to (and I’ve been to a lot). Kudos to the Netflix team – I think everyone had a good time, and I suspect a majority of the audience went home tonight thinking to themselves “Wow that’d be a cool place to work!” 

Which, I suspect, was the goal all along…..

What Public Cloud SLAs ARE good for

When HP announced their cloud was going GA, lots of the coverage said that the cloud came with a 99.95% SLA. While that is technically accurate, it is also very misleading. People may assume (and even some coverage indicates this) that this means if you launch a virtual server inside HP’s cloud, they are guaranteeing that instance will have 99.95% uptime. This is emphatically not true. What they are really guaranteeing is that that instance will be up or you will be able to launch a replacement instance. That’s a really important or. Lydia Leong over at Gartner has a much more detailed analysis of this that you probably want to read. For my part, I’d like to illustrate some theoretical situations that would not be considered violations of your SLA to try to make this a little more real and concrete for those of you with infrastructure running in the public cloud. (AWS’ SLA is similar to HP’s for all these cases).

The Examples

Here we go. Each of these situations is not considered to be a breach of HP’s SLA:

  • The only VM you are running in the cloud blows up and is unrecoverable. This does not count against the SLA.
  • The only VM you are running in the cloud loses all network connectivity. This does not count against the SLA.
  • Every VM you have in every availability zone in the region blows up at exactly the same time. They are all unrecoverable. This does not count against the SLA (because you can launch replacement VMs for all of them).

Amazon is no better. The examples above all would be the same for AWS – not a breach of their SLA. Here’s one that would be covered by HP’s SLA, but not AWS’:

  • Let’s say you have 200 VMs in each of 3 different AZes in a region. In each of those AZes, 199 of those instances die and you cannot launch any replacement instances. So you are left with 3 of your 600 instances running (99.5% of your capacity is lost). This does not count against the SLA.

What to do?

I’m really not ragging on public cloud here. I don’t think SLAs have much value as contracts. What they are good for is helping you do risk management. A provider is going to sum up lots of information about how they’ve designed and architected their services in their SLA. Think about new cars: if one comes with a 12 month limited warranty and another comes with a 10 year bumper to bumper warranty, doesn’t that make you think they are designed and built differently? So both HP’s and AWS’ SLAs tell you a lot about how their systems are architected. Mostly they tell you that individual instances are ephemeral and you better account for that when running your applications on top of them.

This is also why any SLA that says “100% uptime” is completely useless, because that number is obviously divorced from any match to the reality of the design and architecture of the system underneath the SLA. No system can achieve 100% uptime, and by pretending theirs does, service providers actually make it more difficult for their customers to do intelligent risk management.

My advice: make sure you understand your providers’ SLAs. Not because you’re going to use those SLAs to wrangle compensation out of them for outages (maybe I’ll post later about how such compensation is also a joke). You want to understand your provider’s SLAs so that you can do intelligent risk management for yourself. Knowledge is power. Get it and use it!