Formula 1 racing and DevOps. Two things that I love. At first glance, you might not think they have anything in common. But they do! Both of them are about maximizing the throughput of your systems through relentless focus on improving performance and reliability.
Observation Tower at Circuit of the Americas during the 2016 F1 race
How did I start thinking about this? At the Bahrain race this year, Ferrari had a horrific accident in the pit lane that seriously injured one of their mechanics. The car started driving away while the mechanic was still standing in front of the rear tire. These cars accelerate fast and in a split second the moving tire hit the mechanic’s leg and broke it in multiple places. Of particular interest to me, there’s an automation backstory to this horrible event. In the olden days, releasing the car from a pit stop was the responsibility of a human (affectionately known as the “lollipop man”). But in the last few years there’s been a switch to an automated system that checks various things before it will turn the light green and thus signaling the driver to go. In this case, it checked for all the things it’s supposed to check for and turned the light green – but the mechanic was still standing in front of the tire. (Obviously that’s not one of the things it checks for.) While the details aren’t 100% clear, it appears that whatever sequence of events happened was not accounted for in the design of the automation (i.e. an edge case). In this case, by taking the human out of the decision making loop, this particular implementation of automation led to a tragic outcome.
Obviously the link to DevOps is automation. Automation is essential to success in today’s world of computing infrastructure. But it’s not a panacea and can lead to bad outcomes, typically in unaccounted-for edge cases (just like the pit stop). A classic example is automation for simple self-healing. Say you run a bunch of containers and sometimes they get in a bad state and need to be destroyed and new ones spun up. In the olden days some human would keep an eye on the fleet and kill/restart things manually when needed. Automation lets you remove the toil from that human by setting up a simple liveness monitor for your container and when that monitor fails, you have software automatically kill/restart it. That works great – most of the time. But you might experience pathological behavior when some downstream component has a hiccough that causes your liveness probes to all return failures for a brief period, leading them all to be killed and restarted, which leads to a service outage.
Automation is a powerful tool, in both the F1 and the DevOps context. The key is to figure out what should be automated, and what shouldn’t. You want to remove as much of the toil from humans as possible, but in places where key decisions need to be made – well that’s what humans are there for. Is releasing a car from a pit stop that kind of key decision? That’s arguable, but I’d vote yes. Is restarting a container a key decision? Absolutely not. Is restarting all your containers simultaneously a key decision? Yes! Make sure your automation is clear on what is and isn’t a key decision, and treats them appropriately!
Is your service three nines, four nines, or even five? No matter what your answer is, it’s almost surely inaccurate.
I recently went through an exercise at work to calculate the expected availability of one of our foundational systems. It reminded me how little these calculations have to do with actual availability as experienced by consumers. Expected availability numbers are generally based on hardware failure rates. You take how often it fails with how long it takes to repair, and that gives you the component availability. An individual server may have an expected availability number of 99% which means in an average year you’d expect it to be down for repairs for about three and a half days. An easy way to raise the availability of a system is to have redundant components – if you have two of those servers, your system availability goes up to 99.99%. Why? Because the chances of both servers failing at the same time are really small. With three servers you get up to 99.9999%. As you make this system more complex with more layers and more dependencies, the math gets a little more complicated but the idea stays the same, and so you can calculate an expected availability of your entire system based on the availability of each of its components. If you’re running a production system at scale a typical design (redundant data centers, redundant circuits, redundant systems) could easily reach 99.999% (five nines) on paper. That’s about 5 minutes of downtime per year. For calibration, it would take 12 years of uninterrupted service to be able to take a 1 hour outage and still be at five nines. But every big outfit, including Google, AWS, and Facebook has experienced outages longer than that, even though they have big budgets and super smart people designing their systems. Why?
It turns out that most big outages are not caused by component failures. The most common cause of a major outage is someone making a change. All three of the outages I linked to above were caused by a human making a change. Reliability calculations based on component failures tell you absolutely zero about how likely your system is to fail when you make changes – that depends on the quality of your tooling, the frequency of your changes, the design of your system, and the capabilities and training of your team. The second most common cause of outages is overloads – where your system (or some critical subsystem) can’t keep up with what’s being sent at it. Two of the three examples involved overload conditions.
I’ve seen a lot of outages in my career and a vanishingly small percentage were caused by hardware failures – pretty much any decent system these days has been designed to handle individual component failures. The trick is figuring out how to make your system resilient in the face of change and making sure you have the tooling you need to be able to react to and quickly fix any problems that do come up (including being able to quickly add new capacity if needed). If you’re trying to build a reliable service you should pay just as much attention to those as you do to the reliability of your system components!
People have asked me, “Are we doing DevOps, or are we doing SRE?” I’ve also heard (and this is worse): “We’re an SRE team – we don’t do DevOps.” These distinctions don’t make sense, because SRE and DevOps aren’t actually different things. SRE is DevOps. To be more precise, SRE is a specific implementation of DevOps. DevOps is a broad (and vague) term. It’s more of a philosophy than a methodology – it’s a perspective on the world and a set of patterns to apply. SRE shares the DevOps philosophy and many of the same patterns.
The term “SRE” generally refers to Google SRE, which is a particular implementation of DevOps inside of a ton of Google-specific context. (SRE – both the term and the practice – originated at Google and has only recently been used by other organizations). There are several things DevOps and SRE have in common:
- Focus on solving problems with software
- Ownership and empowerment of the team responsible for a service
- Learning relentlessly from successes and (especially) failures
- Driven by data and metrics
Google SRE adds a lot of specifics – some of the most interesting are aspects of economics and incentives, such as:
- Common (human) resource pool between software developers and SREs – scarcity of whom leads to explicit decisions to optimize between features and reliability
- Use of an “error budget” to throttle the rate of change for a product – including the unintuitive guidance that if you are exceeding your SLO for availability, you should launch more features
- A cap of 50% of SRE time on operational tasks (known as “toil”) – to ensure the system can scale faster than the team required to support it
- At least 5% of operational work done by software developers – to maintain visibility of the operational load the software creates
Google SRE operates on top of Google’s internal infrastructure and products. This is an extremely important part of the Google SRE context – they have had brilliant people working for fifteen years on the foundational systems, processes, and tools used to manage Google’s services.And within Google, every SRE team can benefit from not just common tooling and infrastructure, but also repeatable, translatable process. No other existing SRE team (outside of Google) works with the same level of foundational support.
As SRE expands outside the walls of Google, I like to think it will come to mean “applying the principles of DevOps at scale.” “Service Reliability Engineering” (an evolution of Google’s “Site Reliability Engineering”) is a much better term than “DevOps” to apply to teams focused on the reliability and performance of large-scale distributed systems, because it reflects the work and the expertise involved. “DevOps,” unfortunately, tends to just create confusion when applied to an organization or a strategy.
What, then, does it mean to do SRE? What does applying DevOps at scale look like? To start with:
- Automate your infrastructure – build and management
- Monitor what matters – set explicit SLOs for your services and gather the data both to see if you’re hitting the objective, as well as to evaluate the effects of changes to your infrastructure or code
- Make your code builds and deploys both automated and repeatable, leveraging CI/CD
- Learn from your failures with an effective retrospective process for incidents and launches
- Empower your people – software developers and SREs – and hold them accountable for the overall success of the product – which includes its reliability
DevOps and SRE aren’t at odds. You can learn a lot from both!