Are you SRE or are you DevOps?

People have asked me, “Are we doing DevOps, or are we doing SRE?” I’ve also heard (and this is worse): “We’re an SRE team – we don’t do DevOps.” These distinctions don’t make sense, because SRE and DevOps aren’t actually different things. SRE is DevOps. To be more precise, SRE is a specific implementation of DevOps. DevOps is a broad (and vague) term. It’s more of a philosophy than a methodology – it’s a perspective on the world and a set of patterns to apply. SRE shares the DevOps philosophy and many of the same patterns.

The term “SRE” generally refers to Google SRE, which is a particular implementation of DevOps inside of a ton of Google-specific context. (SRE – both the term and the practice – originated at Google and has only recently been used by other organizations). There are several things DevOps and SRE have in common:

  • Focus on solving problems with software
  • Ownership and empowerment of the team responsible for a service
  • Learning relentlessly from successes and (especially) failures
  • Driven by data and metrics

Google SRE adds a lot of specifics – some of the most interesting are aspects of economics and incentives, such as:

  • Common (human) resource pool between software developers and SREs – scarcity of whom leads to explicit decisions to optimize between features and reliability
  • Use of an “error budget” to throttle the rate of change for a product – including the unintuitive guidance that if you are exceeding your SLO for availability, you should launch more features
  • A cap of 50% of SRE time on operational tasks (known as “toil”) – to ensure the system can scale faster than the team required to support it
  • At least 5% of operational work done by software developers – to maintain visibility of the operational load the software creates

Google SRE operates on top of Google’s internal infrastructure and products. This is an extremely important part of the Google SRE context – they have had brilliant people working for fifteen years on the foundational systems, processes, and tools used to manage Google’s services.And within Google, every SRE team can benefit from not just common tooling and infrastructure, but also repeatable, translatable process. No other existing SRE team (outside of Google) works with the same level of foundational support.

As SRE expands outside the walls of Google, I like to think it will come to mean “applying the principles of DevOps at scale.” “Service Reliability Engineering” (an evolution of Google’s “Site Reliability Engineering”) is a much better term than “DevOps” to apply to teams focused on the reliability and performance of large-scale distributed systems, because it reflects the work and the expertise involved. “DevOps,” unfortunately, tends to just create confusion when applied to an organization or a strategy.

What, then, does it mean to do SRE? What does applying DevOps at scale look like? To start with:

  • Automate your infrastructure – build and management
  • Monitor what matters – set explicit SLOs for your services and gather the data both to see if you’re hitting the objective, as well as to evaluate the effects of changes to your infrastructure or code
  • Make your code builds and deploys both automated and repeatable, leveraging CI/CD
  • Learn from your failures with an effective retrospective process for incidents and launches
  • Empower your people – software developers and SREs – and hold them accountable for the overall success of the product – which includes its reliability

DevOps and SRE aren’t at odds. You can learn a lot from both!

About Paul Guth

Old Timey Web Ops guy. I think about cars and clouds, and how they could be faster, cheaper, and more resilient. View all posts by Paul Guth

2 responses to “Are you SRE or are you DevOps?

  • Josh Grose

    Paul – Thank you for delineating between the two. I feel the SRE term/title is following the same nebulous path as DevOps. I noticed that your focus is on large-scale environments (acknolwedging relativity). Do you see SREs adding value in more traditional environments running a blend of proprietary and homegrown applications?

    Thanks,
    Josh

    • Paul Guth

      Thanks for the question Josh. My first gut response was “Yes, of course!” SREs, after all, are just smart people who understand the value of automation, learning from failures, solving problems with software, and using metrics and data to drive decisions, and as such they can add value anywhere! But really, I think the question you’re asking is more complicated than that answer….

      I think there are two questions here:

      1. Can you hire ex-Google SREs to run your traditional IT environment?
      A: Maybe/probably. Google SREs are really really smart. They have great experience running infrastructure and services. OTOH, they come with a Google context. They’re used to having tools and processes that you don’t have. And your most important problems are not the same as Google’s. As an example, there’s a chapter in Google’s “Site Reliability Engineering” book (a must-read, btw) about testing, and it says “It takes little effort to create a series of smoke tests to run for every release.” That’s true at Google, but in a more traditional environment that’s likely to be untrue – no one who has the knowledge required to create those tests is available, and running those tests requires frameworks that no one has developed. Some ex-Googlers are going to have trouble switching contexts to different environments. Of course, many won’t, and would add tons of value to your traditional IT environment (if you could convince them to go there….)

      2. Can you use SRE principles and methods to improve your traditional IT environment?
      A: Not really, if you’re talking about Google SRE principles and methods. Again, those operate in a Google context and they work great there – but your organization has a different context. However, my more generic list above of what it means to do SRE (aka DevOps at Scale) would still apply. No matter what your environment looks like, making it more automated, making monitoring more value-focused and more visible, and learning effectively from failures will definitely add value.

Leave a comment