Formula 1 racing and DevOps. Two things that I love. At first glance, you might not think they have anything in common. But they do! Both of them are about maximizing the throughput of your systems through relentless focus on improving performance and reliability.
How did I start thinking about this? At the Bahrain race this year, Ferrari had a horrific accident in the pit lane that seriously injured one of their mechanics. The car started driving away while the mechanic was still standing in front of the rear tire. These cars accelerate fast and in a split second the moving tire hit the mechanic’s leg and broke it in multiple places. Of particular interest to me, there’s an automation backstory to this horrible event. In the olden days, releasing the car from a pit stop was the responsibility of a human (affectionately known as the “lollipop man”). But in the last few years there’s been a switch to an automated system that checks various things before it will turn the light green and thus signaling the driver to go. In this case, it checked for all the things it’s supposed to check for and turned the light green – but the mechanic was still standing in front of the tire. (Obviously that’s not one of the things it checks for.) While the details aren’t 100% clear, it appears that whatever sequence of events happened was not accounted for in the design of the automation (i.e. an edge case). In this case, by taking the human out of the decision making loop, this particular implementation of automation led to a tragic outcome.
Obviously the link to DevOps is automation. Automation is essential to success in today’s world of computing infrastructure. But it’s not a panacea and can lead to bad outcomes, typically in unaccounted-for edge cases (just like the pit stop). A classic example is automation for simple self-healing. Say you run a bunch of containers and sometimes they get in a bad state and need to be destroyed and new ones spun up. In the olden days some human would keep an eye on the fleet and kill/restart things manually when needed. Automation lets you remove the toil from that human by setting up a simple liveness monitor for your container and when that monitor fails, you have software automatically kill/restart it. That works great – most of the time. But you might experience pathological behavior when some downstream component has a hiccough that causes your liveness probes to all return failures for a brief period, leading them all to be killed and restarted, which leads to a service outage.
Automation is a powerful tool, in both the F1 and the DevOps context. The key is to figure out what should be automated, and what shouldn’t. You want to remove as much of the toil from humans as possible, but in places where key decisions need to be made – well that’s what humans are there for. Is releasing a car from a pit stop that kind of key decision? That’s arguable, but I’d vote yes. Is restarting a container a key decision? Absolutely not. Is restarting all your containers simultaneously a key decision? Yes! Make sure your automation is clear on what is and isn’t a key decision, and treats them appropriately!