constructolution

I’ve noticed that a lot of people find this blog by searching for “incident management versus problem management.” This puzzled me, because I figured there must be way better places on the internets to find that answer. So I looked at the google results for that search and read the top five or so. And now I get it. Because those all suck. They have too much ITIL. And they’re too focused on service desk setups versus web ops setups. So let me try to put this in the simplest terms I can:

Incident Management is about fixing something that’s broken
Problem Management is about fixing something that’s broken

Confusing, right? Well, the two really are different. Let’s start with some analogies.

The Analogies

Let’s say your house is on fire. “Incident Management” is the firefighters putting out the fire. “Problem Management” is the investigation that happens afterwards to figure out why the fire started and what to do about it.

Let’s say you’re on a plane, and all the engines fail. “Incident Management” is the flight crew gliding the plane to an emergency landing. “Problem Management” is the investigation that happens afterwards to determine why those engines all failed and what should be changed in the future so that doesn’t happen again.

In a Web Ops world, “Incident Management” is what you do when all the servers in one of your datacenters have gone tango uniform – perhaps you take all the incoming traffic and redirect it over to a different datacenter. “Problem Management” would be figuring out what happened to all those servers – why did they fail together and what do you have to do to make sure that doesn’t happen again.

The Definitions

“Incident Management,” then, is when you are immediately addressing an urgent situation. Timeframes are generally short and pressure is high, and you’ll do whatever it takes to make things better – even things that in normal situations would be too costly, too risky, or against policy. For example, if your engines are out you might land your plane on a freeway – which is something you’d never do under normal circumstances. And importantly, you don’t do investigations into what caused things while you’re doing incident management. You really don’t care at that point why all your engines failed – you just want to get the plane down safely.

“Problem Management,” in contrast, is a more measured activity where you figure out what was really behind the incident and then what to do about it. The situation isn’t as urgent and the techniques are different. Root cause analysis is a key tool for Problem Management. And I don’t mean the mythical quest for a single root cause, I mean real root cause analysis. Which is a topic for another post.

The two are connected, of course. The inputs to Problem Management are The Facts: what happened when, who did what, what results did those actions have, etc. Those Facts came out of the Incident Management process, and ideally are recorded automatically without the people doing Incident Management having to do anything special – think of the “black box” that commercial aircraft have. In a Web Ops world, automatically recording crisis phone bridges and chatrooms can serve a similar purpose.

Who Cares?

It’s important to recognize that these are two different activities, and treat them as such. Most everyone does Incident Management, out of necessity (I mean, it’s not like you’re going to decide NOT to emergency land your plane when all your engines fail). What’s sometimes lacking, however, is the followup through Problem Management. It’s tempting after an incident to say “We know what happened, and what to do about it.” It’s tempting because it sounds right. And because it’s a lot easier than doing real root cause analysis, and a helluva lot easier than actually prioritizing and executing all the actions that are really required to address the underlying problems that led to the incident. But in the long run you’re way better off investing that time and effort.

Here’s how to tell you need to step up your Problem Management efforts. Do you ever say or hear things like this?

“Oh yeah, I remember when this happened last time.”
“Call Bobby Ray, he fixed this before.”
“Why does this keep happening?”
“We’ve gotten really good at dealing with this problem.”

If so – try some Problem Management.

What to do?

Make sure you’re recording useful information from your incidents – get The Facts
Have an official post-mortem investigation process for your incidents (and use it).
Periodically review the actions you’ve identified to fix things and see how you’re doing at getting them accomplished.

You don’t need a separate Problem Management team. (In fact, I’d recommend against it). You don’t need some expensive Problem Management software. All you need is an incident, some data, and a desire to make things better. I know you’ve got that!

Leave a comment | tags: Incident Management, problem management, root cause analysis | posted in Web Operations

November 3, 2012

Is Ubuntu a good server OS? – firewall edition

By Paul Guth

Earlier, I posted about upstart and how it illustrated how Ubuntu is not a good operating system for running servers. Today I wanted to provide another example: UFW, the “Uncomplicated FireWall” that is installed by default on Ubuntu servers.

Linux firewalling and UFW

Firewalls in Linux are implemented with iptables. In a nutshell, iptables evaluates packets against lists of rules to determine whether they should be accepted or rejected. iptables is typically used on servers through scripts that set up all these rules the way you want them – these days this process is generally managed through whatever configuration management system you’re using (e.g. chef or puppet). UFW is a tool distributed with Ubuntu that provides a different way of setting up these iptables rules, using command line tools.

How UFW works

UFW keeps its own set of iptables rules in /etc/ufw. When ufw is turned on, it flushes out all the existing iptables rules and replaces them with the rules from /etc/ufw. You can then use the command line tools to add ports to allow or deny, which will update both the running iptables rules as well as the copies stored in /etc/ufw.

Note that all UFW works with is what’s in /etc/ufw – it doesn’t know or care what the running iptables rules are.

A Totally Theoretical Example

Let’s pretend you have a big distributed system where the nodes are all running Ubuntu. It’s a big system, and it’s part of an even larger IT environment, so there’s a large IT support organization that’s staffed with people who were hired because they have fair bit of experience with Linux systems – but some of those folks have more at-home, desktop Linux experience vs datacenter, server Linux experience. Also these folks don’t know the ins and outs of the design of this particular distributed system because they are responsible for all of the varied IT environments at this large organization. The hero in our story is one of these selfless and courageous sysadmins. A report comes in from the users of this distributed system that they’re having trouble reaching some of the resources it provides. Initial troubleshooting leads our hero to posit that the problem is a misconfigured firewall. His experience is with Ubuntu and he’s always used ufw, so his first step is to disable the firewall and see if that helps:

ufw disable

There’s no change, so he figures that this is unrelated, so he turns the firewall back on:

ufw enable

Now he moves on to further troubleshooting.

Here’s the problem: The ‘ufw disable’ command above didn’t actually do anything, as ufw wasn’t enabled to begin with. No harm, no foul. However, running ‘ufw enable’ turns on ufw, and configures it in default mode – which denies all incoming connections. As the server in question provides many key network services for this distributed system, suddenly all of the other nodes in this system cannot reach these central services, and the entire distributed system starts to fall over!

So what went wrong here?

UFW is not a front-end for iptables

It may seem like UFW is a front-end for iptables – because you run ufw commands and it makes changes to iptables rules. But what it’s really doing is throwing away whatever is currently in iptables, and replacing it with what’s been configured in ufw. A real front-end would allow you to inspect and modify iptables stuff that already exists.

How this plays into this situation is that lots of carefully-crafted iptables rules for access, masquerading, etc got silently nuked when the ‘ufw enable’ command was run. It does not take into account the starting state of the system before it does stuff. That’s bad on a server.

UFW’s defaults are appropriate for desktops, not servers

I love ‘deny all’ as a default for firewalls. I could even concede that ‘deny all’ is the best default for servers, although I think that could be debated. However, let’s look at some of the specifics that ufw’s defaults do allow:

ICMP (this is OK)
multicast DNS (hmmm….)
UPnP (!!)

OK, this should go without saying, but does it seem to you like your server operating system should enable UPnP discovery by default? Makes sense for a desktop (maybe) – but not for a server. No way, no how.

UFW is an Ubuntu-specific thing (and makes iptables complicated)

If you have a ‘firewall administrator’ on staff, and you ask them to look at a server, how likely is it that they’re going to know anything about UFW? IMO, very unlikely. They’re probably familiar with iptables, and the first thing they’ll do is look at iptables directly, and they’re going to see dozens of rules that point at other rules and they’re going to be very confused. At best, they’ll refuse to touch it and give you a big lecture about how if packets get to your server’s TCP/IP stack before hitting your firewall, you’re doing it wrong. At worst, they’ll start making changes which due to the complexity of these rules will likely result in more problems, not less.

Make your Ubuntu servers better

If you are going to run servers on Ubuntu, here’s one simple suggestion for how to make them better:

apt-get remove ufw

After this, at least your beleaguered sysadmins will have one less possible way to screw up your systems while doing their jobs.

3 Comments | tags: firewall, Iptables, Linux, Operating system, ubuntu, Uncomplicated FireWall | posted in Web Operations

August 9, 2012

Is Ubuntu a good server OS?

By Paul Guth

My “Openstack IRL” presentation informs the audience that we at Cloudscaling use Ubuntu for the systems we deploy. When I present this live and we get to that slide, I usually say something like this:

We use Ubuntu for our systems. This is somewhat ironic because at least once a month in our office we have a big discussion about how terrible Ubuntu is as a server operating system…

Funny. But is it true? Is Ubuntu a terrible operating system for servers? Let’s look at one data point: Upstart.

Upstart’s raisons d’etre

My distillation of the reasons upstart was created to replace the traditional rcX.d init structure from SysV and BSD is:

the traditional system is serial rather than parallel, meaning reboots take longer than they have to – and people reboot their systems a lot more these days
the traditional system doesn’t deal well with new hardware being plugged in and out on a regular basis – and people are constantly plugging stuff in and out of their systems these days

Do those sound like conditions that affect your servers? Me neither. They are desktop-centric concerns. And there’s nothing wrong with that – unless you’re trying to run a server.

Why does it matter?

From the perspective of a crotchety old-time unix sysadmin (a hypothetical one of course!), upstart is a PITA. Let me try to illustrate why:

Checking what order stuff starts in

In the traditional world, here’s what you have to do to find out what order things start in:

ls /etc/rc2.d

That’s it. The output of that command provides you with everything you need to see at a glance that (for instance) apache is going to start after syslog.

Here’s how you do it in the upstart world:

well I wish I cold give you a simple way to do that, but you can’t. You have to open up the conf file for the service you’re interested in in /etc/init and look at what events it depends on for starting. If one of those events is the startup of another service, then you know your service will start after it. However, if there is no dependency listed on another service, then you don’t know what order they will startup in. Yours might start after the other one, or the other one might start before yours does, or they may both be starting up at the same time. You don’t know, and it isn’t guaranteed that they will start in the same order every time the system boots. This makes crotchety old unix sysadmins nervous, and leads to the second point….

Defining what order stuff starts in

In the traditional world, this is done with 2 digit numbers. You have a 2 digit number (part of the name of the file in /etc/rcX.d) and the scripts are run in the order of the numbers in their filename. So if you want one script to start later than another, just change its number to be larger than that other one. Easy to understand, and all you have to know to do it is how to use mv. And there are no hard dependencies here – if you build one server that doesn’t contain a particular service, that init file won’t be installed, and none of the other init files will be affected and startup will go as you expect.

In the upstart world, you do this by specifying startup dependencies between the jobs that start services. Each job emits an event when it completes that you can use in the conf files for other services. So say you have two services, s1 and s2, and you want to be sure s2 starts after s1. You do this by putting a line like this into /etc/init/s2.conf:

start on started s1

So aside from the crochety old sysadmin spending 45 minutes perusing ‘man upstart’ to figure this out in the first place, the problem you run into here is with distributed systems that can be deployed in varied configurations. For example, sometimes s1 and s2 are on the same node, and sometimes they are on different nodes. If you put the above line into /etc/init/s2.conf by default, guess what happens if you deploy in a config where s1 isn’t on the same node? s2 will never ever ever start.

Summary

My take on this is that upstart is a great thing for desktop systems. For server systems, it’s adding a bunch of complexity and brittleness without providing any actual benefits. And it’s one check mark in the “ubuntu isn’t a good OS for servers” category.

3 Comments | tags: ubuntu, Upstart | posted in Uncategorized

August 6, 2012

Openstack LA meetup

By Paul Guth

Last week I took a little trip down to LA to attend the #OSLAX meetup. I’m a little ashamed to admit that I was pleasantly surprised to find a big group of knowledgeable, friendly, and fun Openstack enthusiasts. (Us NorCal types are kinda snobby that way). I had the opportunity to present my talk on “Openstack IRL,” which seemed to be very well received. The video of my talk is up on vimeo if you’re interested in hearing the actual presentation, or you can see the slides on Slideshare. The message was the same as when I presented to #lspe earlier this year – you can build a production cloud using Openstack (we’ve done it!). This time around I was able to provide a little more detail about some of the specifics of what Cloudscaling is doing in our Open Cloud System, especially around resiliency for internal Openstack components, networking, and AWS compatibility.

At about 9:44 in the presentation, I’m talking about how excited network engineers get when they can build a CLOS network that provides loads of bisectional bandwidth. I have been assured multiple times that what everyone is actually hearing is “bisexual bandwidth” – I’m not sure exactly what that is but it’s probably more interesting than a network diagram. Check it out for yourself and see what you think I’m saying….

Openstack In Real Life from Paul Guth

OpenStack LA Meetup Jul 26 – Paul Guth from DreamHost on Vimeo.

Leave a comment | tags: Cloudscaling, openstack | posted in Openstack

July 8, 2012

Divergent, Convergent, and Congruent Infrastructures

By Paul Guth

There’s a very powerful concept that informs most of my thinking about system configuration management: the concept of divergent vs convergent vs congruent infrastructures. I first encountered this concept in Steve Traugott’s seminal paper Why Order Matters: Turing Equivalence in Systems Administration. I highly recommend reading the entire paper if you haven’t before – or even reviewing it if it’s been a while. Steve and Mark Burgess (of cfengine fame) are the two people who IMO brought the first real rigor to analysis of systems configuration management – their stuff is required reading for anyone in this space. Here’s the idea:

Divergent Infrastructure: When multiple systems are managed by hand by individual (well-meaning) sysadmins in a “classical” IT shop, they invariably become out of sync over time. This is known as “configuration drift” and is impossible to prevent if any changes to the systems are allowed at all. Many IT installations go through a divergent phase after their installation until it is recognized as a problem, which then leads to an attempt to create:
Convergent Infrastructure: After configuration drift is recognized as a problem, people attempt to bring the systems back to a common standard. passwd files may be unified, syslog configs may be made identical, etc. During this period, the configurations “converge” towards a common state – however it is very difficult (maybe even impossible) to reach a point where once-divergent systems are identical – what you hope for is an asymptotic approach to identity. Contrast this with
Congruent Infrastructure: A different way of managing configurations is to enforce that all systems are configured identically, so that configuration state remains the same between systems over time. This can be done in multiple ways and in my mind is one of the core goals of any system configuration management system.

This is illustrated in the following graph¹.
Graph showing system divergence, convergence, and congruence

In the beginning, the state of the systems is getting more and more different – they are diverging. Then someone starts applying standards, templates, etc and the configurations begin to converge. Finally a congruent mechanism is implemented and the systems are enforced to identical state during the congruent phase.

Much more on this to come – this is just setting the stage….

1.That graph was generated with the neat-o stuff over at Google Charts. Mostly anyway, I had to load it into Skitch to put the text annotations and arrows on – Google Charts allows annotations of datapoints but I couldn’t find a good way to get it to identify a particular range of points and annotate those as a whole.

Leave a comment | tags: Configuration management, system configuration management | posted in Web Operations

June 19, 2012

Incident Management vs Problem Management

By Paul Guth

Matt Groeninger over at disruptive–it.com wrote a very thought-provoking blog post about the Skype outage from a while back and how it demonstrates that sometimes you need to understand root cause in order to restore service in incidents in complex systems. I suggest you read his post, and I completely agree with his point (as I said in a comment there). However, it also got me thinking about ITIL, Incident Management, Problem Management, etc. Matt says:

the Skype outage demonstrates that some attempts to restore services will require communication/analysis from your problem management group.

There is a tendency among ITIL practitioners to take the functional descriptions in ITIL (Problem Management, Change Management, Incident Management, etc) and formalize them into organizational structure, and I see a little of that here. Root cause analysis is not some forbidden ritual that is the purview only of one particular team or role. So while I agree with his larger point that to resolve incidents you need to understand causes, I don’t agree that this means you need to involve a problem management group in the incident. In fact….

Incident Management often requires analysis of cause

You may have an Incident Management team, but even if you don’t, certainly someone is providing that function and they are responsible for restoring service when there is an incident. But in order to restore service, you often need to at least theorize about what is wrong before you can fix it. This is a standard part of any troubleshooting methodology. Take a simplified example of a single overloaded server. To fix this, you need to speculate on what could be causing it and do some investigation based on that speculation. For instance, maybe there’s a hung process chewing up all the resources – to test that you do a ps or top and look for something using all the CPU. If that’s not it, maybe the disks are overloaded so you do an iostat and look at wait times. Maybe that’s not it so you think it could be an external resource that’s slow, so you look at connection pools. The point is that in any complex system the “restoration” role is going to involve the rapid creation, testing, and discarding of theories about cause. These theories may not necessarily be about root cause, but are certainly going to be about proximate cause and potentially a handful of contributing causes. In fact I would say that nearly all interesting problems cannot be solved without gaining at least some idea of their causes – and any that can should have the solution automated anyway so they no longer require bothering someone in the future.

Only SOME causes should be considered during restoration

What you should not do in any service restoration effort is do any investigation that won’t directly help you solve the problem. As an example, during an incident you often find that some resource is not in the state it should be in – it is very tempting in those moments to ask “Why isn’t this in the state it should be in?” and start looking into what changes happened when, who did them, etc. This temptation must be resisted to keep the focus on restoring the service, as the fix is likely to be simply putting the resource back in the state it should be in, without needing to know who did what when to get it in the bad state. The proper time to look into those particular causes is during Problem Management after the incident is past.

Successful Incident Management requires either smart tools or smart people (ideally you have both)

If you have a complex system that requires high availability, you cannot get by with ticket monkeys and runbooks. (Unfortunately, many people have tried…) You need a combination of smart, experienced, knowledgeable people and useful tools that can analyze and describe the systems involved. If you have great tools you can get by with people who have less knowledge and less experience. If you have great people, you can get by with fewer and/or crappier tools. Ideally however, you have both.

If you want great tools, the best way I know of to get them is to have your smartest, most experienced, most knowledgeable people be intimately involved in their creation. The tools will empower your Incident Management team, and your Problem Management team should have as one of its primary charters the continual evaluation and improvement of those tools.

6 Comments | tags: Incident Management, problem management, root cause analysis | posted in Web Operations

June 12, 2012

Thoughts on “Blameless Postmortems and a Just Culture”

By Paul Guth

I was quoting Conan The Ops Guy in my previous post because I wanted to start writing some stuff about root cause analysis, problem management, after-action reports, etc. Then John Allspaw wrote this incredibly fantastic blog post about blameless postmortems that so eloquently and thoughtfully conveys a bunch of the things I was thinking about that now I am just going to sit here feeling inadequate and tell you to go read it.

Here are some of the things I really liked about what he had to say:

A culture of blame leads to people not providing information, and the information is what you need to improve things
A lack of information leads to larger disconnects in understanding between line and management
Blame implies a strategy of deterrence, versus a strategy of prevention
Just saying “Person X should have done Y instead of Z” does not help the next person, unless you also understand why Person X did Z in the first place and change those circumstances
Post-mortem analyses should be about learning, not about blame

And one note from me that I put into a comment on the original post, but the comment is still awaiting moderation so I better say it here too:

One thing I would add: it’s also critical for a successful process that the output of post-mortems be acted upon in a *timely* and *visible* manner. If someone spends their time doing a bunch of analysis and comes up with recommendations on how to avoid problems but then feels like those recommendations are ignored or not appreciated, that also is highly disincentivizing future analysis. I’m sure Etsy doesn’t have this problem, but I’ve seen it happen in other organizations, especially as they get larger.

Basically John points out that if there are negative consequences to providing information that is useful in preventing future incidents, then people won’t provide that information. The flip side however is that if there aren’t positive consequences for providing that information, people also won’t provide it. And the best positive consequence I can think of is seeing your information acted on to make your service more reliable, prevent future problems, and improve the experience of your customers, all of which ultimately makes your business stronger.

Leave a comment | tags: problem management | posted in Web Operations

May 24, 2012

Conan the Ops Guy

By Paul Guth

Lately I have been thinking of incident analysis and Problem Management (in the ITIL sense), and when I do that I always hark back to the immortal words of Conan The Ops Guy, who spoke thusly on what was best in life:

To crush your incidents, to have their root causes identified, and to see the prevention of their recurrence.

That is good Conan! That is good.

Leave a comment | tags: conan, incident analysis, problem management, root cause analysis | posted in Uncategorized

May 21, 2012

LSPE Presentation on Openstack

By Paul Guth

On April 19th, I gave a brief presentation at the San Francisco Large Scale Production Engineering (LSPE) meetup group. The night’s topic was “Private Clouds.” I talked about an Openstack-based private cloud deployment that we at Cloudscaling have done for a large enterprise. Despite all the buzz around and interest in Openstack, there’s a relatively small amount of real production deployments that have happened so far, so I wanted to reassure people that yes, petunia, you can deploy a real production cloud using Openstack. The deck (available on slideshare and at the bottom of this post) stands pretty well on its own – I covered the high points of planning, building, and running an Openstack-based cloud. I also gave some specifics about what tools and techniques Cloudscaling is using in our builds. I hope this will be informative and also generate some discussion and feedback so we can improve what we’re doing for the next time!

The most interesting thing in here IMO is organizational, not technological. Cloud is a different kind of IT, and it requires different skillsets and mindsets in an IT group. Because of that, I recommend building a new Ops group to run your cloud. However in any good-size deployment you’re likely to be integrating with an existing IT Ops team at some point or another – and probably right before you launch! In order for this to work, you need to build bridges to that team from day one. Nothing can derail a new cloud like resistance (active or passive) from an entrenched Ops team – so make sure they’re on your side from the start!

At the beginning of my presentation I asked for a show of hands for who had heard of Openstack (it was about 50%) and then who knew what the difference was between Nova and Swift (which is a good marker for people who are familiar with Openstack vs having just heard about it). Only 2-3 people (out of 50+) raised their hands! I was pretty surprised by this – my day-to-day circles are pretty much packed with cloudy folks and it was a good reminder that there’s still tons of people out there who don’t know much about Openstack and are curious to learn more. Keep spreading the word!

As with any 20 minute presentation, there is tons more to say – if you have questions about what’s in here or would like more information on anything, please let me know in the comments!

All in all it was a great evening (including some free Racer 5 beers)! Big thanks to Chris Westin for letting me speak, to Yahoo for hosting, and to the entire LSPE group for being so attentive and welcoming. I definitely hope to spend more time at the LSPE meetups in the future.

Openstack In Real Life

View more presentations from Paul Guth

1 Comment | tags: lspe, openstack | posted in Uncategorized

March 15, 2012

Cars, Technology, and Web Operations

By Paul Guth

I like cars. And I love finding analogies for my day job in the car world. Cars have been around for a long time and they’ve gone through a bunch of technological advances. I happen to have two cars from two different eras that demonstrate many of the changes in automotive technology over the last 40 years. I believe that these changes can teach us something about the evolution of Web Ops technologies as well.

One of my cars is a recent Mustang, a 2006. It’s a modern car, full of technology and creature comforts. I also have a 1966 Mustang, which is devoid of both. I just recently started driving the 1966 again (it was non-functional for many years), and I have been really struck by how different the two driving experiences are.

What first got me thinking is when I noticed the vast difference related to the manual transmissions, and specifically the clutches. These two cars are very similar at first glance: both rear-wheel drive, both powerful V8 engines, both manual transmissions. But they don’t drive the same at all. In the 66, all of the user controls (steering wheel, clutch pedal, gas pedal, etc) are mechanically linked to the things they’re controlling – if you press the gas pedal to a certain point, that corresponds exactly to how far the throttle plates in the carburetor will be open. In the 06, everything is computer controlled. The position of the gas pedal is sent to the computer which uses that as one of the inputs to decide how much fuel to inject into the mixture going into the cylinders. Meaning that in the 06, the computer can adjust things if it wants to – and it often wants to. You can see this in how forgiving the 06 is when starting from a stop. As everyone who has driven a manual car knows, from a stop you give the car some throttle to bring up engine speed while simultaneously releasing the clutch pedal, which brings the clutch plate into contact with the flywheel at the back of the engine, starting the clutch plate spinning, spinning the transmission gears, then the driveshaft, then the differential gears, and finally the rear wheels, making the car go. In the 66 it’s very easy to get a result that’s out of expected norms – namely either stalling the car if you don’t give enough throttle or spinning the rear wheels if you give too much. It is much, much harder to stall the 06, because the computer knows that’s probably not what you’re trying to do – so it’ll put more fuel into the mixture than the position of the throttle would normally indicate to keep the engine from stalling (up to a limit). This makes the general driving experience in the 06 more pleasant, more predictable, and safer.

Great, right? That’s progress! Cars should be pleasant, predictable, and safe, shouldn’t they? Yes they should – usually. But not always. What if you’re on a racetrack, trying to get the maximum possible performance out of your car? In those cases, the same technology that makes your car more pleasant in average driving conditions will hold you back. You can see this much more concretely in traction control systems. Briefly, many modern cars (especially with lots of power) have computerized traction control systems that can sense when a wheel starts to spin (meaning you’re losing traction) and they either use the braking system to slow the rotation of that wheel or lower the power output from the engine. In average driving conditions, this is great – it makes the car safer and more predictable. However, if you’ve ever watched Top Gear or any show where they have professional race drivers trying to get the most performance out of a road car, the very first thing they will do is to disable the traction control systems – because at the edges of the performance envelope those systems will hold you back and prevent you from getting the full potential out of your car. You need to be able to drive the car so that one or more of the tires are on the ragged edge of traction to go FAST. When you’re on the edge of out of control is when you’re racing – and (so far) traction control systems are not nearly as good at balancing on that edge as a human driver.

What does this mean for Web Ops?

When you make things easier, more predictable, and safer, you by necessity impose constraints on what can be done with the system. For most people most of the time, these are very good constraints. However, if you are trying to push the edges of the envelope when it comes to performance or efficiency (like a race driver does), then those constraints will get in your way and you’ll want to avoid or remove them. For example, using standardized machine configurations that are automatically applied can greatly increase the robustness of the system, both by removing manual mistakes and also by simplifying diagnostic and repair procedures. However, those standardized configurations are probably not going to be as performant as a custom-built, custom-tuned configuration would be for any particular application. If you’re in the 99% that don’t need maximum performance for your particular business, standardized configurations are going to be great for you. But if you’re that 1% you may need the flexibility to customize.
Another effect of using technology to make things easier, more predictable, and safer is that it usually imposes new costs on the system in terms of resources. Take virtualization as an example. Virtualization provides powerful new capabilities for IT – you can easily reconfigure your infrastructure, snapshot your systems, etc. However, there are costs to virtualization – the virtualization layer consumes system resources that can’t be used directly for your workloads. For most people most of the time, virtualization is overall a large net benefit. However, if you’re running a huge infrastructure where losing 0.5% of your resources represents a significant financial burden, then virtualization probably isn’t for you.
The costs imposed by new technology to make things easier, more predictable, and safer aren’t always measured in resources. They can come in the form of risks as well. Computer-controlled throttle and braking systems could in theory suffer from software bugs that could create dangerous conditions in ways that mechanical systems are not vulnerable to, like having the throttle opened even when the driver is intending to slow the vehicle down. New capabilities and new software introduce new failure modes. IT is littered with similar examples – in my own career I was struck by how often the introduction of high-availability systems (like database clustering for instance) actually led to failures of the highly-available system that wouldn’t have occurred in non-redundant systems – such as when a heartbeat between a master/slave system fails and the slave puts itself in-service, leading to two masters being active, leading to IP address conflicts and general badness.

Don’t get me wrong – I love technology, in cars and in IT. I hope that everyone leverages technological advances to the fullest extent possible. However, I recognize that as technology gives us new capabilities, it also imposes new constraints and costs, and as Web Ops professionals we need to understand those costs and constraints and make sure that we make informed and intelligent decisions about how we apply new technologies to our infrastructures.

Happy motoring.

Leave a comment | tags: Cars, Technology, web operations | posted in Web Operations

constructolution

Incident Management and Problem Management – Web Ops style

The Analogies

The Definitions

Who Cares?

What to do?

Is Ubuntu a good server OS? – firewall edition

Linux firewalling and UFW

How UFW works

A Totally Theoretical Example

UFW is not a front-end for iptables

UFW’s defaults are appropriate for desktops, not servers

UFW is an Ubuntu-specific thing (and makes iptables complicated)

Make your Ubuntu servers better

Is Ubuntu a good server OS?

Upstart’s raisons d’etre

Why does it matter?

Checking what order stuff starts in

Defining what order stuff starts in

Summary

Openstack LA meetup

Divergent, Convergent, and Congruent Infrastructures

Incident Management vs Problem Management

Incident Management often requires analysis of cause

Only SOME causes should be considered during restoration

Successful Incident Management requires either smart tools or smart people (ideally you have both)

Thoughts on “Blameless Postmortems and a Just Culture”

Conan the Ops Guy

LSPE Presentation on Openstack

Cars, Technology, and Web Operations

Recent Posts

Subscribe via Email

Archives

Pages

Twitter Updates