Recently I had the opportunity to create a template for infrastructure change requests at work. Based on the reaction from some of my co-workers, I thought it might be valuable to explain what change requests are for. In a subsequent post I’ll go through what a basic change request looks like.
Change Requests are part of the Change Management (CM) process. Now don’t get freaked out, that doesn’t mean we need forms filled out in triplicate sent through multiple people for review and approval. Processes can have as much or as little heft as required to meet the needs of your organization. But if your infrastructure’s availability is important to you, you should have a CM process. We are a small startup, so our CM process is very lightweight. Here are the main tenets:
- Think about a change before you start executing it
- If something is high-risk, test it before you do it for real
- Know how you’re going to handle it if something goes horribly wrong
- Record that you made the change so people can find it later if they need to (for example, when troubleshooting a problem)
Point 1 (think before you execute) is really philosophical. After many years of doing production web operations, I’m convinced based on the empirical evidence that you’re far more likely to screw something up if you just start cowboying your way through a change rather than planning it ahead of time. You see this point of view in other contexts as well (“plan your flight, fly your plan”). Many times when planning a change, I have thought of something new as I’m doing the planning that I would otherwise have encountered during execution – something that in the heat of the moment would have caused me great panic. Better to hit that and work through it when you’re not stressed out in the middle of a big production change. For me one of the most important parts of having a written Change Request is that it enforces thinking through a change before you execute it.
Point 2 (test high-risk changes) may sound obvious but there are certainly nuances. How do you determine what’s high-risk and where do you draw the line? How much time do you spend doing testing vs simply rolling back a change if it does cause problems? I’ve found that it’s best to leave these decisions in the hands of the people executing the changes – but your CM process needs to remind them to ask these questions, think about the answers, and use their best judgment.
Point 3 (how to handle problems) is not theoretical. If your job is web operations, you will be involved with a change that goes horribly wrong. It just happens. When it happens, if you have not thought about it ahead of time you will be up a smelly brown creek without a paddle. This is when panic sets in, and in the heat of those moments some spectacularly bad decisions can be made which could make the situation even worse. Spending some time prior to execution thinking through potential failure scenarios allows you to execute your rollback plan calmly and effectively. Which way do you prefer?
Point 4 (change recording) is absolutely critical unless you a) never forget anything and b) are the only person involved in the support of your infrastructure. In my experience, the majority of thorny production problems are caused by changes, usually when they introduce latent faults that don’t manifest as incidents for a while. When diagnosing such a problem, it is critical that you know what changed when, and that is precisely the purpose of change recording. There are a million ways to do this, from sending emails to a “changelog” alias or putting change summaries in IRC to having a CMDB with change records in it. Less important than the specific mechanism(*) is that you have a mechanism, that people use it religiously, that it’s easy to search for changes at particular times and to particular systems, and that everyone knows where to find it and how to use it. What seems like busywork when you’re performing a change (“Why do I have to write this down? It’s already done!”) will pay giant dividends when it prevents someone from spending tons of time reverse engineering what happened while the service is down.
(*) – Note: one thing you really should leverage is version control for your CM and recording processes – it’s invaluable for being able to track a sequence of changes and to easily pull back a previously working configuration.