I’m Jason Firth.
On Sunday evening, this very website you’re reading this post from went down. In fact, it was a critical emergency because not only did the server’s main drive fail, but there were no backups, we had no parts on hand, and no plan to recover even if there was a backup.
Have no fear, Nerdyman is here! To put on the cape and save the day!
First thing in the morning, I went out and grabbed a replacement drive from the store, and I was able to limp the drive along long enough to make a full backup, and finally ended up figuring out how to put the full backup onto a new drive and make it boot up! Hoorah! The day is saved!
There’s a big problem.
Superheroes aren’t real. They exist in fantasy. The fact that I could run out and get parts that just happened to exist in a store somewhere, the fact that I just happened to figure out how to get my data back, the fact that I just happened to figure out how to get the system back up with the backup data and do it in time that I could get it done was all pure luck.
Let’s say this wasn’t a personal webserver with pictures of nerdy superheroes on it, but a business critical server. It was 24 hours from failure to recovery. For a personal webserver with pictures of nerdy superheroes on it, that’s pretty decent turnover time, but for a business critical application that’s totally unacceptable, especially given that successfully repairing the server was not a given.
In a business environment, you need to manage your risks. This is going to mean a few things: First, taking an inventory of every device you’re responsible for. Second, asking “What happens if this fails?”. Third, asking “How do I repair this when it does fail?”, and finally asking “Do I have the hardware, backups, and recovery procedures in place so we can get this back up and running reliably and quickly?”, and finally coming up with a plan to make sure you have the hardware, backups, and recovery procedures in place for everything you discovered in your audit.
When you manage your risks like that, there is no such thing as having to put on a cape, because you don’t need to make a miracle happen — you just need to follow your plan.
So what should I do for the future? To start with, I should have a spare drive on hand (and perhaps even a full spare computer if this is really critical). Next, I should have 3 copies of my data: The data itself, the online backup, and the offline (off-site if you’re really paranoid) backup. Third, I should have procedures in place to restore the backups in the event I need to (it’s occurred in the past that backups were made for a machine but the backups could only be used on that exact hardware, so massive amounts of time and effort were used to put the cape on and make the backups operate on hardware it should never have operated on). Finally, I should be testing my procedures occasionally to ensure they are relevant and reviewing my site-wide audit regularly.
One of the interesting things about proper risk management is it not only affects things directly that come from the document, but the process of asking these questions will change the way you do things. You’ve gone through everything you have, and inconsistencies become clear, and the consequences of cutting corners become more obvious.
Risk management is an extremely boring task. It means sitting down perhaps for weeks or months and spending a lot of time asking the same questions over and over again, but it pays off the first time you’re ready.
So I’ve been talking about IT infrastructure, but consider this: What I’ve said can correspond to all the equipment you’re responsible for. If you’re responsible for an instrument, shouldn’t you have a spare part, backup configuration, the tools you’ll need, and a procedure for repairing it if need be?
Doing this risk management totally changes the character of daily maintenance. Doing the work up-front means that when something happens, you put the cape away and just get about doing the work you need to do (and if you’re a supervisor or a manager, it means your workers put the cape away and get about doing the work they need to do). It costs less in the long run because you aren’t rushing parts in. It lowers mean time to repair because all the legwork is already done. It means less downtime because you don’t need to rush in parts, figure out how to do things on the fly that have never been done before, or figure out how to deal with your lost data in the middle of a crisis.
Thanks for reading!