Published: 07 Mar 2006
There probably isn't a backup administrator out there who hasn't heard--or maybe even said--the old line about how backups always work, but restores might be a problem. If it didn't have such a resounding ring of truth to it, it'd be funny.
Think about how often you actually test to see if backed up data can really be restored. Weekly? Once a month? Once in a while? Never? Today's backup apps are sophisticated and can provide reasonable reassurance that a backup was completed as planned, but they simply can't account for every variable. And they won't tell you for certain that your restores will actually work.
The list of things that could go wrong with a backup is long: orphaned servers, open files, media flaws, and on and on. Even after what appears to be a successful backup, heat, moisture or just clumsy handling can damage tapes. And a lot of these backup snafus can go unnoticed--that is, until you need to get to that backed up data.
Disk-based backup methods will improve recoverability odds, but they're far from perfect and don't eliminate restoration uncertainty completely.
The answer is simple: Do more testing. Without testing, you'll never know if your data-protection system is doing any protecting at all. For all intents and purposes, if a restore fails then the backup was a failure, too.
Nobody likes testing; it's not much fun, it takes time and it can punch holes in what was thought to be a carefully planned system. But the alternatives are worse. It might be a disaster recovery (DR) plan that breaks down at a critical juncture or a compliance system that shuffled your data off to who knows where.
Testing reduces risk, so it's directly tied to the effectiveness of the application. If your firm can afford to lose a few days of data, then occasionally testing restores will probably be fine. But if that window of risk shrinks to hours, or even minutes, you need more than a "reasonable assurance" that your data is restorable.
The same goes for other applications. Consider all the moving parts in an information lifecycle management (ILM) implementation, like data classification, storage tiers and data migrations, to name a few. Skimp on testing any single ILM component and the whole thing could tumble like a house of cards. If you don't know for sure that something's working right, it probably isn't.
Testing usually isn't just a pass or fail thing like that required Greek literature course you had to take in college. It's about weighing risks and then determining if a system can meet expectations. Before you can test routine backups, for instance, you first have to determine how much data you can afford to lose. Ditto DR--getting the whole company back up and running after a major disaster is probably just a pipe dream, but a more realistic goal of getting key systems back on track is probably doable and much easier to test.
Besides setting a reasonable scope for the testing process, another important factor is learning how to fail or, more accurately, learning from failure. A failure during a test is actually an indication of a successful test. The process that flunked the test is likely to get the attention it needs so it won't break down when it's really needed.
I'm preaching to the choir here--IT pros know all about testing. Still, the dark little secret is that many cut corners for expediency or to save some bucks. Who wants to tell their CEO, "We saved a bundle on testing, but the system didn't work"?
Smart storage managers don't bite on low acquisition costs because they know total cost of ownership (TCO) is the name of the game. Testing must be considered part of TCO. You may have to scrap your way through some budgetary battles to justify the "testing" line item, but win those battles and you're more likely to win the war. And vendors need to step up to the plate, too, with tools that can automate some of the testing processing. They've made some progress--XOsoft's Assured Recovery comes to mind--but there's plenty of room for improvement.