Will your disaster recovery plan work?


This article can also be found in the Premium Editorial Download "Storage magazine: Better disaster recovery testing techniques."

Download it now to read this article plus other related content.

Business disruption
Beyond cost is the business disruption caused by DR testing. Organizations are constantly playing with their DR budgets to decide how much money to allocate for DR and how much money to devote to testing their DR infrastructure. It's a catch-22 situation: "[Many] companies can't afford to take down their systems to test failover," says Stephanie Balaouras, senior analyst in the enterprise computing and networking decision service at Boston-based Yankee Group.

For companies looking for shortcuts that will reduce the pain of DR testing, XOsoft introduced Assured Recovery this past spring. The software promises to perform a full test of data and application recoverability on either a regular or ad-hoc basis without disrupting the production system. Assured Recovery basically spools data changes to a separate file while it validates if the backed up system will work in a recovery situation.

"The nice thing about XOsoft is that you don't have to take down your application, but you can still validate that the data you're relying on for DR is consistent, recoverable and restartable," says Balaouras. Although it will test the recoverability of the data, that alone, unfortunately, doesn't constitute full DR testing. "It doesn't replace testing people and procedures and all the interdependencies," she notes.

Denny's Inc., the Spartanburg, SC-based restaurant chain,

Requires Free Membership to View

is typical of companies that test their open-systems DR plans once a year, suggests TheInfoPro's Male. To keep costs reasonable, the company focuses on testing the recovery of the IT infrastructure only. "This is not the same thing as testing what the business would need for full business continuity," says Kurt Hazel, senior systems administrator at Denny's.

The company basically tests the backup process. "We make sure the facility has the correct systems to bring back the application and the data. We're not moving people around to test business recovery," Hazel says. Denny's sends a system administrator and a couple of business users to its Atlanta hot site for three days. The first two days are usually spent checking out system configurations and logs. The actual application testing--payroll, financials and ERP system--takes only one day.

Disaster recovery testing tips
The following tips were compiled by Mike Karp, senior storage analyst at Enterprise Management Associates Inc., Boulder, CO.
  • Assume that your best people won't be available; test the plan with people who have never seen the plan before.
  • Use role playing among senior management to test scenarios that might trigger the recovery plan--it's a major decision to initiate a recovery operation.
  • Test how to switch back into full production mode--at some point you'll need to come back and synch up.
  • Continuously document; one undocumented change in production can cause a disaster recovery (DR) effort to fail.
  • Measure DR tests in terms of continuous learning--if you aren't finding problems, you aren't testing hard enough.
Although Denny's testing has passed inspection by its auditors, the company is concerned that its current process isn't adequate for business recovery. "The three-day recovery process is not fast enough," says Hazel. The company is working with its hot site provider on a new process that will run 24 hours the first day, rather than a standard eight-hour business day. That will allow Denny's to chop a full day or more off the recovery testing process, although the cost for the first day will be higher.

Business function testing
Just getting a backup tape to load doesn't qualify as DR testing at a major telecommunications company. "For us, doing a test means we can do the complete business function," the firm's storage manager says. That typically entails testing dozens, if not hundreds, of applications that are sometimes interdependent with other applications and dependent on various types of systems.

Given the difficulty of actual business function testing, the telecommunications company has adopted a three-tier test strategy. The first tier is to prove that the disaster team can recover a local copy of the database within the agreed-upon recovery time. "We don't consider it DR, but it is something. You'd be surprised how many operational challenges we encounter just doing this much," the manager reports. The second tier is the same as the first, except the team is now trying to recover the database to a remote location. "When the DBA says 'Yes, the data is here,' then we know we can get the data over and loaded in one piece," the manager explains.

The third tier is a three-day recovery exercise of the vertical business function where a disaster team visits the hot site, loads tapes and runs the application. "We're talking about hundreds of gigabytes of data, sometimes even a terabyte, so we can spend a whole day just waiting for the tapes to load," he explains. It's not unusual that a glitch in the tape loading forces a complete reload. "By that point, we end up with maybe a few hours on the last day to run the application and the data," he explains.

But even all this third-tier testing doesn't amount to a true business function test. The team still hasn't put a user load on the application or tried to enter new transactions. "To do that, all the data sources would have to be remapped, and we haven't the time and resources to do that," the storage manager says. The third tier has proven so challenging that the telecommunications company is delaying the requirement for full tier-three testing until next year. The cost will run into the tens of millions of dollars, but that's still less than the value of the loss of just a few days' billing.

Without testing, managers can't assume with any confidence that their recovery strategies will work. Still, what managers mainly learn in the tests is how many IT infrastructure points of failure they have. Few have even begun to test the human elements of disaster recovery.

This was first published in October 2005

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: