This article can also be found in the Premium Editorial Download "Storage magazine: Better disaster recovery testing techniques."
Download it now to read this article plus other related content.
Are you confident your disaster recovery (DR) plan will work if a disaster strikes? When TheInfoPro Inc., a NY-based research network, posed that question to several hundred IT executives, the results weren't exactly reassuring. Only 55% of the managers surveyed were confident they could recover their open-systems data in an emergency. The rest were only somewhat confident or not confident their DR system would work.
|Losses incurred in a disaster|
No matter how many checklists a company makes and distributes, the number of disaster scenarios it considers or even how assiduously it backs up its data, managers can't be confident in a firm's ability to recover data if the systems haven't been tested thoroughly. "You have to test to see if your disaster recovery processes really work," says Michelle Zou, research analyst with the storage software team at IDC, Framingham, MA. "Not everybody does enough testing."
Testing is difficult. "It's a complicated process. You're talking about mission-critical applications that companies don't want to take down," Zou says. As a result, tests have to be scheduled far in advance and, to do it right, the testing will likely require the involvement of a large number of people. All of this drives up costs. "And what if the tests don't work?" Zou asks. The organization has to go back through the entire process to identify and fix the problems and then test again--which means more time, money and disruption.
Large mainframe IT shops can often offer a model for DR testing. MasterCard International Inc., for instance, has been honing its DR processes for 15 years and continues to refine them. The current testing plan calls for two major test exercises a year in April and October. Each exercise tests up to 40 of what MasterCard classifies as its Tier 1 systems to meet a corporate DR mandate of testing every Tier 1 system at least once a year.
|Disaster recovery testing costs|
Such mainframe-style DR testing is expensive, something only the largest companies, and those that require bulletproof DR, can afford. "Even small tests can cost $30,000 per test," reports Male. "Large tests can run $1 million a test." Direct costs include the time of the people involved, telecommunications costs, the cost of activating a hot site or another remote facility, and travel (see "Disaster recovery testing costs," this page).
Obviously, testing costs would seem trivial if you couldn't recover from a disaster in a timely way (see "Losses incurred in a disaster," this page). According to a recent Forrester Research study, companies with annual revenues of at least $1 million from an online business average losses of $8,000 per hour during a systems outage, which comes to $192,000 for each 24-hour period the site is down. An earlier study by Meta Group (now Gartner Inc.) found that unplanned downtime of critical systems could cost a large company as much as $1 million per hour due to lost revenue, reduced employee productivity and possible regulatory penalties. And the $1 million figure doesn't include the negative impact on the business' reputation. It's no wonder companies like MasterCard don't skimp on DR testing.
Even organizations like Harvard University feel a compelling need for DR, although costs are a significant issue governing how well each of the university's various apps are protected. "We do primary infrastructure recovery for those applications willing to pay for it," says Ron Hawkins, senior technical architect at the Cambridge, MA-based institution. Not surprisingly, the only business units willing to sign up for Harvard's IT infrastructure recovery are those responsible for critical apps such as payroll, financial and data warehousing. "Everybody wants DR until they see the price," he notes.
Faced with widespread demand for less-costly DR options, Hawkins has explored a variety of options, including VMware, snapshots and remote replication, in an effort to provide a less-expensive recovery service that would still be effective. He's also tried working with business units to juggle their recovery point objectives (RPOs) and recovery time objectives (RTOs) to come up with something less costly to implement, test and maintain (see Disaster recovery testing tips).
The only alternative Harvard has come up with to sending a team to its hot site for several days is a collocation facility located off campus. "They'll sell us rack space and we can replicate some of our critical infrastructure systems there," says Hawkins. These systems include the e-mail hub and DNS service, and perhaps a half- dozen small, but critical, utility services that have to remain accessible to apps outside of Harvard even if all systems go down in a disaster. By replicating them to the collocation facility, they can be recovered nearly instantaneously.
This was first published in October 2005