This article can also be found in the Premium Editorial Download "Storage magazine: Lessons learned from creating and managing a scalable SAN."
Download it now to read this article plus other related content.
|The five most overlooked items in DR tests|
The business-continuity plan
The business and functional aspects of executing and testing a DR plan are just as important as the underlying technologies. Perhaps even more important than testing the DR plan is testing the company's business-continuity plan (BCP). A BCP comprises the arrangements and procedures that enable a company to continue critical business functions despite a major interruption from a disaster event. For example, a BCP should document who can actually declare a disaster and define alternate modes of communication for staff, application owners, customers, partners and even stockholders. BCP testing on this level, which includes non-IT staff and other key executives, helps to identify potential process and policy flaws and, more importantly, provides a methodology to correct them.
I've seen BCP plans that call for using corporate e-mail for critical communications during the recovery process--a system that's likely to be down during a disaster. A similar BCP blunder would be relying on internal voice communications when the corporate telephony and voicemail runs on a voice over IP (VoIP) system located in the production data center. A good BCP should have a plan B--and even a plan C--for its most critical objectives (see "The five most overlooked items in DR tests," at right).
Obviously, staffing can't be overlooked. First and foremost, a good DR plan must be role-based, and should be detailed enough so that people with technical competencies and skills similar to those of the primary staff can fulfill the required activities. Don't assume that your top storage person, DBA or network administrator will be available during recovery from a disaster. So don't always assign your best staff to each DR test. Switch up your teams, replace a senior member with someone more junior, or delay an entire group during your next DR test and see what effect it has on the execution of the test.
Depending on the logistics and location of your DR site, it may make sense to permanently base some internal staff there so they'll be able to start the recovery processes immediately (obviously, this wouldn't be their only job). It may also be a cost-effective move to contract with a vendor to start the DR recovery process until regular staff can reach the remote site.
When to test
It's often difficult to determine how often to test. The solution is specifically tied to benefit vs. cost, not just the cost of the test itself. Refer to your business-impact assessment and identify what the total cost would be if your most mission-critical applications go down. The downtime cost should include items such as potential regulatory penalties and reduced/lost employee productivity, as well as indirect costs such as the long-term effect of customer loss and damage to the corporate reputation.
Hopefully, the total cost of downtime ties directly to your recovery time objectives (RTOs) and recovery point objectives (RPOs). If some data center components can be down for a week with minimal overall business interruption, it probably doesn't make sense to test your DR plan every quarter; however, you may still want to validate subsets of your plan, such as recalling offsite DR tapes. On the other hand, if you measure downtime in hours rather than days, quarterly or semi-annual tests are probably best. Testing costs may be significant vs. your total DR investment and the cost of a poor or failed recovery, but it's money well spent.
When I was an IT director, I found it ironic that DR tests were carefully scheduled and planned. While some disasters can be anticipated (a slow-moving hurricane headed your way), most DR events are unforeseen. So it's a good idea to execute small parts of your DR plan as unplanned activities for your staff. You may find a major kink or two in the plan's execution, or the plan itself, when the parties involved are caught off-guard. I've done this, asking my staff to immediately get me a detailed list of every offsite DR tape we'd need to recover if our data center became unavailable at that very moment. The good news was that they had the list, including the offsite locations of the tapes and associated tape shipment procedures. The bad news was that the list was stored, both electronically and on paper, in our data center!
This was first published in July 2006