Published: 12 Jul 2006
It's not enough to have a DR plan--you need to know it will work. Repetitive, detailed tests will tell you if your DR plan is up to snuff.
We've had more than our share of disasters--hurricanes, tsunamis, floods, tornadoes, earthquakes, acts of terrorism and blackouts--but the good news is that most companies are now focused on dealing with a disaster and have a disaster recovery (DR) plan in place. Companies are paying more attention to the next level, which includes keeping DR plans current and periodically testing them. While this represents a dramatic improvement in disaster response planning, more often than not the testing component is the part of the equation that's often delayed, fails at some level or doesn't properly mirror the response activities during a real disaster.
|Key lessons learned in DR testing|
That said, it's time to take a detailed look at DR testing and the specific steps to take to ensure that your DR plan will perform as expected. A solid DR strategy isn't a small-ticket item. However, it should be treated as fundamental, must-have insurance to safeguard the company's information assets. There are two key aspects to a DR plan: nontechnical details, which focus on the people, policy, process and procedures of DR testing; and technical details (see "Key lessons learned in DR testing," at right).
|The five most overlooked items in DR tests|
The business-continuity plan
The business and functional aspects of executing and testing a DR plan are just as important as the underlying technologies. Perhaps even more important than testing the DR plan is testing the company's business-continuity plan (BCP). A BCP comprises the arrangements and procedures that enable a company to continue critical business functions despite a major interruption from a disaster event. For example, a BCP should document who can actually declare a disaster and define alternate modes of communication for staff, application owners, customers, partners and even stockholders. BCP testing on this level, which includes non-IT staff and other key executives, helps to identify potential process and policy flaws and, more importantly, provides a methodology to correct them.
I've seen BCP plans that call for using corporate e-mail for critical communications during the recovery process--a system that's likely to be down during a disaster. A similar BCP blunder would be relying on internal voice communications when the corporate telephony and voicemail runs on a voice over IP (VoIP) system located in the production data center. A good BCP should have a plan B--and even a plan C--for its most critical objectives (see "The five most overlooked items in DR tests," at right).
Obviously, staffing can't be overlooked. First and foremost, a good DR plan must be role-based, and should be detailed enough so that people with technical competencies and skills similar to those of the primary staff can fulfill the required activities. Don't assume that your top storage person, DBA or network administrator will be available during recovery from a disaster. So don't always assign your best staff to each DR test. Switch up your teams, replace a senior member with someone more junior, or delay an entire group during your next DR test and see what effect it has on the execution of the test.
Depending on the logistics and location of your DR site, it may make sense to permanently base some internal staff there so they'll be able to start the recovery processes immediately (obviously, this wouldn't be their only job). It may also be a cost-effective move to contract with a vendor to start the DR recovery process until regular staff can reach the remote site.
When to test
It's often difficult to determine how often to test. The solution is specifically tied to benefit vs. cost, not just the cost of the test itself. Refer to your business-impact assessment and identify what the total cost would be if your most mission-critical applications go down. The downtime cost should include items such as potential regulatory penalties and reduced/lost employee productivity, as well as indirect costs such as the long-term effect of customer loss and damage to the corporate reputation.
Hopefully, the total cost of downtime ties directly to your recovery time objectives (RTOs) and recovery point objectives (RPOs). If some data center components can be down for a week with minimal overall business interruption, it probably doesn't make sense to test your DR plan every quarter; however, you may still want to validate subsets of your plan, such as recalling offsite DR tapes. On the other hand, if you measure downtime in hours rather than days, quarterly or semi-annual tests are probably best. Testing costs may be significant vs. your total DR investment and the cost of a poor or failed recovery, but it's money well spent.
When I was an IT director, I found it ironic that DR tests were carefully scheduled and planned. While some disasters can be anticipated (a slow-moving hurricane headed your way), most DR events are unforeseen. So it's a good idea to execute small parts of your DR plan as unplanned activities for your staff. You may find a major kink or two in the plan's execution, or the plan itself, when the parties involved are caught off-guard. I've done this, asking my staff to immediately get me a detailed list of every offsite DR tape we'd need to recover if our data center became unavailable at that very moment. The good news was that they had the list, including the offsite locations of the tapes and associated tape shipment procedures. The bad news was that the list was stored, both electronically and on paper, in our data center!
We often hear that "We can't encroach on production" to do DR tests. While I wholeheartedly agree with this principle, with proper planning there's no reason that production needs to be affected.
In a best-case scenario, the DR site "becomes" the production site, with servers, networks and storage all taking similar or identical identities of the production platforms being replaced. However, there are some scenarios, most notably during testing, where production is still up and running while a recovery event or test is occurring. While much of this can be handled via networking and routing, sometimes the storage environment can't be completely isolated, especially if some type of real-time replication is being used. One way to alleviate this problem is to use snapshots of the replicated production data for DR testing, rather than the actual replicas. This will allow detailed testing to a specific RPO without the need to terminate or alter the background replication processes during the test. Some observers may suggest that this will require more storage at the DR site, but you'll probably have snapshots or mirrors at your DR site if you're replicating, especially to protect from rolling disasters. Use a subset of those copies for your DR testing.
There are times when you may need to recover from a DR event when production is still up. One scenario is a bomb threat or any other event that puts the entire data center at risk--even if it's still up and running. One such event I can recall was the "Chicago Flood" of 1992. This event was caused by a piling driven into the Chicago River bottom, which caused a leak in one of Chicago's underground freight tunnels. The rush of water spread through much of the system's miles of tunnels, flooding sub-basements and disrupting utility service throughout "the Loop." When my Chicago-based company was notified that our sub-basement might be affected by this flooding, we immediately declared a disaster. Throughout our entire recovery process, our primary data center was still up and running, as the pending flooding hadn't yet occurred. Due to the parallel "production environments," including live network links between them, our recovery team had to perform numerous workarounds for all networking and routing. If we performed recovery as planned, we could have caused downtime by broadcasting that the same hardware and apps were available in two distinct locations. Ultimately, our primary site wasn't affected, so this DR event was probably the best test we could have devised.
Other software approaches for addressing the production issue use specialized snapshots of production data in the data center while the testing occurs at the DR site. While this software is an additional expense, it may be worthwhile, depending on test frequency and the criticality of production data.
The nontechnical and technical aspects of recovering from a disaster are equally important and should be a critical part of every DR test. The details and technical aspects of DR tests will ultimately be the foundation for successful testing. If you take the best policy and procedure, and map them to poor technology and infrastructure, the results will be disappointing at best.
|Success criteria for initial DR tests|
Application consistency groups
Regardless of whether DR is based on real-time replication or tape, the concept of application consistency groups is critical. The BCP plan should have a documented RTO and RPO for each application group (e.g., SAP, order processing, etc.). These RTOs and RPOs cumulatively become the DR service-level agreement with the application owners and end users. It's rare for a critical application group to be hosted by a single server or within one large database. Therefore, a consistent recovery from an application group perspective is vital. Application grouping and categorization requires research and preparation from an architecture standpoint, and should be one of the primary drivers for all DR testing activities. If all components aren't recovered in a coherent manner, the application may not work at all--even if each individual component is "successfully" recovered (see "Success criteria for initial DR tests," at right).
To ensure your DR testing recovers the entire application group, the architecture and recovery methodology must consider these questions:
- If your recovery is based on tape, do all backups within a group complete at the same time?
- What about any updates to data within the application group that occur during the backup?
- How do the updates get incorporated?
- If your recovery is based on real-time storage replication, is all data replicated at the exact same time?
- If so, what if one of the applications within the groups falls behind?
- Is replication halted on the others until synchronization is re-established?
- What about middleware apps such as Tuxedo or MQSeries messaging queues, which don't readily lend themselves to replication?
Regardless of the recovery approach, you need to define your DBA's responsibilities within the consistency group. Note, for example, if they rely on Oracle archive and redo logs, perform hot/cold backups, if the database is replicated via a combination of methods and how flat-file data is synchronized with database data.
Finally, you need to consider the truly heterogeneous application groups, which can include mainframe, open systems and even NAS platforms, all on different tiers of storage. It's not too difficult to replicate data on a single storage array, but it gets much more challenging when the data to be synchronized is spread across several different server and storage platforms.
So why ask these questions specific to DR testing? Because if your order-entry system has been recovered five hours prior to your warehouse and shipping systems, and you have thousands of dollars of inventory on the shipping dock with no customers attached to it, that's a big deal and potentially a huge expense. Data consistency and synchronization are bigger issues than just getting the data offsite. If the data isn't usable or if the recovery point becomes unreasonable, then you're putting your money in the wrong areas of DR.
Think of your company's most mission-critical application. Chances are pretty good that there's a core system or application that has hundreds, or maybe even thousands, of data feeds going in and out during a 24-hour period. And if it's a 24/7/365 application, at what point do you stop to synchronize? Recurring DR testing will help to address and remediate these issues.
|When DR testing becomes routine|
Getting end users involved
Application group recovery is where the rubber really hits the road for DR testing, and it can be a costly endeavor for each DR test. I worked with a client whose primary success criteria was being able to place a new order, start to finish, on their proprietary, in-house application running fully in the DR site. That might have been that user's key success metric, but we also determined that equally important--to the tune of millions of dollars--was the ability to continue those in-process orders and ensure that all critical components within the application group, including all sales inputs, fulfillment, shipping, billing/invoicing and accounts receivable, were in synchronization in accordance with the documented RPO.
So a DR test must allow adequate time for the application owners and end users to participate (see "When DR testing becomes routine," at right). Time must also be allocated to work through remediation of the identified critical issues, which includes interfacing with the IT recovery team, as well as programmers, DBAs, application owners and end users. This may take several hours or days. Make the remediation action plan a critical part of the test--don't treat it as an afterthought.
DR testing tools
Storage managers often ask what tools are appropriate for DR testing. Because each storage environment is so specialized--even those within the same industries--rather than dwelling on specific tools to use, storage managers should ensure that their DR testing process and procedures are stringently followed and, more importantly, improved and enhanced after each test. There are numerous tools available to emulate a user load, simulate WAN traffic and inject transmit/receive errors, and to track changes in the production environment during a DR test.
The right tool for you depends on the maturity of your testing process. For your first or second DR test, you should focus more on recoverability, including application group synchronicity, than on simulating an extreme user load or benchmarking performance. But if you're into your third year of successful testing, simulating a full production load on your servers and storage at the DR site, or even testing the resiliency of your DR fabric via simulated switch or cable failures, might be the ideal next step.
A technical aspect of DR testing that's often neglected is backup/restore. When the DR site goes live, backups of the production data that was just restored also need to commence. While the backup configuration at the DR site initially doesn't need to be as comprehensive as production, the infrastructure, software policies/procedures and schedule need to be in place right away. At the conclusion of your next DR test, pick an application, run some "transactions," perform backup operations, delete a key database and perform a restore. Backup isn't easy--you may be surprised at the difficulty of getting this done successfully, especially on the restructured architecture at the DR site.
|How to make your next DR test more real|
Another often overlooked element of comprehensive DR testing is documenting the steps and procedures to get back into the primary data center when it's viable to do so. If the primary data center is within the campus of corporate headquarters, there will likely be internal pressure to get back to that site if it's an option. While this may not need to be factored into the first few DR tests, be sure to consider what you would need to do to fail back to the primary site after running in the DR site for a period of time.
Eliminate the "gotchas"
If DR testing has become a mundane requisite for you and your staff, put a few more challenges into the equation, monitor the results and address what needs fixing (see "How to make your next DR test more real," at right). Plus, don't forget those other DR testing "gotchas" such as adding the same amount of storage to the DR environment when mission-critical apps acquire new production space, or forgetting that your automated DR scripts may no longer work after firmware, microcode or operating system upgrades. Make it part of your normal change control process to consider how changes might affect your DR plan.