Feature

Disaster recovery: Test, test and test some more

Ezine

This article can also be found in the Premium Editorial Download "Storage magazine: Upgrade path bumpy for major backup app."

Download it now to read this article plus other related content.

DR testing fundamentals

    Requires Free Membership to View

  • Disaster recovery (DR) testing isn't about pass and fail. It's about exercising and rehearsing the DR plan to reveal shortcomings and weaknesses.
  • DR plans aren't static; they need to be regularly reviewed and adjusted to reflect changes in the enterprise and business.
  • Every recovery step needs to be painstakingly documented and tested. Seemingly minuscule details can bring the recovery effort to a standstill.
  • People are key, and everyone needs to know their role.
  • Be sure to have a designated DR coordinator and an established chain of command. The ability to make fast decisions during a disaster is crucial for a rapid and successful recovery.
  • Always have a plan A, B and C.

What to test?
Only a limited number of services and applications can be realistically included in a DR test plan. Business criticality of services, exposure to loss, risk tolerance, and an assessment of threats and vulnerabilities determine what needs to be included, priorities and within what timeframe--defined by recovery time objectives (RTOs) and recovery point objectives (RPOs)--services need to be restored. DR planning and testing is a continual balancing act among costs, budget and the potential business loss if a disaster occurs.

While prioritization identifies mission-critical applications and services, seemingly low-priority services that high-priority apps depend on are sometimes excluded from the DR test. With IT services and infrastructure tightly intertwined, the risk of neglecting low-priority dependent services is high. This is exactly what happened to Jim Burgard, assistant vice chancellor for university computing and communication at the University of New Orleans (UNO). The university's inability to get to its backup tapes after Hurricane Katrina prevented Burgard from recovering Active Directory, requiring a total Active Directory rebuild from scratch.

With the interdependency of IT services, it's pertinent to consider and exercise all aspects that impact the recovery of mission-critical services, including the following areas:

  • The network
  • Data/storage
  • Applications
  • The data center
  • Communication systems

There isn't a single "right" way to perform DR testing because it depends on the specific situation, defined priorities and the DR plan at hand. The level of redundancy in place has a huge impact on the DR exercise. For instance, the effort to rehearse failing over to a continuously updated redundant storage array in a secondary data center is relatively simple; in contrast, having no secondary array to fail over to requires restoring terabytes of data and rehearsing the loss of the data center itself. The DR testing efforts and costs associated with the two scenarios differ greatly, and companies need to do a thorough analysis before deciding whether to invest in redundancy or to pour money into a more elaborate rehearsal. The real payoff of redundancy comes into play when an actual disaster strikes. Loss of business productivity, caused by long recovery times, can easily exceed the cost of redundancy.

The ability to recover depends on accurate documentation, and a DR test needs to execute the documented recovery steps meticulously. DR documentation needs to be updated continuously and reviewed periodically. "Whenever we procure or develop a new application or system, we review the DR requirements and update our DR plan and DR test procedures accordingly," reports Paul Stonchus, first vice president and data center manager at MidAmerica Bank in Clarendon Hills, IL.

The single greatest risk in keeping up with changes is the lack of a solid change management and verification process to ensure that changes are performed according to procedure. Change management tools like Finisar Corp.'s NetWisdom, Onaro Inc.'s SANscreen, Tek-Tools Inc.'s StorageProfiler, as well as those tools built into storage, network and system management suites, track changes in your environment. Besides third-party tools and free tools such as Syslog, monitoring and element managers like Cisco Systems Inc.'s Cisco Device Manager also track changes.

This was first published in September 2006

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: