This article can also be found in the Premium Editorial Download "Storage magazine: Tips for real-world disaster recovery planning."
Download it now to read this article plus other related content.
Any wily IT veteran develops a keen sense for the gap between IT fantasy and reality. Best practices are often talked about as lofty ideals, but in the real world they tend to be the best we can do given current constraints. In a well-run shop, the gap between the ideal and the practical isn't that great for most functions.
When it comes to disaster recovery (DR), however, the reality gap can be alarmingly huge. The DR vision is a scenario in which all disasters are withstood; using a well-crafted plan, operations are transferred to a remote facility to get the organization back online within recovery time objective (RTO) and recovery point objective (RPO) targets. But this is pure fantasy for most companies. The reality is that if a disaster should occur, nothing short of Herculean efforts by the IT staff would be required to have the slightest chance of getting back online in any reasonable period of time, much less the targeted RTO. So, it's time for a reality check. Here are some reasons why your DR plan may fail.
- Business and IT aren't linked. DR is one component of a larger business recovery undertaking and, to be successful, it's necessary to understand all the requirements, drivers, related activities, interdependencies, contingencies and pitfalls associated with those other activities. But a recent survey sponsored by Veritas found that 76% of the companies studied left DR policy setting solely in the hands
- of IT. While DR itself is an IT-specific function, it has a supporting role to core business activities. As such, its focus must be tied to the overall business continuance effort, which includes ensuring that people and facilities are available and able to function from a business perspective.
- You don't have a DR plan. If there's an IT activity that cries out for teamwork, it's DR. The DR plan should be the playbook for all functional areas within IT prior to, during and after a disaster, and encompass applications, databases, networks, servers, clients and storage. Among its elements are the key contacts and owners for each activity, step-by-step recovery plans, validation tasks and activation processes. But most organizations fall short of this goal. The activities required in a DR situation are unfamiliar and will likely need to be carried out in adverse or chaotic circumstances. The lack of a comprehensive plan is a recipe for disaster--or an even worse disaster.
- Your DR plan isn't current. Two words: change control. DR plans become outdated almost immediately. Management of your DR plan must be integrated as a rigorously enforced part of the change control process. As new applications are brought online, their priority and impact with respect to DR should be considered. If you invest the time to develop a DR plan that classifies servers and applications, identifies interdependencies and documents recovery in detail, adding new elements may simply mean updating the appropriate set of forms and notifying the necessary groups.
- You don't test DR (or you don't test the right things). Let's face it--DR testing is a major pain for most IT shops. It's not only a major operational disruption performed just once or twice a year, but all too often it's treated as a pro forma exercise.
Many DR test plans lack true end-to-end testing. Recovery and testing should be done on an application basis, not simply per server. Complex apps have interdependent elements that run on multiple servers. Recovering operating systems and data is just the first step; the apps should then be recovered and tested. While it may be impractical, the ultimate DR test would be to run production from the DR location for a period of time and then switch back at some later date.
Another problem is that DR testing isn't viewed as a quality improvement exercise, but as an exam. This can lead to counterproductive activities such as limiting recovery to "safe" components that aren't likely to be problematic. It should be assumed that some weaknesses or failures will occur. Finding process bugs is a good thing, so they can be corrected and avoided in the future.
- Your recovery goals are unrealistic. Often, organizations will establish RTO and RPO objectives, and even prioritize and classify servers and apps in accordance with the policies; but when DR capabilities are objectively examined, the goals are unattainable. For example, if you have recovery goals of less than a day, they can't be met if your DR facility is a cold site and you're relying on tape-based recovery. Realistic goals and metrics need to be established that reasonably estimate the time it takes to recover a server or to configure a storage or backup environment.
- You don't have clearly defined DR roles, responsibilities and ownership. DR demands organization and execution. Each participant must understand their job, who they will interact with and, most importantly, the proper chain of command. A good portion of DR planning should be spent defining this structure and developing a level of comfort in its execution. Factors to consider include how a disaster is declared, the time to notify and stage people at DR sites, equipment logistics and execution of the recovery process.
- Your DR plan doesn't address the right risks. DR is an insurance policy. You need to determine how much and what kinds of insurance you need, and what risks you're willing to take.
There are many potential causes of unplanned outages, ranging from internal physical events to external regional or environmental catastrophes. Internal events are more likely to cause problems than events outside the data center. Developing an understanding of disaster categories, weighing the risks and formulating a plan to address the targeted categories should be the goal. People often buy insurance based on what they can afford, rather than what they need, but DR decisions shouldn't be made on that basis.
- Your backups don't work. Although technically related to testing, it's worth underscoring the point that tape backup is the primary medium for DR at most companies. Wide-area data replication is still too costly for many businesses; therefore, their DR is only as good as their tape restoration capabilities and offsite tape management. All the planning in the world matters naught if the tapes are bad (or just don't exist).
Often, offsite tapes can't be created and shipped in a timely fashion due to a lack of resources. Virtual tape libraries and other disk-based approaches can enable backups to complete sooner and allow tape resources to be dedicated to offsite media production.
- Will anyone be there to recover data? An uncomfortable factor to consider is the risk of staff not being available to perform the recovery. Some might say that in such a scenario there are far greater issues than data recovery, but at some point this risk needs to be considered. Large companies with IT expertise in multiple data centers can develop DR capabilities that leverage resources in multiple locations. Third-party service companies may also be involved, provided comprehensive plans and guidelines are in place.
- DR is just too expensive. During a recent conversation, I had an IT manager exclaim, "We simply can't afford to test DR!" I've alluded to this issue in some of the previous points, but good DR is an onerous expense that most organizations are unwilling or unable to absorb.
But even with a small DR budget, prudent steps can be taken, such as ensuring good backups, establishing roles and responsibilities, and effective planning. New technologies may also be leveraged to make recovery more affordable. But don't create false expectations. Establish recovery objectives that are in line with capabilities and make them known and understood outside of IT. DR may be the item IT least wants to talk about, but it's past time to face up to the issue and close the reality gap.
This was first published in May 2005