Disaster recovery planning on a budget

Tips on how to plan a disaster recovery initiative that helps you get the most bang for your buck.

This article can also be found in the Premium Editorial Download: Storage magazine: Storage products of the year 2003:

RPO/RTO impact of potential DR solutions

In this example, introducing replication and increasing communications bandwidth to support higher volumes of data could further reduce recovery point objective (RPO) to 12 hours. By upgrading from a warm site to a hot site, it would be possible to decrease recovery time objective (RTO) from under two days to about six hours.

Source: GlassHouse Technologies

Recently, I received a call from an IT manager responsible for disaster recovery (DR) for a large organization. He was reasonably certain that the DR test that was scheduled in a few months would fail. He was hoping to identify and resolve some of the problems.

More than any other category of IT activity, DR strains the planning and operational capabilities of an IT organization. A DR infrastructure typically is expensive to implement, has low utilization and requires tasks to be accomplished in a time-critical manner. A careful analysis of the trade-offs that you can tolerate--with an understanding of what resources you can commit--should help you get closer to an optimal set up.

The perfect DR infrastructure
Based on today's technological capabilities, experts agree on a standard model that's regarded as the state of the art. The foundation of this model involves disk-based replication. Ideally, you would start with a storage area network (SAN) infrastructure with anywhere from two to six times the primary storage capacity to enable remote replication--preferably via synchronous means--to ensure a current copy of data is available in at least two geographically distinct locations.

To accomplish this, you'd have to standardize on a tier-one storage system such as EMC Symmetrix, HDS Lightning or IBM ESS, with appropriate split mirror and remote replication software. Another critical requirement is the appropriate bandwidth. Dense wavelength division multiplexing (DWDM) is the premiere technology available (see "Linking SANs for disaster recovery, Parts 1 and 2" in the September and October 2003 issues of Storage).

To incorporate this technology, there's a complex integration effort that includes planning and implementing the right combinations of split mirrors and replicated volumes to meet the environment's requirements. You have to balance business requirements, latency considerations, application characteristics, bandwidth constraints and organizational capabilities. The good news is that once the effort has been expended and the management processes are refined, disk-based replication can be highly effective.

The downside--cost
By now, you're probably asking yourself if you can afford this and if you really need it. Disk replication is expensive. Not only is the initial investment in the equipment, software and integration services costly, but for many companies the monthly expense for the additional bandwidth is simply too much.

Lower-cost solutions such as host-based software replication or appliance-based solutions using an enhanced WAN infrastructure are available and can be effective in some situations. One significant benefit is the ability of these products to replicate independently of the storage platform, allowing you to replicate from a high-priced storage system to a lower priced system.

Another expensive consideration is the recurring cost of maintaining a hot or warm site. Large organizations with multiple data centers can often leverage existing space within these sites for DR purposes or can justify the cost of a centralized DR facility shared by multiple divisions. For others, a remote hosting facility with systems made available on an as-needed basis is the only cost-justifiable solution. However, such an environment increases the time to recover and complicates recovery.

Calculating recovery point objective

In this scenario, a particular system is backed up each night at 8:00. Let's assume that the backup tape is sent offsite each morning at 8:00. If a disaster resulting in loss of the site were to occur at 7:59 a.m., the only data that would be available for recovery would be the previous night's backup, which is already 36 hours old at that point in time.

Aligning need and cost
To ensure an effective DR capability, you must understand your requirements which are based on recovery point objective (RPO) and recovery time objective (RTO).

RPO is the worst-case data loss that's acceptable for a specific class of data (see "Calculating recovery point objective"). RTO is the time from the disaster to the resumption of business.

Some of the key trade-offs to consider are the availability dedicated vs. allocated DR assets, online vs. tape recovery and the extent of automation in the recovery or failover process. A major cost element is the availability of assets, specifically standby servers on which to recover. Maintaining a hot site requires dedicating assets that are likely to have low utilization rates and therefore wouldn't be affordable in many situations. On the other hand, the lack of a functioning recovery site could increase RTO to such an extent that some companies could be mortally wounded by the time they recovered.

Many companies that have investigated building advanced DR infrastructures based on remote replication end up abandoning these plans because of the high recurring communications expense. Is it possible to build an effective DR strategy without replication? Where should the investment and focus be placed?

A recent GlassHouse Technologies engagement resulted in the recovery options shown in "RPO/RTO impact of potential DR solutions" on this page. The intent was to show potential RPO/RTO improvements based on an increasing level of technology investment. A variety of alternatives were being considered. Due to distance requirements, synchronous replication wasn't an option. Current DR tests showed a recovery capability of greater than eight days for RTO and RPO because of problems in the backup infrastructure. Remediation steps would reduce the RTO/RPO to two days at a low cost. By weighing the incremental gains with the costs, the company developed a road map based on business requirements.

Process, process, process
While reductions in RPO are largely a technology investment, a significant improvement in RTO can be realized by having a well-executed DR process. Because this is usually a more difficult area to tackle, it often gets overlooked as people search for technology solutions. That's unfortunate because improvement in RPO by implementing technologies is measured in hours; gains in RTO can be measured in days.

In a true DR scenario, people are performing non-routine tasks often in unfamiliar surroundings. There can be confusion over responsibilities and the sequence of recovery tasks. The potential for error is high.

For these reasons, you should invest as much in DR process as in technology. This doesn't mean just creating a DR planning document that sits on the shelf. It means developing a process that ensures DR is considered in all IT planning, recovery plans are reviewed and updated regularly and realistic testing is done on a regular and irregular basis.

This was first published in January 2004
This Content Component encountered an error

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close