Hot site, warm site or cold site? Here's how to figure out the best disaster recovery strategy for your company.
By Jacob Gsoedl
The ability to recover from a disaster in an acceptable period of time is a critical issue for companies with increasing dependence on information technology. Once thought to be a concern for only larger organizations, being able to recover mission-critical applications within a predictable timeframe is a mandate for any size company today. But some users see disaster recovery (DR) as a pricey insurance policy, and may take shortcuts to try and save a few dollars. To avoid becoming victims of budget cuts, DR provisions and sites must be built around a few basic principles that allow management to decide what's required while candidly showing the possible business impact and consequences of retrenchments.
Recovery time objective (RTO) and recovery point objective (RPO) are the key metrics to determine the DR level required to recover business processes and applications. They are reciprocally proportional to the cost of DR: The closer RTO and RPO need to be to zero, the more expensive DR provisioning will be. If recovery time can be days or even weeks, costs will likely be significantly less.
Determining the necessary RTOs and RPOs is the single most important exercise a business needs to perform to ensure the right level of DR without wasting money. RTOs and RPOs are derived through business impact analysis of business processes and applications to determine the value of business processes and the anticipated financial impact if they become unavailable. Obviously, this varies greatly by business process and application. "While for just-in-time manufacturing the critical threshold may be 15 minutes, it could be days for a marketing application," says George Ferguson, worldwide service segment manager for Hewlett-Packard (HP) Co.'s business continuity and recovery services.
Very likely, determining RTOs and RPOs will be an iterative process because of two competing forces: available budget and required recovery objectives. "The challenge of contingency services like disaster recovery is to find the right balance between available budget and what's required to sustain the business," says Greg Schulz, founder and senior analyst at StorageIO Group, Stillwater, MN.
Disaster recovery options
With a business impact analysis in hand and agreement on RTOs and RPOs, IT management can devise implementation options. Disaster recovery site terminology can be confusing -- terms like hot site, warm site and cold site are common in DR parlance, but they're used inconsistently. A hot site in the U.S. typically comprises shared equipment, while "in Europe the term hot site is predominantly used for dedicated equipment," says Ferguson. The following definitions match the prevailing U.S. interpretations of these terms:
Hosted site. A site with dedicated equipment; required whenever RTO and RPO need to be close to zero.
Hot site. Uses shared equipment with dedicated storage and real-time replication; a typical RTO of a few hours.
Warm site. Uses shared equipment without dedicated storage, but depends on data backup for recovery; RTOs can range from a few hours to days depending on the backup method in use.
Cold site. Typically, dedicated space in a data center fully loaded with cooling, power and connectivity ready to accept equipment; RTOs are usually a week or more.
It's quite common for a DR site to serve various roles for different applications. For instance, a DR site may serve as a hosted site with close to real-time failover for a mission-critical e-commerce application, and it may also serve as a low-end warm site with tape-based recovery for a less critical engineering application. Many DR sites are hybrids where the application determines the role of the site. As a result, disaster recovery companies that host DR sites typically offer their services in tiers that can be mapped to RTOs and RPOs required by applications (see "DR tiers," below).).
Click here for a comparison of disaster recovery tiering options for a 5 TB Microsoft Exchange 2007 environment using plans and pricing from Recovery Point Systems as an example. (PDF).
A tier 1 DR offering provides the highest level of DR protection, and is typically used for applications that require close to zero RTOs and RPOs. A characteristic of tier 1 DR is the use of dedicated equipment in the DR site. As a result, it carries the highest price tag and is usually only for the most mission-critical applications. Because the equipment in the DR site is dedicated to a single client company, there are very few constraints on the equipment that can be used, even if the service is outsourced.
Among all DR options, it's best suited to be hosted in-house, where your company owns and maintains the site. Because of the need for specific DR equipment, it's typically less expensive to build tier 1 DR in-house than to outsource it. "Tier 1 DR can be done more cost-effectively in-house, especially if you have the facility and people," explains HP's Ferguson.
Because applications in the primary and DR site are closely coupled, production and DR equipment are commonly managed by one entity.
If the DR site is hosted by a third party, it's not unusual for both the primary and DR equipment to be managed by the DR services provider. As an example, Citrix Systems Inc. decided to outsource management of both its primary HP XP12000 SAN and its DR site. While the production SAN physically resides in Citrix's primary data center in Miami, the DR SAN is hosted by HP. "Our SAN storage in Miami is outsourced with and managed by HP," says Michael Emerson, director, IT security, governance and business continuity at Citrix. "They own the SAN and manage it, including the replication from the production to the DR SAN at HP, using HP Continuous Access replication [HP StorageWorks XP Continuous Access Software]."
Data classification and DR
Disaster recovery (DR) for files and folders is generally simpler than disaster recovery for applications because you don't have to consider issues like application consistency, transaction integrity and application dependencies. The challenge with DR for file-based content is mostly a problem of volume and size. Companies may have tens or hundreds of terabytes of file data, so determining what needs to be included in the DR plan can be a daunting task.
Some companies have turned to data classification tools to determine the value of data and its appropriate DR tier. Data may be classified using a variety of tools:
Storage resource management (SRM) tools typically classify files by meta data such as file type, size and modification date. An example is the Hewlett-Packard (HP) Co. Storage Essentials File System Viewer module, which allows files to be grouped by various file properties.
Archiving tools have built-in classification and tend to go beyond just meta data to include full content indexing. Symantec Corp.'s Enterprise Vault and archiving products from C2C Systems Limited are examples.
Data-loss prevention tools detect and prevent the unauthorized transmission of information and include data categorization capabilities. They're available from McAfee Inc., RSA (The Security Division of EMC Corp.) and Symantec, among others.
Standalone classification tools, available from companies like Abrevity Inc., Kazeon Systems Inc., Njini Inc. and Permabit Technology Corp., can be used to categorize data to determine the appropriate DR tier.
If a recovery time of a few hours (instead of minutes) is acceptable, a hot site is likely appropriate. The biggest difference between a hosted site and a hot site is the use of shared equipment for infrastructure components like servers and peripherals. Storage is dedicated and real-time data replication is used to get data from the production site to the DR site. Because equipment in the DR site is shared by multiple customers, hot sites are significantly less expensive than hosted sites. "Hot sites and warm sites can be implemented less expensively through outsourcing than doing them in-house because of shared equipment," says Ferguson. "DR services providers rely on the fact that not all customers have a disaster at the same time."
On the downside, the use of shared equipment makes hot sites less flexible because customers are limited by the equipment the DR service provider offers. While some service providers may have a limited selection of equipment, others are more flexible. "About 90% of the time we're able to use shared equipment, and the rest of the time we work with the customer to make it work," says Marc Langer, president at Recovery Point Systems, a provider of backup, storage and disaster recovery services. Larger service providers may be less flexible, so the nature of the shared equipment is likely to be a determining factor when selecting a hot or warm site provider.
Another consequence of using a site with shared equipment is the time limit on how long customers can use the shared gear in the event of a disaster. The limit varies among service providers, but typically ranges between 30 days and 90 days. "Customers can use the shared equipment for 60 days before they need to get out or before they get migrated to a cold site," says Langer. Service providers with a larger number of data centers, like IBM Corp., can be more flexible. "We're pretty open-ended because we can shift workloads to other data centers," says John Sing, senior consultant, business continuity strategy and planning at IBM's Systems and Technology Group. To avoid unpleasant surprises, a clear understanding of the terms, conditions and limitations of managed DR services is required prior to committing to an agreement that may span several years.
In contrast to a hot site, a warm site relies on backups for recovery. As a result, it doesn't require dedicated storage but instead can take advantage of less-expensive shared storage. In other words, all components of a warm site, including storage, are shared among multiple customers. Therefore, most of the considerations of hot sites also apply for warm sites.
In the past, there was a huge difference between hot sites and warm sites because backups were limited to tapes. As a result, warm site recoveries were typically measured in days. Warm sites that rely on tape-based backups for recovery are clearly at the lower end of the DR services spectrum.
Disk-based backups have narrowed the gap between warm sites and hot sites, and almost all DR service providers now offer an electronic vaulting option, which is essentially disk-based backup of production data over the network. RTOs and RPOs of warm sites with electronic vaulting are typically less than a day, which is very close to the recovery times offered by hot sites but at a fraction of the cost. "There has been about a 10x price difference between a replicated DR infrastructure and a shared infrastructure with electronic vaulting," explains HP's Ferguson. "Electronic vaulting is closing the gap between tape-based recovery and a replicated DR infrastructure, and customers need to look at it because of its price and reliability benefits."
A cold site is rented space with power, cooling and connectivity that's ready to accept equipment. With recovery times of a week or more, a cold site is only an option for business processes that can be down for an extended period. Cold sites are also used to complement hot sites and warm sites in case of disasters that last a long time. "Some of our customers sign up for a cold site as contingency to migrate equipment from the shared infrastructure to the cold site in case a disaster lasts more than six weeks," says Recovery Point Systems' Langer.
It's the customer's responsibility to provide equipment for the cold site during a disaster. A DR plan that relies on a cold site must clearly define the process of procuring and delivering equipment to the cold site when a disaster strikes. It's a risky strategy to rely on purchasing the equipment on the open market when it's needed as it may not be possible to get the equipment in a timely fashion. A better option is to consider subscribing to a quick-ship service available from companies like Agility Recovery Solutions. "You can rent equipment for as little as $50/month with an option to buy it if needed," says Recovery Point Systems' Langer.
In-house DR vs. outsourced DR
Whether to create a DR site in-house or to outsource it is a fundamental decision that needs to be made when putting a DR strategy in place. The in-house approach may be tempting, with the assumption that the work related to DR can be performed by existing staff. Unfortunately, experience shows that in-house DR is more likely to fail than outsourced DR services.
According to an IDC study, enterprises that didn't outsource lost on average $4 million per disaster incident across a variety of business functions (e.g., sales/marketing, financing, e-commerce). In contrast, enterprises that outsourced to a third party lost an average of $1.1 million per incident. The study adds that companies that leverage an in-house model spend 32% more than those opting to outsource.
It further shows that outsourcers can provide a shorter window of recovery, as measured by RTO over in-house operations by a reduced factor of 0.62. The study concludes that primary and DR data centers are more likely to get out of sync if DR services are performed in-house.
One of the primary reasons why in-house DR scores so poorly is the risk of taking shortcuts and burdening users already overloaded with other work. When a person's primary role is in conflict with their DR role, the primary role usually wins, to the detriment of the DR plan.
What to ask when selecting a DR facility
1. What type of facility should be used?
In-house using another office location
A collocation facility
Managed collocation space from the likes of Hewlett-Packard (HP) Co., IBM Corp. or SunGard Data Systems Inc.
2. For in-house disaster recovery (DR) facilities:
Is the facility equipped to deal with the increased load during a disaster (bandwidth, power, cooling, etc.)?
Is designated DR staff available?
Is equipment designated or at least ensured to be available in case of a disaster?
Are resources available to periodically test failover?
3. For collocation facilities:
Is the collocation facility a far enough distance from the production site?
Does the collocation facility have sufficient bandwidth options and power to scale and deal with the increased load during a major disaster?
Who will manage the equipment in the DR site? If it's managed in-house, many of the considerations of in-house DR apply.
4. For managed collocation space:
Based on recovery time objectives (RTOs) and recovery point objectives (RPOs), determine the type of site required (hosted, hot site, warm site or cold site)
Ensure that DR testing is included in the proposal.
As hot sites and warm sites typically limit how long they can be used during a disaster, clearly understand your options in case you need the DR site longer.
Calculating the cost of DR
Determining the cost of DR is company-specific, and the many variables make it difficult to devise a formula to calculate a DR cost for a given environment. In general, the cost of DR includes the cost for physical space, equipment, power, and network and professional services. But the cost of each of those components can vary greatly. "We have tried to put together a TCO tool, but data centers are too different and our DR options are so customized that it's very difficult to come up with a cost calculator," says David Palermo, vice president of marketing at SunGard Data Systems Inc.
Fujitsu Computer Systems Corp.'s Affordable Business Continuity (ABC) product is one of the few packaged DR kits that includes storage, hosting and bandwidth for a fixed cost of $190,000. The ABC kit includes two Eternus 4000s with 3 TB of raw storage each, replication software and one year of hosting with bandwidth. Fujitsu's professional services works with customers on customized bundles and assists with determining the required server infrastructure (servers aren't included in the bundle).
DR site options
The prevailing options for DR sites are remote-office locations, collocation space and DR service providers' data centers.
Remote-office location and collocation space: Companies with multiple locations frequently use their remote data centers as DR sites. Leveraging existing facilities and infrastructure is a very cost-efficient DR option. For companies with multiple locations, but not multiple data centers, collocation space offered by providers like Equinix Inc., Savvis Inc. and telcos, may be a good alternative. Collocation facilities are relatively cost effective and usually provide first-class space with sufficient power, bandwidth and high facility standards.
Cost was the primary reason why Matt Blydenburgh, CIO at Tannenbaum Helpern Syracuse & Hirschtritt LLP, New York City, used collocation space in Connecticut for the firm's hot site. Blydenburgh uses Double-Take Software Inc.'s Double-Take to replicate data from the firm's New York City location to its hot site in Connecticut. "We looked at managed disaster recovery services from companies like SunGard, but it was very expensive," says Blydenburgh. "We now pay $1,800 for space and another $1,600 for bandwidth for both sites."
Managed DR service providers: Managed DR services providers like HP, IBM, Recovery Point Systems and SunGard are dedicated to disaster recovery and are hard to beat in the quality of service they provide. But they're not cheap. To get a fair price comparison between a managed service and using in-house DR facilities, it's essential to take into account all cost components, including the cost of dedicated DR staff.
With 155 DR data centers worldwide, IBM is the largest managed DR firm. Similar to HP, IBM can source all DR components from within IBM. With 30 U.S. and 30 European data centers, and approximately 12,000 customers worldwide, SunGard is also a major player in the managed DR space. Prior to its acquisition of EDS, HP was focused mostly on providing managed DR for companies using HP equipment, but HP is now playing at the same level as IBM. Smaller DR services firms have the advantage of flexibility and are more willing to wheel and deal to win a contract.
Even in financially challenging times, you should never walk away from DR because you can't afford a certain DR tier. Instead, go with a lower, less-expensive tier that gives reasonable protection for the available budget. Not having a DR plan should never be an option.
BIO: Jacob Gsoedl is a freelance writer and a corporate director for business systems. He can be reached at firstname.lastname@example.org.
Although EMC looks to TwinStrata technology to turn the cloud into a storage tier from inside a SAN, one customer proves CloudArray has value without a separate SAN for primary, backup and disaster recovery storage.
Independent disaster recovery expert Paul Kirvan discusses four important steps to incorporate critical infrastructure resources such as water, electricity, telecommunications, oil and gas, etc. into your DR plan.