DR for virtualized servers

A high level of mobility and the relative hardware independence of virtual servers greatly reduces the cost and complexity of putting disaster recovery (DR) in place, enabling companies to expand DR to a larger number of servers and applications.

It's easier and less expensive to protect virtual server data but, according to a recent survey, most companies don't back up all of their virtual servers.

Virtualized servers are being deployed in data centers at an increasing rate. Benefits such as cost savings through consolidation, simplified administration and lower energy consumption are some of the primary reasons for the proliferation of virtualized infrastructures. On the flip side, the ease of deploying virtualized servers bears the risk of spawning new servers running critical apps without the necessary attention to data protection and disaster recovery (DR). Moreover, data protection and DR in virtual environments pose a variety of challenges, and processes deployed for the physical infrastructure may not work (or work differently) for virtualized servers.

According to Symantec Corp.'s annual Symantec Disaster Recovery Research Report, 35% of virtual servers aren't covered in organizations' DR plans; in addition, only 37% of those surveyed back up all of their virtual systems. The primary reason cited for insufficient data protection and disaster recovery of virtualized servers is a lack of resources. IT departments that are already stretched to their limits don't have the time to put in place workable DR plans for many of their virtualized systems. And the tools for protecting physical and virtual servers differ in many cases, resulting in higher training, labor and software costs. Notwithstanding the challenges identified in the Symantec survey, protecting virtual server data is simpler and more cost effective than doing the same for physical servers.

DR for virtual vs. physical servers (PDF)

Click here for
dr for virtual vs. physical servers (PDF).

DR strategies for virtual machines
The reason virtual servers are easier to protect than physical servers lies in part in how the virtual machine's (VM's) hypervisor is architected. A hypervisor is a virtualization platform where multiple virtualized systems (so-called guest OSes) run on a single physical machine, known as the host. The VMs run their own OSes but share the underlying physical machine resources, from CPU and memory to I/O devices and storage. While physical servers are inseparably attached to a physical machine, virtual servers are stored as files or VM images on the host system. Because the VM image contains everything about the virtual server, the VM can be moved among physical machines by copying the relatively hardware-independent VM image file and booting it on different hosts. This innate mobility benefit of virtual servers is the primary reason why a virtual server is easier to deploy on a DR site.

In comparison, DR of physical servers where primary servers are configured to fail over to secondary servers is a much bigger and more costly challenge. The primary and secondary machines require the same or at least very similar hardware, and for any type of automated failover the physical machines need to be configured in some type of cluster configuration. Traditional cluster software like Microsoft Cluster Server forces a relatively static relationship where clustered server nodes are pre-assigned and designated as primary or secondary nodes.

"In traditional cluster software that has been used for high availability [HA] and DR of physical servers, nodes are tightly coupled, which limits scalability and increases both capital and operational expenditure," explains Jason Nadeau, group product manager for Symantec's Veritas Cluster Server (VCS).

Conversely, in a virtualized server environment, the primary host with the production VMs and the secondary DR host can be very different. "DR for virtual servers no longer requires matching hardware," says Mark Bowker, analyst at Enterprise Strategy Group (ESG), Milford, MA. "You can have 10 physical servers running VMs in your primary data center and four very different physical servers in the secondary data center." To take this a step further, you can bring up the VMs on any of your DR hosts or run multiple instances of a single production VM on more than one host in your DR site.

The flexibility gained through the inherent mobility of virtual servers is stunning and opens new possibilities to use server resources. In a virtual server environment, physical servers (VM hosts) designated as DR servers can be used for noncritical apps during normal operations. Instead of DR servers being idle 99%-plus of the time, they can be used for apps not required during a disaster.

"With our VMware ESX hosts we have the option to leverage the DR servers in our secondary data center for DEV and TEST instances during normal operations," says Peter Allen, director of IT operations at Nixon Peabody LLP in Rochester, NY, an international law firm with 18 locations worldwide. Allen runs 17 ESX hosts with more than 100 virtual servers in the primary data center in Rochester and a secondary data center in Ohio.

The benefits of virtualized servers for data protection and DR are becoming a primary impetus for an increasing number of IT managers to convert to a virtualized server infrastructure. "DR is replacing server consolidation as the main driver to deploy virtual servers," says Bowker. "Some people are deploying a single VM on a host for critical and resource-intense apps like Exchange to reap the DR benefits but ensure sufficient resources; the data protection advantages by far outweigh the slight virtualization overhead."

Hypervisors and DR
With approximately a 70% share of the market, according to an ESG survey, VMware is the prevailing hypervisor, followed by Microsoft Virtual Server with 23% and about 4% for the various XenServer derivatives. By offering Hyper-V (the successor to Microsoft Virtual Server) as an integral part of Windows Server 2008, this market share ratio will likely change. "With a significant price advantage and the fact that Hyper-V is part of the operating system, we predict that within 18 months there will be more VMs running on Hyper-V than on VMware ESX," says ESG's Bowker.

Even though the different hypervisors are based on the same fundamental architectural principles, they vary in implementation and management capabilities. This impacts data protection and DR, requiring different approaches for implementing DR solutions.

Different ways to back up virtual machines (PDF)

Click here for
different ways to back up virtual machines (PDF).

VMware ESX Server
At this point, ESX Server leads in features, performance and scalability, and has options and architectural characteristics that are beneficial for data protection and DR. To start, ESX Server doesn't run atop a third-party OS like Windows or Linux; it boots a very thin kernel optimized for the single-purpose hypervisor task without the overhead of a general-purpose OS.

It comes with Virtual Machine File System (VMFS), a clustered file system designed specifically for virtualization. Shared storage is a key requirement for hypervisors to share data across VMs and for live migration. "VMFS fully supports live migration, and enables multiple VMs to share a single LUN and still be able to migrate and fail over individual VMs," explains Noemi Greyzdorf, research manager, storage software at Framingham, MA-based IDC.

VMware's decision to opt for a proprietary file system and to perform storage management tasks makes it more challenging for existing data protection tools to back up and replicate virtual machine disk format (VMDK) files stored on a file system foreign to traditional tools. To overcome this challenge, VMware released a set of tools and apps that mediate between the proprietary VMware protocols and mechanisms and standard data protection applications.

With virtual disk files stored on a proprietary file system, ESX users need to run backup agents within each VM or sign up for VMware's Consolidated Backup (VCB) to provide for a more efficient and scalable backup solution with low impact on server performance. Consolidated backup takes a VM snapshot and mounts the snapshot to a central proxy server from where the data is backed up via regular backup apps.

As part of this process, VCB quiesces the file system in the VM to ensure that the entire state of the VM is captured at the point the snapshot is created. VCB currently falls short when it comes to backing up apps like Exchange running within VMs. Although the latest version of VCB has added a VSS requester to make application-consistent snapshots, it hasn't addressed the restore aspect, as the only way to restore an application to a consistent state is to restore the entire virtual machine.

VMware's Site Recovery Manager (SRM) enables companies to automate the failover process of their virtual server infrastructure. In the absence of a tool like SRM, the failover process needs to be done manually or via custom scripts. The integration with replication software is an instrumental aspect of automating the failover process and through Site Recovery Adapter (SRA) plug-ins, third-party replication software can integrate with SRM.

All major array and replication software vendors have developed or are in the process of offering SRA adapters, which are available directly from VMware. Released in late 2007, SRM is still a pretty new tool with shortcomings that are likely to be addressed in future revisions. "We aren't currently using SRM to bring up virtual machines in the DR site because it doesn't support NFS, but only works for FC LUNs," says Nixon Peabody's Allen. Not having a tool like SRM for Hyper-V or XenServer isn't as much of a fundamental problem as it is an inconvenience. "There's nothing in VMware's Site Recovery Manager that couldn't be done via scripts; in fact, SRM provides a GUI to define the recovery process and generates the script for you," explains ESG's Bowker.

VMware was the first hypervisor vendor to perfect live migration of virtual machines through VMotion, which moves an entire running VM instantaneously from one server to another with zero downtime to apps. The entire state of a VM is encapsulated by a set of files stored on shared storage, and VMware's VMFS cluster file system allows the source and target VMware ESX server to access these VM files concurrently. The active memory and precise execution state of a virtual machine can then be rapidly transmitted over a high-speed network.

Although VMotion is used mostly for operational reasons, users are taking advantage of it as part of their DR strategies to distribute the workload to other ESX servers. Jim Yarber is senior manager of network operations at HeritageBank of the South in Albany, GA, a regional bank serving southwest Georgia and north central Florida. VMotion is an instrumental part of his DR plan. "In case of a failover, virtual machines are brought up on three ESX servers in the DR site," says Yarber. "To accommodate the increased workload, we then use VMotion to transparently migrate virtual servers to other ESX servers."

Yarber also takes advantage of VMware HA, which monitors virtual server availability and automatically moves and restarts failed virtual servers on other ESX servers. "VMware HA enables us to have high availability without the need for expensive standby hardware and additional software required by traditional HA cluster solutions," he says.

Microsoft Hyper-V
In Hyper-V and its predecessor Virtual Server, virtual disks are stored as VHD files on the Windows file system and are no different from other files. As a result, existing Windows data protection tools and methods can be used unchanged to back up and restore VHD files. "Double-Take can replicate Hyper-V file-level changes just like we replicate Exchange DB files," says Bob Roudebush, director of solutions engineering at Double-Take Software.

Unlike VMware ESX Server, Hyper-V leverages the NTFS file system, not a clustered file system, which makes live migration more challenging. In Microsoft's quick migration, which is based on traditional clustering, all VMs on a LUN are migrated at the same time. As Hyper-V-based HA is still dependent on cluster software, Hyper-V users can opt for a next-generation cluster server like the latest version of Symantec's VCS, which supports up to 256 heterogeneous nodes and is very lightweight. "Unlike VMware HA, which provides DR for VM containers without regard to the apps within the virtual server, VCS focuses on HA for apps," says Symantec's Nadeau.

Even though Microsoft Hyper-V is a huge step forward from Virtual Server, especially regarding management, reliability and scalability, it isn't quite on par with VMware ESX Server yet. It also relies to a larger degree on third-party tools and custom scripts for DR than does ESX Server.

XenServer and Virtual Iron
While open-source Xen-based hypervisors are offered by Citrix Systems Inc., Novell Inc., Oracle Corp., Red Hat Inc., Sun Microsystems Inc. and other Linux distributions, the greatest traction has been garnered by Citrix XenServer and Virtual Iron Software Inc.'s Virtual Iron.

Like Hyper-V, Xen runs on Unix-like OSes, mostly Linux; but contrary to Hyper-V, VM files are written to raw disks, challenging data protection tools that monitor and back up file changes. "We needed to add a volume block-level filter driver to replicate Xen virtual image files," says Roudebush.

Live migration of running virtual machines is part of the Xen hypervisor, but it's not based on a virtual file system like VMFS. Instead, it depends on NFS mounts for shared storage while VMs are moved from one host to another. To move single virtual servers, XenServer assigns a LUN to each virtual disk image and leverages features in storage systems to back up, snapshot and clone these volumes.

Among the various Xen hypervisors, Virtual Iron has the most extensive management tools, attempting to match what VMware offers. "We come into play for customers for whom VMware is too expensive; for a third of the price, we provide similar management tools as VMware," claims Chris Barclay, director of product management for Virtual Iron.

Virtual server backup choices
Depending on the hypervisor in use, IT infrastructure in place, and expected recovery time objective (RTO) and recovery point objective (RPO), companies deploy different disaster recovery methods for virtual servers.

Traditional backup and recovery: For small- to midsized firms with a small number of virtual servers, backing up VMs and restoring them in the DR site in case of a disaster is a frequently used option. As the restore can be done on almost any hardware that runs the hypervisor software, hardware requirements for restoring physical servers become a non-issue. Being able to restore multiple VMs to a single host further reduces hardware requirements for the secondary site and significantly lowers the overall cost of DR.

In its simplest form, virtual servers can be backed up by installing backup agents in the VMs, but that adds overhead and will likely impact server performance while backups are running. Companies that run VMware have the option to deploy VMware's Consolidated Backup, which removes the backup load from VMs. Besides backing up each VM, backups can be taken at the hypervisor level, eliminating the need to install agents on each VM, but with the disadvantage that it only allows restoring at the VM level.

Synchronous vs. asynchronous replication (PDF)

Click here for
synchronous vs. asynchronous replication (PDF).

Backup software vendors like CommVault have extended their backup suites to accommodate the backup needs of virtual servers. The majority of backup software vendors' products are integrated with VCB and some have added extra features. "We use VMware VCB and CommVault Galaxy to back up about 70 VMware ESX guests," says Peter Kovaleski, network Unix administrator at Oral Roberts University in Tulsa, OK. "The CommVault restore agent allows us to directly restore files back to ESX hosts, eliminating the manual copy process of files from the proxy server to the VM."

Synchronous or asynchronous replication: Storage system-based snapshots and replication are the prevailing methods in the enterprise space to get VM images and data from a primary to a secondary site. From a storage and replication perspective, the requirements to protect physical and virtual servers are very similar.

To start, snapshots are scheduled to capture all changes since the last snapshot. The frequency of snapshots varies and depends on the acceptable RPO. A key requirement during snapshots is quiescing virtual servers to ensure that the entire state of VMs is captured at the point the snapshots are created. Snapshots are then replicated to the secondary site via synchronous or asynchronous replication.

Among all hypervisors, Microsoft faces the fewest integration issues with storage systems because it uses NTFS and VSS, protocols that are widely supported by storage vendors. Similarly, because of its 70% market share, VMware enjoys widespread integration support, especially for its Site Recovery Manager, which is supported by most major storage vendors.

Storage system-based snapshots and replication are favored by enterprise customers because they're likely to already have storage systems with snapshot and replication support and are hesitant to sign up for less-proven alternatives like continuous data protection (CDP). "We decided to use NetApp and NetApp's SnapManager for Virtual Infrastructure because it allowed us to automate what was previously scripted, from quiescing the virtual machines and taking snapshots to replicating them to the secondary site," explains Nixon Peabody's Allen.

CDP: CDP products such as Double-Take for Virtual Systems; FalconStor Software Inc.'s CDP Virtual Appliance for VMware Infrastructure and Network Storage Server (which enables automated, application-consistent failover in geographically dispersed clusters of both physical and Hyper-V virtual servers); and InMage Systems Inc.'s DR-Scout are viable DR alternatives to storage-based snapshots and replication for several reasons. They're less expensive, especially for customers who don't have matching storage systems in the primary and secondary data center. Because changes are captured and replicated as they occur, they add very little overhead to VMs. Finally, CDP products not only provide for failing over to the latest replica, but allow users to easily roll back to previous points in time.

"We chose DR-Scout over array-based replication because of its minimal bandwidth use," says HeritageBank's Yarber. "I have a 100Mb Ethernet connection between our two data centers, and DR-Scout barely scratches it."

Without question, virtualized servers are revolutionizing disaster recovery. DR has always been expensive and many plans only cover mission-critical apps. A high level of mobility and the relative hardware independence of virtual servers greatly reduce the cost and complexity of putting disaster recovery in place, enabling companies to expand DR to a larger number of servers and applications.

Dig Deeper on Storage virtualization