The advantages of asynchronous replication

Storage system-based asynchronous replication is becoming the new de facto standard to recover data quickly for business continuity. Asynchronous replication offers advantages over more costly synchronous replication, but it still requires similar arrays at both ends of the replication setup, so it's likely that you'll rely on a single vendor's technology.

If some of your apps don't require a recovery time objective in seconds, asynchronous replication delivers many...

benefits over synchronous replication.

Storage system-based asynchronous replication is becoming the new de facto standard to recover data quickly for business continuity. But because this type of data replication requires similar arrays at both ends of the replication, be prepared to be tied to a particular vendor for a long time.

Storage system-based asynchronous replication overcomes the typical dis-advantages of synchronous replication, such as the requirement of a high-bandwidth network connection and distance limitations. In addition, when purchased as part of the storage system, asynchronous replication is more economical than add-on replication technologies like continuous data protection.

Asynchronous replication supports five IT functions:

  1. Business continuity
  2. Content distribution
  3. Data migrations
  4. Load balancing
  5. Testing and development

High-end storage systems such as EMC Corp.'s Symmetrix DMX series, Hitachi Data Systems (HDS) Corp.'s TagmaStore Universal Storage Platform (USP) and IBM Corp.'s System Storage DS8000 provide multiple ways to implement asynchronous replication. For example, IBM's System Storage DS8000 offers three asynchronous replication options--Global Copy, Global Mirror and z/OS Global Mirror--to satisfy different recovery point objectives (RPOs) and recovery time objectives (RTOs). Though each supports replication over any distance, Global Copy is intended for apps that have RPOs of hours, while Global Mirror is designed for applications with RPO requirements of seconds or minutes. z/OS Global Mirror is like Global Mirror, but is used only in IBM System z environments.

Key selection criteria

Some key areas to consider while comparing asynchronous replication software.

Delta change or write-order fidelity. Storage systems employ two methods to capture changes to block data: delta changes and write-order fidelity. Deltas are changes since the last replication interval and they capture only the changes at the time for which the replication is scheduled. Write-order fidelity captures every change and sends all of them during each replication interval.

Different model support. If you need to replicate from many-to-one, one-to-many or many-to-many configurations, confirm that the asynchronous replication works the same way on multiple models from the vendor.

Firmware upgrades. Firmware updates may disrupt asynchronous replication until all firmware is applied on all systems.

Read-write snapshots. Snapshots taken of volumes previously replicated or scheduled for replication by some storage systems are sometimes read-only. This is fine for recoveries and backup, but if you're planning on performing testing and development with these snapshots you need to determine if they support read-write capabilities.

Replication bandwidth management. You can manage the network bandwidth used by the asynchronous replication software by setting policies in the network, the storage system or both. Most storage systems leave most of the management to the network, but some storage systems let you prioritize replication traffic and fine-tune quality of service queues.

Recovery point objective (RPO) is the point in time to which data must be restored to satisfy application owners. For applications that require more specific point in times of recovery down to seconds, minutes or specific writes, you should select a high-end storage system that supports write-order fidelity. Users who can recover to an approximate point in time (give or take 15 minutes), will find snapshots taken at specific intervals adequate for their needs.

Recovery time objective (RTO) defines how soon businesses must recover before unacceptable risks to the business may occur. For applications that must be up in seconds or minutes, you should consider only high-end storage systems that support both synchronous and asynchronous replication software. For applications that have 60 minutes or more to recover, asynchronous replication on midrange storage systems should suffice.

In a Global Mirror configuration, there's one master system and subordinate systems. The master coordinates the creation of a consistent set of volumes on the subordinate systems every few seconds by sending a "pause" command to all of the subordinates. Once all of the subordinates have responded, the master indicates that the writes can proceed. All writes received prior to the "pause" are considered part of the consistent set while data received after the pause isn't.

The major difference between the Global Copy and Global Mirror features is that Global Copy doesn't guarantee data consistency on the target system because it doesn't maintain the order of dependent writes as Global Mirror does.

With asynchronous replication, if the network link becomes congested or broken, replication stops until the network link is reestablished. Any queued writes beyond the first couple of minutes of the disruption aren't transmitted. Once the network connection returns to normal, the asynchronous replication program needs to synch the volumes on the source and target storage systems, which increases recovery times and the RPO.

Asynchronous replication transmissions spike during periods of peak application write I/O. To balance these peaks and valleys, HDS' USP Universal Replicator journals all writes on the source storage system and stores them in its disk cache; it then transmits the writes after the peak periods of application I/O have passed.

Replication intervals
Businesses that don't require write-order consistency and that can withstand lengthier RPOs (10 minutes to 15 minutes or longer) will find that the asynchronous replication software found on many midrange storage systems satisfies their less-stringent requirements. These systems don't track every write nor do they constantly send changes to the target storage array; they only replicate changed blocks at intervals defined by the user.

EMC's Clariion CX3 series typifies the way many midrange storage systems perform asynchronous replication. A point-in-time copy of the production data on the source is made and copied from the source to the target storage system. Once the target system has the point-in-time copy of the production data, the source storage system creates a delta set of all of the changes since the point-in-time copy was created. This delta set doesn't include every write or change, just the last set of changes prior to the snapshot. For example, if one block has changed 10 times since the last replicated set of production data, only the last change to that block is transmitted as part of the delta set.

Before the delta set is applied on the target volumes, the target system first takes a snapshot of the target volumes. This ensures the target storage system has a recoverable image of data should the replication not complete before the delta set is transmitted. Once the delta set is sent and applied to the target volumes, that becomes the new primary recovery copy with previous snapshots kept or discarded based on retention policies set by users. This replication process then repeats based upon the user-defined replication interval.

Bill Snow, IT director for the construction firm Moss & Associates in Fort Lauderdale, FL, uses Compellent's Storage Center storage systems to replicate data. Compellent Storage Center employs a method similar to EMC's Clariion, but differs in that it creates a snapshot on the primary storage system and then replicates the snapshot. Snow retains these snapshots for various lengths of time and uses them in lieu of backups. He keeps hourly snapshots for 48 hours, daily snapshots for two weeks and a weekly snapshot for a month. Moss & Associates still uses IBM Tivoli Storage Manager (TSM) to create monthly backups to tape for long-term offsite storage but, says Snow, "I rely on snapshots created during the replication process to act as my front line for backup and recovery."

Other uses
While business continuity is usually the initial driver for the adoption of asynchronous replication, other IT functions such as content distribution, data migrations, load balancing, and testing and development usually aren't far behind.

Eric Midkiff, network administrator at Thomas Memorial Hospital in South Charleston, WV, stores his Picture Archiving and Communication System (PACS) data on two EqualLogic Inc. PS Series storage systems in separate data centers--one in the hospital and the other approximately a mile away--and replicates data between them. Initially done for data protection and rapid recoveries of PACS images, production servers are now used at both sites to access and store PACS data, replicating data between them every 15 minutes. "This allows us to distribute content and load balance between the two sites providing higher performance for our users in both sites," says Midkiff.

EMC's SAN Copy is the only storage system-based replication tool that allows users to migrate data from volumes on another vendor's storage arrays directly to a Clariion without first virtualizing the other vendor's storage array. SAN Copy avoids this virtualization step by allowing administrators to present LUNs on other vendors' storage systems directly to the Clariion. The caveats with this technique are that the Clariion storage system must be able to access the other storage system's LUN through the SAN, the LUN on the Clariion must be the same size or larger than the source LUN, and the source and destination LUNs must be offline during the data migration.

Replication caveats
As EMC's SAN Copy illustrates, there are caveats to using any storage system's asynchronous replication technology. There's considerable variation among products related to the degree of flexibility users have in their choice of storage systems, what replication configurations are supported and how well the storage systems manage the asynchronous replication process.

Companies with Hewlett-Packard (HP) Co.'s StorageWorks Continuous Access Enterprise Virtual Array (EVA) software can replicate data between any of HP's EVA models, in either direction from the high-end EVA8000 to the lower end EVA3000, as well as any model in between. The catch with the EVA Continuous Access software is that it doesn't replicate to other HP storage systems such as HP's StorageWorks Modular Smart Array (MSA) or XP Series. The MSA offers no support for asynchronous replication software, while the asynchronous replication software for the XP Series isn't incompatible with the EVAs. Cost-conscious users will also need to pay a premium for disk drives on the EVA because it supports only Fibre Attached Technology Adapted (FATA) disk drives as opposed to the more economical SATA drives.

The lack of interoperability among different vendors', and even a single vendor's, models is a clear downside when using asynchronous replication, although some vendors give administrators some flexibility in reusing existing capacity on competitor's storage systems. For example, admins may present LUNs on other vendors' storage systems to Compellent, HDS and Network Appliance (NetApp) Inc. controllers so they can use the capacity on these storage systems as part of their management scheme.

As users deploy storage system-based replication more extensively, they'll find it will become more difficult to migrate off those storage systems. Another potential problem with using asynchronous replication is that in write I/O-intensive environments there may be a performance hit on the application running on the storage system. The severity of the problem depends on the nature of the write I/Os. If the write I/O represents new data to the storage system, the impact is negligible. But when an application makes frequent changes to existing blocks, the storage system must first copy the existing data to a new block location so it can replicate the data at the next replication interval and then write the new data to the old block location. To address this delay, storage systems like those from NetApp eliminate the extra copy step by writing the new data to a new block and then updating the set of pointers the asynchronous replication software uses.

Another feature to investigate is how, or if, the bandwidth between the two storage systems is managed. Most storage systems leave it to the network to manage the bandwidth and quality of service (QoS) on the network pipe connecting the storage systems, while the storage system focuses on filling the pipe with data.

EqualLogic's PS Series, for example, relies primarily on network switches and routers to manage QoS. To fill the network pipe, the PS Series storage system manipulates TCP/IP packets to create larger window sizes so it can put more data into the pipe. Compellent's Storage Center uses a combination of both. Rather than leaving everything to the network, which can manage only bandwidth, Compellent's Storage Center permits administrators to prioritize which specific volumes are sent. If multiple volumes are being replicated and bandwidth availability is at a premium, the volumes with the highest priorities are replicated first to increase the likelihood they'll transmit successfully before lower priority volumes are sent.

Asynchronous software on midrange systems will meet the requirements of most average business applications. You can expect relatively good success on storage systems with low to average write I/O loads and applications that have RPOs that can exceed 30 minutes. Companies running applications with a large number of write I/Os and RPOs approaching zero should only consider deploying high-end storage systems from EMC, HDS and IBM that support write-order fidelity and can deploy engineers better equipped to correct issues in mission-critical environments.

Click here for Storage system-based asynchronous replication software (PDF).

Dig Deeper on Data storage management