Data replication protects your data during a disaster and makes backups a snap, but with so many vendors in the space, and a variety of architectural approaches to the problem, how do you determine which data replication strategy is right for you?
Christopher Poelker, co-author of Storage Area Networks for Dummies, has assembled the following tips and best practices to help your team devise a data replication architecture that reduces downtime, saves money and minimizes the drain on your resources. You may download a printer-friendly version.
Checklist: Ten steps to data replication
1. Determine the amount of data that needs to be copied on a daily basis:
The amount of data that needs to be copied will determine the bandwidth of the network required to move that amount of data. This is the physics of remote replication and usually the most expensive part. This is also the part that people don't really think about until they find out how much bandwidth they really need. Finding the amount of data that needs to be copied requires understanding how much data is written on a daily basis. A simple way to find this out is to use your backups as a guide. During a peak workload cycle, do a full backup on Monday, and then do a differential on Friday (or daily incrementals). Then calculate the differences. Although this is not a fool-proof method, it should give you a general idea of how much data is changed during the week.
2. Calculate available network bandwidth between locations:
If you only have a dial-up connection between sites, you may as well back up the Chevy truck and start loading tapes to be shipped to your disaster site. As a good rule of thumb, you will need about 10 Mb of bandwidth for each MB of data you need to copy per second. As an example, a T3 link can handle almost 5 MB of data per second.
3. Measure the distance between locations:
The distance will determine what kind of remote copy solution you can use: synchronous or asynchronous. Under sync replication, an I/O is not complete until it is written to both sides. This is a good thing, because your transactions stay consistent. Every write written to the primary side is written "in-order" to the remote side before the application sees an "I/O complete" message. The problem here is that Fibre Channel protocol requires four round trips to complete every I/O under sync replication. Even using dark Fibre cables between sites the speed of light becomes your limiting factor because of the four round trips, you loose about a millisecond for every 25 miles. Sync is limited in distance to about 100 kilometers. After that, application performance goes in the toilet. Async can go around the planet. So the farther you go, the more you need async remote copy.
4. Consider the type of operating systems and number of servers involved:
Software-based replication products work great. The problem arises when you have hundreds of servers to copy data from. Buying a software license for 200 servers at the primary location and another 200 licenses for the servers that need to be at the remote site can get very expensive. Also, I don't know of a software package yet that can be used with every operating system. If you have AIX, Solaris, Netware, NT, Windows 2000 and VMS, you may need several separate software solutions. For a homogenous NT or Unix environment though, software works great and can save you money.
5. Take clustering into account:
Most cluster solutions require real-time connectivity for heartbeat and locking for quorum resources. If you use clustering software like MSCS and want to stretch the cluster between locations so that all your applications transparently fail over, you will need to be within sync replication distances. Also, for high performance databases like Oracle, using the application itself as the replication engine may allow you more granularity of control. You can just replicate the control files and transaction logs, then just roll forward or back non-committed writes (or writes in transit) at the remote site during a disaster. Log only replication allows lower bandwidth links to be used between sites, but requires at least one full database instant at the remote site. The initial copy can be done by shipping tapes, and restoring at the DR site.
6. Determine availability of storage, servers and floor space at the remote site:
If you have your own data center for your remote site, you're fine. If you need to lease space from a provider, you want to make sure your solution is as compact as possible. Server and storage consolidation must be considered prior to introducing hosted disaster recovery solutions. Hey, when you're paying by the foot you want to have very small feet!
7. Calculate your available budget:
This is a no-brainer. Many companies, when faced with the real world costs of disaster recovery, tend to get shell-shocked. Consider the costs:
- Floor space
- Servers for the recovery site
- Staff for the recovery site
- Storage hardware and licenses
- Software licenses
- Services to implement the solution
- Services to determine what needs to be copied, and why
- Network links (this is usually the most expensive part)
- Network-based SAN extension gear
The costs can add up quick. This sometimes makes the CTAM method look like a wonderful idea. (CTAM = Chevy Truck Access Method. Dump your backup tapes in the back of a truck and drive your data to the remote site.)
Intelligent fabric and storage controller solutions allow the same replication technology to be used across heterogeneous pools of storage. It also allows the use of different tiers of storage for the replication process. The ability to move data from expensive production storage to cheap serial ATA-based modular storage will allow you to save money at the DR site.
Although not yet as common as storage-based or host-based replication, using intelligent switches or appliances at the fabric level is making inroads in many datacenters. There are solutions available from all the major switch vendors, and many appliance vendors. I would recommend bringing each solution into your lab for further testing to see which one may fit your requirements. You can use the RFP process to get a better understanding of how each solution works, and what the costs are.
9. Weigh outsourcing:
All the data storage companies out there that let you connect to their storage over a fast connection, and let them worry about keeping the data safe. You can outsource the entire process, and let your outsourcing supplier worry about which technology to use, and have them create the required DR process and procedure documentation.
10. Determine how much downtime you can afford:
Downtime is the actual amount of time your computers are not operational due to any type of planned or unplanned outage. Most companies' computers need to run twenty four hours a day, seven days a week (24x7) to keep up with the new global economy. What would happen if your company had a catastrophic failure, but your competitor was still operational?
Consider what the cost of downtime would be for your organization. If your server went down for an hour, how much business would be lost? If your server went down for the whole day or even worse, your building caught fire, and you lost your whole database, then how much business would be lost and how long would it take you to bring everything back up again?
Take a look at the figure below for an indication of what one hour of downtime costs by industry. This dollar amount should determine how much your company should spend on a disaster recovery (DR) solution.
Cost of one hour of downtime per industry
For more information:
Tip: The best way to move data
Advice: RTO, RPO and storage over distance
Storage Management Survival School: Storage planning
About this author: Christopher Poelker, co-author of Storage Area Networks for Dummies is a storage architect at Hitachi Data Systems. Prior to Hitachi, Chris was a lead storage architect/senior systems architect for Compaq Computer, Inc., in New York. While at Compaq, Chris built the sales/service engagement model for Compaq StorageWorks, and trained most of the company's VAR's, Channel's and Compaq ES/PS contacts on StorageWorks. Chris' certifications include: MCSE, MCT (Microsoft Trainer), MASE (Compaq Master ASE Storage Architect), and A+ certified (PC Technician).