Recovering from a data center disaster

Your worst nightmare -- of your data center suffering a catastrophic disaster -- has come true. Now what?

Your data center has just suffered a catastrophic disaster, and your worst nightmare has begun. As disaster recovery coordinator, you're responsible for orchestrating the recovery of applications that were labeled mission-critical before the disaster occurred. You had enough forethought to develop and store your business continuance plan away from the place of disaster, and you now have it firmly in hand. You make contact with the application owners, technical support staff and all other essential staff that are on their way to the recovery site. Now what?

Remote tape ensures recoverability

That depends on whether you are relying on tape or disk for recovery. The point at which the tape user picks up the phone to call their off-site tape vendor and the point at which the disk user is speeding swiftly to the recovery site is where the two solutions fork--and their recovery experiences become inherently different. If you read my personal account of a tape-based recovery I participated in during the Sept. 11 disaster (see "Recovering from the WTC: a personal account" in the June 2002 issue of Storage), you learned many of the downsides of using tape for recovering from a full-scale disaster.

However, by placing your archiving hardware at a safe distance from the primary copies of your data, many of these downsides will be addressed from Day 1. By observing the tape-based solution in "Remote tape ensures recoverability" (this page) and assuming the common 7 p.m. to 7 a.m. backup window was in use, we can see that at the start of the Sept. 11 disaster (approximately 8:45 a.m.), the latest completed backup tapes would have been off-site. Conversely, in a local tape-based solution, these same tapes would likely still have been in the same location as your primary data center because the time between the close of your backup window and the pickup time of your off-site tape vendor would probably have been after 8:45 a.m., if you leave room for late-running backups, as well as the collection and organizing of the tapes. So a complete disaster would have destroyed all copies of the most recently updated data.

Not only can this location problem affect the recovery point of your applications, it could also affect your overall recovery objective time--if the tapes weren't destroyed in the disaster but were somehow inaccessible to your recovery staff, similar to what the brokerage firm in my personal account experienced. It's not likely that the terrorists saw this vulnerable gap of time as an additional side effect, but the timing of the strikes hindered our ability to recover data. And if we still think that it's safe to play the odds on scenarios like this, then we really haven't learned very much at all with regard to DR.

One huge milestone in recovering applications using a tape-based solution is the point at which you have every tape library designated for recovery populated with the tapes that the backup server will request while directing the recovery of your production applications. Admittedly, there's quite a bit of preparatory work that's necessary to identify the most critical applications and their associated backup tapes, load and inventory these tapes in the library.

But once you complete the inventory of the tape libraries, you will have arrived at the point where a disk-based solution would have positioned you from the start of the disaster. At this point, both solutions would have the necessary data within their enclosures, but the amount of time needed to access and transfer this data to your recovered application servers varies greatly.

If you compare the access times of the tape and disk approaches, you'll see that there's really no comparison at all. Even the fastest, most expensive mid-cartridge load tape drive cannot match a low-end, Fibre Channel (FC)-attached ATA disk array. The brokerage firm in my Sept. 11 experience was using DLT drives for its backup and recovery solution. These drives were engineered for raw capacity--not speed--in their load and unload operations. So, when loading and unloading a tape during the recovery, this operation was elongated and proceeded in single file, because they were essentially hardware interrupts of the same priority. At times, this caused the robotic arm in the library to be overrun with SCSI commands, and it eventually had to be reset. Unfortunately, this phenomenon was not recognized until recovery jobs started failing.

Making the best of tape
Two variables will affect reaching a "ready to recover" state with tape. First, how long are tapes kept at the recovery site before being moved to an off-site vendor? And were your backup servers and their indexes located local to your primary data center and thus destroyed by the disaster, or were they located at your recovery site?

Disk-only approach yields fastest time to recovery

Because technology allows us to access a library's robotic arm from across a wide distance, a storage area network (SAN) designer has the option of locating the backup server at either location. But for security reasons, an organization may not want a server that has access to every other client network to remain in a less-secure location. If there's any way to lighten the fears of the security organization in your company by enhancing the security at the remote location and in your network, then perhaps you should explore those paths.

One huge benefit you would realize is not having to rebuild your backup server from scratch because it was at a safe distance at the time of the disaster. This saves you time by not requiring you to restore your backup server's indexes, as well as keeping those indexes available to you for report queries. With the indexes available, you can execute prewritten scripts to query the backup server for the tape volumes that will be necessary to restore a backup client to a particular recovery time and then load those tapes as they arrive from the off-site vendor. However, this benefit can be fully exploited only when you have a priority list of application servers to restore.

The costs associated with a tape-based DR solution range from the procurement of tapes and their tape drives, to the management of the physical tape for the rest of the tape's life. Tape costs will vary depending on the technology used for backups. The choice of DLT, LTO, AIT or other midcartridge load technologies will depend on the number of backup clients that must be assigned to each tape library and the data characteristics of the clients.

For example, if you have a high number of clients with lots of mount points assigned to a tape library for backup, recovering that same high number of mount points implies a more significant amount of tape mounts, because mount points are assigned to specific tape drives during backups. A complete DR slows restores for all but the tape drives with the fastest of load and unload times. And because the capacity of the tapes associated with those types of tape drives are often much smaller than those built for capacity, more of them are necessary to regain the capacity lost to the mount speed of the tape drive. This adds cost to the tape-based solution by requiring more tapes, floor space and staff to administer a large-scale DR solution.

The disk alternative
I have yet to meet an IT manager who wouldn't like to remove or reduce the financial burdens associated with tape media management. More staff, more off-site vendor contracts and more tapes are some of the larger itemized expenses coming from out of your media management budget. Tape management requires staff to collect the most recent backup tapes from the library, possibly entering them into a separate tracking system, and coordinate the pickup and drop-off of tapes needed for recovery and re-initialization.

In addition to these tasks, this group will also determine the recoverability of tapes that were vaulted some time ago. Although most organizations outside of the government aren't prudent in their efforts to ensure recoverability, it's still a best practice.

An increasingly interesting alternative to tape that alleviates many problems is one of the many disk-based DR solutions on the market. With this solution, disk can be used to stage data at a remote site.

That's exactly what one of my clients has done. They already owned space tens of kilometers away from the primary data center, and they decided to mirror their data to this location (see "Disk-only approach yields fastest time to recovery"). Every night before the backup, the mirrors were split and the secondary copy mounted on a backup server local to the remote tape library. Then, the backup data was transferred to the library, and the mirrors merged when the backup was complete. This solution put the backup data off-site immediately, and it removed the application server from the process of moving data to tape by splitting the mirrors and mounting them on a different server.

At first glance, this approach may not look like a disk-based solution because the data is being driven to tape. However, the functionality of mirroring data across distances makes this a disk-based DR solution with benefits. In addition to the aforementioned benefits, there's one less copying function necessary for archiving. The mirrored data is already located off-site in a second or third disk array for protection, and the ability to split and mount the mirrors on a separate server implies application-free backup and a higher quality of application availability.

Should the primary data center be destroyed, mirrored and recently synchronized copies of each file system or volume will be resident on Switch B and discovered by leased servers provisioned at the recovery site some time after the disaster. By locating the backup server at the remote site and provisioning application servers, as well as having a plan in place for changing your SAN's zoning and perhaps LUN configurations to include the new host bus adapters (HBAs) in the provisioned servers, you have done much to ensure the expedited recovery of your applications. Furthermore, with your boot disks being mirrored across the extended SAN, you'll be in a better position to recover by connecting a server to Switch B, install the necessary drivers, change your zoning configuration and then boot the server to assume the identity of the destroyed application server at the primary data center. I can't think of a better way to perform a bare-metal restore of an OS.

Because this DR solution still uses tape, the media management issues surrounding tape still exists. But by combining the mirroring functionality of your disk array with the ability to extend your SAN over distances, it takes the off-site tape vendor out of the critical path of a DR exercise by alleviating the need to wait for the vendor to retrieve and deliver the latest full backup tapes. Instead, your recovery team can start connecting servers and discovering volumes much faster than was possible with recovering complete volumes from full backup tapes, leaving tape recoveries for incremental restores only. And guess what? Those tapes are already loaded in the library. Thus, you will only need the full backup tapes being retrieved by your off-site vendor for file systems that were corrupted when its mirrored partner went off the air. A side benefit of having your tape library located at a remote site is you may be able to save budget dollars by refining your off-site pickup schedule because your backup tapes are off-site.

Another disk-based DR solution involves using a disk volume as the target of the backup and recovery application. In this approach, the scalability of capacity and performance is essential. On the front end, you can configure the backup server to recognize the storage nodes or media servers as data movers in the transfer of application server data to disk. As for resource management, you should approach storage provisioning the same way as any other application server. After all, backup and recovery in its simplest form is just another application.

Because the storage nodes and accompanying SAN basically replaces your tape library and perhaps the duties of your tape management staff, you can assume that there are administrative burdens associated with this solution. These burdens mostly mirror those associated with managing your primary and/or secondary disk arrays. The problem is that if this solution is going to make any kind of economical sense, the chosen disk array in your DR solution will probably be less of a robust storage solution than the one chosen for your primary storage solution, thereby suggesting different management tools and more administrative staff. To increase your ROI in a disk-based DR solution, storage management software that provides a management interface into your disk arrays should be chosen prior to implementing your solution.

An enterprise-class tape library at the end of an extended SAN is a tape vendor's best chance at keeping real estate in enterprise data centers for DR purposes. With the right distance between the primary data center and the remote tape libraries and a less-than-aggressive recovery objective, tape libraries still make economical sense for business applications. However, the more servers, applications and file systems that must be recovered with an aggressive recovery objective, the less likely a tape-based solution will satisfy your SLAs.

Dig Deeper on Data storage strategy

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.