Feature

Recovering from a data center disaster

Ezine

This article can also be found in the Premium Editorial Download "Storage magazine: How storage managers can survive e-mail archiving."

Download it now to read this article plus other related content.

Your data center has just suffered a catastrophic disaster, and your worst nightmare has begun. As disaster recovery coordinator, you're responsible for orchestrating the recovery of applications that were labeled mission-critical before the disaster occurred. You had enough forethought to develop and store your business continuance plan away from the place of disaster, and you now have it firmly in hand. You make contact with the application owners, technical support staff and all other essential staff that are on their way to the recovery site. Now what?

    Requires Free Membership to View

Remote tape ensures recoverability

That depends on whether you are relying on tape or disk for recovery. The point at which the tape user picks up the phone to call their off-site tape vendor and the point at which the disk user is speeding swiftly to the recovery site is where the two solutions fork--and their recovery experiences become inherently different. If you read my personal account of a tape-based recovery I participated in during the Sept. 11 disaster (see "Recovering from the WTC: a personal account" in the June 2002 issue of Storage), you learned many of the downsides of using tape for recovering from a full-scale disaster.

However, by placing your archiving hardware at a safe distance from the primary copies of your data, many of these downsides will be addressed from Day 1. By observing the tape-based solution in "Remote tape ensures recoverability" (this page) and assuming the common 7 p.m. to 7 a.m. backup window was in use, we can see that at the start of the Sept. 11 disaster (approximately 8:45 a.m.), the latest completed backup tapes would have been off-site. Conversely, in a local tape-based solution, these same tapes would likely still have been in the same location as your primary data center because the time between the close of your backup window and the pickup time of your off-site tape vendor would probably have been after 8:45 a.m., if you leave room for late-running backups, as well as the collection and organizing of the tapes. So a complete disaster would have destroyed all copies of the most recently updated data.

Not only can this location problem affect the recovery point of your applications, it could also affect your overall recovery objective time--if the tapes weren't destroyed in the disaster but were somehow inaccessible to your recovery staff, similar to what the brokerage firm in my personal account experienced. It's not likely that the terrorists saw this vulnerable gap of time as an additional side effect, but the timing of the strikes hindered our ability to recover data. And if we still think that it's safe to play the odds on scenarios like this, then we really haven't learned very much at all with regard to DR.

One huge milestone in recovering applications using a tape-based solution is the point at which you have every tape library designated for recovery populated with the tapes that the backup server will request while directing the recovery of your production applications. Admittedly, there's quite a bit of preparatory work that's necessary to identify the most critical applications and their associated backup tapes, load and inventory these tapes in the library.

But once you complete the inventory of the tape libraries, you will have arrived at the point where a disk-based solution would have positioned you from the start of the disaster. At this point, both solutions would have the necessary data within their enclosures, but the amount of time needed to access and transfer this data to your recovered application servers varies greatly.

If you compare the access times of the tape and disk approaches, you'll see that there's really no comparison at all. Even the fastest, most expensive mid-cartridge load tape drive cannot match a low-end, Fibre Channel (FC)-attached ATA disk array. The brokerage firm in my Sept. 11 experience was using DLT drives for its backup and recovery solution. These drives were engineered for raw capacity--not speed--in their load and unload operations. So, when loading and unloading a tape during the recovery, this operation was elongated and proceeded in single file, because they were essentially hardware interrupts of the same priority. At times, this caused the robotic arm in the library to be overrun with SCSI commands, and it eventually had to be reset. Unfortunately, this phenomenon was not recognized until recovery jobs started failing.

This was first published in August 2003

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: