Recovering from the WTC: a personal account


This article can also be found in the Premium Editorial Download "Storage magazine: Is storage virtualization ready for the masses?."

Download it now to read this article plus other related content.

These scripts allowed us to determine the likelihood of a successful backup before the client was submitted, as well as preserve the Legato server's resources for quality work. The stability of the IP network and name resolution is of the utmost importance to a backup and recovery application. The firm's applications were designed to totally depend on the operating system and supporting environment for forward and reverse lookups of client machines. Lesson learned: It's important to have these supporting services worked out in advance as part of a business contingency plan.

Chaos reigned during the weeks immediately following the attacks. Day-to-day business practices and common sense seemed to be absent at times. Initially, we were all in reactionary mode. That's the very reason why a business contingency plan is helpful after a disaster. Had there been a thorough business impact analysis performed, the firm would not have faced confusion and unwanted results when it started to restore applications. Once the recovery servers were set up, all of the business units started yelling, "Me first." The firm's recovery team could have served its clients better had upper management signed off on an application recovery priority list. Usually, the most revenue-generating or most used applications top the list.

While application data was being restored, updated data had to be backed up. The fundamental submittal mechanism in Legato NetWorker is a group. The group will contain

Requires Free Membership to View

one or more Legato clients that have similar characteristics (i.e., same availability requirements). However, there is a limit on the number of clients that should be in any one group, and the number of groups that should be active at any one time. This limit was directly related to the number and performance of the tape drives configured in each StorageTek L700 Library, and the capabilities of the hardware supporting the Legato NetWorker server. Not being cognizant of these limitations led storage administrators to schedule more clients for concurrent backups than could be supported by the hardware. As a result, many backups failed because they timed out.

The best approach to calculate your backup and recovery infrastructure's thresholds is to determine the number of streams (file systems or volumes) necessary to keep the tape drive streaming at its maximum transfer rate, and then multiply that number by the number of tape drives supporting your environment. The result will give you the total number of streams that can be supported at any one time by your hardware. And depending on the number of streams each backup client is configured to throw at the backup server, the result will also give you the total number of clients that can be active at any one time.

With chaos prevailing, it was more important than ever to coordinate information regarding the status of recoveries and subsequent backups. A Web-based reporting tool would have lessened the burden of fielding user queries directly by the recovery team, often interrupting critical work, and slowing down our main task of recovering the brokerage firm's data.

Another area of concern was the projects that fell through the cracks due to shift turnover. Client backups would fail for one reason or another, and then be restarted without any corrective measures implemented in the interim. And of course, the client backup would fail again for the same reason. Not only did this action delay the successful backup of the client, it also consumed valuable resources from the Legato NetWorker Server with no chance of success. And in a few circumstances, this action even caused the failure of active backups by consuming resources.

In order to recover the backup and recovery server, the recovery team needed to know the volume number and IDs of the tapes containing the server's bootstrap information. Not only was the firm smart enough to send a copy of this report off-site with their tapes, they also e-mailed these reports to an ISP, which gave them access to this critical information from anywhere in the world, and before the paper report arrived with the tapes. This practice gave the firm the ability to prep the recovery environment for the return of the bootstrap tapes, and prepare some sort of written script about what to do when the tapes did arrive.

Media management
The firm faced many obstacles in obtaining the necessary tapes for client restores because the area surrounding the data center was considered a crime scene. As a result, special permission was necessary before the firm's tape operators were allowed to enter the building to retrieve the most recent tapes.

However, because the firm used an enterprise-class library, once a tape was loaded into the library, the tape operator didn't have to retrieve and mount the tape every time it was requested by the Legato NetWorker server. To further reduce or eliminate the initial time it took the tape operator to load a tape, it would have been beneficial to the firm if they had a predetermined client priority list in place that also indicated the recovery objective date. The recovery team could then have identified the exact tapes necessary to restore a particular client to its recovery objective date, before the disaster ever happened. Then, arriving on site, operators would know which tapes to preload into the tape library.

Although it appears as if the firm's efforts to manage the recovery of its application data fell short of any real success, the reality is that given the nature of the disaster, and the many facets of our society that was altered due to the scope of the disaster, they faired well in spite of the chaos that surrounded them.

As tragic as the WTC and Pentagon disasters were, America should consider the events of Sept. 11 a wake-up call to our nation. No longer can we rely on the low odds of a disaster happening. We need to fortify our information infrastructures with solutions that will make such a disaster far less reaching than the one we experienced in September.

This was first published in June 2002

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: