Home > Storage Magazine > Features > Recovering from the WTC: a personal account
EMAIL THIS LICENSING & REPRINTS
Storage Magazine

  CURRENT ISSUE  

  FEATURES  

  TOOLS, TRENDS & ANALYSIS  

  COLUMNS  

  ARCHIVES  

  SUBSCRIBE/RENEW  
 

Recovering from the WTC: a personal account
by Darryl Brooks
Issue: Jun 2002
printer-friendly
licensing & reprints
< PREV PAGE   |   1  |   2  |   NEXT PAGE  >
These scripts allowed us to determine the likelihood of a successful backup before the client was submitted, as well as preserve the Legato server's resources for quality work. The stability of the IP network and name resolution is of the utmost importance to a backup and recovery application. The firm's applications were designed to totally depend on the operating system and supporting environment for forward and reverse lookups of client machines. Lesson learned: It's important to have these supporting services worked out in advance as part of a business contingency plan.

Chaos reigned during the weeks immediately following the attacks. Day-to-day business practices and common sense seemed to be absent at times. Initially, we were all in reactionary mode. That's the very reason why a business contingency plan is helpful after a disaster. Had there been a thorough business impact analysis performed, the firm would not have faced confusion and unwanted results when it started to restore applications. Once the recovery servers were set up, all of the business units started yelling, "Me first." The firm's recovery team could have served its clients better had upper management signed off on an application recovery priority list. Usually, the most revenue-generating or most used applications top the list.

While application data was being restored, updated data had to be backed up. The fundamental submittal mechanism in Legato NetWorker is a group. The group will contain one or more Legato clients that have similar characteristics (i.e., same availability requirements). However, there is a limit on the number of clients that should be in any one group, and the number of groups that should be active at any one time. This limit was directly related to the number and performance of the tape drives configured in each StorageTek L700 Library, and the capabilities of the hardware supporting the Legato NetWorker server. Not being cognizant of these limitations led storage administrators to schedule more clients for concurrent backups than could be supported by the hardware. As a result, many backups failed because they timed out.

The best approach to calculate your backup and recovery infrastructure's thresholds is to determine the number of streams (file systems or volumes) necessary to keep the tape drive streaming at its maximum transfer rate, and then multiply that number by the number of tape drives supporting your environment. The result will give you the total number of streams that can be supported at any one time by your hardware. And depending on the number of streams each backup client is configured to throw at the backup server, the result will also give you the total number of clients that can be active at any one time.

With chaos prevailing, it was more important than ever to coordinate information regarding the status of recoveries and subsequent backups. A Web-based reporting tool would have lessened the burden of fielding user queries directly by the recovery team, often interrupting critical work, and slowing down our main task of recovering the brokerage firm's data.

Another area of concern was the projects that fell through the cracks due to shift turnover. Client backups would fail for one reason or another, and then be restarted without any corrective measures implemented in the interim. And of course, the client backup would fail again for the same reason. Not only did this action delay the successful backup of the client, it also consumed valuable resources from the Legato NetWorker Server with no chance of success. And in a few circumstances, this action even caused the failure of active backups by consuming resources.

In order to recover the backup and recovery server, the recovery team needed to know the volume number and IDs of the tapes containing the server's bootstrap information. Not only was the firm smart enough to send a copy of this report off-site with their tapes, they also e-mailed these reports to an ISP, which gave them access to this critical information from anywhere in the world, and before the paper report arrived with the tapes. This practice gave the firm the ability to prep the recovery environment for the return of the bootstrap tapes, and prepare some sort of written script about what to do when the tapes did arrive.

Media management
The firm faced many obstacles in obtaining the necessary tapes for client restores because the area surrounding the data center was considered a crime scene. As a result, special permission was necessary before the firm's tape operators were allowed to enter the building to retrieve the most recent tapes.

However, because the firm used an enterprise-class library, once a tape was loaded into the library, the tape operator didn't have to retrieve and mount the tape every time it was requested by the Legato NetWorker server. To further reduce or eliminate the initial time it took the tape operator to load a tape, it would have been beneficial to the firm if they had a predetermined client priority list in place that also indicated the recovery objective date. The recovery team could then have identified the exact tapes necessary to restore a particular client to its recovery objective date, before the disaster ever happened. Then, arriving on site, operators would know which tapes to preload into the tape library.

Although it appears as if the firm's efforts to manage the recovery of its application data fell short of any real success, the reality is that given the nature of the disaster, and the many facets of our society that was altered due to the scope of the disaster, they faired well in spite of the chaos that surrounded them.

As tragic as the WTC and Pentagon disasters were, America should consider the events of Sept. 11 a wake-up call to our nation. No longer can we rely on the low odds of a disaster happening. We need to fortify our information infrastructures with solutions that will make such a disaster far less reaching than the one we experienced in September.
< PREV PAGE   |   1  |   2  |   NEXT PAGE  >





TechTarget Storage Media
Storage Magazine View this month\\'s issue and subscribe today.
Storage Decisions Apply online for free conference admission.
SearchStorage.com
HomeNewsMagazineTopicsLearningMultimediaWhite PapersBlogsEventsAbout Us

About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
TechTarget provides enterprise IT professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective IT purchase decisions and managing their organizations' IT projects - with its network of technology-specific Web sites, events and magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Reprints  |  Site Map




All Rights Reserved, Copyright 2000 - 2008, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts