This article can also be found in the Premium Editorial Download "Storage magazine: Is storage virtualization ready for the masses?."
Download it now to read this article plus other related content.
It was shortly after the world trade center towers collapsed that I stood in the lobby of an undisclosed location peering across the Hudson River at what used to be the World Trade Center complex. In its place were blaring lights and huge tractor trailers that were aiding in the search and recovery effort for victims. I was part of a different type of search and recovery effort being lead by Legato Professional Services. Our mission was to establish a backup and recovery infrastructure that would allow one of the world's largest brokerage firms to find and recover their data, while at the same time, continue to meet SEC regulations regard-ing data protection. During this engagement, I witnessed the trials and tribulations this firm experienced during what must have been one of the most strenuous times in U.S. history. Here's my report.
The firm's data center was located in the WTC complex, resulting in the destruction of their local computers and storage due to the dust and debris that fell around the WTC towers. In addition to any live data being destroyed by the collapse of the towers, and because the off-site vendor hadn't yet arrived that morning, the previous night's backup tapes were still in the building, now considered a crime scene.
For those applications that were deemed mission-critical before the attack, data had been mirrored using a Hitachi SAN. Thus, those applications were up and running within hours of the collapse. For the many other applications,
Initially, we needed to get duplicate hardware and software to reconstruct the destroyed production environment so recovery could begin. The supporting vendors (Legato, Hitachi, StorageTek, and Sun) were all great in providing the necessary pieces of equipment. After procuring the hardware to rebuild the backup and recovery infrastructure, the process began by recovering the firm's six Legato NetWorker servers: four for recoveries and two for continued backups. These servers were Sun Enterprise 6500s with 8GBs of memory, four CPUs and direct-attached StorageTek L700 libraries using DLT tapes.
Because the firm used DLT drives, we experienced severe performance problems loading and unloading hundreds of tapes used during the recovery. DLT drives are great once the tape has been loaded into the drive, but they cause problems with an unusual amount of tape mounts, such as in an enterprise-wide disaster. This problem was exasperated by code in the binary command responsible for loading and unloading the tapes, which needed to run atomically. As a result, the multiple requests for loads and unloads during the many recoveries caused several delays. These delays aren't exclusive to DLT tape drives - any tape drive that is designed for increased capacity instead of speed would yield the same dismal results.
Additional problems rapidly surfaced: In our recovery operation, the Legato NetWorker servers weren't only responsible for mounting tapes and updating the indexes, they were also responsible for moving data between the tape drive and the recovering client system. Although it should be understood that a disaster of this magnitude couldn't have been imagined, and getting the recovery systems up and running as soon as possible was at the forefront of everyone's mind, the deployed design negatively impacted the established service level agreements between the recovery team and the business units that they supported.
A better solution would have been to station storage nodes between the recovering clients and the NetWorker server. In such a configuration, the storage nodes would have been responsible for the movement of the data, freeing the NetWorker server of all of the hardware interrupts associated with opening and closing the tape drive and NIC card interface. Thus, the loading and unloading of tapes would have proceeded more smoothly.
Nonetheless, recovery of the six Legato NetWorker servers was completed without incident. Each took approximately five to six hours once the hardware was set up properly.
Lack of consistent IP connectivity
Most of the requested recoveries were completed without incident. Many of the ones that did fail, however, failed because of the lack of consistent IP connectivity and name resolution. For one reason or another, client systems were randomly dropping off of the network. Luckily, the firm's day-to-day management practices included a set of nicely written scripts to test the functionality of the managed client before executing a backup.
This was first published in June 2002