This article can also be found in the Premium Editorial Download "Storage magazine: Expanding SANs: How to scale today's storage networks."
Download it now to read this article plus other related content.
Provisioning IP addresses
As is usually the case, IP connectivity was a problem from the start of the recovery effort. However, after a meeting of minds, these issues were permanently resolved. During problem resolution, I discovered that the insurance company asked Sungard to resurrect its IP infrastructure with the same address nomenclature (e.g., IP address and netmasks) as its production LAN. Although I understand the reasons behind this decision are that they didn't want to change their IP nomenclature at the host level and that they wanted to restore their DNS server, resurrecting the production LAN as it existed before the disaster always seems to delay the recovery effort.
A better approach would be to request a bulk number of IP addresses at the necessary link speed; assign those IP addresses to the recovering hosts and then create a master hosts file to be placed on each recovering host.
Although this may require more work, this file can be created as part of a deliverable from your DR provider before you arrive on site. And if you foresee a need for a DNS server for future host additions in a real disaster, there's plenty of free nameserver software that takes a host's file and creates the DNS database files from its input.
In order for that to work, configuration files in your production LAN shouldn't be peppered with IP addresses. Use only the hostnames of servers running application software to make
Floppy disk fallback
The most resource-independent network around is what's generally referred to as "sneaker net." This is the process of moving data from one secured network to another by an administrator physically removing storage media from a computer on one network and walking it over to a computer on a second network. Although this could be a tape or disk device, the usual medium is a floppy disk. And in a DR exercise, the two networks are probably a private connection to your production environment back home and the DR provider's onsite LAN that you have provisioned for the disaster recovery.
Some files were unrecoverable due to tape errors, and the administrators resorted to transferring the files via FTP from their environment back home. However, because the PC that connected them to their network back home didn't have visibility on the LAN at Sungard, they had to use floppies to transport the files to the recovering servers over sneaker net. The problem was that blank floppies weren't readily available because no one foresaw the need to FTP files from their production LAN. So a search party was formed to locate a few floppy disks.
In preparing for a disaster and considering the real possibility of this scenario, removable storage devices and media should be included in your off-site canisters as a fallback. Optimally, this removable storage device should be able to store the largest of your client files, and be discovered by both the computer connecting you back to your production LAN, and a computer on the DR provider's private LAN.
Take nothing for granted
Assume the administrator coming in off the street knows nothing about your backup and recovery processes at home. Therefore, documenting everything from how to connect your application data to your recovery server--whether it is in the form of a tape library or disk array--to how to perform command line restores of file systems and databases should be included in your DR plan. To test your plan, have someone from another group who is IT literate, yet unfamiliar with your group's storage practices, execute the plan. This should give you some idea of how an outside consultant would fare in resurrecting your applications.
One last note on this subject is that all procedures should be standardized, documented and well known by the recovery team. The insurance company's resident Legato administrator was well versed in his knowledge of the Legato application. And as a result, it appeared as if he had been given full reins of the environment. This scenario often leads to administrators creating procedures and walking around with them in their heads, and this administrator was no exception. Make your procedures well known with public documentation. Not sharing this information doesn't give you more job security. All it does is require the operation support staff to call you in the middle of the night.
A prioritized application recovery list was nonexistent during this particular exercise. In the absence of this list, application administrators simply fired up recoveries at their whim. Because there weren't a large number of servers to recover, tape drive resources weren't overrun. But when application administrators required my attention to help perform their recoveries, I wasn't able to immediately assist other application owners who, by the attention being given to them by the DR coordinator, owned more important applications. Having a recovery priority list helps shift resources based on a business objective approved by upper management, which leads to a more successful recovery effort.
Testing application recovery on the anniversary dates of historic terror events also helps promote a more successful recovery effort. That's amplified when you consider that Sungard DR policies are based on a first-come, first-serve model with regard to the allocation of hardware and floor space. With this policy in mind--which is primarily the same for service providers in this space--how do you know your applications will have a home once a far-reaching disaster strikes?
This was first published in November 2003