The tenth and final level in the Availability Index (introduced in part one of this series) is disaster recovery. Disaster Recovery (DR) is hardly a topic that we can cover in a brief article such as this one; there are full day courses, and week-long conferences on the subject, as well as books and magazines.
So I want to concentrate on the difference between simple data replication and full application recovery in a disaster-recovery scenario this month.
A lot of system administrators believe that if their data is successfully being replicated from their primary site to their DR site, then their DR needs have been met. While that may be true for some environments, it is most definitely not true for most of them.
Although the protection of data is critically important, if the applications aren't up and running on the remote side before the disaster occurs, having the data in place really won't be of much value.
There are two scenarios, both of which are cause for some concern. The first is where only application data is being replicated to the remote side. In this case, although the data is on the remote side, separate efforts must be made to install and test the applications, and then to start them there. If only the application data is being replicated, then updates to the applications, system files, and other non-application data have not been replicated, and will not (unless special efforts have been undertaken) be present on the remote system. In a true disaster, it will not be possible to go back to the original systems and retrieve data that was never copied to the remote site.
In this case, a significant amount of human intervention will be required, on a regular basis, to make sure that all files and applications are available on the remote side in addition to the pure application data.
The second scenario is where all relevant data, files, and applications have been replicated to the remote side. Since commercially available data replication products do not generally permit read/write access to replicated data on the remote side, the remote side, which in a disaster has become disconnected from the primary side, must change its status, and allow local writes. In general, this requires human intervention.
As we have discussed in an earlier column, under local clustering, system administrators can manually failover (migrate running services) from one system to another, as long as all of the relevant data and applications are available to both systems. But through the use of clustering software, the failover can be automated. Automated failovers generally complete much faster than manual ones.
Wide-area clustering is essentially the same thing; as long as the data is on the remote system, operations can be migrated there, either manually or automatically. Automated failovers will generally complete much faster than manual ones.
As in local clustering, to ensure that a remote failover will succeed when it is needed, it must be tested on a regular basis. Systems change and evolve, and without regular testing, there is no way to make sure that these have been replicated to the remote system. The wrong time to find out that something doesn't work is during an emergency.
Evan L. Marcus is the Data Availability Maven at VERITAS Software. He can be reached at firstname.lastname@example.org.