Editor's note: In part 1 of this series, the authors described the basic reasons to deploy a server failover-type architecture. They described failover as taking two or more conventional, local servers and connecting them together with controlling software so that one server could take over when the other one fails. This migration of services from one server to another is what they termed "failover." Here, the authors go into greater...
detail on how best to establish a failover environment.
At minimum, the migration of services during a failover should meet the following criteria:
- Rapid failover
- Minimal manual intervention
- Guaranteed data access
We will go into each of these criteria further in the rest of this column.
The failover should be no more intrusive to the clients who access the server's services than a simple reboot. This intrusiveness may not be reflected in the duration of the outage, but rather in what the clients must do to get back to work once services have been restored.
In some cases, primarily databases, it may be necessary for the user to log back in to his application. Nonauthenticated web and file services should not require logging back in. Login sessions into the server that failed over do, with today's technology, still require a re-login on the takeover server.
Failover should take no more than five minutes, and ideally less than two minutes. The best way to achieve this goal is for the takeover server to already be booted up and running as many of the underlying system processes as possible. If a full reboot is required in order to failover, failover times will go way up and can, in some cases, take an hour or more.
The two- to five-minute goal for failovers is a noble one and can be easily met by most applications. The most glaring exception to this is databases such as Oracle or DB2. Databases can only ber restarted after all of the transactions that have been cached are rerun, and the database updated. (Transactions are cached to speed up routine database performance; the trade-off is a slowing of recovery time.) There is no limit to how long it might take a database to run through all of the outstanding transactions, and while those transactions are being rerun, the database is down from a user's perspective.
Minimal manual intervention
Ideally, no human intervention at all should be required for a failover to complete; the entire process should be automated. Some sites or applications may require manual initiation for a failover, but that is not generally desirable. As already discussed, the host receiving a failover should never require a reboot.
Guaranteed data access
After a failover, the receiving host should see exactly the same data as the original host. Replicating data to another host when disks are not shared adds unnecessary risk and complexity, and is not advised for hosts that are located near to each other.
The systems in a failover configuration should also communicate with each other continuously, so that each system knows the state of its partner. This communication is called a heartbeat.
Content in this tip has been excerpted by permission from the book, ""Blueprints for high availability, Second edition," authored by Evan Marcus and Hal Stern, Wiley Publishing, Inc. All rights reserved.
About the authors: Evan Marcus is a frequent SearchStorage.com contributor and an expert at answering readers' questions related to availability, backup and disaster recovery-related issues. He is also a principal engineer for Veritas Software and the industry's data availability maven, with over 12 years of experience in this area. He is also a frequent speaker at industry technical conferences.
Hal Stern is the vice president and chief technology officer for the Services business unit of Sun Microsystems. He has worked on reliability and availability issues for some of the largest online trading and sports information as well as several network service providers.
Do you have a question for Evan Marcus? You can find him in our High Availability category.