When a computer system fails, it can take hours, or in some cases days, to diagnose the failure. If the failure is an intermittent one, it can take even longer; some intermittent problems are never reliably diagnosed.
If it turns out that the source of the problem is hardware, a replacement for the failed part must be obtained, and then someone who is capable must be called upon to replace it. If the problem is in software, a patch to the application or to the operating system must be obtained, if it even exists (it may have to be written first). Assuming the fix works, the host must be rebooted, and recovery must be initiated from any damage that the failure may have caused.
Sometimes you'll find yourself in the finger-pointing circle game, where the hardware vendor blames the OS vendor, who blames the storage vendor, who blames the application vendor, who blames the hardware vendor again. All the while, of course, your system is down. If the failed server is a critical one, this sort of hours- or days-long outage due to vendor bickering is simply unacceptable.
What can you do? You could take your applications off the Unix or Windows server you've installed them on and put them on a multimillion-dollar, fault-tolerant server, instead. FT servers are designed with redundant hardware so that if one component fails, others can instantly step in and take over for them. (FT servers are often designed with triple-redundant hardware, and there is at least one quad-redundant
Unfortunately, a fault-tolerant (FT) server may still not offer adequate protection. Although the FT vendors may make enhancements to their drivers and operating system, FT systems are no less vulnerable to software issues than more conventional systems. What's more, by their nature, FT systems are closed systems that do not offer the flexibility or connectivity of conventional systems, because those benefits can introduce risk to the system. It is difficult to migrate existing applications to FT systems, because they are not always compatible with conventional systems. FT systems are popular in certain high-end applications, such as gaming (casinos and lotteries), and air traffic control, where the benefits that they provide offset their cost.
A more practical and less expensive solution is to take two or more conventional servers and connect them together with some controlling software, so that if one server fails, the other server can take over automatically. The takeover occurs with some interruption in service, but that interruption is usually limited to just a few minutes.
The migration of services from one server to another is called failover.
To ensure data consistency and rapid recovery in a failover situation, the servers should be connected to the same shared disks. This series of tips will assume that the servers are located within the same site, and generally in the same room. (Migrating critical applications to a remote site is a disaster recovery issue. While it seems similar to the local case, it actually introduces many new variables and complexities. This will not be covered in this series, but can be found in Chapter 18, "Data Replication, of the book, "Blueprints for high availability, second edition.")
Content in this tip has been excerpted by permission from the book, ""Blueprints for high availability, Second edition," authored by Evan Marcus and Hal Stern, Wiley Pug blishing, Inc. All rights reserved.
About the authors: Evan Marcus is a frequent SearchStorage.com contributor and an expert at answering readers' questions related to availability, backup and disaster recovery-related issues. He is also a principal engineer for Veritas Software and the industry's data availability maven, with over 12 years of experience in this area. He is also a frequent speaker at industry technical conferences.
Hal Stern is the vice president and chief technology officer for the Services business unit of Sun Microsystems. He has worked on reliability and availability issues for some of the largest online trading and sports information as well as several network service providers.
Do you have a question for Evan Marcus? You can find him in our High Availability category.
This was first published in December 2003