Making a case for high-quality failover management software
One of our customers has two database servers (Oracle 9i) attached to the same storage. They have been advised to cluster these to obtain high availability. They have neglected this advice saying manual failover will be adequate.
For most shops, I believe that manual failover
is inadequate because it requires manual intervention to bring the systems back on line. If an application is important enough to justify purchasing a failover system and mirrored disk and all of the other accessories that enable failover, then it's important enough to justify automating the failover.
Manual failover is slow. Before the process can begin an administrator has to learn about the failure, stop what she is doing and make her way to the physical systems so she can throw the right A-B switches and follow the procedures to enable the failover. In some shops that requires riding one or more elevators and having appropriate security cards (don't forget them!).
If the failover takes place in the middle of the night when many shops are unmanned then first, the administrator has to get out of bed, get dressed and get to the office. If she is on vacation or at lunch, that can also slow down the failover process.
Note, that I am making the assumption that physical contact with the servers is required to make the failover complete. That is not always the case but I have seen it quite often.
Manual failover also requires that the administrator know exactly what the right steps are and that she perform them in the right order. In a data center with lots of manually failed over clustered pairs, she also has to make certain she picks the RIGHT systems otherwise she has doubled the seriousness of the situation by bringing down the wrong systems. This happens all the time. Ever sit in front of machine 1 and telnet into a second machine and forget that you were connected? When you think you are shutting down #1 you are actually shutting down #2, a production server?
When a new administrator is on duty, she may not have been briefed in the right steps and will likely either wait for further instructions or dive right in and potentially do something wrong.
Surveys indicate that human error is responsible for at least 25-30% of downtime. When you require that critical steps are performed by humans, you are introducing a significant opportunity for error. And the likelihood of error increases dramatically when the steps must be performed under pressure or late at night.
Manual failover sometimes makes sense in shops where there is a 7x24
operations staff that continually monitors the status of all critical systems as long as they are competent and can be trusted. And where monitoring YOUR systems is a priority for them.
When failover is performed automatically by high-quality failover management software
the steps are preconfigured and tested so that there is a high degree of confidence that they will work. The automatic failover will work just as well at 4:00 in the morning on Thanksgiving or Christmas as it will at 3:00 on a regular Tuesday afternoon.
I believe that the savings that can be gained by using existing staff to perform manual failovers are phantom savings when the cost of downtime and errors are factored in, as likely as not, the true cost of manual failover is higher than that of automated failover.
Evan L. Marcus
Editor's note: Do you agree with this expert's response? If you have more to share, post it in one of our
This was first published in April 2004