In previous tips in this series on system availability, we introduced the idea that implementing availability requires taking a layered approach, and then following that layered approach, we looked at good system administrative practices, backups, disks and storage (and why larger disks aren't always better), networking, system's local environment, and server-based applications. The eighth level in the Availability Index (introduced...
in part one) is clustering.
It is very interesting to me that clustering is always used as a synonym for high availability. If you only get one point through this series of columns, it should be that you cannot achieve high availability simply by implementing clustering and walking away. High availability isn't clustering any more than it's mirroring or replication.
When you cluster two systems together (let's not worry about more than two right now), the second system (call it peppermints) automatically steps in for the first, incense, should incense stop working for some reason. All clustering really does is allow the critical services that were running on incense to recover more quickly than they would if peppermints weren't in the picture.
Since failover always adds some risk (what if the other machine is down, or what if something was changed since the last time, and nobody changed the failover configuration along the way…), it's always better NOT to need to failover. It's therefore better to build your systems so that they can survive as many types of outages as possible without requiring them to failover.
I have had users call me and tell me that they had a critical homegrown application that was crashing their server every four hours. Could they use clustering software to make it more highly available? The answer to that question is yes. They *could* use clustering software.
The better question is, should they? And that answer is no.
Actually, the real question they are asking is what they can do to increase the availability of their critical application. And while clustering would increase the availability, there is a much better solution: fix the application.
Fixing the application is more work than implementing the cluster, and may take longer to complete. It's more expensive. It's harder. But it's the right way to increase the application's availability. Attack the root cause; anything else is a band-aid fix.
If clustering were always the right way, or the only way, to increase system availability, then it would not be at the eighth level of the Availability Index. The construction of a critical highly available system is like the construction of a tall building; if you start on the eighth floor, the building simply will not stand. If you try building highly available systems on top of shoddy applications, bad disks, and an unreliable operating system, clustering simply won't help. The system will continue to suffer interruptions. As we'll discuss in a future column, although it costs money to implement the tools that give your systems their required level of availability, it often costs more money when the systems are down. Whether or not it makes financial sense to implement a particular protective measure depends on the cost of doing so, balanced against the value that the application will deliver by being up a greater percentage of the time. There is no universal right answer.
Life would be so much easier if there were.
Evan L. Marcus is the Data Availability Maven at VERITAS Software. You can contact him at firstname.lastname@example.org.