When users, and especially management, discuss the details of availability and of high availability, one of the first things that comes up is "the nines." In availability circles, everyone talks about the nines of availability, as shown in the table below.
|Nines||Percentage||Downtime in a year|
The nines specifically refer to the percentage of time that a system is up over a given period of time, usually a year. It's a nice, easy method of summing up availability, and evaluating system and system administrator quality, within a model that even the pointiest-haired boss can understand.
It's also badly flawed.
The nines model makes a fundamental and invalid assumption. It assumes that all time is worth exactly the same amount to the organization that has deployed the critical system. That's simply not true.
Consider amazon.com, for example. Would two 26-minute outages on their external Web site (giving them 99.99% availability, if they were the only outages in a year) hurt them the same amount if one occurred on a Friday night in December, while the other occurred at 3am on a Sunday morning in late August?
If the systems that control the cameras and graphics in the network news studio at NBC failed, would it hurt more if the failure occurred just before the live news broadcast, or just after it was over?
The nines model does not take timing into account.
The more components a system has, the more complex it is. Consider a system with ten components, and an availability goal of five nines (99.999%, or five minutes downtime a year). What that really means is that each component is allocated an average of 30 seconds a year of downtime that it can be responsible for. If any one component is responsible for more than its 30 seconds, another component must be responsible for less. If any one component is responsible for more than five minutes of downtime, then it doesn't matter what the other components do, the goal has been exceeded.
In the last few years, many system vendors have begun to offer contractual uptime guarantees, where if a system's downtime exceeds a given threshold, the vendor will pay money to the end user, as compensation. The problem with this model is that there are many causes of downtime that are outside the domain of the system vendors. Electric power is one example. If HP guarantees that your system will be up 99.999% of the time, but you suffer a power outage for two hours (your UPSs only held out for an hour), the uptime guarantee should kick in. But it's not HP's fault that you had a power outage, and in fact, their contracts specifically exclude certain types of outages. By the time these reasonable exclusions are accounted for, these contractual agreements have lost most of their teeth.
My advice on 9s is to measure them. Keep availability statistics. Report them if they are good, or if you are called upon to do so. Even if your management has not called on you to report availability statistics, record them on a regular basis. Then look at what causes the majority of your downtime, and fix it. If you concentrate on the majority of the problems, you will see a significant improvement in availability.
Rather than getting buried in the details of the numbers, concern yourself with basic improvement. Trend upwards.
For more information on availability, view more tips by Evan L. Marcus.