The availability index
Evan L. Marcus
Ask computer users how much system uptime they want, and they'll all tell you the same thing, "I want 100 percent uptime." They want 100 percent uptime until they see the bill. That's because we live in a world where scheduled, planned power blackouts regularly strike California, the heart of the modern day computer world, and the most popular operating system is famous for something called The Blue Screen of Death. And there's something called Internet Time, where new software is expected to have more features, more lines of code, and more complexity than at any time in history, but it's expected to work without going through rigorous testing, because there isn't time. (Do you think there's a connection between products that are rushed into production, and a decline in reliability?)
So what we have is a paradox. On one hand, the users expect higher and higher levels of availability. But service providers are providing lower and lower levels of availability as their products increase in complexity, and delivery times get shorter and shorter.
How do we fix this problem? The best way is to start designing availability into our products and our systems in the earliest design phases. Designing availability into critical systems is an iterative process. It requires attention to detail and to process, and periodic reevaluations.
We need a working definition of High Availability. It is a phrase that has been totally co-opted by product marketers. To achieve high availability in computing systems, you don't necessarily need mirrored disks, failover, replication, or any of the other technologies that people have come to associate with high availability, at least not right away. Implemented properly, those technologies (and many others) will definitely help increase your systems' availability, but there are things you can do first that will get you a long way down that road. You will need these technologies as your availability needs increase, but before you can go there, you need to build a solid foundation.
My working definition of High Availability is that it is simply the level of availability that your users need to get their jobs done, when they want to. HA is the level of availability that allows your critical systems to meet the business requirements that those systems are expected and required to meet. How high is High? That depends on what you expect, and what you need. If the system is up 99.9 percent of the time, but not when you need it, then perhaps that 99.9 percent isn't that important, and the 0.1 percent is much more important. The system isn't available enough if it's not there when you need it. On the other hand, if your system is only up 30 percent of the time, but that 30 percent corresponds precisely with when it's needed, then that system is highly available.
If you want to build high availability into your systems, and you try to start with failover (the function most commonly associated with high availability), that is like building a skyscraper and starting on the 25th floor. It just won't stand! In my day job, I often get calls from customers who say things like, "I have an application that crashes every two hours. Can we implement failover to make it more reliable?" The answer, once I stop laughing, is that you can, but it's a really bad idea to do so. The first job is to make the application work better; fix it so it only crashes once every six hours, then fix it so it only crashes once a day, then once a week. It may require a total application redesign, but if that's the right answer, then that's what you must do. The right solution will increase your availability and make your users much happier than implementing a failover solution that's no more than a band-aid. Before you go looking at expensive add-on products, get your own house in order.
In Figure 1, I offer the Availability Continuum. It's a way of classifying the level of availability that different systems require. The very bottom represents systems that do not need any protection at all. At the very pinnacle of the continuum are systems that simply cannot ever go down. These include life support and avionic systems, and other systems whose failure will lead to loss of life.
Every system has a place on the continuum. At the #1 on the continuum would be e-commerce systems or equities trading floor systems. These are the systems that require the utmost in availability; when they are down, the losses to their respective businesses can be staggering. (One common metric is that an equities trader who cannot trade costs his firm $2 million every 20 minutes.)
At #2 might be systems that operate a robotic assembly line. The systems are unquestionably important to their businesses, but less so than the systems at #1. Systems at #3 are less critical still, such as a billing application that runs once a month. I leave it to you to determine what systems might be at #4 on the continuum.
All systems don't sit all the way at the top of the continuum, despite their users' wishes, because it costs a lot of money to get them there. The higher the availability requirements for a system, the more the system will cost. And the relationship between availability and cost is far from a linear one. In Figure 2, we build a two dimensional graph, showing availability on the Y-axis, and cost (or investment) on the X-axis.
The reason it costs so much more to attain higher levels of availability is that the outages you must protect against at higher levels are rarer, harder to defend against, and more complex. You may find yourself spending a lot of money to protect against a particular serious outage, and when you retire the system, you learn that the outage never occurred. But you still had to protect against it just in case.
You may ask yourself, as you look at Figure 2, "Where are the nines? Every availability graph I've ever seen has lots of nines in it." (By nines, I mean a discussion of levels of availability at 99 percent, 99.9 percent, 99.99 percent, etc.) I believe that the value of nines is way overstated. There are vendors out there who will "guarantee" levels of availability. But they do so with contracts that have more loopholes than the US Tax Code. A system vendor cannot unconditionally guarantee a level of uptime in California in 2001, when there are rolling blackouts going on. The electric power infrastructure is totally out of their control, as are many other aspects of system design and maintenance.
If you follow the trends and guidelines that we will outline, you will achieve higher and higher levels of availability. As for the actual percentage of uptime you achieve, your mileage will vary (and California mileage will almost certainly be lower).
Figure 3 is the index graph from Figure 2, with a set of product classes and disciplines laid over it. I call this the Availability Index, and it gives a visual depiction of the skyscraper discussion above. If you choose to start your protection model with (for instance) LAN clustering, but you don't have a proper system infrastructure, you will not achieve a level of availability that corresponds to your needs, and therefore you will not achieve High Availability.
In future columns, we will dig into the Availability Index at each level, and discuss the sorts of things that you need to have in order, so that you can actually achieve the levels of availability that your users and your business need. In the meantime, if you have suggestions for future columns, or want to discuss some of the topics I've raised in this one, please feel free to contact me; you can write me care of [email protected].
Copyright 2002, Evan Marcus