Coming soon to a system near you: instability. How can I make this prediction in the face of conventional wisdom...
that things are faster, cheaper and more reliable? Things are not always as they appear.
Reliability is the lack of failure. Availability is the ability to use a system. The latest systems are designed for higher availability, not necessarily higher reliability. Before starting the hate mail, I know most vendors are building the most reliable systems possible for the price customers are willing to pay.
To highlight the difference between reliability and availability I will describe two examples using the IBM zSeries processors and a RAID-5 disk subsystem.
The zSeries has up to twenty general purpose processors. Each processor chip, like the latest Intel processors, contains two complete processors. Instructions are run in parallel and results compared to ensure accuracy. This is necessary because smaller traces on the chip lead to greater chance for a random error due to a number of internal and external conditions. IBM has also designed the zSeries so most models have spare chips that, working with the operating system software, can take over in the event of a failure.
This cross-check and spare-out design works extremely well. ZSeries' are rarely down. Most zSeries are relatively young and not running at maximum capacity. As machines age and customers utilize all processors, expect a very small percentage to have failures. The operating system hides these failures but the system is degraded until a complete outage can be scheduled. You can expect a similar situation with the emerging Intel based multiprocessor systems. Clustering is a method to maintain availability when repair is required.
RAID-5 by definition uses multiple disks and parity checking to ensure the ability to rebuild data when a disk is lost. While vendors use high quality components, the action of adding components increases the chance of a failure. It is almost normal for a 10 to 100 terabyte disk installation to experience a disk failure several times a month. This is a non-event since RAID-5 protects the data. As disk subsystems age, more failures can be expected.
Since vendors have overcome or hidden these failures, why will reliability will get worse? Due to economic pressure, older equipment will be kept longer. Older equipment tends to have lower reliability. Single failures have no impact on availability. Multiple failures may impact availability.
Another factor is the retirement of older staff trained to design availability into systems. That is another column.
About John Weinhoeft:
For the past 30-plus years John Weinhoeft has had his hand in the computer industry. He recently retired from designing and managing the State of Illinois' centralized computer systems that served 100 agencies. John has authored and edited a number of analytical books published by Computer Technology Research Corporation. He is, or has been, a member of several computer organizations including the Computer Measurement Group and Central Illinois Personal Computer Users Group.
Send John your comments