No sure fire way to avoid running into hardware failures
About four years ago, our shop purchased four IBM ESS 2105 (Shark). This July, we coupled the cache and added disk drives. This weekend, the non-volatile storage memory card failed. Our in-house customer engineer installed replacement NVS. Access to half our drives went away. Then, the other cluster attempted Initial Machine Load and the whole ESS was not accessible. Powered off and on six times, took an hour to reach the failure point.
Phoenix was finally able to block access to the new NVS allowing one cluster to come up. Their post failure analysis was the NVS was recognized as valid at the start, then just before becoming ready got a time out due to no response from the replacement bad NVS. We had most of our 13 MVS systems down for a while, then were able to get about half of them up.
Obvious step to reduce outage is mirroring all volumes. IBM will be modifying ESS code quickly to allow continued operation. Dividing volume assignments by system to limit number of systems down for this type of outage will be done. Any suggestions?
When I started to read your question, I feared that I would have to decline to answer it because it started to sound like something for IBM technical support. But in fact, it turned into an excellent question.
Thank you for that.
There is no surefire way to avoid running into hardware failures but here are some things that you can do. One of the best is to pre-test arts before you put them into production. This can be an expensive and
time-consuming process, and if you are forced to do it while systems are already down, can only make things worse.
Therefore, my advice is to keep a small spare-part stock on hand. You an test these parts more or less at your leisure, and when you need to replace a part with something from your stock, you can order a replacement and begin testing that, in advance of needing it in production.
Another approach that may work better for you is to cannibalize parts from non-production systems so that you are effectively using those non-production systems as test beds for replacement parts.
Of course, the greatest risk of damage to these parts comes as you transfer them from one system to another so these parts must be handled with the greatest of care. Make sure you wear anti-static bands and that you minimize the amount of time that the boards are touched.
You also want to minimize the effects of failed boards. The way to do that is to make sure that the board itself is not a single point of failure. That means maintaining two boards and splitting the mirrors so that one side of each mirror is handled by each board.
I must admit that I am not familiar with the particular hardware environment you describe but you'll find that, subject to hardware peculiarities and limitations, this advice should work in most environments.
Thank you for writing. I hope this was helpful.
Evan L. Marcus
Editor's note: Do you agree with this expert's response? If you have more to share, post it in one of our
This was first published in October 2003