Earlier this week, we ran a story about email hosting provider Intermedia attributing a recent outage to a failure in its EMC SAN. After the story ran, we received feedback from Bob Adams, a storage systems engineer at a leading Boston teaching hospital, on the case:
“I can’t see how Intermedia can truly blame this on EMC,” Adams wrote in an email.
First of all, the EMC SAN referred to here is clearly an EMC CLARiiON based on the information provided. The fact that one of the storage processor’s had a failure, probably a bugcheck panic (like a windows BSOD…CX’s run Windows OS on the SP’s) due to a bug in the firmware aka FLARE code is a case that their SAN Admin hadn’t been patching/updating the FLARE code on a regular basis as he/she should be doing.
Then with the failure and having to run on one storage processors is something the CLARiiON is designed to be able to do for fault tolerance as well as load balancing, again the SAN admin was at fault for this CLARiiON was clearly over utilized. The utilization on the storage processors has to be within a CPU percentage range so that if an SP had a failure the second SP could handle its own load plus the load of the other. Meaning if the utilization of say SPA was 75% and the utilization of SPB was 75%, there is no way if SPA failed SPB will be able to handle the load. Which sounds what happened here. I see this as more of Intermedia’s own fault over EMC.
What do you think? Comments operators are standing by…