Intermedia Chief Operating Officer Jonathan McCormick sent a letter to impacted customers last week explaining the reasons for the April 16-17 outage. McCormick also posted an update last Thursday on Intermedia's official blog.
According to the statement sent to customers:
At approximately 6:15 a.m. PT on Thursday 4/16, a hardware failure occurred on one of the EMC storage area networks (SANs) located in Intermedia's New Jersey data center. The service processor for one of the controller nodes had a failure. This failure caused the entire load for that SAN to be shifted to the service processor on the redundant controller node.
The spare capacity on the single service processor was not enough to handle the entire load of all systems connected to the SAN, which caused a degradation of performance for the reading and writing of data to the SAN. The degradation of performance on the SAN in turn impacted the overall system's ability to process email messages creating a queuing of several hundred thousand messages within the system. The back log was large enough that it took 32 hours for it to clear after the original event. At approximately 2 p.m. PT on Friday 4/17, all systems were functioning normally and mail delivery was considered to be "real-time."
The statement continues:
* The vendor [EMC] determined that the service processor failure occurred due to a unique bug in the specific version of firmware on the system. This bug caused the service processor to "panic" and automatically take itself off line. As the first corrective action, on Friday 4/17 at 11 p.m. PT, our vendor performed an emergency upgrade to the version of firmware running on the SAN. This newer version of firmware has a fix for the bug that caused the failure we experienced.
* Since the outage, as the second corrective action, we have added additional processing capacity to the SMTP hub farm in this domain. We have also performed performance tuning on the SMTP hubs to guarantee that they are able to more rapidly process a larger than normal queue of messages.
* Over the next several weeks, we will be taking additional corrective actions to make certain that there is enough spare capacity on the SAN to guarantee that it performs without performance degradation in the case of a single hardware failure. An additional SAN is being installed this week and starting as early as this weekend we will begin to migrate a portion of the existing systems to the new SAN. Additionally, we have engaged our SAN vendor to review the performance tuning of our SAN and implement adjustments to increase its overall performance capabilities. These events in tandem will guarantee that the SAN will be able to perform without an impact to the service in the event we experience another individual hardware error.
Intermedia declined to comment on which of EMC's SAN products was involved, and also declined to disclose the firmware level before and after the outage, citing security concerns. An EMC spokesperson also declined to comment.
"We can confirm that the issue impacted customers on two of our 21 domains," wrote Intermedia's spokesperson in an email to SearchStorage.com "Impacted customers will be proactively credited on 4/23 under the terms of our service level agreement."
Today I received their formal RFO (Reasons for Outage) letter via email which goes into great details describing why this outage occurred and what steps they are taking to try to prevent a re-occurrence for the same reasons in future. In a nutshell, there was a hardware failure in one of their EMC SAN devices, and this failure occurred in such a way that prevented the device's own in-built fault tolerance mechanisms from allowing the SAN to effectively remain "up" – that is, they are saying this is one of those failures that should not have happened. These devices are designed precisely NOT to fail under such circumstances, but nonetheless it did fail.
Intermedia's letter goes on to describe the actions they are taking along with the hardware vendor to guard against this in future. All very good and well. Now on to the little gem in the letter that I found the most surprising, and from which all technologists with "uptime" responsibility for Software as a Service (SAS) systems would do well to learn from.
Mok declined comment on the most recent outage, saying his email service had not been affected this time.
In its email to customers, Intermedia acknowledged having received "significant constructive feedback regarding our communication throughout the outage…we have developed a new client notification tool that will be used by the Technical Support organization to proactively notify and communicate with clients during a service interruption." Intermedia's spokesperson confirmed there were two outages, but did not disclose further details.