This article can also be found in the Premium Editorial Download "Storage magazine: Strategies to take the sting out of microcode upgrades."
Download it now to read this article plus other related content.
In January 2005, Steve Schaub discovered a nasty bug in some RAID 5 software on his previous employer's EMC Clariion CX500. To reduce the amount of tape in its Tivoli Storage Manager backup environment, the company installed a CX500 for disk-based backup. When they kicked off the migration from tape to disk, the CX500 problems began. "EMC [told] us a disk in our RAID 5 set had gone bad; then, during the rebuild, a second disk failed," says Schaub. The firm had to back up Oracle database logs to the backup server and the server was down all weekend. "It was really close to a crisis situation," adds Schaub, who's now a systems engineer for backup and recovery at BlueCross BlueShield of Tennessee.
IBM's DS8000 bug
One notable microcode upgrade story involved problems IBM Corp. users encountered with early models of the DS8000 storage array in 2005. SearchStorage.com reported last August that IBM's newly shipping flagship array, which can cost more than $1 million, contained a bug that automatically shut down the system without warning every 49 days. IBM quickly released microcode Revision 220.127.116.118 to fix the problem. However, users in Europe and North America reported major outages due to the fix being a disruptive upgrade.
A large North American clothing retailer, which prefers to remain anonymous to avoid any conflict, received the bug fix from IBM before it was affected by the problem, but for several months was unable to block off a large enough window of time to apply the patch. The company relied on outsmarting the bug by rebooting the system every six weeks before the counter timed out. With only a few hosts connected to the DS8000, rebooting the array took less than an hour. "It's been manageable, but certainly not seamless," says the user. "IBM's testing was crazy. They know that every customer on the planet expects to keep these things up forever" without requiring a reboot.
The firm eventually found a long enough window to apply the fix. IBM estimated it would take between seven and 12 hours to patch the code, but it took only approximately three hours. IBM provided the higher estimate as a cushion in case something went wrong, says the user. Since applying the upgrade, "everything is good, knock on wood," he says.
On the other hand, the initial bug was enough to scare a U.S. utility company off the system altogether. "We decided not to go with the DS8000 because of the code problems; we didn't feel safe," says Chris Andriano, storage administrator at the company. It was planning to migrate from two IBM 2105 "Shark" arrays to the DS8000, but went with an EMC Corp. DMX2000 instead. "We would still consider IBM in the future," says Andriano. "In 12 months, it'll be a more stable, more viable product." The company uses third-party storage management software from StoreAge Networking Technologies Inc., so it isn't dependent on a single disk vendor. "We can go from IBM to EMC effortlessly and back again if we choose," says Andriano.
IBM acknowledged that the clock problem occurred in a small number of early systems, but declined to explain why. However, it did e-mail us the following response: "IBM is able to continuously and incrementally improve the field quality of its storage products through regular upgrades to higher levels of microcode. For IBM's external disk systems like the DS8000, DS6000, DS4000 and ESS, this upgrade process is non-disruptive to customers' normal production usage of the product."
EMC engineers eventually determined that the problem was caused by a microcode bug that tricked the RAID 5 software into seeing a "phantom failure on the second disk during the rebuild," says Schaub. To compound the problem, Schaub and his team didn't know if the bug caused the first disk to fail as well. The company lost a 900GB logical unit number (LUN) and Schaub was also using a pre-release version of the latest CX500 firmware. "I'll never do that again," he says.
Another Clariion user, Kelly Carpenter, senior technical manager at the Genome Sequencing Center at the Washington University School of Medicine in St. Louis, is sick of hearing about bug fixes for the Clariion microcode. Carpenter says that going back two years, his configuration included 146GB disks that were supposed to be a hot spare for both 73GB disks and 146GB disks. If a 73GB disk failed, the 73GB hot spare was supposed to take over transparently with no loss of service. The 146GB hot spare was supposed to function the same way for the 146GB disks.
When Carpenter had a second 73GB disk fail, the 146GB hot spare attempted to take over, but it failed and the disk array "panicked and crashed with no data loss, but complete loss of service," he says. The ability to use both disk sizes for hot spares was a standard feature in the Flare code, says Carpenter, but it didn't work on several occasions. EMC and partner Dell Inc. later admitted that it was a bug in the Flare code that would require an outage to fix.
The problem reappeared a year ago, requiring another Flare code upgrade that solved that issue but produced "data access issues" and another upgrade, says Carpenter. "I'm told that the new code will fix everything," he says. "Even if it's true, I have a hard time believing it ... two years' worth of experience says different." He says the reliability problems are due to EMC not testing the code in a wide enough variety of scenarios. Jay Krone, EMC's director of Clariion platform marketing, adamantly disagrees. "We test an awful lot of scenarios," he says. He described current interoperability matrices as being as thick as "the Manhattan Yellow Pages."
Large number of fixes
EMC says major upgrades add support for new hardware or add new features and functionality, while patches deliver functional improvements. On average, vendors will issue two major releases, approximately eight minor releases and up to 20 patches a year, says Ashish Nadkarni, senior consultant at Framingham, MA-based GlassHouse Technologies Inc., adding that "it's often more than this."
Patches can be administered non-disruptively, claims EMC, but when it comes to upgrading code on the Clariion, the firm's Krone notes that because the array is a two-board system, you're required to upgrade one processor at a time. "There is a blip in performance as half the resources are offline," he says. EMC recommends that customers upgrade at a time when the load is lower so that overall performance isn't affected.
To help make upgrades less complicated, EMC ships PowerPath/SE with the Clariion. This supports back-end failover for users who don't have dual host bus adapter (HBA)-equipped servers, but still require failover capabilities for their Clariion. The upgrade process works as follows: Processor A is brought off-line and the LUNs are failed over to Processor B (or failed over using PowerPath/SE); the upgrade is done on Processor A, which is then brought back online. The process is repeated for Processor B. PowerPath/SE provides the hosts with continuous access to the LUNs.
This was first published in March 2006