The first version of IBM's flagship DS8000 storage array, shipping since March, contains a bug that automatically...
shuts down the entire system without warning every 49 days, users in Europe and North America report.
IT Austria, a service provider and systems integrator in Austria provides storage resources to the number one and number two banks in the country. It bought three of the 192 terabyte capacity DS8000s as soon as the product was released. The major selling point for iT Austria was the system's partitioning capability that allows it to run different environments within the same box. However, this feature became irrelevant when a lethal bug shut down the entire system.
"At the microcode level there was a counter built in counting down from 49 days to zero. At day zero the DS8000 shut itself down automatically," said Karin Poschel, head of department for central storage management at iT Austria.
"In my personal opinion the DS8000 has been shipped too early since problems and failures pop up almost every time you do something with the storage system," Poschel said. IT Austria is running microcode level 18.104.22.1686 and has not experienced any automatic shutdowns yet.
A North American user, who preferred to remain anonymous, is also "familiar with the 49-day counter thing." The user described the problem as follows: "It's an internal counter that counts the hours that the thing has been up," and when that time is up "essentially causes a buffer overflow with all sorts of unpredictable results."
To IBM's credit, it alerted this user to the bug before they were directly impacted by it. IBM provided the user with the microcode upgrade to fix the problem, but so far the user hasn't been able to block off a large enough window of time during which to perform the upgrade. IBM estimates that it will take between seven and 12 hours to upgrade the system.
Until the user can find a long enough window, they rely on outsmarting the bug by rebooting the system every six weeks, before the counter times out. With only a few systems connected to the DS8000, rebooting the system takes under an hour. "It's been manageable, but certainly not seamless."
He added, "IBM's testing was crazy. They know that every customer on the planet expects to keep these things up forever" without requiring a reboot.
On a positive note
The user hasn't completely turned on IBM. "This certainly doesn't reflect well on them, but we did get a very early box. We knew there was some risk, although this seems like a pretty silly risk. Frankly, I was expecting something a bit more spectacular." IBM, meanwhile, "has managed the problem very well."
In other ways, the DS8000 "has a lot of redeeming qualities," the user said. As a replacement for the Shark 2105 serving up data to an I/O bound application, the DS8000 has easily met the user's performance expectations. In addition: "It's been relatively easy to provision and implement."
Bob Venable, manager of enterprise systems at Blue Cross Blue Shield of Tennessee, a big IBM shop, said he always waits for the 1.1 release of any new system. "Every product has some kinks when first released." [Ed note: To put it lightly].
Back across the pond, the Danish postal service, Post Denmark, has retired two DS8000s until IBM can provide it with customers it can talk to that are running the system in production. It experienced the fatal clock crash but avoided total chaos when a mirrored system took over.
Tony Asaro, senior analyst with the Enterprise Strategy Group said that these problems are a reminder of the issues IBM encountered with its older ESS Shark arrays. "Reliability issues slowed down momentum for the original Shark … the channel was concerned about the release of the DS8000 because of what IBM had promised with Shark, to repeat that is very bad," he said.
When IBM announced the DS8000 in October, 2004, it claimed a key virtue of the system was its processor-based partitioning and diagnostics that exploit its Power5 chip's self-healing capabilities -- both designed to avoid operational failures and to limit downtime.
"Every vendor says it has nondisruptive code upgrades however that's rarely the case," said Chuck Standerfer, senior analyst with the Evaluator Group. He noted that IBM is not alone in shipping systems into the field that are not fully baked. "Today's systems are so complex that they can't do all the testing of all the possible scenarios so they prioritize testing to satisfy major requirements."
IBM issued a statement over e-mail in response to questions about this problem. "We have addressed all concerns in regards to the DS8000 brought to us by customers, which is having extremely good market acceptance, and we are unaware of any problems with systems shipping today."