IBM DS8000 bug shuts down users like clockwork

A bug in the form of a timer built into the firmware of IBM's DS8000 storage array spontaneously shuts down the system every 49 days, users report.

The first version of IBM's flagship DS8000 storage array, shipping since March, contains a bug that automatically shuts down the entire system without warning every 49 days, users in Europe and North America report.

IT Austria, a service provider and systems integrator in Austria provides storage resources to the number one and number two banks in the country. It bought three of the 192 terabyte capacity DS8000s as soon as the product was released. The major selling point for iT Austria was the system's partitioning capability that allows it to run different environments within the same box. However, this feature became irrelevant when a lethal bug shut down the entire system.

"At the microcode level there was a counter built in counting down from 49 days to zero. At day zero the DS8000 shut itself down automatically," said Karin Poschel, head of department for central storage management at iT Austria.

Related articles

IBM DS8000 gaining ground

 

IBM debuts DS6000 and DS8000

 

EMC brushes up DMX

 

HDS slashes pricing for high-end features

IBM informed iT Austria of the problem on May 27 and by May 30 the firm had received microcode revision 6.0.0.388 to fix the problem. Unfortunately the repair was a disruptive upgrade. Another outage, this time for 22 hours, occurred when iT Austria needed to add a second frame to the DS8000 to expand its capacity. Several microcode versions later iT Austria is crossing its fingers that the system stays up and running.

"In my personal opinion the DS8000 has been shipped too early since problems and failures pop up almost every time you do something with the storage system," Poschel said. IT Austria is running microcode level 6.0.0.446 and has not experienced any automatic shutdowns yet.

A North American user, who preferred to remain anonymous, is also "familiar with the 49-day counter thing." The user described the problem as follows: "It's an internal counter that counts the hours that the thing has been up," and when that time is up "essentially causes a buffer overflow with all sorts of unpredictable results."

To IBM's credit, it alerted this user to the bug before they were directly impacted by it. IBM provided the user with the microcode upgrade to fix the problem, but so far the user hasn't been able to block off a large enough window of time during which to perform the upgrade. IBM estimates that it will take between seven and 12 hours to upgrade the system.

Until the user can find a long enough window, they rely on outsmarting the bug by rebooting the system every six weeks, before the counter times out. With only a few systems connected to the DS8000, rebooting the system takes under an hour. "It's been manageable, but certainly not seamless."

He added, "IBM's testing was crazy. They know that every customer on the planet expects to keep these things up forever" without requiring a reboot.

On a positive note

The user hasn't completely turned on IBM. "This certainly doesn't reflect well on them, but we did get a very early box. We knew there was some risk, although this seems like a pretty silly risk. Frankly, I was expecting something a bit more spectacular." IBM, meanwhile, "has managed the problem very well."

In other ways, the DS8000 "has a lot of redeeming qualities," the user said. As a replacement for the Shark 2105 serving up data to an I/O bound application, the DS8000 has easily met the user's performance expectations. In addition: "It's been relatively easy to provision and implement."

Bob Venable, manager of enterprise systems at Blue Cross Blue Shield of Tennessee, a big IBM shop, said he always waits for the 1.1 release of any new system. "Every product has some kinks when first released." [Ed note: To put it lightly].

Back across the pond, the Danish postal service, Post Denmark, has retired two DS8000s until IBM can provide it with customers it can talk to that are running the system in production. It experienced the fatal clock crash but avoided total chaos when a mirrored system took over.

Tony Asaro, senior analyst with the Enterprise Strategy Group said that these problems are a reminder of the issues IBM encountered with its older ESS Shark arrays. "Reliability issues slowed down momentum for the original Shark … the channel was concerned about the release of the DS8000 because of what IBM had promised with Shark, to repeat that is very bad," he said.

When IBM announced the DS8000 in October, 2004, it claimed a key virtue of the system was its processor-based partitioning and diagnostics that exploit its Power5 chip's self-healing capabilities -- both designed to avoid operational failures and to limit downtime.

"Every vendor says it has nondisruptive code upgrades however that's rarely the case," said Chuck Standerfer, senior analyst with the Evaluator Group. He noted that IBM is not alone in shipping systems into the field that are not fully baked. "Today's systems are so complex that they can't do all the testing of all the possible scenarios so they prioritize testing to satisfy major requirements."

IBM issued a statement over e-mail in response to questions about this problem. "We have addressed all concerns in regards to the DS8000 brought to us by customers, which is having extremely good market acceptance, and we are unaware of any problems with systems shipping today."

Dig deeper on Disk arrays

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close