Published: 12 Apr 2007
Rrecent independent studies from Google and Carnegie Mellon University have concluded that disk drive failure rates are considerably higher than the rates reported by disk drive manufacturers. But, it turns out, many users may not care.
At a Usenix conference in San Jose, CA, this past February, Google released its study, which found an 8% annual failure rate for drives in service for two years. That's one out of every 12 drives.
Manufacturers claim the mean time to failure (MTTF) of Fibre Channel (FC) and SATA drives ranges between 1,000,000 and 1,500,000 hours, suggesting a normal annual failure rate of 0.88%.
"Typically, this problem does not hit home for me because vendor support contracts offset the cost associated with the drive replacements," says Earl Hartsell, senior IT analyst at Solvay Pharmaceuticals, Marietta, GA. "It would take a relatively large increase in support costs for this problem to become a pain point."
Similarly, Mark Holt, information technology specialist at Media General in Richmond, VA, says failure rates help manufacturers control support costs, but don't mean much to users. "We have very little interest in that magic number," says Holt. "The complexity of systems means a failure generally isn't worth chasing down; we only want to know if the vendor or supplier is going to be there quickly when we do lose a drive, for whatever reason."
Carnegie Mellon's study of approximately 100,000 consumer and enterprise drives concludes that failure rates are, in some cases, 13 times greater than a vendor's published MTTF. Furthermore, their study shows no evidence that FC drives are any more reliable than less-expensive, slower performing SATA drives.
Google's research, begun in 2001, relies on data collected on more than 100,000 serial and parallel ATA consumer-grade disk drives, ranging in speed from 5,400 rpm to 7,200 rpm and in size from 80GB to 400GB. At least nine different drive models from many of the largest disk drive manufacturers were included; Google didn't release names.
Google's report found very little correlation between drive failure rates and either elevated temperature or activity levels. It also concluded that the self-monitoring, analysis and reporting technology hard drive feature that warns users of problems before failure isn't a reliable predictor.
"How do you argue with Google?" says W. Curtis Preston, VP of data protection services at GlassHouse Technologies, Framingham, MA. "The drive vendors won't enjoy hearing that disk drives are a pure commodity and basically all the same thing."
Storage vendors were quick to point out holes in the studies. "Humidity, PoH [power-on hours], vibration, isolation and general cabinet design all make a huge difference," notes John Joseph, VP of marketing at EqualLogic and formerly head of marketing at Quantum. "The variability across such a large population [of drives] also makes it difficult to come to sound conclusions."
Regarding the Carnegie Mellon report, Joseph says the university took data from various high-performance computing and Internet service sites, suggesting different workloads. "This might explain the Fibre Channel vs. SATA results ... workload is a major factor and must be taken into consideration," he says.
A Seagate representative sent the company's response via email: "Unfortunately, after quite a lengthy debate amongst the management team, we have opted not to comment on the study except to address a few issues."
According to Seagate, the conditions surrounding drive failures are "complicated and require a detailed failure analysis to determine what the failure mechanisms were along with their dependencies. One must also be clear about the differences between drive replacements vs. drive failures," adds the company. Drive replacements can include drives that haven't failed but may have been rejected by a system for a nondisk-related reason (e.g., OS damage due to virus or malware); these can make up as much as 40% of that replacement population.
Media General's Holt agrees. "The authors of the Google report explain the difference between replacement and failure, but continue to treat them as the same thing statistically," he says.
Fujitsu and Hitachi declined to comment on the studies.
"I think it points out the need to begin searching for a nonmechanical means to reliably store enterprise data," says Hartsell at Solvay Pharmaceuticals. He's been researching flash-based storage, but says development is pretty slow.
Meanwhile, Chuck Hollis, VP of technology alliances at EMC, calls the whole debate "much ado about nothing. The industry invented RAID precisely to mitigate this problem," he says.