Users Need Better Way to Predict Disk Failures

This article can also be found in the Premium Editorial Download: Storage magazine: Five cutting-edge storage technologies

All disk drive manufacturers happily publish their products' mean time between failure (MTBF) ratings, numbers...

like 500,000 hours for desktop-class drives, or 1 million hours for SCSI and Fibre Channel drives. But many users feel that the MTBF metric is largely irrelevant in a real-world storage environment.

MTBF "is basically a useless number" from a user's perspective, says Mike Chenery, vice president of advanced product engineering at Fujitsu Computer Products of America, a disk drive manufacturer. "What they really want to know is 'How many of my drives are going to fail?'"

Part of the problem with MTBF is that not everyone understands how it is calculated. An MTBF of 1 million hours, for example, does not mean that a drive will fail in 114 years (24*365*114), as some users may conclude. Rather, it's a statistical metric derived from testing of a large number of drives for a number of days, and determining the mean failure rate for the entire group, explains Aloke Guha, CTO at Copan Systems, a startup that makes a disk-based backup system built with low-cost ATA drives.

The issue is further complicated because some drive vendors "tend to play games with the numbers," says Guha. For example, two drives may both have MTBFs of 500,000 hours, but one may be rated for a 24/7/365 duty cycle, but the other for an 8-hour duty cycle. "Read the fine print," he warns.

Unlike MTBF, another metric called annualized failure rate (AFR) will give you a sense of how often you can expect a drive to fail. AFR is calculated according to the number of drives that are returned to the manufacturer and deemed to indeed be defective. Enterprise drives tend to have AFRs of under 1%. Thus, assuming 1,000 enterprise drives, it's safe to assume that about 10 drives will fail per year.

In this day and age of very large disk drives, it's good to be able to predict your drive failure rates, says Dick Benton, senior consultant with Glasshouse Technologies in Framingham, MA. That's because the larger a disk drive in a RAID set, the longer it will take to rebuild, the longer you are exposed to another drive failure, or total data loss.

The problem with AFR is that vendors don't generally publish it. But Copan's Guha, for one, thinks that "the time may have come for users to demand 'What is the reported AFR?'"

This was last published in October 2004

Dig Deeper on Primary storage devices

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.