A better way to predict disk failures

This article first appeared in "Storage" magazine in their October issue. For more articles of this type, please visit www.storagemagazine.com.

What you will learn from this tip: The question you

Requires Free Membership to View

should be asking your disk drive vendor, when ascertaining real-world failure rates.

All disk drive manufacturers happily publish their products' mean time between failure (MTBF) ratings, numbers like 500,000 hours for desktop-class drives, or 1 million hours for SCSI and Fibre Channel drives. But many users feel that the MTBF metric is largely irrelevant in a real-world storage environment.

MTBF "is basically a useless number" from a user's perspective, says Mike Chenery, vice president of advanced product engineering at Fujitsu Computer Products of America, a disk drive manufacturer. "What they really want to know is 'How many of my drives are going to fail?'"

Part of the problem with MTBF is that not everyone understands how it is calculated. An MTBF of 1 million hours, for example, does not mean that a drive will fail in 114 years (24*365*114), as some users may conclude. Rather, it's a statistical metric derived from testing of a large number of drives for a number of days, and determining the mean failure rate for the entire group, explains Aloke Guha, chief technology officer at Copan Systems, a startup that makes a disk-based backup system built with low-cost ATA drives.

The issue is further complicated because some drive vendors "tend to play games with the numbers," says Guha. For example, two drives may both have MTBFs of 500,000 hours, but one may be rated for a 24/7/365 duty cycle, but the other for an 8-hour duty cycle. "Read the fine print," he warns.

Unlike MTBF, another metric called annualized failure rate (AFR) will give you a sense of how often you can expect a drive to fail. AFR is calculated according to the number of drives that are returned to the manufacturer and deemed to indeed be defective. Enterprise drives tend to have AFRs of under 1%. Thus, assuming 1,000 enterprise drives, it's safe to assume that about 10 drives will fail per year.

In this day and age of very large disk drives, it's good to be able to predict your drive failure rates, says Dick Benton, senior consultant with Glasshouse Technologies in Framingham, Mass. That's because the larger a disk drive in a RAID set, the longer it will take to rebuild, the longer you are exposed to another drive failure, or total data loss.

The problem with AFR is that vendors don't generally publish it. But Copan's Guha, for one, thinks that "the time may have come for users to demand 'What is the reported AFR?'"

For more information:

Tip: Steeling SATA for duty

Advice: Assessing failure rates for tape

Advice: SCSI/FC disks vs. IDE/ATA disks

About the author: Alex Barrett is "Storage" magazine's trends editor.

This was first published in November 2004

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.