This article can also be found in the Premium Editorial Download "Storage magazine: Exploring systems that detect and repair hard disk problems automatically."
Download it now to read this article plus other related content.
Several storage system vendors claim their products can detect and repair hard disk problems automatically. Here's how they do it and the low-down on how well they work.
A fundamental change in the basic building blocks of storage is occurring, one that's as groundbreaking today as RAID was when it was introduced 20 years or so ago. The revolutionary development is commonly referred to as "autonomic self-healing storage," and it promises greater reliability from disk systems than ever before.
Autonomic self-healing storage might sound more like a trumped-up term than a fundamental change. After all hasn't it been around for a while in the form of RAID, redundant array of independent nodes (RAIN), snapshots, continuous data protection (CDP) and mirroring?
If you define self-healing as the ability to restore from a failure situation, you'd be right. All of those familiar technologies are designed to restore data from a failure situation. But to be a bit more precise, those technologies are actually self-healing data, not self-healing storage. They restore data when there's a storage failure and mask storage failures from the apps -- they don't restore the actual storage hardware.
Self-healing storage is more accurately defined as transparently restoring both the data and storage from a failure. That might seem like splitting hairs, but it's not. It's the difference between treating the symptoms and fixing the cause.
happens when a disk fails
The lowest common denominator in standard storage systems today is the hard disk drive (HDD). The hard disk drive is the only electro-mechanical device in the storage system, and it has the highest probability of failure or lowest mean time between failures (MTBF) (see "MTBF: The odds of failure," below). It's well documented that the HDD component is the Achilles' heel of a storage system.
|MTBF: The odds of failure|
A disk manufacturer's hard disk drive (HDD) mean-time between failures (MTBF) rating enables you to forecast the useful operational life of a hard disk drive. When there are a lot of HDDs in the system, the probability of HDD failures increases. The general formula for calculating average time between drive failures within a system is as follows:
Using the manufacturers' MTBF numbers (approximately 1.5 million hours for enterprise-class Fibre Channel and SAS HDDs, and approximately 600,000 hours MTBF for SATA HDDs), a system with 240 enterprise drives should expect a hard disk drive failure every 260 days: 1,500,000/240 = 6,250 hours or about 260 days (roughly two HDDs per year or approximately a 0.8% replacement rate). If the HDDs are SATA, the system should expect a HDD failure every 104 days (roughly four HDDs per year or approximately a 1.67% replacement rate).
Unfortunately, manufacturer MTBF numbers don't reliably reflect real-world MTBFs. The Computer Science department at Carnegie Mellon University in Pittsburgh ran stress tests of 100,000 Fibre Channel, SAS and SATA hard disk drives. Their published testing results determined that a typical drive (Fibre Channel, SAS or SATA) has a realistic MTBF of approximately six years or 52,560 hours. Using Carnegie Mellon's MTBF numbers, a storage system with 240 HDDs can expect a drive failure approximately every nine to 10 days (approximately 40 HDDs per year or an annual replacement rate of 16.67%).
This was first published in June 2009