This article can also be found in the Premium Editorial Download "Storage magazine: Exploring systems that detect and repair hard disk problems automatically."
Download it now to read this article plus other related content.
The unprecedented growth of data in most companies has led to an explosion of storage systems and hard disk drives. It's statistically proven that as HDDs proliferate so will the number of hard disk drives failures, which can lead to lost data. Analyzing what happens when a HDD fails illustrates the issue:
1) A hard disk dive fails
2) The drive must be physically replaced, either manually or from an online pool of drives.
3) Depending on the RAID set level, the HDD's data is rebuilt on the spare:
4) The time it takes to rebuild the HDD's data depends on the hard disk drive's capacity, speed and RAID type.
- 3 with double parity.)
- SATA drives typically have a rated non-recoverable read error rate of 1014: roughly 1 out of 100,000,000,000,000 bits will have a non-recoverable read error. This means that a seven-drive RAID 5 group with 1 TB SATA drives will have approximately a 50% chance of failing during a rebuild resulting in the loss of the data in that RAID group.
- Enterprise-class drives (Fibre Channel or SAS) are rated at 1015 for non-recoverable read errors, which translates into less than a 5% chance of the RAID 5 group having a failure during a rebuild.
- RAID 6 eliminates the risk of data loss should a second HDD fail. You pay for that peace of mind with decreased write performance vs. RAID 5, and an additional parity drive in the RAID group.
Eventually, the hard disk drive is sent back to the factory. Using the MTBF example in "MTBF: The odds of failure," this suggests that there'll be approximately 40 HDD "service events" per year.
Most storage admins might be surprised by what happens when a HDD is sent back to the factory. After being run through the factory's failure analysis process, the results for the vast majority of failed hard disk drives (somewhere between 67% and 90%) will be "no failure found" -- the HDD is fine. But the service event still took place and the RAID data rebuild still had to occur. That's a lot of operational hassle for "no trouble found."
Undetected data corruption
Another problem with HDDs that's rarely mentioned but actually quite prevalent is "silent data corruption." Silent data corruptions are storage errors that go unreported and undetected by most storage systems, resulting in corrupt data being provided to an application with no warning, logging, error messages or notification of any kind.
Most storage systems don't detect these errors, which occur on average with 0.6% of SATA HDDs and .06% of enterprise HDDs over 17 months (from "An Analysis of Data Corruption in the Storage Stack," L.N. Bairavasundaram et al., presented at FAST '08 in San Jose, Calif.). Silent data corruption occurs when the RAID doesn't detect data corruption errors such as misdirected or lost writes. It also occurs with a torn write -- data that's partially written and merges with older data, so the data ends up part original data and part new data. Because the hard disk drive doesn't recognize the errors, the storage system isn't aware of it either so there's no attempt at a fix.
This was first published in June 2009