Ezine

This article can also be found in the Premium Editorial Download "Storage magazine: Exploring systems that detect and repair hard disk problems automatically."

Download it now to read this article plus other related content.

The unprecedented growth of data in most companies has led to an explosion of storage systems and hard disk drives. It's statistically proven that as HDDs proliferate so will the number of hard disk drives failures, which can lead to lost data. Analyzing what happens when a HDD fails illustrates the issue:

1) A hard disk dive fails

2) The drive must be physically replaced, either manually or from an online pool of drives.

3) Depending on the RAID set level, the HDD's data is rebuilt on the spare:

  • RAID 1/3/4/5/6/10/60 all rebuild the hard disk drive's data, based on parity
  • RAID 0 can't rebuild the HDD's data
  • 4) The time it takes to rebuild the HDD's data depends on the hard disk drive's capacity, speed and RAID type.

  • A 1 TB 7,200 rpm SATA HDD with RAID 5 will take approximately 24 hours to 30 hours to rebuild the data, assuming the process is given a high priority.
  • If the rebuild process is given a low priority and made a background task to be completed in off hours, the rebuild can take as long as eight days. The RAID group is subject to a higher risk of a second disk failure or non-recoverable read error during the rebuild, which would lead to lost data. This is because the parity must read every byte on every drive in the RAID group to rebuild the data. (Exceptions are RAID 6, RAID 60 and NEC Corp. of America's D-Series RAID 3 with double parity.)
    • SATA drives typically have a rated

      Requires Free Membership to View

      • non-recoverable read error rate of 1014: roughly 1 out of 100,000,000,000,000 bits will have a non-recoverable read error. This means that a seven-drive RAID 5 group with 1 TB SATA drives will have approximately a 50% chance of failing during a rebuild resulting in the loss of the data in that RAID group.
      • Enterprise-class drives (Fibre Channel or SAS) are rated at 1015 for non-recoverable read errors, which translates into less than a 5% chance of the RAID 5 group having a failure during a rebuild.
      • RAID 6 eliminates the risk of data loss should a second HDD fail. You pay for that peace of mind with decreased write performance vs. RAID 5, and an additional parity drive in the RAID group.

      Eventually, the hard disk drive is sent back to the factory. Using the MTBF example in "MTBF: The odds of failure," this suggests that there'll be approximately 40 HDD "service events" per year.

      Most storage admins might be surprised by what happens when a HDD is sent back to the factory. After being run through the factory's failure analysis process, the results for the vast majority of failed hard disk drives (somewhere between 67% and 90%) will be "no failure found" -- the HDD is fine. But the service event still took place and the RAID data rebuild still had to occur. That's a lot of operational hassle for "no trouble found."

      Undetected data corruption

      Another problem with HDDs that's rarely mentioned but actually quite prevalent is "silent data corruption." Silent data corruptions are storage errors that go unreported and undetected by most storage systems, resulting in corrupt data being provided to an application with no warning, logging, error messages or notification of any kind.

      Most storage systems don't detect these errors, which occur on average with 0.6% of SATA HDDs and .06% of enterprise HDDs over 17 months (from "An Analysis of Data Corruption in the Storage Stack," L.N. Bairavasundaram et al., presented at FAST '08 in San Jose, Calif.). Silent data corruption occurs when the RAID doesn't detect data corruption errors such as misdirected or lost writes. It also occurs with a torn write -- data that's partially written and merges with older data, so the data ends up part original data and part new data. Because the hard disk drive doesn't recognize the errors, the storage system isn't aware of it either so there's no attempt at a fix.

    This was first published in June 2009

    There are Comments. Add yours.

     
    TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

    REGISTER or login:

    Forgot Password?
    By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
    Sort by: OldestNewest

    Forgot Password?

    No problem! Submit your e-mail address below. We'll send you an email containing your password.

    Your password has been sent to: