This article can also be found in the Premium Editorial Download "Storage magazine: Exploring systems that detect and repair hard disk problems automatically."
Download it now to read this article plus other related content.
Does end-to-end error correction work?
User evidence over the past 18 months suggests that HDD error-correction methods work. Interviews with IT organizations storing petabytes of storage (where silent data corruption is statistically more likely to be noticed) in mission-critical applications such as government labs, high-energy particle research, digital film/video production and delivery, seismic processing and so on, have revealed high levels of satisfaction. Perhaps the most telling remark came from an IT manager who wishes to remain anonymous: "I don't worry about silent data corruption anymore because it's no longer an issue for us."
Sector errors in traditional disk subsystem designs mark the HDD as failed. A failed HDD initiates a RAID data rebuild process that degrades performance and takes a long time. It can also be expensive, as there may still be useful life in the hard disk drive.
A heal-in-place system goes through a series of automated repair sequences designed to eliminate or reduce most of the "no failure found" HDD failures, as well as the subsequent unnecessary and costly RAID data rebuilds. As of now, there are five systems that provide heal-in-place capabilities: Atrato Inc.'s Velocity1000 (V1000), DataDirect Networks' S2A series, NEC's D-Series, Panasas' ActiveStor and Xiotech's Emprise 5000. Each provides a proven, albeit completely different, heal-in-place technology.
DataDirect Networks' S2A's heal-in-place approach to disk failure attempts several levels of HDD recovery before a hard disk drive is removed from service. It begins by keeping a journal of all writes to each HDD showing behavior aberrations and then attempts recovery operations. When recovery operations succeed, only a small portion of the HDD requires rebuilding using the journaled information. Having less data to rebuild greatly reduces overall rebuild times and eliminates a service event.
NEC's D-Series Phoenix technology detects sector errors, but allows operation to continue with the other HDDs in the RAID group. If an alternative sector can be assigned, the hard disk drive is allowed to return to operation with the RAID group avoiding a complete rebuild. Phoenix technology maintains performance throughout the detection and repair process.
This was first published in June 2009