This article can also be found in the Premium Editorial Download "Storage magazine: R.I.P. RAID?."
Download it now to read this article plus other related content.
When a HDD fails in a RAID 5 set, the system will rebuild the data on a spare drive that replaces the failed hard disk drive. The storage system then exercises every sector on every HDD in the RAID set to reconstruct the data. This heavy utilization of the other HDDs in the RAID set increases the likelihood of another HDD failure (usually a non-recoverable read error) by an order of magnitude, which significantly increases the likelihood of a data failure. Ten or 20 years ago when disk capacities were much lower, rebuilds were measured in minutes. But with disk capacities in the terabytes, rebuilds can take hours, days or even weeks. If application users can't tolerate the system performance degradation that rebuilds cause, the rebuild is given a lower priority and rebuild times increase dramatically. Longer data reconstruction times typically equate to significantly higher risks of data loss. Because of this, many storage shops are stepping up their use of RAID 6.
RAID 6 provides a second parity or stripe that protects the data even if two HDDs fail or have a non-recoverable read error in the RAID set. The risk of data loss drops dramatically, but the extra stripe consumes additional usable capacity and system performance will take a bigger hit if two drives must be reconstructed simultaneously from the same RAID group. More disturbing is the increased risk of data loss if a third HDD fails or a non-recoverable read error occurs during the rebuild.
Another RAID issue is documenting the chain of ownership for replacing a failed HDD, which includes the documented trail (who, what, where, when) of the failed HDD from the time it was pulled to the time it was destroyed or reconditioned. It's a tedious, manually intensive task that's a bit less stringent if the HDD is encrypted. Even more frustrating is that the vast majority of failed HDDs sent back to the factory for analysis or reconditioning (somewhere between 67% and 90%) are found to be good or no failure is found. Regrettably, the discovery happens after the system failed the HDD, the HDD was pulled, the data was reconstructed and the chain of ownership documented. That's a lot of operational pain for "no failure found."
Solid-state storage devices actually exacerbate the aforementioned RAID problems. Because solid-state drives (SSDs) can handle high-performance applications, they allow for storage systems with fewer high-performance HDDs and more high-density, low-performance hard disk drives. Tom Georgens, NetApp's CEO, recently noted that "fast access data will come to be stored in flash with the rest in SATA drives." Lower CapEx and OpEx for the system can end up translating into higher OpEx because of the increase in RAID problems.
These RAID issues have inspired numerous vendors, academicians and entrepreneurs to come up with alternatives to RAID. We categorize those innovative alternatives into the three groups: RAID + innovation, RAID + transformation and paradigm shift.
RAID + innovation
Several vendors have addressed traditional RAID problems by taking an incremental approach to RAID that leverages its reliability while diminishing some of the tradeoffs (see "RAID enhancements," below). IBM's EVENODD (implemented by EMC on Symmetrix DMX) and NetApp's RAID-DP (implemented on NetApp's FAS and V-series) have enhanced RAID 6 by reducing algorithm overhead while increasing performance.
Click here to get a PDF of the RAID enhancements chart.
This was first published in May 2010