Recent empirical studies have established two things about data stored to magnetic disk RAID arrays that suggest they aren't as reliable as once believed,
First, drives fail much more frequently than previously thought -- perhaps up to 1,500 times more frequently than vendor mean time to failure statistics would suggest. By extension, drive failure frequency places RAID arrays at high risk of multiple concurrent drive failures that can compromise the uber-disk that RAID creates.
Secondly, many studies have confirmed that bit errors, bit rot and silent corruption events -- problems that can introduce errors at the time data is written to magnetic disk or, alternatively, to the degradation of bit integrity over time -- pose a much greater challenge to data integrity in RAID arrays than previously thought. Corrupted bits can be limited in their impact to an individual file, or if they occur in the wrong place, they can take down a full RAID set.
Some RAID vendors have sought to downplay the importance of these factors, but with high-capacity drives having long rebuild times should a RAID drive fail, RAID may be becoming part of the problem of data availability, rather than part of the solution.
This has influenced the introduction of a number of proprietary RAID techniques, not part of the original five schemes (six, if you count no RAID at all) described in a 1987 white paper from the University of California, Berkeley.
Techniques predating RAID, like erasure coding, have reentered discussions of storage array architecture. In a few cases, RAID is being reconsidered as a technique applied to storage systems that lay out their data across a grid rather than in stripes across concentric tracks. Now alternatives to RAID have taken center stage, in part based on the engineering and evangelizing of established storage vendors ranging from NEC to Hitachi Data Systems (HDS) and newcomers such as Amplidata and Cleversafe.
Erasure coding was originally conceived as a means to protect data transmitted over unreliable channels or networks. Erasure codes expand data, making data internally redundant so it can be reconstructed from the remaining portion, even if part of the transmitted data is lost. Amplidata's BitSpread technology proposes to do the same thing. Put simply, BitSpread technology takes incoming data, converts it into a binary object and then applies algorithms to generate a configurable number of reconstructive object fragments. These fragments are then dispersed across the storage target -- a grid, in the case of Amplidata. If any data thus encoded is found to be corrupt, it can be reconstructed quickly by locating any two object fragments and applying a reconstruction algorithm.
While Amplidata promotes its technology as a wholesale replacement for RAID, some vendors resist framing the discussion in terms of either-or. NetApp, for example, sees applications for erasure coding, such as deep archive and others, as enhanced RAID techniques, leveraging high-availability clustering, data and parity information dispersal, and hardware parallelization.
NetApp has argued that some algorithms employed by erasure coding technology are more computationally intensive than RAID, and impose a greater processing burden that shows up as slow storage performance. High-availability clustering with failover, they argue, is an effective way to mitigate the impact of a RAID set failure and is much faster than erasure coding as a means to "recover" data availability.
Hu Yoshida, vice president and chief technology officer at HDS, has suggested that RAID 6 (preferred by HDS on its platforms) uses redundant parity, building tolerance for dual disk failures, but acknowledges that future drive densities will eventually produce three or more drive failures in a RAID set, requiring a replacement for RAID.
So, will RAID alternatives, such as erasure coding and grid storage, replace the technique? A lot depends on the expense of replacement technology and the development of standards that can contain the costs associated with data protection as RAID has done for nearly 30 to 40 years (if you start the clock with the first proto-RAID technique, patented by IBM in 1977).
This was first published in May 2013