RAID has formed the backbone of local (within the array) data availability for nearly 30 years, but the technology is starting to show its age. Exponential growth in drive capacity without the equivalent growth in drive performance means RAID systems struggle to be effective at scale. Enter the new kid on the block -- erasure coding.
Erasure coding meets the needs of large-scale storage deployments, and some say it could even eliminate the need for data backup. However, as we will see, erasure coding isn't the solution for all problems and RAID still has a place in modern storage systems.
To understand why RAID has started to reach the extent of its capabilities, we need to look in detail at how RAID systems work. In a RAID system, some of the storage capacity is set aside for recovery purposes, creating protection through data redundancy. For example, a RAID-1/10 system simply mirrors the data between two drives with an effective 100% overhead or 50% capacity for data. There is a range of other RAID schemes, including RAID 2, RAID 3 and RAID 4, all of which are no longer or rarely used.
Today, the vast majority of data protection is achieved with RAID 5 and RAID 6 configurations. RAID 5 systems store redundant data in the form of parity as the basis of data protection, calculated from all of the application/host data. For example, a RAID-5 3+1 configuration has three data disks and one parity disk. In practice, the data and parity are spread across all disks for performance purposes, but the overhead (33%) and effective capacity (75%) is still the same.
In order to maximize capacity, some vendors allow large RAID sets (as many as 28 total drives). However, each drive added to a RAID set increases the chance of failure of the RAID group as a whole. Recovery of data in large RAID groups also incurs a significant I/O penalty, because all drives are read to recover the failed data. In other words, there is a trade-off between the practical size of a RAID group and the overhead of data redundancy for protection.
Some vendors build volumes of disk pools from multiple RAID groups to mitigate the risk of disk failures within a group. This provides some additional protection (as disks can individually fail in multiple groups without representing a problem), but the effective overhead of parity still remains (multiple 3+1 RAID sets still use 25% of data for parity). In addition, a double disk failure in one group still affects the entire volume/pool as data in this type of configuration will be striped across multiple RAID sets for performance purposes.
RAID protection and recovery
When data is written to a RAID 5 set (or group), a data stripe is built, including both the data and the parity component. This stripe is then written to all of the disks in the RAID group as a "stripe set." Parity calculation is based on XOR (exclusive OR) logic and performed either in software or on dedicated RAID controllers. Subsequent re-reads of data come from only the data parts of the stripe set; the parity data is only used for failure scenarios.
Updating data in a RAID stripe incurs additional I/O overhead compared to writing data without protection. When data is updated, the old data and parity in a stripe is read, parity is recalculated on the basis of the new data, and then both data and parity are written back to disk. Therefore, every write I/O from the application or host results in four I/O operations on disk, independent of the size of the RAID group.
Within a RAID group, the failure domain is a single disk drive. When drives were small in capacity and had a good capacity/IOPS ratio, rebuilding an entire drive was straightforward without significant overhead. The RAID logic read all of the good data disks and the parity and reversed the XOR parity calculation to recover the missing data.
However, as drive capacities increased, rebuild times became the Achilles' heel of RAID protection. When a single disk drive can only deliver 200 random IOPS, the rebuild of multi-terabyte drives can run into days and weeks. There are already scary stories circulating that predict rebuild times for 10 TB drives will be in the order of months.
RAID rebuild time wouldn't be a problem if it weren't for the exposed position a RAID group with a failed disk is in. With RAID 5 protection, a single failed drive means data protection has been lost. Another drive failure within the same disk group before the first failed disk has been rebuilt will result in data loss and significant manual work to reconstruct data from the failing device.
Vendors have mitigated this problem by implementing double-parity protection schemes like RAID 6. A RAID 6 group uses two parity drives per RAID group, so it can tolerate the failure of two drives and not suffer data loss. Of course, this level of protection comes with capacity and performance overhead. There is even talk in the industry about triple-parity products, but there are no products available today.
Finally, we have to consider the problem of an unrecoverable read error (URE) when rebuilding data. Although extremely rare (vendors quote rates of one failure per 12.5 terabyte (TB) read for SATA drives), it is possible for a drive to fail to read a disk sector. With very large drives and large RAID sets, the chance of a URE becomes a real possibility. If it occurs during a parity rebuild, then the data affected by the URE cannot be recovered.
In practice, the probability of a URE may be much lower than vendors quote, but drives are mechanical in nature and not immune from manufacturing issues. At scale, in systems with thousands of drives, these kinds of problems are more likely to be experienced.
How erasure coding works
Clearly, RAID data protection isn't suitable for very large volumes of data and drives, such as those found in object stores. Instead, vendors have turned to erasure coding as one technique for implementing data protection without the scale issues of RAID systems. Erasure coding (sometimes called forward error correction) operates in a similar way to RAID in that it uses additional redundant data as the protection method.
The erasure coding process first divides source data into "chunks." Then, a mathematical function is applied to these chunks to create smaller "slices" or "shards." As an example, an erasure coding system could take eight input data chunks and produce 10 slices, of which only eight are needed to recover data. In this instance, the overhead of using erasure coding is 25% (two out of eight) but the system can also tolerate the loss of two slices, which if the data is stored on 10 drives, means tolerating a double disk failure.
Also, erasure coding has the benefit that the number of slices produced and the number required for recovery can be easily varied, allowing many data protection schemes to be implemented, even on the same set of physical disks. And, unlike RAID, erasure coding improves efficiency and resiliency as the volume of data and number of drives increases. Having data spread across a greater number of drives with the same percentage overhead means the loss of more drives can be tolerated for any one erasure-coded set. In addition, unrecoverable read errors can be mitigated as any piece of original data can be recreated (and validated) using many combinations of chunks.
However, this resiliency benefit comes at a cost. The mathematical functions used to perform erasure coding (most of which are based on Reed-Solomon codes) are much more computationally intensive than traditional RAID XOR, resulting in a greater processor overhead that directly translates to increased host latency. In addition, servicing read I/O requests requires reconstituting the original data from the minimum subset of shards using another mathematical function, compared to RAID, where the data is simply read from disk. In a similar vein, updating data requires reading the entire set of shards that comprise the encoded data, re-applying the erasure coding process and re-writing the data back to disk.
With this kind of overhead, it's easy to see why erasure coding has initially been applied to object stores, where an object is typically immutable and generally re-written as a new version rather than being updated in place.
Replication vs. erasure coding
One area where erasure coding does have a distinct advantage is in data protection for disaster recovery. RAID systems protect data within a single storage array. As such, users must rely on remote replication to provide against array or site loss. Replication is an expensive proposition, requiring two identical sets of data, one of which is typically rarely (if ever) used. With erasure coding, shards can be geographically dispersed and used to mitigate the loss of a single site or appliance.
For example, imagine an erasure coding scheme that requires 12 of 16 slices to be available for data access. These 16 slices could be spread across four data centers, enabling the loss of any one data center to be tolerated without losing information. This level of resiliency is achieved without the need for any additional disk capacity, although it is important to remember that performance is based on reading at least 12 slices, so inter-site latency would be a factor.
So, could erasure coding mean the end of backup? For traditional block-based data where updates are based on small block sizes (typically 4KB/8KB), erasure coding isn't appropriate due to the overhead of implementing the erasure-coding algorithm. For this reason, we see many vendors implementing RAID protection in their object stores when handling small object data. This data will still need some backup.
However, for larger objects (which covers most unstructured data and files), erasure coding provides a resilient and efficient scalable protection mechanism, which when implemented with versioning (to protect against data corruption or deletion) could negate the need for backup in many cases.
It is likely that we will see the gradual introduction of erasure coding into shared storage arrays as a protection method alongside RAID systems, with system intelligence used to choose the protection method, based on customer policies. RAID isn't going away, but will be one of a suite of protection solutions in future storage systems.
Advancements in RAID technology
A closer look at RAID data protection
- Storage Insights Enables Broader Use of Storage Resource Management –Arrow and IBM
- StrongLink Autonomous Data and Storage Management –Fujifilm Recording Media USA, Inc.
- Managed Apache Spark for Large-Scale Analytics –Instaclustr
- Storage Designs for Big Data and Real-Time Analytics –Western Digital