This article can also be found in the Premium Editorial Download "Storage magazine: Solid-state adds VROOM to virtual desktops."
Download it now to read this article plus other related content.
Truth be told, RAID ought to be dying. Mean time to data loss (MTTDL) numbers from vendors are demonstrably fraudulent, and the actual rates of multiple disk failures (adequate to compromise both RAID 5 and 6) are approximately 1,500 times higher than advertised. Part of the explanation is linked to “silent corruption” caused by bit rot and other factors that generate about one non-recoverable error for every 67 TB of disk. Doesn’t seem like much until you realize how many TBs go into a single array these days, and how sensitive RAID controllers are to small read errors that mark a drive as bad even when it’s mostly good.
Even mirroring the way we do it today is flawed. RAID 0+1 mirrors die when just one drive of each mirrored set of striped drives goes south, while 1+0 mirrors die when two drives fail in a mirror set that’s subsequently striped with other mirrors, then mirrored. (If you’re confused, come see the talk at SD.) Could two drives fail in close order? Of course. Does it happen frequently? Yup. In an eight drive RAID 0+1, your chances of a complete fail are four in seven. In an eight drive 1+0, the chances are a bit longer: one in seven.
The math is indisputable. But it doesn’t mean RAID is dead.
As we see in the current election cycle, the comfort of ideology often trumps the inconvenience of facts for a lot of prospective voters. A wannabe candidate can state that HPV vaccines cause mental retardation, which is patently absurd,
So it is with RAID. Disk array vendors have spent a lot of money over the years promoting omnia in orbis (“everything on disk,” for those not fluent in Latin). They’ve done remarkable work to squeeze bits closer and closer together, and to shrink the electronics packs on disks to make spindles smaller. Bit rot -- the leakage of magnetic energy used to store binary state from a cell location on a disk or chip, whether from degraded insulation or cosmic radiation or humid climate -- happens. One study of 1.5 million disks found that one in 90 disks contained these “soft errors” that not only make data irretrievable but can also cause RAID errors.
The solution is to virtualize the disk media and use different kinds of recording patterns, like X-IO’s Redundancy Allocation Grid System (RAGS) at the hardware layer, or an implementation of erasure code technology, like Amplidata’s BitSpread technology, at the file system layer. Hadoop, ZFS and some other strategies may also work.
What is interesting is that even after these technologies are exhaustively explained, the customer usually says “Great. And will they work with RAID 5?” Go figure.
BIO: Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.
This was first published in July 2012