3dmentat - Fotolia

Bit rot eroding RAID

RAID’s roots run deep and the technology does a decent job of data protection, but a little math and an open mind might lead you to some better alternatives.

RAID’s roots run deep and the technology does a decent job of data protection, but a little math and an open mind...

might lead you to some better alternatives.

Recently, in preparation for a talk I’m delivering this year at Storage Decisions (SD), I’ve been reacquainting myself with the latest technologies and methodologies for data protection. Naturally, this included rereading the many articles and blogs that, for the past few years, have constituted a debate framed as the “death of RAID.”

I find “death of” articles amusing. We’ve repeatedly been told that tape was dead (Gartner killed it in 1999, SANs killed it in 2000, deduplicating VTLs stuck a fork in it in 2005 and so on). We’ve been hearing since the 1980s that mainframes were dead (despite the hockey stick revenues that CA Technologies, IBM and others have generated from big iron over the past few years) and more recently that PCs are dead (replaced by tablets and smartphones or VDI technology that never quite seems to catch on). Application service providers and storage service providers (the same beasties as contemporary “clouds”) were to have killed off data centers and/or shrink-wrapped software -- until all the ASPs and SSPs went belly up in 2001, that is. The list of examples goes on and on.

So, excuse me if I have déjà vu when it comes to all the talk about the death of RAID. Not because the arguments are poorly made or the science is inaccurate, but because people are people.


Truth be told, RAID ought to be dying. Mean time to data loss (MTTDL) numbers from vendors are demonstrably fraudulent, and the actual rates of multiple disk failures (adequate to compromise both RAID 5 and 6) are approximately 1,500 times higher than advertised. Part of the explanation is linked to “silent corruption” caused by bit rot and other factors that generate about one non-recoverable error for every 67 TB of disk. Doesn’t seem like much until you realize how many TBs go into a single array these days, and how sensitive RAID controllers are to small read errors that mark a drive as bad even when it’s mostly good.

Even mirroring the way we do it today is flawed. RAID 0+1 mirrors die when just one drive of each mirrored set of striped drives goes south, while 1+0 mirrors die when two drives fail in a mirror set that’s subsequently striped with other mirrors, then mirrored. (If you’re confused, come see the talk at SD.) Could two drives fail in close order? Of course. Does it happen frequently? Yup. In an eight drive RAID 0+1, your chances of a complete fail are four in seven. In an eight drive 1+0, the chances are a bit longer: one in seven.

The math is indisputable. But it doesn’t mean RAID is dead.

As we see in the current election cycle, the comfort of ideology often trumps the inconvenience of facts for a lot of prospective voters. A wannabe candidate can state that HPV vaccines cause mental retardation, which is patently absurd, and gain traction with a segment of voters who care less about whether the vaccine is scientifically proven to safeguard my daughters against a known cancer vector than whether the politician’s claim supports a pre-existing ideological view either of Big Pharma (pushing bad vaccines on unsuspecting people) or Big Government (mandating vaccinations).

So it is with RAID. Disk array vendors have spent a lot of money over the years promoting omnia in orbis (“everything on disk,” for those not fluent in Latin). They’ve done remarkable work to squeeze bits closer and closer together, and to shrink the electronics packs on disks to make spindles smaller. Bit rot -- the leakage of magnetic energy used to store binary state from a cell location on a disk or chip, whether from degraded insulation or cosmic radiation or humid climate -- happens. One study of 1.5 million disks found that one in 90 disks contained these “soft errors” that not only make data irretrievable but can also cause RAID errors.

The solution is to virtualize the disk media and use different kinds of recording patterns, like X-IO’s Redundancy Allocation Grid System (RAGS) at the hardware layer, or an implementation of erasure code technology, like Amplidata’s BitSpread technology, at the file system layer. Hadoop, ZFS and some other strategies may also work.

What is interesting is that even after these technologies are exhaustively explained, the customer usually says “Great. And will they work with RAID 5?” Go figure.

BIO: Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.

Dig Deeper on Primary storage devices