Remember the research paper Google made a splash with two years ago on disk drive failure rates? The one that showed that most failed drives didn’t raise significant SMART flags, failed to find a correlation between temperature and utilizaation with failure rates, and instead established that failure rates are more correlated to drive manufacturer, model and age?
Well, there’s now a DRAM equivalent — and it doesn’t paint a much prettier picture than the one on hard drive failures.
According to a new paper, “DRAM Errors in the Wild: A Large-Scale Field Study“, engineers from Google and the University of Toronto found that once again, failure rates and patterns did not match the received wisdom in the industry about how Dual Inline Memory Modules (DIMMs) behave. According to the paper:
We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don’t observe any indication that newer generations of DIMMs have worse error behavior.
As with the disk drive study, temperature also doesn’t play a huge role in DRAM failures. Here, vendor and model didn’t make as much difference as in the disk drive study.
However, the study showed errors were more highly dependent on motherboard design than previously thought. And contrary to conventional wisdom about DRAM, more failures were hardware than software-based. According to an article analyzing the paper by Data Mobility Group’s Robin Harris,
This means that some popular [motherboards] have poor EMI hygiene. Route a memory trace too close to noisy component or shirk on grounding layers and instant error problems…For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!
These two reports raise one common question, according to Harris — why didn’t we know about these things before? As he put it, “Big system vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.”