Paper review: DRAM errors in the wild : umbrant

Paper review: DRAM errors in the wild

February 5, 2013

Today, I'm looking at an excellent study by Schroeder et al., "DRAM errors in the wild: A Large-Scale Field Study". This is the definitive paper on the subject, covering two years, thousands of machines, and millions of DIMM hours. Memory errors are particularly important in the context of growing cluster sizes; one-in-a-million errors become common at scale.

ECC and memory errors

Schroeder and her Google co-authors lead with an overview of DRAM errors and the countermeasures present in today's hardware. Almost all server-grade memory is ECC, meaning it can detect double-bit errors and correct single-bit errors (via something like a 7-4 Hamming code). More advanced "chipkill" schemes can also correct some multi-bit errors. These errors are detected on read; uncorrectable errors (UEs) normally result in a system reboot, while correctable errors (CEs) are fixed up on-the-go. Some systems also have a hardware scrubber, which periodically checks and rewrites errors (at the rate of 1GB every 45 minutes). This is important since it can prevent correctable errors from accumulating and becoming uncorrectable.

Errors are also divided into hard and soft errors. Soft errors are the famed cosmic ray flipping a bit; a random, one-off fault caused by the environment. These are the errors that countermeasures like checksums and hardware scrubbers are designed for. Hard errors are structural, and more difficult to deal with. In practice, these emerge as "stuck bits" which can't be rewritten and fixed, and are caused by things like hardware faults or buggy firmware.

Failure rates

The biggest takeaway from their study is that correctable errors are far more common than previously thought. One third of machines experienced a correctable error in a year. Breaking it down, there's a 1.29% incidence per DIMM-year of a CE. Another important note is that errors are strongly correlated by node; 20% of machines cause 90% of errors. These errors are also strongly correlated by age; a DIMM that experiences 1 error in a month typically experiences 10-100 errors the following month. I think aligning with the aging hypothesis, the authors found that hard errors were much more likely than soft errors.

One surprising data point was that they did not find temperature and error rates to be correlated, once normalized for utilization. Error rates were correlated with utilization though, and utilization is correlated with temperature. I hypothesize that more heavily utilized machines are using more memory and doing more reads, and thus are more prone to errors (else, the random bit flip might just happen in an unused portion of memory and never be encountered).

The uncorrectable error rates plateau over time because Google aggressively decommissions machines that experience UEs. All said, their trace shows a 0.22% incidence per DIMM-year of a UE. Based on this, I don't think there's much danger of undetected errors (should be extremely rare).

In terms of technology, they found no differences between DDR1, DDR2, and FBDIMM. Furthermore, they did not find significant differences between manufacturers. Chipkill is seen as an extremely beneficial technology though, since hardware platforms with chipkill experienced far fewer uncorrectable errors.

Conclusion

Memory errors are scary, and they happen with some frequency. However, the situation seems to be manageable. The study is a strong advocate for ECC memory, chipkill, and hardware scrubbers. Getting proper monitoring to quickly detect and decommission machines with uncorrectable errors is also important. Overall, this paper was a very readable and thorough analysis of memory errors, and will probably remain the gold standard on the topic for some time.