Hacker News

To ECC or Not to ECC

by shriteshon 11/20/2015, 12:02:52 AM with 15 comments

by mehrdadaon 11/20/2015, 9:29:52 AM
Please note that the main purpose of ECC is not to reduce RAM error rate and make it look more reliable, but to help the system stop the process when an unrecoverable memory error occurs as opposed to propagating it and resulting in unpredictable outcomes. The change in the effect of failure is what matters most, not the probability of it. Without ECC, there's often no clear way to realize that the result of a computation is valid or garbage and should be discarded.
(Of course, in extreme scenarios, like at Google scale, even ECC can fail to fail due to multibit errors, but in almost all non-pathological scenarios, SECDED[1] is enough to catch all erroneous cases.)
[1]: http://cr.yp.to/hardware/ecc.html
by PhantomGremlinon 11/20/2015, 9:03:42 AM
I know that Jeff is a demigod to some people, but I interpret this article as: "As a software guy, I don't really understand why I need this fancy hardware, so this can't be important". IMO he's wrong.
The margins between working and non-working DRAM these days are extremely small. E.g. Rowhammer demonstrated that even user-space programs could readily obliterate main memory, without even trying very hard to do so.[1]
But, maybe in this case he's right. It's not like "open source Internet forum software" is anything that's mission critical. If there's an occasional garble in a character or two, will the latte-swilling hipsters even notice? :-)
Just like the original Google servers he points to. Who cares if they occasionally screwed up in reporting search results, because they didn't have ECC memory. Overall the experience was still 100x better than using something like Altavista.
[1] https://en.wikipedia.org/wiki/Row_hammer
by teddyhon 11/21/2015, 3:47:28 AM
We once had a new server with all new hardware which had weird problems and kept crashing mysteriously. Memory tests showed no errors, so we were all tearing our hair out. We took the server offline and set it to test continously – still no errors. After running Memtest86 on nothing but test #4, for about a day or so – then a few memory errors showed up. Replaced memory, problem gone, server started working.
Memory errors are especially insidious compared to how common they are. ECC is worth it.
by tzson 11/21/2015, 8:06:07 AM
I tried to catch soft errors for about a year on a couple of Linux boxes I had. They were both desktop form factor machines, one being used as a home server and one as a desktop at work.
I had a background process [1] on each that simply allocated a 128 MB buffer, filled it with a known data pattern, and then went into an infinite loop that slept a while, woke up and checked the integrity of the buffer, and if any of the data had changed logged the change and restored the data pattern.
Based on the error rates I'd seen published, I expected to catch a few errors. For example, using the rate that Tomte's comment [2] cites I think I'd expect about 6 errors a year.
I never caught an error.
I also have two desktops with ECC (a 2008 Mac Pro and a 2009 Mac Pro). I've used the 2008 Mac Pro every working day since I bought it in 2008, and the 2009 Mac Pro every day since I bought it in 2009. Neither of them has ever reported correcting an error.
I have no idea why I have not been able to see an error.
[1] http://pastebin.com/Bv56kVwC
[2] https://news.ycombinator.com/item?id=10600308
by tshtfon 11/20/2015, 12:14:31 AM
Soft errors are fairly common; in fact it allows for problems in DNS resolution such as Bitsquatting: https://www.defcon.org/images/defcon-19/dc-19-presentations/...
Anyone who has bought a popular bitsquatted domain name can attest to this.
by Tomteon 11/20/2015, 8:55:41 AM
IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in time"; per 10e-9 hours of operation) and gives the following sources:
a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130 nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009
b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies, Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005
c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004
d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel, 2003, IEEE Computer Society
e) Basic mechanisms and modeling of single-event upset in digital microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.
f) Destructive single-event effects in semiconductor devices and ICs, F. W. Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.
g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson, Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340
h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual Research Conference on Reliability Stanford University, October 2000
i) International Technology Roadmap for Semiconductors (ITRS), several papers.
If that's correct, the math is simple: you have bit flips in your PC about once a day.
It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.
by cushychickenon 11/21/2015, 3:53:16 AM
These things do happen with a reasonable amount of frequency. I used to work at a division of a major memory manufacturer that dealt with writing tests to find these DIMMs that exhibited these sorts of failures - the semiconductor industry calls them "variable retention transfers". (Aside: numerous PhDs in the field of semiconductor physics have built prosperous careers trying to understand why these soft failures happen. Short answer: we have some theories, but we don't really know.) It was provably worth millions of dollars to be able to screen for this sort of phenomenon, because a Google or an Apple or an IBM would return a whole manufacturing lot of your bleeding edge, high-margin DIMMs if they found one bit error in one chip of one lot. Each lot was shipping for millions and millions of dollars.
by CrLfon 11/21/2015, 11:36:26 AM
Anyone who've managed even a modest amount of servers with ECC RAM for a reasonable amount of time has surely seen ECC events in their hardware logs. Most of these are one-time errors that never happen again on the same server, ever.
Without ECC these errors would have unknown consequences. They could happen in some unused region of memory, or they could happen in a dirty page in the filesystem cache. It's not fun to discover that your filesystem has been silently corrupted a unknown time after the fact.
Maybe Google doesn't need ECC. Their data is duplicated across several machines and it's extremely unlikely that a few corrupt servers would lead to any data loss.
However, on a smaller scale (and just like RAID) it's cheaper to have ECC than add more servers for extra redundancy.
by wmfon 11/20/2015, 12:14:50 AM
Or he could have waited a few months and gotten ECC anyway: http://ark.intel.com/products/88171/Intel-Xeon-Processor-E3-...
by sebcaton 11/21/2015, 11:41:59 AM
What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"
People need to adapt to a world where we have more cores instead of faster execution per core. You can't compare late 90's growth in execution speed per core with the situation we have today.
Write software for an environment where the number of cores scale, instead of an environment where the execution speed of a single core is more important.
by vox_mollison 11/21/2015, 1:25:58 PM
This cannot possibly be right. There was a DC21 talk regarding DNS request misfires due to bit flips in non-ECC DRAM, and the researcher was able to collect a surprisingly large number of requests on the basis of this.
Edit: found it: https://www.youtube.com/watch?v=ZPbyDSvGasw
by Animatson 11/21/2015, 7:28:08 AM
If soft errors are rare, parity checking, without correction, might be more useful. It's better to have a server fail hard than make errors. In a "cloud" service, the systems are already in place to handle a hard failure and move the load to another machine. Unambiguous hardware failure detection is exactly what you want.
by scurvyon 11/21/2015, 3:07:57 AM
I don't think that data corruption was a huge issue for Google back then (really early on). Corrupt data? Big whoop. Re-index the internet in another X hours, and it's gone. I doubt they had much persistent storage as most of their data was transient and well, the Internet.
Also, I still see "fire hazard" when I look at the early Google racks. No idea how Equinix let them get away with it. Too much ivory tower going on there. Not enough "you know we're liable if we burn down the colo with that crap, right?"
by deviton 11/21/2015, 11:26:22 AM
The article is wrong.
The Xeon E3-1270 v5 goes from 3.6 to 4.0 GHz and only costs 10% more than the i7-6700 (3.4-4.0 GHz)
Also, the Xeon E3-1230 v5 goes from 3.4 to 3.8 GHz (same base clock) and costs less than the Core i7-6700.
In general, you should never buy non-Xeon CPUs if you have the choice, both for desktop and for servers, since ECC memory is essential if you don't want to have a significant chance of having to replace your RAM after discovering mysterious problems with your system.
by venomsnakeon 11/20/2015, 8:59:05 AM
Isn't it simple enough calculation:
Will someone die if the data gets corrupted? No - then no ECC should be enough. And you should have checksums everywhere anyway.