ECC Memory?

leman · Jan 13, 2023

Basic75 said:
We do know that RAM errors happen, and that they cause random crashes and data corruption. Another benefit of ECC RAM is that you can get an advance warning before it is failing, i.e. while it is starting to fail and producing a higher rate of errors.

And what if it isn't a crash but silent data corruption? Why is that only bad if it's on a server? A businessman's Excel sheet, a software developer's source code, your Amazon order and online banking transaction, are they not important?

The same persistent data that passes through RAM at least once on its way to and from persistent storage? ECC is not rocket science. We are investing so much into nicer screens, faster drives, faster everything actually, why not invest just a little bit to make your system more reliable, too?

I totally understand what you mean. However, at the end of the day this is about cost–benefit analysis. The main question for me is how real a chance of „proper„ data corruption is as opposed to a crash. I mean, if it costs 5–10% of performance at every given moment to avoid something that will probably not happen in my lifetime… is it really a must–have feature? But if instead one can demonstrate that data corruption is real and happens with a high enough frequency (even if it’s one file in a few months) than ECC tradeoffs is definitely worth it.

Basic75 · Jan 13, 2023

leman said:
I totally understand what you mean. However, at the end of the day this is about cost–benefit analysis. The main question for me is how real a chance of „proper„ data corruption is as opposed to a crash. I mean, if it costs 5–10% of performance at every given moment to avoid something that will probably not happen in my lifetime… is it really a must–have feature? But if instead one can demonstrate that data corruption is real and happens with a high enough frequency (even if it’s one file in a few months) than ECC tradeoffs is definitely worth it.

Even though Intel marketing has tried to brainwash everybody into believing that only servers need ECC, RAM error rates are not that uncommon.

DRAM error rates: Nightmare on DIMM street

A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought -- a mean of 3,751 correctable errors per DIMM per year.This is the world's first large-scale study of RAM errors in the field.

www.zdnet.com

leman · Jan 13, 2023

Basic75 said:
Even though Intel marketing has tried to brainwash everybody into believing that only servers need ECC, RAM error rates are not that uncommon.

DRAM error rates: Nightmare on DIMM street

A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought -- a mean of 3,751 correctable errors per DIMM per year.This is the world's first large-scale study of RAM errors in the field.

www.zdnet.com

Yeah, I know that study. I don’t find it convincing. The distribution of errors is too skewed, suggesting that a lot of errors were constrained to specific hardware.

Analog Kid · Jan 13, 2023

leman said:
I am under the impression that we are not yet at a point where RAM errors are as ubiquitous to actually cause serious harm.

Unless you're running for public office in Belgium...

Electronic_voting_in_Belgium

Analog Kid · Jan 13, 2023

leman said:
I totally understand what you mean. However, at the end of the day this is about cost–benefit analysis. The main question for me is how real a chance of „proper„ data corruption is as opposed to a crash. I mean, if it costs 5–10% of performance at every given moment to avoid something that will probably not happen in my lifetime… is it really a must–have feature? But if instead one can demonstrate that data corruption is real and happens with a high enough frequency (even if it’s one file in a few months) than ECC tradeoffs is definitely worth it.

Yeah, I think the reason most consumer equipment has gotten away without it is because most failures (like most genetic mutations) are generally fatal or corrected by internal consistency checks when saving a file or something. If the system crashes, it's annoying, but most people blame Windows or Tim Cook (this never would have happened when Steve was alive!), reboot and move on.

It's more important in scientific computing and such where there's a risk that a dataset is corrupted in memory leading to an invalid outcome. If you are spending hours or days processing terrabytes of data there's a reasonable chance that a random event will occur in that time that impacts your results.

leman · Jan 13, 2023

Analog Kid said:
Yeah, I think the reason most consumer equipment has gotten away without it is because most failures (like most genetic mutations) are generally fatal or corrected by internal consistency checks when saving a file or something. If the system crashes, it's annoying, but most people blame Windows or Tim Cook (this never would have happened when Steve was alive!), reboot and move on.

It's more important in scientific computing and such where there's a risk that a dataset is corrupted in memory leading to an invalid outcome. If you are spending hours or days processing terrabytes of data there's a reasonable chance that a random event will occur in that time that impacts your results.

Precisely. And that’s why those domains use ECC memory. But it’s not something I need when doing MCMC Bayesian estimation for example. I’m already running thousands of estimates, I don’t care if one of those gets corrupted once in a while.

Basic75 · Jan 14, 2023

Linus Torvalds’s faulty RAM slows kernel development

Emperor penguin swipes Intel's attitude to ECC memory and maybe wimpy Mac performance too

www.theregister.com

leman said:
But it’s not something I need when doing MCMC Bayesian estimation for example. I’m already running thousands of estimates, I don’t care if one of those gets corrupted once in a while.

What if the corruption is in the code and all of these thousands of estimates?

I'm sorry, I really can not fathom why anybody would resist using a readily available technology that increases the reliability and stability of our digital devices!

The price would come down if it were deployed ubiquitously.

The only reason against it seems to be "I feel like I don't need it", even though we know that RAM cells are not perfect, we know that they occasionally produce errors, we know that the error rate can increase at some point in the life of a RAM chip, another thing that ECC would help detect early enough.

Nobody knows how many faults they have encountered on their non-ECC machines because they often wouldn't notice! Not every fault leads to the equivalent of a blue screen, and even then you'd probably blame MS or Apple and not your RAM because you don't know what happened.

Linus Torvalds’s faulty RAM slows kernel development

Emperor penguin swipes Intel's attitude to ECC memory and maybe wimpy Mac performance too

www.theregister.com

Why don’t PCs use error correcting RAM? “Because Intel,” says Linus

Artificial market segmentation may have suppressed demand for ECC in desktops.

arstechnica.com

"The only reason Intel says 'ECC is for servers and embedded' is because Intel marketing people have convinced the powers that be that they can sell otherwise inferior chips for a higher price by enabling ECC functionality,"

And some people here have listened to Intel too much.

With DRAM cells shrinking all the time they are not getting much more reliable, on the other hand RAM sizes are ever increasing, making the probability of an error in a device rise over the years.

Basic75 · Jan 14, 2023

Analog Kid said:
I think the reason most consumer equipment has gotten away without it is because most failures (like most genetic mutations) are generally fatal or corrected by internal consistency checks when saving a file or something.

I think the reason most consumer equipment has gotten away without ECC is because more often than not the cheaper option wins. And what consistency checks? Are applications performing hashes over the user data every time it changes in memory? That would be quite a hassle.

leman · Jan 14, 2023

Basic75 said:
Linus Torvalds’s faulty RAM slows kernel development

Emperor penguin swipes Intel's attitude to ECC memory and maybe wimpy Mac performance too

www.theregister.com

So Torvald's PC broke, what's your point? It's not like ECC RAM doesn't break. I don't see how ECC RAM would have prevented this. Sure, he might have realised that he has a problem a bit earlier. The end effect is the same: he has to wait until the new RAM arrives.

Basic75 said:
I'm sorry, I really can not fathom why anybody would resist using a readily available technology that increases the reliability and stability of our digital devices!

The price would come down if it were deployed ubiquitously.

I'm not resisting anything. I am simply not convinced of the usefulness. You are trying to sell me a load of bulky protective gear for my light hike. Maybe this gear would protect me from bruising my knee once in a while. But for now I'd prefer to take the risk and so that I don't have to sweat under all this stuff.

Don't get me wrong please. If the next revision of Apple Silicon comes with error correction, I'd be very much rejoiced. I'd even take a RAM capacity hit for that (those codes have to be stored somewhere). But not if it will cut the performance by 20% or increase the price by 20%.

Basic75 · Jan 14, 2023

leman said:
I'd even take a RAM capacity hit for that (those codes have to be stored somewhere). But not if it will cut the performance by 20% or increase the price by 20%.

Implementations of ECC add extra RAM chips, ECC does not cut into the capacity. Next, ECC RAM is like 2% slower, not 20%, which should be hardly noticeable given today's cache sizes. And that one extra RAM chip for every eight is 12.5% more, whether that would be added to Apple's already ridiculously overpriced RAM prices would have to be seen.

theluggage · Jan 14, 2023

Basic75 said:
Implementations of ECC add extra RAM chips, ECC does not cut into the capacity.

Whichever way you cut it, you need to pay for an extra RAM chip per module.

Moreover, Apple Silicon uses LPDDR RAM which would probably use inline ECC which does grab part of the RAM capacity to store the ECC data.

Error Correction Code (ECC) in DDR Memories | Synopsys IP

Explore how ECC memory enhances DDR reliability, preventing data corruption and system failures effectively.

www.synopsys.com

...this would be a problem for Apple Silicon since the SoCs we've seen so far already have low-ish maximum RAM sizes, a trade off for the power/speed advantages of having LPDDR RAM mounted directly on the package. Part of that is down to memory bus width, but some of it is physically how many RAM dies you can fit on the package... Adding ECC would eat ~20% of that. Even with "regular" ECC RAM you have to find space for the extra RAM chips somewhere.

I don't think there's any real demand or justification for ECC below "Mac Pro" serious workstation level - and the main area where it starts to become essential (along with redundant PSUs, parity-checked RAID for storage etc.) is in the server/high-density computing market, and Apple haven't had a dog in that race since they dropped the XServe.

As for the Mac Pro, either Apple are going to create a new processor with expandable RAM and shedloads of GPU-capable PCIe lanes to create something comparable to the 2019 MP (In which case it would probably have ECC) or it's going to be a souped-up Mac Studio based on the next version of the M1 Max/Ultra, in which case people complaining about lack of ECC will have to queue up behind those who want upgradeable RAM, AMD GPUs and 64 lanes of PCIe...

Basic75 said:
I'm sorry, I really can not fathom why anybody would resist using a readily available technology that increases the reliability and stability of our digital devices!

There's always one more thing that you can pay for to make your digital device a bit more reliable. It comes down to cost/benefit and weighing up vs. all the other things that can cause code to crash or produce bad results. Why not have RAID 6 storage, military EMP/solar-flare-proof chips and multiple redundancy for everything? Those things all exist and for some applications they are worthwhile but, like ECC, they cost extra.

leman · Jan 14, 2023

Basic75 said:
Implementations of ECC add extra RAM chips, ECC does not cut into the capacity. Next, ECC RAM is like 2% slower, not 20%, which should be hardly noticeable given today's cache sizes. And that one extra RAM chip for every eight is 12.5% more, whether that would be added to Apple's already ridiculously overpriced RAM prices would have to be seen.

As @theluggage writes above, you seem to be talking about side-band ECC, which widens the RAM bus (and RAM itself) to send error correction codes alongside with the data. But LPDDR5 uses inline ECC, where checksums are stored in the same RAM and communicated with a separate request. This eats both into the bandwidth and latency. I wasn't able to find any hard numbers that describe the performance penalty, but it's fairly clear that it's going to be non-trivial. At least when using LPDDR5, maybe Apple will roll their own totally custom DDR that can do it all. Who knows.

Dismayed · Jan 17, 2023

Keep in mind that 92.7% of statistics are made up on the spot.

Basic75 · Jan 21, 2023

RWT Forums - Real World Tech

content overridden

www.realworldtech.com

RWT Forums - Real World Tech

content overridden

www.realworldtech.com

sunny5 · Jan 21, 2023

Basic75 said:
You're literally saying that speed is more important than correctness. To me that's an insane point of view.

All ECC RAMs have reduced performance for better stability compared to the same gen normal RAM so I guess he knows nothing about ECC RAM.

sam_dean · Jan 21, 2023

TylerL said:
Intel's lack of ECC memory support on the Core line was a conscious effort to differentiate their Xeon line from "consumer" chips, as much or more so than simply a cost consideration.
With Intel out of the picture, would Apple make a point to move to ECC memory on Apple Silicon Macs?

The added assurance of stability can be a selling point for Pro workflows, and it's hard to argue that it wouldn't benefit The Rest Of Us at the same time.
Added cost has never held Apple back before... but what about the extra space required?

I think whatever Apple is doing is heads above shoulder what other PC makers do. So I do not think about something that is out of my hands.

Basic75 · Feb 10, 2023

theluggage said:
I don't think there's any real demand or justification for ECC below "Mac Pro" serious workstation level - and the main area where it starts to become essential (along with redundant PSUs, parity-checked RAID for storage etc.) is in the server/high-density computing market, and Apple haven't had a dog in that race since they dropped the XServe.

RWT Forums - Real World Tech

content overridden

www.realworldtech.com

leman · Feb 11, 2023

The RTW discussion is very high quality and should be followed by anyone interested in this topic. After having read the very informative posts, I am even more convinced that the only way for Apple to add ECC support is to design a custom RAM solution.

ECC Memory?

macrumors Core

macrumors 68020

macrumors Core

macrumors G3

macrumors G3

macrumors Core

macrumors 68020

macrumors 68020

macrumors Core

macrumors 68020

macrumors G3

macrumors Core

macrumors member

macrumors 68020

Suspended

Suspended

macrumors 68020

macrumors Core

Our Staff