In my experience, memory latency is a much bigger bottleneck than bandwidth when working with large amounts of data. Apple has made two architectural choices that make Apple Silicon worse for this kind of work than traditional x86 hardware.
I would say that if your workload is constrained by memory latency, you are pretty much screwed regardless. Even the fastest RAM has latency in ballpark of 60-70ns, which are hundreds of instructions for any modern performance oriented CPU. You just can’t afford stalls like that without performance tanking to zero.
For inherently parallel problems latency is usually not an issue. GPUs for example can be understood as devices that trade latency for compute througtout, and they have been very successful. So I would first look for bottlenecks elsewhere before I blame RAM latency.
Then again, maybe I am misunderstanding what you mean by “latency”? You seem to refer to RAM latency, but your post appears to be more concerned with latency of dependent operactions.
First, once working set size is larger than about 100 MB, memory latency is higher with Apple Silicon than with the Intel hardware that preceded it.
The latency is higher simply because LPDDR has higher latency. Also, with Apple Sikicon there are more memory controllers which will also increase latency. As mentioned above, I doubt that this is really your problem, unless of course the nature of your memory accesses is truly unpredictable and the data window is large than the cache. But again, if thst would be the case you’d see very bad performance on any hardware.
As a consequence, while my M2 Max MBP is much faster at compiling code than my Intel iMac, there is often no noticeable performance difference in running the code.
I would guess your code has long data dependency chains and does not play well with massively wide Apple cores (they need high ILP to get high performance)? Or maybe it’s bound by some other constraint, e.g. SIMD throughout (Apples weakness compared to x86)?
Second, Apple uses a few large and fast CPU cores instead of many smaller and slower ones, which is the wrong choice for workloads constrained by memory latency. There is also no support for SMT, which can improve CPU utilization with workloads like this. As a consequence, performance per die area is lower with Apple Silicon than with x86 for these workloads.
I don’t really see how using more cores will address the latency problem, wouldn’t bottleneck be with the memory controller/RAM interface anyway? Maybe you mean something like a barrel processor, where threads are switched out to hide latency? GPUs work thst way (kind of, at least). But I am not aware of any SMT implementation thst can do that. If a thread is stalled on memory access, it stays stalled. From what I understand, SMT can usually help to hide the latency of dependent operations, as you can “share” the execution resources between multiple threads, so if one can’t fully saturate the OOO machinery, two hopefully can.
Or do you mean thst with more smaller cores meniry stalls become “cheaper” and you still make more forward progress than if stalling fewer large cores? I’d need to think more about it. The entire thing is made more complicated by the fact that a single OOO core (especially one as large as Apple) can track hundreds of computation chains at once, hiding latency as it makes forward progress.
What’s also interesting is that Intel aparrentoy plans to drop SMT for their future hardware. Apparently they run into issues with performance scaling/power consumption. There is an in-depth discussion about SMT happening right now on RTW (most of it is way beyond me, but still interesting to read):
https://www.realworldtech.com/forum/?threadid=212944&curpostid=213086