Intel Alder Lake vs. Apple M1

leman · Mar 20, 2022

diamond.g said:
Why is M1 Ultra consider UMA and not NUMA?

Good question. I suppose that if you are pedantic enough you might characterize it as NUMA, but in practice it probably won’t behave like a NUMA system. The interconnect should fast enough. There is of course the question of memory topology, hopefully someone will investigate this in detail.

cmaier · Mar 20, 2022

leman said:
Good question. I suppose that if you are pedantic enough you might characterize it as NUMA, but in practice it probably won’t behave like a NUMA system. The interconnect should fast enough. There is of course the question of memory topology, hopefully someone will investigate this in detail.

I don’t think anyone considers it UMA. Not sure I’ve even heard apple say it’s UMA. They say it’s unified memory architecture, which happens to have the same acronym.

leman · Mar 20, 2022

cmaier said:
I don’t think anyone considers it UMA. Not sure I’ve even heard apple say it’s UMA. They say it’s unified memory architecture, which happens to have the same acronym.

Fair enough. I guess I was just trying to comment on the second part of the question. Is it a NUMA architecture in the broad sense? That would depend on how things are connected I guess. I wouldn’t be surprised is the RAM access latency on the Ultra will be identical no matter where the RAM is located (I do expect it to be slightly higher than for the Max).

cmaier · Mar 20, 2022

leman said:
Fair enough. I guess I was just trying to comment on the second part of the question. Is it a NUMA architecture in the broad sense? That would depend on how things are connected I guess. I wouldn’t be surprised is the RAM access latency on the Ultra will be identical no matter where the RAM is located (I do expect it to be slightly higher than for the Max).

I think the latency has to be higher when accessing RAM connected to the other chip. It’s gotta pass through the crossbar, which has to be latched on both sides, so it’s going to take at least a couple cycles longer.

Pretty sure the way it will work is chip A passes the address through the crossbar to chip B (adding at least a cycle or 2), chip B fetches the data, then chip B passes the data through the crossbar to chip A (adding at least a couple of cycles).

Due to clock skew, you may have another cycle wasted in each direction. And, of course, chip B may be busy fetching its own data (though each chip seems to have ample bandwidth to its own memory), which can also cause an indeterminate delay.

We don‘t know, though, how the cache is managed, Etc.

diamond.g · Mar 20, 2022

cmaier said:
I think the latency has to be higher when accessing RAM connected to the other chip. It’s gotta pass through the crossbar, which has to be latched on both sides, so it’s going to take at least a couple cycles longer.

Pretty sure the way it will work is chip A passes the address through the crossbar to chip B (adding at least a cycle or 2), chip B fetches the data, then chip B passes the data through the crossbar to chip A (adding at least a couple of cycles).

Due to clock skew, you may have another cycle wasted in each direction. And, of course, chip B may be busy fetching its own data (though each chip seems to have ample bandwidth to its own memory), which can also cause an indeterminate delay.

We don‘t know, though, how the cache is managed, Etc.

Could this also be why the GPU side isn’t reaching full 2x potential? Having to reach across to the memory controller on another die must raise latency, I guess the question is how much that could be hidden and if the GPU cache is large enough to do so.

cmaier · Mar 20, 2022

diamond.g said:
Could this also be why the GPU side isn’t reaching full 2x potential? Having to reach across to the memory controller on another die must raise latency, I guess the question is how much that could be hidden and if the GPU cache is large enough to do so.

I think it’s rare for any workload to scale perfectly. But I also think that for GPUs bandwidth matters a lot more than latency. That said, I’m definitely not a GPU expert. I’ve never worked on a GPU, nor have I even thought about how I’d make one, so I’m really just guessing.

One of the questions I have about how Apple manages the cache is where things are cached (and how is the system level cache managed). If the original requestor for address X is die 1, but die 2 fetches it from its memory, which cache(s) store the contents of address X? A cache on die 1? A cache on die 2? Both?

If it is stored in die 1’s cache, does die 1 write it into the cache first, then provide it to the core? Or is it written to the cache in parallel with providing it to the core? How is the SLC partitioned? Is the SLC 2x the size of an M1 Max? Does each die store addresses associated only with its own RAM in the SLC? Or is it first come first served? Or is it based on which core did the requesting?

And how are GPU cores associated with CPU cores? Is there a preferential affinity? What if CPU cores on die 1 are busy putting data into memory for use by GPU cores on die 2?

The answers to some of these questions may be obvious to GPU designers or system architects.

moondarkx · May 4, 2022

Just some news regarding this. It seems that there is a PR ( https://github.com/embree/embree/pull/330 ) for embree that may increase the performance of CPU-based raytracing (cinebench uses it) by a factor of 8 to 13%. So Cinebench still does not use the Arm NEON at its full potential, making it an unfair comparison as the code is well optimized for AVX2.

theorist9 · May 4, 2022

senttoschool · May 4, 2022

moondarkx said:
Just some news regarding this. It seems that there is a PR ( https://github.com/embree/embree/pull/330 ) for embree that may increase the performance of CPU-based raytracing (cinebench uses it) by a factor of 8 to 13%. So Cinebench still does not use the Arm NEON at its full potential, making it an unfair comparison as the code is well optimized for AVX2.

Looks like a direct result of Apple contributing to Blender? Pretty cool.

You can see a lot of NEON extension code being added in the actual PR code:

Restore AVX2 via 2xNEON support for Apple silicon by Developer-Ecosystem-Engineering · Pull Request #330 · RenderKit/embree

Restore changes from Lighttransport/embree-aarch64 related to AVX2 (lighttransport/embree-aarch64#34) Add bootstrap scripts for linux/arm64 and mac/arm64

github.com

leman · May 5, 2022

moondarkx said:
Just some news regarding this. It seems that there is a PR ( https://github.com/embree/embree/pull/330 ) for embree that may increase the performance of CPU-based raytracing (cinebench uses it) by a factor of 8 to 13%. So Cinebench still does not use the Arm NEON at its full potential, making it an unfair comparison as the code is well optimized for AVX2.

No matter how Apple tries to patch it (and that particular PR I’ve been watching for several months now, there is not much happening), Embree is still fundamentally limited by the fact that the code is designed around x86 SIMD. True performance potential can probably only be reached by rewriting substantial parts of the code in NEON directly.

throAU · May 5, 2022

leman said:
True performance potential can probably only be reached by rewriting substantial parts of the code in NEON directly.

Which is likely what they're doing with a bunch if #IFDEFs

McDaveH · May 5, 2022

moondarkx said:
Just some news regarding this. It seems that there is a PR ( https://github.com/embree/embree/pull/330 ) for embree that may increase the performance of CPU-based raytracing (cinebench uses it) by a factor of 8 to 13%. So Cinebench still does not use the Arm NEON at its full potential, making it an unfair comparison as the code is well optimized for AVX2.

It's worse than that. Intel engineers Sven & Carsten at some point admit they did have the fixes to enable 256-bit NEON so Embree (used in Cinebench but also Blender CPU) has been hobbled. There's still some work to do as it doesn't address the AMX2s (one per 4-core CPU cluster) these pack a punch of around 2TFLOPS of matrix multiplication often used in AI but also ray tracing. As these are on the CPU clusters sharing the same cache they should qualify as "CPU".

If you're interested: https://tlkh.dev/benchmarking-the-apple-m1-max

McDaveH · May 5, 2022

leman said:
No matter how Apple tries to patch it (and that particular PR I’ve been watching for several months now, there is not much happening), Embree is still fundamentally limited by the fact that the code is designed around x86 SIMD. True performance potential can probably only be reached by rewriting substantial parts of the code in NEON directly.

"Embree is still fundamentally limited by the fact that" - it's an Intel-controlled software component.

throAU · May 5, 2022

McDaveH said:
"Embree is still fundamentally limited by the fact that" - it's an Intel-controlled software component.

Being open source it can just be forked.

mr_roboto · May 5, 2022

throAU said:
Being open source it can just be forked.

Technically true, but if it's a hostile fork and the original project is still intact how do you get downstream users like Cinebench to adopt your fork of Embree over the original? Forking probably isn't a viable option in this case.

senttoschool · May 5, 2022

throAU said:
Being open source it can just be forked.

Any fork would need to be adopted by software makers.

Software makers don't want to maintain two different engines that could radically diverge at any point.

Thus, the only reasonable way to go about this is to share the same codebase and add a lot of if statements.

If/when Apple Silicon grabs more market share for Blender/Cinebench then these software makers will optimize more for AS.

treehuggerpro · May 5, 2022

What the fork!

leman · May 5, 2022

McDaveH said:
. There's still some work to do as it doesn't address the AMX2s (one per 4-core CPU cluster) these pack a punch of around 2TFLOPS of matrix multiplication often used in AI but also ray tracing. As these are on the CPU clusters sharing the same cache they should qualify as "CPU".

How would you use matrix multiplication for ray tracing? I dint see how AMX units are relevant here…

Xiao_Xi · May 5, 2022

mr_roboto said:
Technically true, but if it's a hostile fork and the original project is still intact how do you get downstream users like Cinebench to adopt your fork of Embree over the original? Forking probably isn't a viable option in this case.

If Intel didn't want to include the Apple patches, Apple could include the Intel patches in its fork. So, downstream projects like Bender could then adopt the Apple fork instead of the Intel project.

Andropov · May 6, 2022

leman said:
How would you use matrix multiplication for ray tracing? I dint see how AMX units are relevant here…

It appears that some of the algorithms for computing the intersection of the ray with a given triangle do use matrices (like: https://www.scratchapixel.com/lesso...gle/moller-trumbore-ray-triangle-intersection).

Given how the AMX units are per-cluster instead of per-core, maybe the overhead of sending tiny 3x3 or 4x4 matrices there is just not worth it, though.

Andropov · May 6, 2022

Xiao_Xi said:
If Intel didn't want to include the Apple patches, Apple could include the Intel patches in its fork. So, downstream projects like Bender could then adopt the Apple fork instead of the Intel project.

That would become impossible to maintain for Apple in a short amount of time. Since the Intel version would not have Apple's patches, every time Intel changed a file, the patches to those files would have to be re-applied.

leman · May 6, 2022

Andropov said:
It appears that some of the algorithms for computing the intersection of the ray with a given triangle do use matrices (like: https://www.scratchapixel.com/lesso...gle/moller-trumbore-ray-triangle-intersection).

Given how the AMX units are per-cluster instead of per-core, maybe the overhead of sending tiny 3x3 or 4x4 matrices there is just not worth it, though.

Look at the code in that article, there are no matrix multiplications in there, only a bunch of vector products and conditionals. That's not what you can do on the AMX units that are optimised for multiplying large matrices.

And besides, many formal algorithm descriptions deal with the big O notation, but things are not as simple in the real world. In practice, you are almost always better using branchless SIMD even if it ends up with more operations on paper. Consider for example the classical rectangle-line-clipping algorithm: https://en.wikipedia.org/wiki/Liang–Barsky_algorithm Notice how use a bunch of conditionals to set up the parameters to minimise the number of ALU operations performed, but that is far from optimal on modern hardware. I have a version of that algorithm that uses just a couple of SIMD instructions with data-parallel reduction, no branches, no nothing. Runs very fast especially on M1 since it has fast horizontal instructions (x86 doesn't).

Intel Alder Lake vs. Apple M1

macrumors Core

Suspended

macrumors Core

Suspended

macrumors G4

Suspended

macrumors newbie

macrumors 601

macrumors 68030

macrumors Core

macrumors G4

macrumors member

macrumors member

macrumors G4

macrumors 6502a

macrumors 68030

macrumors regular

macrumors Core

macrumors 68000

macrumors 6502a

macrumors 6502a

macrumors Core

Our Staff