Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
Why is M1 Ultra consider UMA and not NUMA?

Good question. I suppose that if you are pedantic enough you might characterize it as NUMA, but in practice it probably won’t behave like a NUMA system. The interconnect should fast enough. There is of course the question of memory topology, hopefully someone will investigate this in detail.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Good question. I suppose that if you are pedantic enough you might characterize it as NUMA, but in practice it probably won’t behave like a NUMA system. The interconnect should fast enough. There is of course the question of memory topology, hopefully someone will investigate this in detail.

I don’t think anyone considers it UMA. Not sure I’ve even heard apple say it’s UMA. They say it’s unified memory architecture, which happens to have the same acronym.
 
  • Like
Reactions: Tagbert

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
I don’t think anyone considers it UMA. Not sure I’ve even heard apple say it’s UMA. They say it’s unified memory architecture, which happens to have the same acronym.

Fair enough. I guess I was just trying to comment on the second part of the question. Is it a NUMA architecture in the broad sense? That would depend on how things are connected I guess. I wouldn’t be surprised is the RAM access latency on the Ultra will be identical no matter where the RAM is located (I do expect it to be slightly higher than for the Max).
 
  • Like
Reactions: diamond.g

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Fair enough. I guess I was just trying to comment on the second part of the question. Is it a NUMA architecture in the broad sense? That would depend on how things are connected I guess. I wouldn’t be surprised is the RAM access latency on the Ultra will be identical no matter where the RAM is located (I do expect it to be slightly higher than for the Max).

I think the latency has to be higher when accessing RAM connected to the other chip. It’s gotta pass through the crossbar, which has to be latched on both sides, so it’s going to take at least a couple cycles longer.

Pretty sure the way it will work is chip A passes the address through the crossbar to chip B (adding at least a cycle or 2), chip B fetches the data, then chip B passes the data through the crossbar to chip A (adding at least a couple of cycles).

Due to clock skew, you may have another cycle wasted in each direction. And, of course, chip B may be busy fetching its own data (though each chip seems to have ample bandwidth to its own memory), which can also cause an indeterminate delay.

We don‘t know, though, how the cache is managed, Etc.
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,665
OBX
I think the latency has to be higher when accessing RAM connected to the other chip. It’s gotta pass through the crossbar, which has to be latched on both sides, so it’s going to take at least a couple cycles longer.

Pretty sure the way it will work is chip A passes the address through the crossbar to chip B (adding at least a cycle or 2), chip B fetches the data, then chip B passes the data through the crossbar to chip A (adding at least a couple of cycles).

Due to clock skew, you may have another cycle wasted in each direction. And, of course, chip B may be busy fetching its own data (though each chip seems to have ample bandwidth to its own memory), which can also cause an indeterminate delay.

We don‘t know, though, how the cache is managed, Etc.
Could this also be why the GPU side isn’t reaching full 2x potential? Having to reach across to the memory controller on another die must raise latency, I guess the question is how much that could be hidden and if the GPU cache is large enough to do so.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Could this also be why the GPU side isn’t reaching full 2x potential? Having to reach across to the memory controller on another die must raise latency, I guess the question is how much that could be hidden and if the GPU cache is large enough to do so.
I think it’s rare for any workload to scale perfectly. But I also think that for GPUs bandwidth matters a lot more than latency. That said, I’m definitely not a GPU expert. I’ve never worked on a GPU, nor have I even thought about how I’d make one, so I’m really just guessing.

One of the questions I have about how Apple manages the cache is where things are cached (and how is the system level cache managed). If the original requestor for address X is die 1, but die 2 fetches it from its memory, which cache(s) store the contents of address X? A cache on die 1? A cache on die 2? Both?

If it is stored in die 1’s cache, does die 1 write it into the cache first, then provide it to the core? Or is it written to the cache in parallel with providing it to the core? How is the SLC partitioned? Is the SLC 2x the size of an M1 Max? Does each die store addresses associated only with its own RAM in the SLC? Or is it first come first served? Or is it based on which core did the requesting?

And how are GPU cores associated with CPU cores? Is there a preferential affinity? What if CPU cores on die 1 are busy putting data into memory for use by GPU cores on die 2?

The answers to some of these questions may be obvious to GPU designers or system architects.
 

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
Just some news regarding this. It seems that there is a PR ( https://github.com/embree/embree/pull/330 ) for embree that may increase the performance of CPU-based raytracing (cinebench uses it) by a factor of 8 to 13%. So Cinebench still does not use the Arm NEON at its full potential, making it an unfair comparison as the code is well optimized for AVX2.
Looks like a direct result of Apple contributing to Blender? Pretty cool.

You can see a lot of NEON extension code being added in the actual PR code:

 
  • Like
Reactions: throAU

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
Just some news regarding this. It seems that there is a PR ( https://github.com/embree/embree/pull/330 ) for embree that may increase the performance of CPU-based raytracing (cinebench uses it) by a factor of 8 to 13%. So Cinebench still does not use the Arm NEON at its full potential, making it an unfair comparison as the code is well optimized for AVX2.

No matter how Apple tries to patch it (and that particular PR I’ve been watching for several months now, there is not much happening), Embree is still fundamentally limited by the fact that the code is designed around x86 SIMD. True performance potential can probably only be reached by rewriting substantial parts of the code in NEON directly.
 

McDaveH

macrumors member
Dec 14, 2017
30
15
Just some news regarding this. It seems that there is a PR ( https://github.com/embree/embree/pull/330 ) for embree that may increase the performance of CPU-based raytracing (cinebench uses it) by a factor of 8 to 13%. So Cinebench still does not use the Arm NEON at its full potential, making it an unfair comparison as the code is well optimized for AVX2.
It's worse than that. Intel engineers Sven & Carsten at some point admit they did have the fixes to enable 256-bit NEON so Embree (used in Cinebench but also Blender CPU) has been hobbled. There's still some work to do as it doesn't address the AMX2s (one per 4-core CPU cluster) these pack a punch of around 2TFLOPS of matrix multiplication often used in AI but also ray tracing. As these are on the CPU clusters sharing the same cache they should qualify as "CPU".

If you're interested: https://tlkh.dev/benchmarking-the-apple-m1-max
 

McDaveH

macrumors member
Dec 14, 2017
30
15
No matter how Apple tries to patch it (and that particular PR I’ve been watching for several months now, there is not much happening), Embree is still fundamentally limited by the fact that the code is designed around x86 SIMD. True performance potential can probably only be reached by rewriting substantial parts of the code in NEON directly.
"Embree is still fundamentally limited by the fact that" - it's an Intel-controlled software component.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
Being open source it can just be forked.
Technically true, but if it's a hostile fork and the original project is still intact how do you get downstream users like Cinebench to adopt your fork of Embree over the original? Forking probably isn't a viable option in this case.
 
  • Like
Reactions: jdb8167

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
Being open source it can just be forked.
Any fork would need to be adopted by software makers.

Software makers don't want to maintain two different engines that could radically diverge at any point.

Thus, the only reasonable way to go about this is to share the same codebase and add a lot of if statements.

If/when Apple Silicon grabs more market share for Blender/Cinebench then these software makers will optimize more for AS.
 
Last edited:

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
. There's still some work to do as it doesn't address the AMX2s (one per 4-core CPU cluster) these pack a punch of around 2TFLOPS of matrix multiplication often used in AI but also ray tracing. As these are on the CPU clusters sharing the same cache they should qualify as "CPU".
How would you use matrix multiplication for ray tracing? I dint see how AMX units are relevant here…
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
Technically true, but if it's a hostile fork and the original project is still intact how do you get downstream users like Cinebench to adopt your fork of Embree over the original? Forking probably isn't a viable option in this case.
If Intel didn't want to include the Apple patches, Apple could include the Intel patches in its fork. So, downstream projects like Bender could then adopt the Apple fork instead of the Intel project.
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
How would you use matrix multiplication for ray tracing? I dint see how AMX units are relevant here…
It appears that some of the algorithms for computing the intersection of the ray with a given triangle do use matrices (like: https://www.scratchapixel.com/lesso...gle/moller-trumbore-ray-triangle-intersection).

Given how the AMX units are per-cluster instead of per-core, maybe the overhead of sending tiny 3x3 or 4x4 matrices there is just not worth it, though.
 
  • Like
Reactions: jdb8167

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
If Intel didn't want to include the Apple patches, Apple could include the Intel patches in its fork. So, downstream projects like Bender could then adopt the Apple fork instead of the Intel project.
That would become impossible to maintain for Apple in a short amount of time. Since the Intel version would not have Apple's patches, every time Intel changed a file, the patches to those files would have to be re-applied.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
It appears that some of the algorithms for computing the intersection of the ray with a given triangle do use matrices (like: https://www.scratchapixel.com/lesso...gle/moller-trumbore-ray-triangle-intersection).

Given how the AMX units are per-cluster instead of per-core, maybe the overhead of sending tiny 3x3 or 4x4 matrices there is just not worth it, though.

Look at the code in that article, there are no matrix multiplications in there, only a bunch of vector products and conditionals. That's not what you can do on the AMX units that are optimised for multiplying large matrices.

And besides, many formal algorithm descriptions deal with the big O notation, but things are not as simple in the real world. In practice, you are almost always better using branchless SIMD even if it ends up with more operations on paper. Consider for example the classical rectangle-line-clipping algorithm: https://en.wikipedia.org/wiki/Liang–Barsky_algorithm Notice how use a bunch of conditionals to set up the parameters to minimise the number of ALU operations performed, but that is far from optimal on modern hardware. I have a version of that algorithm that uses just a couple of SIMD instructions with data-parallel reduction, no branches, no nothing. Runs very fast especially on M1 since it has fast horizontal instructions (x86 doesn't).
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.