Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,665
OBX
The original M1 page on apple.com said the (regular) M1 supported 24,576 threads in flight. That'd mean (since the max number of threads in threadgroups is fixed to 1,024 and there's only 8 GPU cores) that a single GPU core can execute different threadgroups at once (up to 3).

For the M1 Ultra the number would be 196,608 threads in-flight (just like you say). High, but still lower than even a 1920x1080 image, and that's a 'best case scenario' where you're running light shaders/kernels that don't push the register pressure too far (or you'd be able to have less threads per threadgroup to fit in memory, and then the max threads in flight would decrease).

I'm not saying that scheduling doesn't play a role in the apparently poor scaling of the M1 Ultra on some tasks, but I'm not convinced that's the entirety of the problem.

As for how many shader cores/units the Ultra has: it does have 8,192 ALUs, but is every single one of them executing instructions from a different thread? Not sure how the scheduling works at that level.
So if scheduling isn't the issue (as in all ALU's are fed), then that would leave memory subsystems as being the culprit?
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,228
The original M1 page on apple.com said the (regular) M1 supported 24,576 threads in flight. That'd mean (since the max number of threads in threadgroups is fixed to 1,024 and there's only 8 GPU cores) that a single GPU core can execute different threadgroups at once (up to 3).

For the M1 Ultra the number would be 196,608 threads in-flight (just like you say). High, but still lower than even a 1920x1080 image, and that's a 'best case scenario' where you're running light shaders/kernels that don't push the register pressure too far (or you'd be able to have less threads per threadgroup to fit in memory, and then the max threads in flight would decrease).

I'm not saying that scheduling doesn't play a role in the apparently poor scaling of the M1 Ultra on some tasks, but I'm not convinced that's the entirety of the problem.

As for how many shader cores/units the Ultra has: it does have 8,192 ALUs, but is every single one of them executing instructions from a different thread? Not sure how the scheduling works at that level.

I noticed at the other place (don’t have an account there yet) you mentioned the GPU has a limit on RAM 48 out of 64GB - this isn’t true. The GPU can fill out as much as it like, the soft limit is just Apple recommending you not do so. I believe @leman or someone did the test awhile back.

Found it: https://forums.macrumors.com/thread...ore-vram.2320970/?post=30563961#post-30563961

So it doesn’t seem like a full test was done (@Gnattu did one, but with caveat), but @leman convinced me this was a soft limit, not a hard one.
 
Last edited:

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
So if scheduling isn't the issue (as in all ALU's are fed), then that would leave memory subsystems as being the culprit?
Hard to say for sure without profiling. Since the problem is that it doesn't scale as expected, it'd have to be something that is different on the M1 Ultra vs the M1 Max. Two things I can think of are the scheduling and the interconnect, but I'm not convinced with either. We'll have to wait until someone gets his hands on one of these.

So it doesn’t seem like a full test was done (@Gnattu did one, but with caveat), but @leman convinced me this was a soft limit, not a hard one.
Maybe it is a soft limit after all. I remember reading a lot of people in my (twitter) timeline talking about a 48GB limit after the M1 Pro/Max was released, so I though it had already been tried and tested by several people. Maybe it wasn't.

I was planning on making up a test for some GPU related things tonight to better understand how Apple's GPU works (regarding how many threads are executing in parallel), I could test how much memory Metal can fill before something happens too (only have the M1 Pro / 16GB though).
 
  • Like
Reactions: crazy dave

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,665
OBX
Hard to say for sure without profiling. Since the problem is that it doesn't scale as expected, it'd have to be something that is different on the M1 Ultra vs the M1 Max. Two things I can think of are the scheduling and the interconnect, but I'm not convinced with either. We'll have to wait until someone gets his hands on one of these.


Maybe it is a soft limit after all. I remember reading a lot of people in my (twitter) timeline talking about a 48GB limit after the M1 Pro/Max was released, so I though it had already been tried and tested by several people. Maybe it wasn't.

I was planning on making up a test for some GPU related things tonight to better understand how Apple's GPU works (regarding how many threads are executing in parallel), I could test how much memory Metal can fill before something happens too (only have the M1 Pro / 16GB though).
Well they said the interconnect is high speed, but didn't mention how much latency is added by going across the link.
 

dugbug

macrumors 68000
Aug 23, 2008
1,929
2,147
Somewhere in Florida
Well they said the interconnect is high speed, but didn't mention how much latency is added by going across the link.

If one gpu core is working on data kept in the RAM of the 'far-side' chip for example. Stuff I throw at on xeons we never really cared about latency across chips (in the noise really) but with GPUs split like this but acting as one to the software I gotta believe there is some sensitivity here? Very interested.


-d
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
Well they said the interconnect is high speed, but didn't mention how much latency is added by going across the link.
They mention that it has minimal latency, and I don't see any reason why the latency would skyrocket with their interposer.

If one gpu core is working on data kept in the RAM of the 'far-side' chip for example. Stuff I throw at on xeons we never really cared about latency across chips (in the noise really) but with GPUs split like this but acting as one to the software I gotta believe there is some sensitivity here? Very interested.
Even if the 2.5TB/s of the interconnect were bi-directional, that'd leave 1.25TB/s of bandwidth from one side to the other. The memory bandwidth is 'only' 400GB/s between a die and its closest RAM, so the interconnect should have more than enough bandwidth to handle a die accessing the full 400GB/s of memory from memory chips on 'the other side', so I can't see how that'd be a bottleneck. Latency will increase compared to the M1 Max, but GPUs are good at hiding latency.
 

Rigby

macrumors 603
Aug 5, 2008
6,257
10,215
San Jose, CA
I think that "dubious" is the wrong word. Just browsing quickly, you almost see no instruction with 1:1 mapping.
Really? Here's, for example, the implementation of _mm_fmadd_ps, which is probably frequently used by something like Embree:

Code:
FORCE_INLINE __m128 _mm_fmadd_ps(__m128 a, __m128 b, __m128 c)
{
#if defined(__aarch64__)
    return vreinterpretq_m128_f32(vfmaq_f32(vreinterpretq_f32_m128(c),
                                            vreinterpretq_f32_m128(b),
                                            vreinterpretq_f32_m128(a)));
#else
...
There are tons more like that.
 

Rigby

macrumors 603
Aug 5, 2008
6,257
10,215
San Jose, CA
You are the one claiming that SSE and NEON are practically equivalent. The burden of proof should be on you.
Just looking at the code should tell you that.

Anyway, there are reasons why ARM has spent a lot of work on SVE, which should be more competitive with AVX2/512. Will be interesting to see if Apple adopts it (for >= 256 bit vectors) in the M-series CPUs.

ADL has a decisive core advantage, and Cinebench also scales extremely well with SMT.
No doubt. Similarly, the M1 Ultra is at a core advantage when compared to Alder Lake. But SMT is an advantage that Intel will always have over ARM.

Rewriting it just to get better CB scores? Please, that’s childish. Apple is pushing GPU rendering, where they have much better scaling potential.
Embree isn't just used for artificial benchmarks. It is for example used by Blender, which is a popular application in Apple's target group. Incidentally, Blender CPU rendering doesn't perform very well on M1.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
Anyway, there are reasons why ARM has spent a lot of work on SVE, which should be more competitive with AVX2/512. Will be interesting to see if Apple adopts it (for >= 256 bit vectors) in the M-series CPUs.

SVE aims to solve a different problem. It does not replace NEON. Also, I don’t see Apple adopting larger than 128bit SIMD, SVE or not, as there is not much benefit in doing that. Their throughput strategy relies on the GPU and the AMX units. The CPU is for flexibility, not throughtput, which is also why M1 is killing it in scientific applications. They could adopt more FP units in future architectures though.

Embree isn't just used for artificial benchmarks. It is for example used by Blender, which is a popular application in Apple's target group. Incidentally, Blender CPU rendering doesn't perform very well on M1.

For applications like Blender GPU rendering is a much more lucrative target. That’s why Apples efforts are concentrated there. Embree patches are just the obvious low hanging fruit.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Really? Here's, for example, the implementation of _mm_fmadd_ps, which is probably frequently used by something like Embree:

Code:
FORCE_INLINE __m128 _mm_fmadd_ps(__m128 a, __m128 b, __m128 c)
{
#if defined(__aarch64__)
    return vreinterpretq_m128_f32(vfmaq_f32(vreinterpretq_f32_m128(c),
                                            vreinterpretq_f32_m128(b),
                                            vreinterpretq_f32_m128(a)));
#else
...
There are tons more like that.

Yes, there are tons more, as well as tons of more examples, where the emulated code takes 10 times or longer of the execution time of the original SSE code. I expect that the average emulation penalty is close to factor 2 compared to a pure NEON implementation - and this already requires that a larger part of the used instructions are are 1:1 mappings.
Not sure, if you understand, that you have to look at the average emulation penalty not just at the best case.
 
Last edited:

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Anyway, there are reasons why ARM has spent a lot of work on SVE, which should be more competitive with AVX2/512. Will be interesting to see if Apple adopts it (for >= 256 bit vectors) in the M-series CPUs.

Not convinced you really understand why ARM is proposing SVE going forward. In order to understand this you should ask yourself what features SVE has, which neither NEON nor AVX2/512 do support.
Also the vector length is not that important - having more units with less vector length can even be of advantage.
 

Rigby

macrumors 603
Aug 5, 2008
6,257
10,215
San Jose, CA
SVE aims to solve a different problem. It does not replace NEON. Also, I don’t see Apple adopting larger than 128bit SIMD, SVE or not, as there is not much benefit in doing that.
My experience with AVX says otherwise. The jump to 256-bit vectors brought huge performance gains for some workloads.

Their throughput strategy relies on the GPU and the AMX units.
That may well be the case. But if they neglect CPU SIMD performance, that affects some workloads that depend on it. That may be exactly what we are seeing here.

For applications like Blender GPU rendering is a much more lucrative target. That’s why Apples efforts are concentrated there. Embree patches are just the obvious low hanging fruit.
GPU rendering cannot completely replace CPU rendering.
 

Rigby

macrumors 603
Aug 5, 2008
6,257
10,215
San Jose, CA
Yes, there are tons more, as well as tons of more examples, where the emulated code takes 10 times or longer of the execution time of the original SSE code. I expect that the average emulation penalty is close to factor 2 compared to a pure NEON implementation - and this already requires that a larger part of the used instructions are are 1:1 mappings.
Not sure, if you understand, that you have to look at the average emulation penalty not just at the best case.
SSE2Neon is not emulation, but translation. And we can't judge the performance impact on a specific application without specifiying the instruction mix that it uses. Also, not all intrinsics are equally performance critical. The ones used in the inner loops matter most.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
My experience with AVX says otherwise. The jump to 256-bit vectors brought huge performance gains for some workloads.

Your experience with AVX does not mean anything since x86 uses a completely different design paradigm. Apple goes for flexibility, this is why they increase the number of ALUs rather than just making ALUs bigger. And that’s why their FP performance is off the charts. The beauty of SVE is that it doesn’t matter what size your ALUs are as long as you can schedule the instructions properly. You can use 8 128bit ALUs and have effective 1024bit SIMD throughout with a single SVE instruction - or you can do 8 scalar/smaller SIMD FP ops. Intel with two AVX512 units has the same SIMD throughout - but they can only do two smaller FP operations, because their ALUs are fundamentally limited.



That may well be the case. But if they neglect CPU SIMD performance, that affects some workloads that depend on it. That may be exactly what we are seeing here.

They do not neglect CPU SIMD performance. They simply do a different balancing act. M1 excels at general purpose, complex FP and SIMD code. And x86 SIMD will have a slight edge in throughtput if optimized well - mostly thanks to higher clocks. That’s it. It is up to you to decide what is more important to you personally.

GPU rendering cannot completely replace CPU rendering.

Why not? The biggest problem is memory. Apples UMA completely removes this factor.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
SSE2Neon is not emulation, but translation. And we can't judge the performance impact on a specific application without specifiying the instruction mix that it uses. Also, not all intrinsics are equally performance critical. The ones used in the inner loops matter most.

You can call it whatever you want, the nature of the phenomenon does not change. It’s a different ISA and you are emulating semantics. Unless your algorithms are entirely trivial you won’t get an optimal implementation. And even then you will be foregoing advanced features of NEON like FMA that SSE simply doesn’t offer.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
PC workstation can accommodate much more RAM and AS Macs.

We were talking about GPU rendering. Even most powerful workstation GPUs are ultimately limited by their VRAM and the slow PCIe interface.
 

oz_rkie

macrumors regular
Apr 16, 2021
177
165
We were talking about GPU rendering. Even most powerful workstation GPUs are ultimately limited by their VRAM and the slow PCIe interface.
Workstation cards use nvlink though, that can reach upto 600 GB/s on the interconnect bandwidth as opposed to just 64GB/s on pcie gen4. Obviously these are super high end cards.
 

jeanlain

macrumors 68020
Mar 14, 2009
2,460
954
We were talking about GPU rendering. Even most powerful workstation GPUs are ultimately limited by their VRAM and the slow PCIe interface.
Your post suggested that GPU rendering could replace CPU rendering thanks to AS, but this assumes no one needs more than 128 GB or memory.
 

januarydrive7

macrumors 6502a
Oct 23, 2020
537
578
Your post suggested that GPU rendering could replace CPU rendering thanks to AS, but this assumes no one needs more than 128 GB or memory.
I believe what @leman was saying is that, given equal access to memory, the simple and highly parallel nature of GPUs makes them an ideal candidate for rendering as opposed to CPUs.

In other words, if you're limited to 128GB of memory, and both the CPU and GPU have equal access to it, then the GPU is the better target for your rendering job. If, on the other hand, you have M GB of VRAM and N >> M GB of RAM, then it's possible that CPU rendering is the better option. Apple Silicon sits in the former camp of designs, so GPU rendering ought to always win out.
 
  • Like
Reactions: leman and Andropov

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
The memory bandwidth is 'only' 400GB/s between a die and its closest RAM, so the interconnect should have more than enough bandwidth to handle a die accessing the full 400GB/s of memory from memory chips on 'the other side', so I can't see how that'd be a bottleneck.
The UltraFusion fabric is likely only used to ensure cache coherency across the L1, L2 and L3 caches for both M1 Max dies, probably similar to how the caches are synchronised across the various cache blocks within one M1 Max die. That's why it will require a tremendous amount of low latency bandwidth.

My simple calculation shows that for a single CPU core accessing it's L1 cache with a 1024-bits? cache line at max CPU frequency of 3.2 GHz, we would need:

(1024 / 8) * 3200 / 1024 = 400 GB/s thruput

If the cache line is 512-bits, thruput would be halved to 200GB/s per CPU core.

If multiple cores can access the caches simultaneously (which I think it does), the internal fabric have to be able to push bits very very quickly, likely in the TB/s range. That's why the UltraFusion fabric need to have high thruput low latency performance.

Each M1 Max's 4 LPDDR5 memory controllers would just load from memory and just push it to it's L3 cache, and the UltraFusion fabric will then sync it across both die's L3 cache. Same thing when L2 and L1 cache changes across the dies.
 

vladi

macrumors 65816
Jan 30, 2010
1,008
617
We were talking about GPU rendering. Even most powerful workstation GPUs are ultimately limited by their VRAM and the slow PCIe interface.

Out of core rendering has been present on serious render solutions for quite a while now.
 

Rigby

macrumors 603
Aug 5, 2008
6,257
10,215
San Jose, CA
They do not neglect CPU SIMD performance. They simply do a different balancing act. M1 excels at general purpose, complex FP and SIMD code.
They spent an awful lot of silicon on highly specialized functionality for niche audiences (like Prores encoders). Given that their dies are already huge, it's stands to reason that they had to save elsewhere.

And x86 SIMD will have a slight edge in throughtput if optimized well - mostly thanks to higher clocks. That’s it. It is up to you to decide what is more important to you personally.
As I said above, there is a lot more to it. But since neither of us have in-depth knowledge of ARM SIMD instructions, that doesn't seem to be a productive discussion.
Why not? The biggest problem is memory. Apples UMA completely removes this factor.
Try running a custom shader (OSL) using the GPU renderer, for example.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
Your post suggested that GPU rendering could replace CPU rendering thanks to AS, but this assumes no one needs more than 128 GB or memory.

Apple has not presented their high-end solution yet. The Mac Pro will likely have different RAM limits.

In other words: the Studio is a 128GB class machine. It can’t deal with bigger problem sizes, no matter whether you do your rendering on the CPU or on the GPU. So this is not an argument against GPU-based rendering.

They spent an awful lot of silicon on highly specialized functionality for niche audiences (like Prores encoders). Given that their dies are already huge, it's stands to reason that they had to save elsewhere.

They have to build hardware that offers the best balance of features across a different range of products. That is why they focus on flexibility, latency and energy efficiency as opposed to pure throughout when it comes to SIMD.
Try running a custom shader (OSL) using the GPU renderer, for example.

What’s the problem? Apple GPUs support function pointers, dynamic linking, recursion and other goodies. Sure, the shaders will have to be written taking the peculiarities of the GPU architecture into account, but that is more straightforward than say, optimizing for AVX512, and let’s you target a wider class of hardware.
 
Last edited:
  • Like
Reactions: JMacHack and altaic
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.