3D Rendering on Apple Silicon, CPU&GPU

leman · Feb 3, 2025

@crazy dave Excellent write-up with a lot of attention to detail! The GPU implementation details are unfortunately closely guarded secrets, so it can be very difficult to get a clear picture of what is going on.

The idea I personally find most compelling is that it boils down to data movement. I've seen it suggested that Nvidia does not provide enough parallel data path bandwidth to actually feed all the processing units they have. Instead, they rely on a caching circuitry called the operand collector to optimize register storage access. This schema (if true) would mean that some of their ALUs are idle at least some of the time, which results in poorer efficiency. At the same time, it seems that Apple uses a different approach, trying to maximize the computational efficiency.

There is some evidence lending credibility to this world. First, Nvidia GPU cores are smaller than Apple GPU cores despite including more compute (e.g. tensor processing units) and larger storage banks. A narrower data path would explain this discrepancy. Second, Apple appears to be the only GPU on the market with four-input instructions (compare and select), which are difficult to feed.

Assuming that this theory is correct, it is hard to say which approach makes more sense. One might say that Nvidia is being wasteful, but they do achieve comparable or better compute per die area, and certainly better compute per $. By using smaller processing units, they can cram more of them into their device, and idling some of the units can be a good strategy for both energy movement and work balancing. It appears that both strategies excel at what they are doing, and both would have some difficulties with scaling.

Apple could — quite easily — implement concurrent FP32 pipes within their GPU cores. However, this would also require them widening the data path by ~ 33% if they want to keep the same efficiency. Right now, Apple cannot sustain parallel fetches of two 32-bit values, which becomes clear when running FP32 and Int code concurrently — it gets a penalty (no such penalty is observed when combining 32-bit and 16-bit pipes). In principle, it could be well worth it, as they are likely to improve the performance by way more than 30%.

crazy dave · Feb 4, 2025

leman said:
@crazy dave Excellent write-up with a lot of attention to detail! The GPU implementation details are unfortunately closely guarded secrets, so it can be very difficult to get a clear picture of what is going on.

The idea I personally find most compelling is that it boils down to data movement. I've seen it suggested that Nvidia does not provide enough parallel data path bandwidth to actually feed all the processing units they have. Instead, they rely on a caching circuitry called the operand collector to optimize register storage access. This schema (if true) would mean that some of their ALUs are idle at least some of the time, which results in poorer efficiency. At the same time, it seems that Apple uses a different approach, trying to maximize the computational efficiency.

There is some evidence lending credibility to this world. First, Nvidia GPU cores are smaller than Apple GPU cores despite including more compute (e.g. tensor processing units) and larger storage banks. A narrower data path would explain this discrepancy. Second, Apple appears to be the only GPU on the market with four-input instructions (compare and select), which are difficult to feed.

Assuming that this theory is correct, it is hard to say which approach makes more sense. One might say that Nvidia is being wasteful, but they do achieve comparable or better compute per die area, and certainly better compute per $. By using smaller processing units, they can cram more of them into their device, and idling some of the units can be a good strategy for both energy movement and work balancing. It appears that both strategies excel at what they are doing, and both would have some difficulties with scaling.

Apple could — quite easily — implement concurrent FP32 pipes within their GPU cores. However, this would also require them widening the data path by ~ 33% if they want to keep the same efficiency. Right now, Apple cannot sustain parallel fetches of two 32-bit values, which becomes clear when running FP32 and Int code concurrently — it gets a penalty (no such penalty is observed when combining 32-bit and 16-bit pipes). In principle, it could be well worth it, as they are likely to improve the performance by way more than 30%.

Thanks!

I like the data path argument, especially as it generalizes to the Apple situation with FP32 and Int32. I'm not as sold on the core size argument as while it is true that Apple lacks tensor cores altogether, they double the number of pipes per core that Nvidia has and Nvidia doesn't even have separate FP16 pipes. BTW found this for a description of modern Nvidia pipes and how they operate:

I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

If you are trying understand the microarchitecture of the SM the first step is to never use the term CUDA core or Tensor core as these are marketing terms that help provide a magnitude of performance. At a high level each SM consists of the following hierarchy (this can change slightly between...

forums.developer.nvidia.com

Very cool. The discussion about where the FP16 work goes (earlier in the thread), the tensor cores or the FP32 pipes is interesting and it appears there may be 3 FP32 pipes but I'm pretty sure some of these are logical pipelines, not all are physical.

Here's another resource:

Hardware Counters in GPU Captures

Contents Supported Hardware Intro: Hardware Counters in GPU Captures Event List Counters High Frequency Counters New: Occupancy Graphs Old: Occupancy Graph AMD Counters Intel Occupancy Graphs NVIDIA Counter Acronyms/Terms Supported Hardware Timing Analysis Event List Counters High Frequency...

devblogs.microsoft.com

Plus we don't know how big Apple's ray tracing cores are - in some ways while I show they can keep up with Nvidia's in a per TFLOPS basis, that means individually they have to coordinate with a full 128 FP32 units per core - while for Nvidia it's an up-to-128 FP32 units per core but as we've been discussing Nvidia never gets close to that it's 64+x where x << 64. So Apple's ray tracing cores might be more efficient from that perspective. But we don't know how big they are. So it's less clear to me how all that might impact core size - too many variables.

Plus there is another reason I mention below for why Nvidia fails to get anywhere close to double the FP32 throughput and why Apple might not either simply if they were to double up their FP32 cores, because it's intrinsic to the design: Occasionally one does have to do INT32 work! Make no mistake I don't think that explains everything and I like the data path argument, especially as it applies to Apple, but if Nvidia's double throughput methodology relies on using their INT32 pipes for FP work, that means anytime there is actual INT32 work to be done, which is not uncommon and why there are separate INT32 pipes to begin with!, then they can't take advantage of the double throughput.

To test this we'd need to run your utilization code (ported to CUDA ofc*) on an Ampere or later GPU and test not just pure FP, INT, FP & FP, FP & INT, and FP & FP & INT work but also FP & FP & occasional INT and see how much INT work is needed to drag performance down to 30% and if that's reasonable. *Edit: I do recognize you wrote it to test Apple's dual-issue and Nvidia's is different but philosophically we would need to do something similar.

With Apple of course if they wanted to go down the optional FP32 pipe they could also make the FP16 pipes optionally FP32 (although that would make them bigger). But which is better to augment might depend on which gets encountered in code more often alongside FP32, INT32 or FP16. You'd want to do whichever is least.

diamond.g said:
I could have sworn Chips and Cheese had an article about AMD pseudo dual issue where the compiler only did it in certain circumstances. It made me wonder if both them and Nvidia are having compiler issues.

As I said above, in addition to my earlier reply there is another more obvious reason I forgot to mention why Nvidia fails to get anywhere close to double the FP32 throughput ...

diamond.g · Feb 4, 2025

crazy dave said:
Thanks!

I like the data path argument, especially as it generalizes to the Apple situation with FP32 and Int32. I'm not as sold on the core size argument as while it is true that Apple lacks tensor cores altogether, they double the number of pipes per core that Nvidia has and Nvidia doesn't even have separate FP16 pipes. BTW found this for a description of modern Nvidia pipes and how they operate:

I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

If you are trying understand the microarchitecture of the SM the first step is to never use the term CUDA core or Tensor core as these are marketing terms that help provide a magnitude of performance. At a high level each SM consists of the following hierarchy (this can change slightly between...

forums.developer.nvidia.com

Very cool. The discussion about where the FP16 work goes (earlier in the thread), the tensor cores or the FP32 pipes is interesting and it appears there may be 3 FP32 pipes but I'm pretty sure some of these are logical pipelines, not all are physical.

Here's another resource:

Hardware Counters in GPU Captures

Contents Supported Hardware Intro: Hardware Counters in GPU Captures Event List Counters High Frequency Counters New: Occupancy Graphs Old: Occupancy Graph AMD Counters Intel Occupancy Graphs NVIDIA Counter Acronyms/Terms Supported Hardware Timing Analysis Event List Counters High Frequency...

devblogs.microsoft.com

Plus we don't know how big Apple's ray tracing cores are - in some ways while I show they can keep up with Nvidia's in a per TFLOPS basis, that means individually they have to coordinate with a full 128 FP32 units per core - while for Nvidia it's an up-to-128 FP32 units per core but as we've been discussing Nvidia never gets close to that it's 64+x where x << 64. So Apple's ray tracing cores might be more efficient from that perspective. But we don't know how big they are. So it's less clear to me how all that might impact core size - too many variables.

Plus there is another reason I mention below for why Nvidia fails to get anywhere close to double the FP32 throughput and why Apple might not either simply if they were to double up their FP32 cores, because it's intrinsic to the design: Occasionally one does have to do INT32 work! Make no mistake I don't think that explains everything and I like the data path argument, especially as it applies to Apple, but if Nvidia's double throughput methodology relies on using their INT32 pipes for FP work, that means anytime there is actual INT32 work to be done, which is not uncommon and why there are separate INT32 pipes to begin with!, then they can't take advantage of the double throughput.

To test this we'd need to run your utilization code (ported to CUDA ofc) on an Ampere or later GPU and test not just pure FP, INT, FP & FP, FP & INT, and FP & FP & INT work but also FP & FP & occasional INT and see how much INT work is needed to drag performance down to 30% and if that's reasonable.

With Apple of course if they wanted to go down the optional FP32 pipe they could also make the FP16 pipes optionally FP32 (although that would make them bigger). But which is better to augment might depend on which gets encountered in code more often alongside FP32, INT32 or FP16. You'd want to do whichever is least.

As I said above, in addition to my earlier reply there is another more obvious reason I forgot to mention why Nvidia fails to get anywhere close to double the FP32 throughput ...

Have you checked out their arch paper? It still kind of reads like a marketing slide, but they do talk about some higher level things.

https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf

Note that the number of possible INT32 integer operations in Blackwell are doubled compared to
Ada, by fully unifying them with FP32 cores, as depicted in Figure 6 below. However, the unified
cores can only operate as either FP32 or INT32 cores in any given clock cycle. Figure 6 below
shows how the SM architecture evolved between Ada and Blackwell.

So they still cannot do two different ops (well fp32 and int32) in the same cycle, which is something Apple is able to do?

leman · Feb 4, 2025

crazy dave said:
I like the data path argument, especially as it generalizes to the Apple situation with FP32 and Int32. I'm not as sold on the core size argument as while it is true that Apple lacks tensor cores altogether, they double the number of pipes per core that Nvidia has and Nvidia doesn't even have separate FP16 pipes.

That is true, Apple does seem to have more ALUs, although it is difficult to quantify due to hardware differences. Nvidia also has FP64 etc… You might be right that having dedicated Int and FP16 pipes might take more area.

diamond.g said:
So they still cannot do two different ops (well fp32 and int32) in the same cycle, which is something Apple is able to do?

It depends on how you count

Nvidia can execute FP and Int concurrently on the same partition (via separate execution units), but they can only schedule a single instruction per cycle. Their solution is quite neat: each instruction takes two cycles to dispatch (lower and upper half due to hardware being 16-wide and wave being 32-wide), so they start dispatching a different instruction every cycle. This looks like scheduling two instructions per cycle if you squint hard enough.

Apple hardware (and Intel I think) can instead schedule more than one instruction per cycle on the same partition.

All of this can be difficult to wrap one’s head around because of obfuscation. An important point is that these instructions come from different threads. We are not talking about super scalar execution like on CPU (where instructions from the same thread can run in parallel if they don’t have data dependencies). GPUs overlap execution across multiple threads. This is what makes the architecture feasible.

crazy dave · Feb 4, 2025

leman said:
That is true, Apple does seem to have more ALUs, although it is difficult to quantify due to hardware differences. Nvidia also has FP64 etc… You might be right that having dedicated Int and FP16 pipes might take more area.

I was wondering in your copious spare time if you'd use your program to test FP/INT64 throughput on the Apple Silicon? I'm just curious what that's like relative Nvidia/AMD. As far as I know, no one has tested that for Apple Silicon.

leman said:
It depends on how you count Nvidia can execute FP and Int concurrently on the same partition (via separate execution units), but they can only schedule a single instruction per cycle. Their solution is quite neat: each instruction takes two cycles to dispatch (lower and upper half due to hardware being 16-wide and wave being 32-wide), so they start dispatching a different instruction every cycle. This looks like scheduling two instructions per cycle if you squint hard enough.

Apple hardware (and Intel I think) can instead schedule more than one instruction per cycle on the same partition.

All of this can be difficult to wrap one’s head around because of obfuscation. An important point is that these instructions come from different threads. We are not talking about super scalar execution like on CPU (where instructions from the same thread can run in parallel if they don’t have data dependencies). GPUs overlap execution across multiple threads. This is what makes the architecture feasible.

Is the Nvidia one superscalar though? Because the execution of each half has to be from the same wave, right? So the two FP32 or one FP and one Int has to be from the same thread?

leman · Feb 4, 2025

crazy dave said:
I was wondering in your copious spare time if you'd use your program to test FP/INT64 throughput on the Apple Silicon? I'm just curious what that's like relative Nvidia/AMD. As far as I know, no one has tested that for Apple Silicon.

There is no FP64 on Apple GPUs. I looked at Int64 a while ago, can't find the results, it was quite a bit slower than Int32 as far as I remember.

crazy dave said:
Is the Nvidia one superscalar though? Because the execution of each half has to be from the same wave, right? So the two FP32 or one FP and one Int has to be from the same thread?

Each half is the same wave of course, but the different instructions come from different waves. So execution of two waves overlap. More importantly, I don't consider it superscalar because the next instruction from a wave cannot be scheduled before the previous instruction has completed. Here an example timeline:

Code:

           pipe0     pipe1
cycle 0:  wave0.lo     -         # schedule from wave0
cycle 1:  wave0.hi   wave1.lo    # schedule from wave1
cycle 2:  wave2.lo   wave1.hi    # schedule from wave2
cycle 3:  wave2.hi   wave3.lo    # schedule from wave3

crazy dave · Feb 4, 2025

leman said:
There is no FP64 on Apple GPUs. I looked at Int64 a while ago, can't find the results, it was quite a bit slower than Int32 as far as I remember.

Wait ... they can't do FP64 at all? I know Int64 can be done using two Int32s ... I guess the same isn't true for FP64, you need actual FP64 units?

EDIT: eoof there are techniques but I guess pretty slow:

Emulating FP64 with 2 FP32 on a GPU

If one were to emulate double precision floating point with two single precision floating points what would the performance be like, and can it be done well? Currently Nvidia is charging quite a p...

stackoverflow.com

leman said:
Each half is the same wave of course, but the different instructions come from different waves. So execution of two waves overlap. More importantly, I don't consider it superscalar because the next instruction from a wave cannot be scheduled before the previous instruction has completed. Here an example timeline:

Code:

pipe0 pipe1 cycle 0: wave0.lo - # schedule from wave0 cycle 1: wave0.hi wave1.lo # schedule from wave1 cycle 2: wave2.lo wave1.hi # schedule from wave2 cycle 3: wave2.hi wave3.lo # schedule from wave3

leman · Feb 4, 2025

crazy dave said:
Wait ... they can't do FP64 at all? I know Int64 can be done using two Int32s ... I guess the same isn't true for FP64, you need actual FP64 units?

Yep, no FP64 at all. To be honest, I am with Apple on this one. Dedicated FP64 needs a lot of circuitry, so GPUs have been neglecting it anyway. Software emulation is not that much slower than having just a few hardware units, so why bother?

If a GPU is interested in supporting wider scientific calculations, I would prefer to have some tailored integer instructions to make FP emulation faster. Like custom shift instructions to help with mantissa alignment, for example.

crazy dave said:
EDIT: eoof there are techniques but I guess pretty slow:

Emulating FP64 with 2 FP32 on a GPU

If one were to emulate double precision floating point with two single precision floating points what would the performance be like, and can it be done well? Currently Nvidia is charging quite a p...

stackoverflow.com

There are different methods, but yeah, you need a lot of instructions to achieve extended range. If all you need is extended precision (and not the range), this is a nice technique: https://link.springer.com/article/10.1007/PL00009321

diamond.g · Feb 4, 2025

leman said:
There is no FP64 on Apple GPUs. I looked at Int64 a while ago, can't find the results, it was quite a bit slower than Int32 as far as I remember.

Each half is the same wave of course, but the different instructions come from different waves. So execution of two waves overlap. More importantly, I don't consider it superscalar because the next instruction from a wave cannot be scheduled before the previous instruction has completed. Here an example timeline:

Code:

pipe0 pipe1 cycle 0: wave0.lo - # schedule from wave0 cycle 1: wave0.hi wave1.lo # schedule from wave1 cycle 2: wave2.lo wave1.hi # schedule from wave2 cycle 3: wave2.hi wave3.lo # schedule from wave3

Any ideas why RDNA is so crappy if it can do a whole wave32 in 1 cycle (AMD docs says that wave64 takes 2 cycles with some exceptions if I read it correctly). does it come down to utilization?

leman · Feb 4, 2025

diamond.g said:
Any ideas why RDNA is so crappy if it can do a whole wave32 in 1 cycle (AMD docs says that wave64 takes 2 cycles with some exceptions if I read it correctly). does it come down to utilization?

I think this again boils down to misleading marketing. AMD has advertised Navi3 as doubling FP32 compute. But this only applies to a very narrow set of codes. So the measurable improvement looks very disappointing compared to what has been advertised. But when you think of the 7900 XTX as a 30 TFLOPs part (and not 60 TFLOPs, as advertised), then actually performs as well as you'd expect.

OptimusGrime · Feb 4, 2025

leman said:
Yep, no FP64 at all. To be honest, I am with Apple on this one. Dedicated FP64 needs a lot of circuitry, so GPUs have been neglecting it anyway. Software emulation is not that much slower than having just a few hardware units, so why bother?

I have no idea how popular fp64 is, so perhaps it isn’t worth it, but I had thought showing a definitive advantage vs Nvidia (4090 = 1.3 tflops of fp64) might be a way to gain traction. Having said that, the M4 Max gets .85 tflops of fp64 on its cpu/amx so maybe that’s a viable route for them.

leman · Feb 4, 2025

OptimusGrime said:
I have no idea how popular fp64 is, so perhaps it isn’t worth it, but I had thought showing a definitive advantage vs Nvidia (4090 = 1.3 tflops of fp64) might be a way to gain traction. Having said that, the M4 Max gets .85 tflops of fp64 on its cpu/amx so maybe that’s a viable route for them.

You seem to be referring to matrix multiplication? What makes you focus on this particular niche? I've never heard of anyone using FP64 for either training or inference in deep learning models.

OptimusGrime · Feb 4, 2025

leman said:
You seem to be referring to matrix multiplication? What makes you focus on this particular niche? I've never heard of anyone using FP64 for either training or inference in deep learning models.

Lol. I have no idea. I just saw the figure of .85 tflops when the M4 came out and had heard that fp64 was useful for scientific computing/HPC. I had no idea it wasn’t comparable to your discussion. Apologies.

name99 · Feb 4, 2025

crazy dave said:
With respect to the second point, @leman and I were discussing this possibility and I was quite in favor of it but I have to admit, I've cooled somewhat based on this analysis. So from that perspective it makes sense for Apple to adopt dual FP32 pipes per core. Dual-issue FP32 per core increases performance and doesn't cost much if any silicon die area and Apple already has a dual-issue design, it just doesn't have two FP32 pipes. Intel interestingly as a triple issue design but also doesn't double the FP32 pipes, maybe they will later but it is noteworthy that neither Apple nor Intel doubled FP32 pipes despite having the ability to issue multiple instructions per clock. As ChipsandCheese says, it's possible occupancy and other architectural issues become more complicated. AMD certainly suffers here as well. I'm not going to provide full analysis here but the 890M has a dual issue design, without dual issue it's roughly a 6 TFLOPS design that can when dual issue can be 12 TFLOPS but is very temperamental. More here on the 890M in particular. But in performance metrics the 890M can struggle versus the base M4, even in non-native games like CP2077 (coming soon to the Mac) and often has the power profile of the M4 Pro. Similar the Strix Halo looks to be roughly the M4 Pro level (maybe a bit better) despite having 40 cores and presumably large amounts of theoretical TFLOPS. Lastly, despite having the same TFLOPS rating, the 890M is also seemingly not as performant in many of these benchmarks and games as the older 6500XT, which was RDNA 2.

Now as aforementioned, Apple has a different dual issue design. If they adopt double FP32-pipes per core, they may not suffer from whatever problems is holding Nvidia/AMD back, they might get better performance increases per additional TFLOPS. They may not get the full double, but they might be better. However, one of the key drivers of this dual-issue design is to increase the TFLOPS without impacting die size. But that doesn't necessarily stop power increases and given the above, the power increases may be bigger than the performance increases! Apple, not having to sell its GPUs as a separate part to OEMs and users, may decide that if they need to increase TFLOPS, they'll just increase the number of cores and eat the silicon costs.

If I understand the Apple GPU design correctly (and it's definitely possible that I don't!) Apple
- already gets the equivalent of some degree of superscalar from the clause-based execution (which can allow load/store instructions, datapath instructions, and special function unit instructions, though from different threads, to execute in parallel)

- to make the datapath superscalar would require the register cache to be able to supply more traffic per cycle which may be tough... The superscalarity they added (with what, the M3 was it?) allowed, I think I'm getting this correct, one FP32 or two FP16s per cycle, quite possibly with some constraints, so you could fit that into the existing register cache framework.
Now it doesn't have to be hopeless. For example it's possible, who knows, that the actual access into the register cache is a register-like hardware not SRAM-like hardware, and is substantially faster than the clock cycle. So you could double pump the access *into* the register file, and now "all" you need is twice the number of wires moving the registers into the latches before execution?
Or you could define the FP32 execution as SIMD-like (ie only operates on registers rN and rN+1) so that these are always stored adjacently in the register cache and now it's just a single wide access to the cache (and of course, still double the number of transport wires).

I don't think we're in a position yet to say whether or no it's a good idea. The M3 GPU mods were so wide ranging, and you have to limit what you do in any one re-design. Something like FP32 may well, IMHO, be on the list, but postponed because it's not QUITE as simple as just add more FP32 hardware; and they'll get to it when other higher priority work is done.

PROBABLY, in terms of thinking about the M5/A19 GPU the more important question is: how much does Apple care about FP32 vs FP16 when executing AI LLMs...
Obviously you want to execute as much of the LLM as possible on the ANE. But just as obviously you want the iPhones and Macs to make a decent showing over the next (very unpredictable!) few years. Given the uncertainly, it's PROBABLY the most flexible bet to do whatever you can to bump GPU FP16 (and BF16, and FP/BF16 matrix accumulate to FP32?) performance while leaving FP32 alone for now?
OTOH the above still requires wider register cache support...
On the third hand, is FP32 training important? If so, do we make the central project doubling register cache support, which we translate into both doubled FP32 (try to make a nice training workstation?) AND doubled FP16 (for flexible inference given we have no idea exactly what models will look like in three years)?

name99 · Feb 4, 2025

crazy dave said:
Plus we don't know how big Apple's ray tracing cores are - in some ways while I show they can keep up with Nvidia's in a per TFLOPS basis, that means individually they have to coordinate with a full 128 FP32 units per core - while for Nvidia it's an up-to-128 FP32 units per core but as we've been discussing Nvidia never gets close to that it's 64+x where x << 64. So Apple's ray tracing cores might be more efficient from that perspective. But we don't know how big they are. So it's less clear to me how all that might impact core size - too many variables.

There is a die shot here:

https://twitter.com/x/status/1886476438603215338

If we believe it..., then the RT unit of an Apple GPU core looks like about the size of 3 of the 4 datapath units (including their register caches).
Surprisingly large, IMHO, which is one of the reasons I keep hoping/suggesting that the goal is to make it a generic "memory transform engine" capable of doing all manner of lightweight memory manipulation (eg searches of complex graphs and similar data structures) beyond just eye candy for games.

name99 · Feb 4, 2025

name99 said:
If I understand the Apple GPU design correctly (and it's definitely possible that I don't!) Apple
- already gets the equivalent of some degree of superscalar from the clause-based execution (which can allow load/store instructions, datapath instructions, and special function unit instructions, though from different threads, to execute in parallel)

- to make the datapath superscalar would require the register cache to be able to supply more traffic per cycle which may be tough... The superscalarity they added (with what, the M3 was it?) allowed, I think I'm getting this correct, one FP32 or two FP16s per cycle, quite possibly with some constraints, so you could fit that into the existing register cache framework.
Now it doesn't have to be hopeless. For example it's possible, who knows, that the actual access into the register cache is a register-like hardware not SRAM-like hardware, and is substantially faster than the clock cycle. So you could double pump the access *into* the register file, and now "all" you need is twice the number of wires moving the registers into the latches before execution?
Or you could define the FP32 execution as SIMD-like (ie only operates on registers rN and rN+1) so that these are always stored adjacently in the register cache and now it's just a single wide access to the cache (and of course, still double the number of transport wires).

I don't think we're in a position yet to say whether or no it's a good idea. The M3 GPU mods were so wide ranging, and you have to limit what you do in any one re-design. Something like FP32 may well, IMHO, be on the list, but postponed because it's not QUITE as simple as just add more FP32 hardware; and they'll get to it when other higher priority work is done.

PROBABLY, in terms of thinking about the M5/A19 GPU the more important question is: how much does Apple care about FP32 vs FP16 when executing AI LLMs...
Obviously you want to execute as much of the LLM as possible on the ANE. But just as obviously you want the iPhones and Macs to make a decent showing over the next (very unpredictable!) few years. Given the uncertainly, it's PROBABLY the most flexible bet to do whatever you can to bump GPU FP16 (and BF16, and FP/BF16 matrix accumulate to FP32?) performance while leaving FP32 alone for now?
OTOH the above still requires wider register cache support...
On the third hand, is FP32 training important? If so, do we make the central project doubling register cache support, which we translate into both doubled FP32 (try to make a nice training workstation?) AND doubled FP16 (for flexible inference given we have no idea exactly what models will look like in three years)?

Oh I forgot. The other thing you might do to buy some degree of improved superscalar performance at low area is support for uniform execution. Right now Apple supports uniforms as a concept and as a register type but (as far as I can tell) uniform execution is splatted across all 32 lanes, even if the operands are all uniforms.

AMD seems to feel it's worth maintaining a separate lane for scalar execution, and if Apple were to provide a lane (a "33rd lane") plus hardware for uniform execution that's an addition of ~3% hardware. If uniforms operations make up more than 3% of the instruction stream that's a win. Do they? No idea... Anyone have a good guess or intuition?

name99 · Feb 4, 2025

crazy dave said:
Wait ... they can't do FP64 at all? I know Int64 can be done using two Int32s ... I guess the same isn't true for FP64, you need actual FP64 units?

EDIT: eoof there are techniques but I guess pretty slow:

Emulating FP64 with 2 FP32 on a GPU

If one were to emulate double precision floating point with two single precision floating points what would the performance be like, and can it be done well? Currently Nvidia is charging quite a p...

stackoverflow.com

Apple's solution to if you need to perform lots of FP64 is to use AMX/SME.
It's not actually a bad solution when you run the numbers... Not as flexible as a GPU (let alone scalar code on a CPU) but a good match to most of the practical use cases of massive FP64 throughput.

crazy dave · Feb 4, 2025

leman said:
Yep, no FP64 at all. To be honest, I am with Apple on this one. Dedicated FP64 needs a lot of circuitry, so GPUs have been neglecting it anyway. Software emulation is not that much slower than having just a few hardware units, so why bother?

If a GPU is interested in supporting wider scientific calculations, I would prefer to have some tailored integer instructions to make FP emulation faster. Like custom shift instructions to help with mantissa alignment, for example.

There are different methods, but yeah, you need a lot of instructions to achieve extended range. If all you need is extended precision (and not the range), this is a nice technique: https://link.springer.com/article/10.1007/PL00009321

You know ... I have a sneaking suspicion that you've told me all this before. Let's not look that up to confirm 😣 But yes I can see that 1/64th rate FP64 ain't great either.

diamond.g said:
Any ideas why RDNA is so crappy if it can do a whole wave32 in 1 cycle (AMD docs says that wave64 takes 2 cycles with some exceptions if I read it correctly). does it come down to utilization?

leman said:
I think this again boils down to misleading marketing. AMD has advertised Navi3 as doubling FP32 compute. But this only applies to a very narrow set of codes. So the measurable improvement looks very disappointing compared to what has been advertised. But when you think of the 7900 XTX as a 30 TFLOPs part (and not 60 TFLOPs, as advertised), then actually performs as well as you'd expect.

I don't know ... maybe I need to dig in further but at least the 890M seems to be even worse than their base TFLOPS suggest ...

AMD Radeon 890M Specs

AMD Strix Point, 2900 MHz, 1024 Cores, 64 TMUs, 32 ROPs

www.techpowerup.com

AMD Radeon RX 6500 XT Specs

AMD Navi 24, 2815 MHz, 1024 Cores, 64 TMUs, 32 ROPs, 4096 MB GDDR6, 2248 MHz, 64 bit

www.techpowerup.com

Techpowerup quotes the 890M base TFLOPS, not its double rate. So if we ignore the potential for it double throughput it should be comparable to the 6500XT if not slightly better especially since the 6500XT is a pretty old processor with not great bandwidth or ancillary stats and yet performance-wise the 890M only occasionally acts like it's the newer better GPU ... or bigger than the M4:

https://www.notebookcheck.net/AMD-Radeon-890M-Benchmarks-and-Specs.843536.0.html

vs.

https://www.notebookcheck.net/AMD-Radeon-RX-6500-XT-GPU-Benchmarks-and-Specs.665752.0.html

Apple M4 10-core GPU - Benchmarks and Specs

The 10-core M4 GPU is a graphics adapter built into the 9-core and 10-core Apple M4 processors that has direct access to fast on-package LPDDR5x-7500 RAM (120 GB/s bandwidth) as well as hardware support for ray tracing and mesh shading and other modern technologies. It appears its real-world...

www.notebookcheck.net

Often the 6500XT beats it (not always) and I feel like the M4 is competitive when it shouldn't be, even on nan-native games like CP2077 or TW3. Maybe the 7900XTX and other dGPU parts are better?

name99 said:
Oh I forgot. The other thing you might do to buy some degree of improved superscalar performance at low area is support for uniform execution. Right now Apple supports uniforms as a concept and as a register type but (as far as I can tell) uniform execution is splatted across all 32 lanes, even if the operands are all uniforms.

AMD seems to feel it's worth maintaining a separate lane for scalar execution, and if Apple were to provide a lane (a "33rd lane") plus hardware for uniform execution that's an addition of ~3% hardware. If uniforms operations make up more than 3% of the instruction stream that's a win. Do they? No idea... Anyone have a good guess or intuition?

If I understand right, uniform is basically an INT that has been made into a special kind of float for doing indices and loops so as to not interrupt a stream of FP32 with an INT, is that right?

name99 said:
Apple's solution to if you need to perform lots of FP64 is to use AMX/SME.
It's not actually a bad solution when you run the numbers... Not as flexible as a GPU (let alone scalar code on a CPU) but a good match to most of the practical use cases of massive FP64 throughput.

Aye, for myself, it was more occasional FP64 mixed with GPU FP32 ... then again if I were to ever port my stuff over to Metal I would have a lot more access to different parts of the SOC as opposed to just being stuck on the GPU. That ... might actually be useful though I couldn't say for certain for this particular use case.

leman · Feb 4, 2025

TLuc said:
If I understand the Apple GPU design correctly (and it's definitely possible that I don't!) Apple
- already gets the equivalent of some degree of superscalar from the clause-based execution (which can allow load/store instructions, datapath instructions, and special function unit instructions, though from different threads, to execute in parallel)

Are you sure they still use clauses? I only heard about them as an older IMG concept. I’ve been looking at the ISA docs for newer GPU (from the Asahi project), and I can’t find anything like clauses in them.

TLuc said:
- to make the datapath superscalar would require the register cache to be able to supply more traffic per cycle which may be tough... The superscalarity they added (with what, the M3 was it?) allowed, I think I'm getting this correct, one FP32 or two FP16s per cycle

It’s two instructions issued per cycle, to any two pipes. So you can do FP32+Int, FP32+FP16, or FP16+Int. Not quite sure where special operations are, they seem to be dispatched on the Int pipe if I understand it correctly. An interesting limitation is that FP32+Int has a 25% performance penalty. It seems that the register cache can only supply 48bits per thread per cycle.

TLuc said:
Or you could define the FP32 execution as SIMD-like (ie only operates on registers rN and rN+1) so that these are always stored adjacently in the register cache and now it's just a single wide access to the cache (and of course, still double the number of transport wires).

Yeah, this has been tried and so far without much success. It’s really annoying if you SIMD width changes depending on the type of data. You need to do extra work to make the programming model behave sanely again, and your hardware gets more complex too. This also has non-trivial performance implications for complex shaders, e.g. when you need ballot and shuffle functionality.

name99 said:
Oh I forgot. The other thing you might do to buy some degree of improved superscalar performance at low area is support for uniform execution. Right now Apple supports uniforms as a concept and as a register type but (as far as I can tell) uniform execution is splatted across all 32 lanes, even if the operands are all uniforms.

AMD seems to feel it's worth maintaining a separate lane for scalar execution, and if Apple were to provide a lane (a "33rd lane") plus hardware for uniform execution that's an addition of ~3% hardware. If uniforms operations make up more than 3% of the instruction stream that's a win. Do they? No idea... Anyone have a good guess or intuition?

I’m curious about this too. Could be a historical thing? Probably the biggest cost of foregoing scalar operations is that you are sacrificing precious register space.

P.S. “Uniform” registers on Apple are more like constant storage. They put stuff like base pointers or parameters there, things that don’t need to change. Whatever you declare in MSL as “constant” storage space goes in there.

crazy dave · Feb 4, 2025

name99 said:
There is a die shot here:

https://twitter.com/x/status/1886476438603215338

If we believe it..., then the RT unit of an Apple GPU core looks like about the size of 3 of the 4 datapath units (including their register caches).
Surprisingly large, IMHO, which is one of the reasons I keep hoping/suggesting that the goal is to make it a generic "memory transform engine" capable of doing all manner of lightweight memory manipulation (eg searches of complex graphs and similar data structures) beyond just eye candy for games.

Link doesn't work for me. But that does indeed sound quite large.

I found this die shot of the M4 (near of the bottom of the article), but the individual pieces of the shader core aren't annotated and I couldn't tell what is what.

Qualcomm Snapdragon X die shot reveals massive CPU cores with huge caches

GPU is quite modest though.

www.tomshardware.com

name99 · Feb 5, 2025

crazy dave said:
Link doesn't work for me. But that does indeed sound quite large.

I found this die shot of the M4 (near of the bottom of the article), but the individual pieces of the shader core aren't annotated and I couldn't tell what is what.

Qualcomm Snapdragon X die shot reveals massive CPU cores with huge caches

GPU is quite modest though.

www.tomshardware.com

name99 · Feb 5, 2025

crazy dave said:
If I understand right, uniform is basically an INT that has been made into a special kind of float for doing indices and loops so as to not interrupt a stream of FP32 with an INT, is that right?

That's interesting. I assumed uniform was basically a synonym for "scalar" data, but the use case you describe does suggest that a "uniform execution unit" would not be that valuable.
On the other hand, the situation where one lane of one thread of a threadblock has to perform some sort of scalar cleanup code at the end of the vector task is common enough that Apple have HW support to accelerate this...
So I suspect that overall there is enough scattered scalar work that providing a "33rd lane" to execute superscalar with the other work may be a win.

As I said, info from AMD experience would be nice to hear!

name99 · Feb 5, 2025

leman said:
Are you sure they still use clauses? I only heard about them as an older IMG concept. I’ve been looking at the ISA docs for newer GPU (from the Asahi project), and I can’t find anything like clauses in them.

I talk a lot about this in my GPU volume. There's a substantial patent trail, so presumably they used them initially. And the way they used them is so valuable that I can't see them giving up the idea. Even so, I'm not sure -- that whole section of the GPU volume is very tentative as you'll see if you read it!

It's possible that Asahi doesn't talk about it because they don't "see" clauses; they are not a shading languge level construct. I can find very little of how clauses were used by AMD, and nothing about how they were used by IMG.
So it's possible that, while Apple use the same term, they mean something rather different, much more "implementation" but with essentially no user-level visibility, except that you'd get lousy performance if you aggressively interleaved instructions of one type (eg load/store) with those of another type (eg datapath). You presumably could only do that by bypassing the Metal compiler and writing direct assembly, and is that even possible?

crazy dave · Feb 5, 2025

name99 said:
View attachment 2479569

Oh wow, you weren't kidding. If that's accurate, that's huge! Unfortunately for comparison I only ever found was an estimate for the original Nvidia Turing RT cores and not only have things changed substantially since then but the poster based his calculations on non-annotated die shots and I'm not exactly capable of double checking.

https://www.reddit.com/r/nvidia/comments/baaqb0/rtx_adds_195mm2_per_tpc_tensors_125_rt_07

People have posted die shots of later Nvidia GPUs but I couldn't find one where the RT/Tensor cores were annotated. Someone more knowledgeable about things than me might be able to suss that out. Regardless by comparing Turing GPU cores without RT (TU116 used in the 1650/1660) with Turing GPU cores with RT (TU106 used in the 2060) both on the same 12nm node, this poster estimated that the Turing RT core was quite a small percentage of the core size - 0.7mm2 out of 1.95mm2 for the RT features (1.25mm2 for the tensor core) and with the whole core complex being even bigger, like >10mm^2. Though how this poster differentiated between tensor core and RT core is unclear - the TU116 didn't have either as far as I know and that's how they said they did that but that's also how they did it for the whole tensor/RT core addition. So I'm a little confused there. Anyway, again, later RT cores are very different performance wise from Turing so even if accurate that may have changed dramatically. Also, even though Apple RT cores don't appear to drag relative performance up or down compared to Nvidia GPUs, Apple does make do with fewer of them so them being bigger would not be a surprise. That said, I agree that it would be super interesting to see if they had any other applications given their die area.

name99 said:
That's interesting. I assumed uniform was basically a synonym for "scalar" data, but the use case you describe does suggest that a "uniform execution unit" would not be that valuable.
On the other hand, the situation where one lane of one thread of a threadblock has to perform some sort of scalar cleanup code at the end of the vector task is common enough that Apple have HW support to accelerate this...
So I suspect that overall there is enough scattered scalar work that providing a "33rd lane" to execute superscalar with the other work may be a win.

As I said, info from AMD experience would be nice to hear!

I think I've also seen it describe that way! Truthfully until a couple of days ago when I was looking into the details of Nvidia execution units, I hadn't even heard of this before. Here's an arXiv paper on Nvidia's use of the "uniform register" which may be different from AMD's (which I know is what you're actually after, but this is as helpful as I can be):

https://arxiv.org/pdf/1903.07486 (page 32)

leman said:
Yep, no FP64 at all. To be honest, I am with Apple on this one. Dedicated FP64 needs a lot of circuitry, so GPUs have been neglecting it anyway. Software emulation is not that much slower than having just a few hardware units, so why bother?

If a GPU is interested in supporting wider scientific calculations, I would prefer to have some tailored integer instructions to make FP emulation faster. Like custom shift instructions to help with mantissa alignment, for example.

There are different methods, but yeah, you need a lot of instructions to achieve extended range. If all you need is extended precision (and not the range), this is a nice technique: https://link.springer.com/article/10.1007/PL00009321

crazy dave said:
You know ... I have a sneaking suspicion that you've told me all this before. Let's not look that up to confirm 😣

Ugh ... it's even worse. Looking at the other forum, apparently I already knew it had to be emulated 2 years ago. Anyway, the original link I posted there was just someone talking about doing an FP64 emulation project, I found an (archived? I didn't know GitHub did that) link to the actual project:

GitHub - philipturner/metal-float64: Emulating double-precision arithmetic on Apple GPUs

Emulating double-precision arithmetic on Apple GPUs - philipturner/metal-float64

github.com

And yeah he claims to get the same performance ratio as on consumer Nvidia GPU for FP64 - with the caveat that it may require a function call rather than be inlined depending on which operation you want. But your paper on using pseudo-64bit if you just need the range is very, very interesting.

OptimusGrime · Feb 5, 2025

crazy dave said:
Ugh ... it's even worse. Looking at the other forum, apparently I already knew it had to be emulated 2 years ago. Anyway, the original link I posted there was just someone talking about doing an FP64 emulation project, I found an (archived? I didn't know GitHub did that) link to the actual project:

GitHub - philipturner/metal-float64: Emulating double-precision arithmetic on Apple GPUs

Emulating double-precision arithmetic on Apple GPUs - philipturner/metal-float64

github.com

And yeah he claims to get the same performance ratio as on consumer Nvidia GPU for FP64 - with the caveat that it may require a function call rather than be inlined depending on which operation you want. But your paper on using pseudo-64bit if you just need the range is very, very interesting.

So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?

I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?

3D Rendering on Apple Silicon, CPU&GPU

macrumors Core

macrumors 68000

macrumors G5

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors G5

macrumors Core

macrumors regular

macrumors Core

macrumors regular

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68000

macrumors Core

macrumors 68000

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68000

macrumors regular

Our Staff