Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Why remove the comparison?


His completion time on an overclocked 4080 Super is lower than my stock 4080 Super at 52% (165W vs default 320W) power limit but I see he has OBS and other apps running which may impact rendering time. Also, his relative speed bars are off.

Monster Under the Bed (mine 16.7secs vs his 22secs)
MonsterUnderTheBed165W.png


Lone Monk (mine 1min 54secs vs his 1min 58secs)
LoneMonk165W.png
 
He didn’t use optix for rendering. How is it even possible to be so ignorant and still work with 3d? So all his results are totally irrelevant
 
M4 Max 3s faster than desktop 4070 Super in Blender Classroom with Optix.

View attachment 2450949View attachment 2450950
Since in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...

That's the thing, Apple's RAM upgrades are expensive (though better now that the base 16GB), but VRAM in the PC world, especially beyond 24GB is much, much worse. This is where Apple's approach really shines. When it comes to VRAM/bandwidth (above 24GB), Apple is actually the cost effective solution ... feels weird man to write that :). Hopefully that results in price pressure on Nvidia in particular to bring its workstation GPU prices down or increase max RAM on its consumer GPUs, but that may be unlikely given Nvidia's overall market dominance.

==============================

The die area devoted to the GPU is sometimes why I've seen people, only half-jokingly, refer to the Max and Ultra as a GPU with an integrated CPU ;).

The M3 Max has about 92 billion transistors and the GPU is roughly 36% of that, for about 33 billion transistors. The M4 Max is likely bigger and thus the GPU, in terms of raw transistor count, is likely quite a bit bigger than the 3090 (of course how you use those transistors is even more important).


Don't get me wrong, it's still very impressive! The 3090 packs more SMs at similar/slightly higher clocks (and thus has more raw TFLOPs).

One thing I didn't make clear in my original post: even if you compare Apple with say the desktop 4070 whose stats are more comparable to the M3/M4 Max being on TSMC 4N rather than Samsung 8 ... why is Nvidia able to seemingly fit more shader cores/SMs per transistor than Apple and how does that relate to TFLOPs and so forth? Nvidia is able to fit 46 SMs in 36 billion transistors (which also includes IO and tensor cores, the first of which I didn't include in the above for the M3 Max and the other Apple doesn't have .. oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU) compared to 40 Cores in 33 billion for the M3 Max (and presumably more transistors for the M4 Max).

I'm glad you asked! :)

So Nvidia in the last few generations reports 128 FP32 shader cores per SM, but, unlike Apple, Nvidia actually has 2 FP32 pipes per individual compute subunit, one of which can double as a INT32 pipe. Apple does not. So when Apple reports 128 FP32 shader units per Core, that includes 128 of all the other pipes as well (Integer, FP16, Complex, etc ...). This means Apple's cores are going to be bigger than Nvidia's SMs even though SMs also include 4 Tensor cores. Further, because of the way the pipes work, Nvidia GPUs are not always going to be able to take advantage of the extra FP32 pipe - like if they have integer work to do or not enough parallel FP32 work within the subunit.


Side note: The AMD chips are, as far as I can tell, worse in this respect where RDNA 3 reports "TFLOPs" that are only achievable if there is instruction level parallelism (like on a CPU thread). Compare calculating the TFLOPs per shader core for the linked RDNA 3 GPU vs the 4070 above.

In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit.

Of course there is Apple's (ImgTech's) TBDR which helps in rendering complex scenes where lots of objects are occluding each other and this speeds up the rasterization portion of the process. TBDR is not always a net win, but when it is, it is.

Finally, as I mentioned above, Apple has much better bandwidth/VRAM per TFLOP than Nvidia typically offers on its consumer GPUs and its workstation GPUs are extremely expensive. So when memory bandwidth and footprint are at a premium during a workload, Apple GPUs will also punch above their weight.

Given the above, I hope it becomes a little more clear how Apple competes with Nvidia in these Blender benchmarks with seemingly smaller GPUs by SM/core and slower clocks, but with similar if not greater transistor counts. Not every workload is going to be this favorable for Apple of course. But these are probably the workstation/"prosumer" workloads Apple had mind when designing the Max.
 
Last edited:
  • Like
Reactions: Homy and M4pro
the really good thing here is that these days apple is a useful platform for 3d work. Not the best but clearly viable. Quite the difference from when the bad old days when nvidia was 10x the perf of an apple solution. Just two and half years ago, the 3090 in my threadripper system was so much better than a m1max or a 2019 mac pro that benchmarks was not even meaningful to look at. Now a 4090 is just about twice as fast and i imagine that the 5090 to m4 “ultra” will have about the same relationship. Quite impressive how well the rt cores work on apple compared to what amd managed to produce. Also, let’s not forget that for many tasks in 3d the cpu is still very important so for sims and general data shuffling the complete apple package packs quite the punch. Now, the thing that needs most improvement is the software support for ai frameworks but it seems apple is doing great progress there as well. Good times!
 
Since in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...

That's the thing, Apple's RAM upgrades are expensive (though better now that the base 16GB), but VRAM in the PC world, especially beyond 24GB is much, much worse. This is where Apple's approach really shines. When it comes to VRAM/bandwidth (above 24GB), Apple is actually the cost effective solution ... feels weird man to write that :). Hopefully that results in price pressure on Nvidia in particular to bring its workstation GPU prices down or increase max RAM on its consumer GPUs, but that may be unlikely given Nvidia's overall market dominance.

==============================



One thing I didn't make clear in my original post: even if you compare Apple with say the desktop 4070 whose stats are more comparable to the M3/M4 Max being on TSMC 4N rather than Samsung 8 ... why is Nvidia able to seemingly fit more shader cores/SMs per transistor than Apple and how does that relate to TFLOPs and so forth? Nvidia is able to fit 46 SMs in 36 billion transistors (which also includes IO and tensor cores, the first of which I didn't include in the above for the M3 Max and the other Apple doesn't have .. oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU) compared to 40 Cores in 33 billion for the M3 Max (and presumably more transistors for the M4 Max).

I'm glad you asked! :)

So Nvidia in the last few generations reports 128 FP32 shader cores per SM, but, unlike Apple, Nvidia actually has 2 FP32 pipes per individual compute subunit, one of which can double as a INT32 pipe. Apple does not. So when Apple reports 128 FP32 shader units per Core, that includes 128 of all the other pipes as well (Integer, FP16, Complex, etc ...). This means Apple's cores are going to be bigger than Nvidia's SMs even though SMs also include 4 Tensor cores. Further, because of the way the pipes work, Nvidia GPUs are not always going to be able to take advantage of the extra FP32 pipe - like if they have integer work to do or not enough parallel FP32 work within the subunit.


Side note: The AMD chips are, as far as I can tell, worse in this respect where RDNA 3 reports "TFLOPs" that are only achievable if there is instruction level parallelism (like on a CPU thread). Compare calculating the TFLOPs per shader core for the linked RDNA 3 GPU vs the 4070 above.

In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit.

Of course there is Apple's (ImgTech's) TBDR which helps in rendering complex scenes where lots of objects are occluding each other and this speeds up the rasterization portion of the process. TBDR is not always a net win, but when it is, it is.

Finally, as I mentioned above, Apple has much better bandwidth/VRAM per TFLOP than Nvidia typically offers on its consumer GPUs and its workstation GPUs are extremely expensive. So when memory bandwidth and footprint are at a premium during a workload, Apple GPUs will also punch above their weight.

Given the above, I hope it becomes a little more clear how Apple competes with Nvidia in these Blender benchmarks with seemingly smaller GPUs by SM/core and slower clocks, but with similar if not greater transistor counts. Not every workload is going to be this favorable for Apple of course. But these are probably the workstation/"prosumer" workloads Apple had mind when designing the Max.
There's much more going on than just the above. eg
- Apple has distinct FP16 and FP32 pipes, nV does FP16 in the FP32 pipe. The Apple scheme uses more area but less power, since the FP16 pipe is optimized for only FP16.
- Apple (as of M3) has has a much more abstract resource allocation scheme for registers and Threadblock Scratchpad, which takes additional area but allows much more flexibility (ie much more efficient use of the rest of the GPU).
- nV allows for some degree of superscalarity (ie core executes more than one instruction per cycle) but most of the work to ensure this happens in the compiler. Apple gets to the same place, but much of the work to do this happens through the GPU core have multiple "mini-front-ends" feeding into a single set of execution pipelines. The nV scheme takes less area but is much less flexible.

Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).
You can always find microbenchmarks that push a particular GPU design to 100% activity, and choosing the right such benchmark may make nV (or Apple, or AMD) look good. But on REAL GPU code, with all its real messiness, we see the Apple "IPC" advantage, and the lower activity factors for other GPUs.

There are similar techniques operating at higher levels, so that, for example, the traditional scheme might be something like one kernel executes, it finishes and resources are deallocated, then the next kernel starts. As of M4 Apple overlaps these stages so that as one kernel is approaching its end, the resources for the next kernel begin being allocated and pulled into the appropriate cache level. Then the new kernel starts executing, and simultaneously the resources of the previous kernel are now deallocated. So you're no longer wasting the 5% of time or so between kernels that was basically housekeeping, all that's overlapped with real work.

Each of these things seems minor, 1% here, 2% there, but keep at them over a few years, and you land up with a 50% or so "IPC" advantage (ie you get ~50% more performance out of what appear to be essentially the same count of GPU cores).
 
There's much more going on than just the above. eg
There's even more going here than that. I had myself a longer post detailing the differences between Nvidia and Apple GPUs in my own pipeline, but since you covered some of the rest of that, I'll just add on the extra details at the end here.
- Apple has distinct FP16 and FP32 pipes, nV does FP16 in the FP32 pipe. The Apple scheme uses more area but less power, since the FP16 pipe is optimized for only FP16.
- Apple (as of M3) has has a much more abstract resource allocation scheme for registers and Threadblock Scratchpad, which takes additional area but allows much more flexibility (ie much more efficient use of the rest of the GPU).
- nV allows for some degree of superscalarity (ie core executes more than one instruction per cycle) but most of the work to ensure this happens in the compiler. Apple gets to the same place, but much of the work to do this happens through the GPU core have multiple "mini-front-ends" feeding into a single set of execution pipelines. The nV scheme takes less area but is much less flexible.

Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).
You can always find microbenchmarks that push a particular GPU design to 100% activity, and choosing the right such benchmark may make nV (or Apple, or AMD) look good. But on REAL GPU code, with all its real messiness, we see the Apple "IPC" advantage, and the lower activity factors for other GPUs.

There are similar techniques operating at higher levels, so that, for example, the traditional scheme might be something like one kernel executes, it finishes and resources are deallocated, then the next kernel starts. As of M4 Apple overlaps these stages so that as one kernel is approaching its end, the resources for the next kernel begin being allocated and pulled into the appropriate cache level. Then the new kernel starts executing, and simultaneously the resources of the previous kernel are now deallocated. So you're no longer wasting the 5% of time or so between kernels that was basically housekeeping, all that's overlapped with real work.

Each of these things seems minor, 1% here, 2% there, but keep at them over a few years, and you land up with a 50% or so "IPC" advantage (ie you get ~50% more performance out of what appear to be essentially the same count of GPU cores).

In addition to the FP16 pipeline we both mentioned earlier, Apple also has a "Complex" pipeline for doing sines cosines, etc ... It has never been clear to me if Nvidia has the same as they don't cover that explicitly in their white papers, but given its commonality across different GPU designs and the fact that "fast" approximations for those complex functions exist on Nvidia GPUs, I would assume they do. However, how many they have per core may be different from Apple's where for Apple they have 1 complex data path for every FP16, FP32, and INT32 path, Nvidia may not. There is precedence for doing this e.g. Nvidia also has a smattering of FP64 lanes throughout the consumer GPUs but only a few cores get such a lane, at least active - previous Nvidia GPUs simply fused off FP64 in consumer cores but it's not clear if more recent ones just use a different mask. Also, when I alluded the structure within a subunit but that was an oversimplification as each GPU (Apple and Nvidia are the same here) is actually split into 4:

1 Core/SM is 4 of the following subunits:
Nvidia: 16 FP32, 16 FP32/INT32, + optional 1 FP64, ? Complex
Apple: 32 FP32, 32 INT32, 32 FP16, 32 Complex

Since the M3 generation, Apple is also a superscalar design capable of issuing two commands at once. Previously in the M2 generation and earlier each warp could only issue a single INT32 or an FP32 or FP16 or Complex command. Since then, Apple has allowed the GPU to issue two commands to two pathways at once. I won't repost @leman 's charts without his permission here, but @leman actually found some odd results in his testing of the M3 GPU, namely that while FP16 and FP32 could double the number of operations when performed simultaneously, INT32 could not, but it would still increase, just by a measly 20% instead of double. And issue three commands (FP16, INT32, FP32) at once would increase the throughput by 70%, again instead of double. It's unclear why this is the case or if the M4 generation is the same. I know @leman is preparing a deep dive on the M3 GPU architecture so maybe he'll have more insight when he is ready to present his full findings.

The net result is actually fascinating so where Nvidia GPUs drop FP32 throughput in the presence of FP16/INT32, it is possible that Apple GPUs won't going forwards and even more intriguingly is the possibility that future Apple GPUs could, like Nvidia, turn one of their other pipelines in an optional FP32 pipe if they desired to increase potential FP32 throughput without sacrificing additional die area (you'd still pay for the power of course).

Interestingly it's less clear if the modern Nvidia GPU actually is a true superscalar GPU. What's even more interesting even some Nvidia engineers themselves are unsure about it. Robert Crovella is an Nvidia engineer who is often seen on the forums fielding questions from those seeking help on CUDA and other issues. Even he doesn't know and admits it isn't documented. His hypothesis is that, unlike Kepler which was superscalar, post-Kepler GPUs aren't at the warp level (except one brief generation where oddly Nvidia experimented with doubling FP16 through the FP32 pipe but abandoned that in subsequent generations - I think that was the 2000-series GPUs). Instead what they do is every other clock issue commands to the different subunits to give the appearance of being superscalar at least at the level of throughput. I have to admit I am a little unsure if that would work and if that couldn't be seen by people doing throughput level tests given FMA calculations are typically done in a single clock cycle. So he stressed that he wasn't sure and the Nvidia GPU may indeed by superscalar.


It should be pointed out though that superscalar GPUs do not always deliver better performance, unlike on the CPU where it always a net win. Post-Kepler Nvidia deliberately moved away from superscalar designs to simplify their core designs, make them smaller and then put more of them in the same area - thus widening the design by adding the capability to simply run more threads at once. That can be more beneficial for ensuring a GPU's power to performance ratio as having more dedicated pipes IF the work you want is focused on having more of a specific KIND of pipe. Everything is a tradeoff. Nvidia made one set of decisions, Apple made a different set and sometimes Apple's pays off.

Similarly for the cache, as I said and you went into more detail, Apple in the M3 has a brand new cache structure which allows them to increase thread occupancy, and for those interested in not spending every waking moment compiling every damn option to do so at run time instead of compile time. While it is not clear to me how big Apple's GPU caches are now (probably a lot bigger than they were), what is clear is that the structure of them is entirely different from the classic GPU data cache. Again, I won't post @leman's chart without his permission, but if you were able to look at his chart you would no longer see the tell tale signs of bank conflicts in shared memory that you see in M1 and M2 and for Nvidia GPUs.This is not to say that there isn't variation in strided memory access for M3, there is, but it no longer corresponds as neatly to the classic bank conflicts as seen in earlier Apple GPUs. Also for M3 also overall cache bandwidth is much lower than before as well (again probably the cache is much bigger now). So while overall M3 GPUs could outperform M2 GPUs clock for clock, not for every benchmark and benchmark subtest where sometimes the new GPU would fall behind the old one. Again, unclear what is going on here and what the behavior of the M4 GPU is.

Anyway I gotta go and this post is already long enough, but I hope getting into the nitty gritty was interesting for people.
 
Last edited:
1 Core/SM is 4 of the following subunits:
Nvidia: 16 FP32, 16 FP32/INT32, + optional 1 FP64, ? Complex
Apple: 16 FP32, 16 INT32, 16 FP16, 16 Complex

Apple's pipelines are 32-wide (or at least they behave like they are).

Since the M3 generation, Apple is also a superscalar design capable of issuing two commands at once. Previously in the M2 generation and earlier each warp could only issue a single INT32 or an FP32 or FP16 or Complex command. Since then, Apple has allowed the GPU to issue two commands to two pathways at once. I won't repost @leman 's charts without his permission here, but @leman actually found some odd results in his testing of the M3 GPU, namely that while FP16 and FP32 could double the number of operations when performed simultaneously, INT32 could not, but it would still increase, just by a measly 20% instead of double. And issue three commands (FP16, INT32, FP32) at once would increase the throughput by 70%, again instead of double. It's unclear why this is the case or if the M4 generation is the same.

Here is the graph for illustration

1700765326050-png.27416


As @crazy dave says, it is not clear why float+int is slower. There appears to be some penalty with dual-issuing int and fp instructions.


Interestingly it's less clear if the modern Nvidia GPU actually is a true superscalar GPU. What's even more interesting even some Nvidia engineers themselves are unsure about it. Robert Crovella is an Nvidia engineer who is often seen on the forums fielding questions from those seeking help on CUDA and other issues. Even he doesn't know and admits it isn't documented. His hypothesis is that, unlike Kepler which was superscalar, post-Kepler GPUs aren't at the warp level (except one brief generation where oddly Nvidia experimented with doubling FP16 through the FP32 pipe but abandoned that in subsequent generations - I think that was the 2000-series GPUs). Instead what they do is every other clock issue commands to the different subunits to give the appearance of being superscalar at least at the level of throughput. I have to admit I am a little unsure if that would work and if that couldn't be seen by people doing throughput level tests given FMA calculations are typically done in a single clock cycle.

I thought these things were fairly well known? There is no difference on throughput tests since both subunits per schedule operate concurrently. There is just one cycle at the beginning of the workload and one cycle at the end of the workload where one of the units is inactive. That is not measurable if your workload runs over millions of cycles.

It should also be fairly simple to verify experimentally. Run a very small (but long-duration) workload that will only target a single subunit. If the hypothesis is correct, you should see 1/8 of the performance that a SM is capable of.

Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).

I am also wondering about Nvidia's register and shared memory design. Nvidia documentation are very adamant about the need to avoid bank conflicts, and some results I've seen suggest that this is a particularly significant concern for their hardware. Apple's caches have more uniform access timings, but they are also slower. There seem to be a lot of design tradeoffs and different approaches to balancing.

There is also some evidence that Nvidia has a narrower register data path and is willing to sacrifice some hardware utilization efficiency to reduce the core die area. I think this is particularly visible on complex shaders like Blender where Nvidia has considerably lower performance per FLOP than Apple. Nvidia also has a dedicated hardware unit, the operand collector that tracks register usage and aligns bank access to compensate for this.
 
Last edited:
IMG_7740.jpeg
IMG_7741.png
Though no renderer is optimized for the M4 series so soon after its integration in the Mac - from what I’ve seen so far, the Cinebench GPU 2024 benchmark gives a good indication of the generational performance gains Apple achieved with the M4 GPUs.

Credits: Andrew Cunningham, Ars Technica for the test chart; and the scene rendering included here for visual reference is from Tech4gamers.
 
Last edited:
  • Like
Reactions: crazy dave
View attachment 2451530Though no renderer is optimized for the M4 series so soon after its integration in the Mac - from what I’ve seen so far, the Cinebench GPU 2024 benchmark gives a good indication of the generational performance gains Apple achieved with the M4 GPUs.

That is very impressive given that the core count did not increase and the clock only increased marginally. 30% year-to-year is nothing to sneeze on.
 
That is very impressive given that the core count did not increase and the clock only increased marginally. 30% year-to-year is nothing to sneeze on.
I wonder what is the source of this improvement? I know they referred to an improved scheduler. Also do we know the clock speed improvement? I haven’t seen it mentioned.
 
I wonder what is the source of this improvement? I know they referred to an improved scheduler. Also do we know the clock speed improvement? I haven’t seen it mentioned.
Power draw for one ;)
Unfortunately I have no M3 Max to compare but the M4 Max sure uses a bit of power in that test and I got a very similar score.

I can install Asitop if someone wants the numbers.

My guess is it uses nearly 2x what the M1 Max did.

Just checked as I am still at work, M1 Max GPU seems to use max 24 W peak in the test. Average is probably quite a bit lower like 20 or so but did not want to run the full 10 min now.
 
Last edited:
Power draw for one ;)

Well, given that M1 Max scores ~ 4400 points in that test, I'd gladly take an 4x improvement in performance for a 2x increase in power draw. Especially since 50 watts is still a ridiculously low power draw for a large workstation laptop. Apple can go at least for 60-70 watts for the GPU alone without any detrimental effects.
 
  • Like
Reactions: Macintosh IIcx
Well, given that M1 Max scores ~ 4400 points in that test, I'd gladly take an 4x improvement in performance for a 2x increase in power draw. Especially since 50 watts is still a ridiculously low power draw for a large workstation laptop. Apple can go at least for 60-70 watts for the GPU alone without any detrimental effects.
Ooh I think we probably are at those 60-70 W to be honest.

I will check it when I get home.
 
Apple's pipelines are 32-wide (or at least they behave like they are).
Yeah I typed the above too quickly and reused the Nvidia numbers as like Nvidia, Apple also has 4 subunits and 128 total FP32 pipes in a Core, but the whole point was that Nvidia's numbers include the optional FP32 pipes while Apple's do not resulting in double the number of pipes per subunit even beyond the extra FP16 pipes. That was the entire point of writing it out like that and I completely goofed it. Sorry.

Edited to this:
1 Core/SM is 4 of the following subunits:
Nvidia: 16 FP32, 16 FP32/INT32, + optional 1 FP64, ? Complex
Apple: 32 FP32, 32 INT32, 32 FP16, 32 Complex

Nvidia has the same theoretical FP32 output per clock per core only if they don't have simultaneous integer or FP16 work and Apple has double the possible integer output per clock per core.

Here is the graph for illustration

1700765326050-png.27416


As @crazy dave says, it is not clear why float+int is slower. There appears to be some penalty with dual-issuing int and fp instructions.




I thought these things were fairly well known? There is no difference on throughput tests since both subunits per schedule operate concurrently. There is just one cycle at the beginning of the workload and one cycle at the end of the workload where one of the units is inactive. That is not measurable if your workload runs over millions of cycles.

If Robert didn't know, then at least at the time of his posts that wasn't something Nvidia had officially clarified. If they have issued a clarification since then, I have not been able to find it so far. My issue is this and I'm sure I'm wrong here somewhere, but if the FMA itself is only taking a single cycle to complete then attempting to issue subsequent FMAs should result in tandem, not parallel execution, no? Perhaps my mental model here is wrong. Here's what I think is going on:

A warp consisting of 32 threads should execute over two subunits, right? And then, provided divergences of course, both subunits are scheduled simultaneously and execute together. If there are two independent FP32 calculations per thread and if the Nvidia GPU is truly superscalar, then the Nvidia GPU's warp scheduler can issue commands to the INT32/FP32 pipe simultaneously with the FP32 pipe completing 64 FP32 operations per cycle per two subunits. However, if the Nvidia GPU is not superscalar, then by the time it schedules the work for the combined pipe, the FP32 pipe should already be done and you now have tandem execution and no increase in throughput per clock cycle, which should mean no increase in throughput.

It should also be fairly simple to verify experimentally. Run a very small (but long-duration) workload that will only target a single subunit. If the hypothesis is correct, you should see 1/8 of the performance that a SM is capable of.
Again, I think my mental model is wrong or at least given my mental model I'm not quite following why we would need to do this at all since according to how I'm thinking about it even just a simple throughput test should tell the difference. Since that's presumably not the case, I must be mistaken somewhere. If you both and Robert are saying this schema is reasonable, it most likely is. :)

I am also wondering about Nvidia's register and shared memory design. Nvidia documentation are very adamant about the need to avoid bank conflicts, and some results I've seen suggest that this is a particularly significant concern for their hardware. Apple's caches have more uniform access timings, but they are also slower. There seem to be a lot of design tradeoffs and different approaches to balancing.

There is also some evidence that Nvidia has a narrower register data path and is willing to sacrifice some hardware utilization efficiency to reduce the core die area. I think this is particularly visible on complex shaders like Blender where Nvidia has considerably lower performance per FLOP than Apple. Nvidia also has a dedicated hardware unit, the operand collector that tracks register usage and aligns bank access to compensate for this.
I know Nvidia has 64K 32bit registers per SM in its register file, do we know what Apple is using now? Chips and Cheese didn't go into that in their unfortunately brief look in the M2 (which is also not the M3 obviously which probably dramatically changed things).

I should also state I made another error in my above statements about caches.

... oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU ...

... In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit. ...

I said Nvidia doesn't have the equivalent of the SLC, but that's not really true. Nvidia has 128KB L1 per SM (256KB for compute graphics cards like the H100) and tens of MB for its L2 which in fact functions as a last level cache similar to Apple's SLC. Apparently on some of Nvidia's compute cards there is even a "near L2" and "far L2". What Nvidia doesn't have the equivalent of Apple's L2 cache - though again it isn't clear how the M3/M4 have changed things. Again, sorry for any confusion.



With the M3/M4 having its unique dynamic cache for both cache data and registers and also Apple having that full 128-wide core with no (or smaller) limitations on achieving that theoretical FLOP output (and TBDR), yeah we can definitely see where Apple might catch up in graphics performance per FLOP.

That is very impressive given that the core count did not increase and the clock only increased marginally. 30% year-to-year is nothing to sneeze on.

I do wonder about improvements to the L1 cache bandwidth and/or INT scheduling - like did the M4 improve those things over the M3 or is it literally just an upclocked M3 with better memory bandwidth? I keep going back and forth but given some of these results, maybe Apple made a few tweaks to improve them?

It's all a little over my head, but is certainly fascinating! I love this corner of the forum for the knowledgable people sharing their stuff.

Thanks @name99 and @crazy dave

Sorry I don't mean to talk over people's heads, but I recognize that I went very fast, too fast through the material. If you'd like, feel free to ask a question and I'll do my best to explain starting from basics ... and without the mistakes I made above. :)

Actually maybe istat menus needs an update, it is way lower than I expected.
Cinebench 2024
Just for comparison my 4070ti super uses about 50 W sitting on the desktop and peaks over 200 running Cinebench.
View attachment 2451596
That makes sense to me as the GPU should more or less scale okay in power - Apple didn't increase GPU clock speed in the M3 generation and so had some silicon node advantage to play with in the M4 compared to GPU clocks in the M2.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.