Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

mi7chy

macrumors G4
Oct 24, 2014
10,619
11,292
Why remove the comparison?


His completion time on an overclocked 4080 Super is lower than my stock 4080 Super at 52% (165W vs default 320W) power limit but I see he has OBS and other apps running which may impact rendering time. Also, his relative speed bars are off.

Monster Under the Bed (mine 16.7secs vs his 22secs)
MonsterUnderTheBed165W.png


Lone Monk (mine 1min 54secs vs his 1min 58secs)
LoneMonk165W.png
 

innerproduct

macrumors regular
Jun 21, 2021
222
353
He didn’t use optix for rendering. How is it even possible to be so ignorant and still work with 3d? So all his results are totally irrelevant
 

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,219
M4 Max 3s faster than desktop 4070 Super in Blender Classroom with Optix.

View attachment 2450949 View attachment 2450950
Since in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...

That's the thing, Apple's RAM upgrades are expensive (though better now that the base 16GB), but VRAM in the PC world, especially beyond 24GB is much, much worse. This is where Apple's approach really shines. When it comes to VRAM/bandwidth (above 24GB), Apple is actually the cost effective solution ... feels weird man to write that :). Hopefully that results in price pressure on Nvidia in particular to bring its workstation GPU prices down or increase max RAM on its consumer GPUs, but that may be unlikely given Nvidia's overall market dominance.

==============================

The die area devoted to the GPU is sometimes why I've seen people, only half-jokingly, refer to the Max and Ultra as a GPU with an integrated CPU ;).

The M3 Max has about 92 billion transistors and the GPU is roughly 36% of that, for about 33 billion transistors. The M4 Max is likely bigger and thus the GPU, in terms of raw transistor count, is likely quite a bit bigger than the 3090 (of course how you use those transistors is even more important).


Don't get me wrong, it's still very impressive! The 3090 packs more SMs at similar/slightly higher clocks (and thus has more raw TFLOPs).

One thing I didn't make clear in my original post: even if you compare Apple with say the desktop 4070 whose stats are more comparable to the M3/M4 Max being on TSMC 4N rather than Samsung 8 ... why is Nvidia able to seemingly fit more shader cores/SMs per transistor than Apple and how does that relate to TFLOPs and so forth? Nvidia is able to fit 46 SMs in 36 billion transistors (which also includes IO and tensor cores, the first of which I didn't include in the above for the M3 Max and the other Apple doesn't have .. oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU) compared to 40 Cores in 33 billion for the M3 Max (and presumably more transistors for the M4 Max).

I'm glad you asked! :)

So Nvidia in the last few generations reports 128 FP32 shader cores per SM, but, unlike Apple, Nvidia actually has 2 FP32 pipes per individual compute subunit, one of which can double as a INT32 pipe. Apple does not. So when Apple reports 128 FP32 shader units per Core, that includes 128 of all the other pipes as well (Integer, FP16, Complex, etc ...). This means Apple's cores are going to be bigger than Nvidia's SMs even though SMs also include 4 Tensor cores. Further, because of the way the pipes work, Nvidia GPUs are not always going to be able to take advantage of the extra FP32 pipe - like if they have integer work to do or not enough parallel FP32 work within the subunit.


Side note: The AMD chips are, as far as I can tell, worse in this respect where RDNA 3 reports "TFLOPs" that are only achievable if there is instruction level parallelism (like on a CPU thread). Compare calculating the TFLOPs per shader core for the linked RDNA 3 GPU vs the 4070 above.

In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit.

Of course there is Apple's (ImgTech's) TBDR which helps in rendering complex scenes where lots of objects are occluding each other and this speeds up the rasterization portion of the process. TBDR is not always a net win, but when it is, it is.

Finally, as I mentioned above, Apple has much better bandwidth/VRAM per TFLOP than Nvidia typically offers on its consumer GPUs and its workstation GPUs are extremely expensive. So when memory bandwidth and footprint are at a premium during a workload, Apple GPUs will also punch above their weight.

Given the above, I hope it becomes a little more clear how Apple competes with Nvidia in these Blender benchmarks with seemingly smaller GPUs by SM/core and slower clocks, but with similar if not greater transistor counts. Not every workload is going to be this favorable for Apple of course. But these are probably the workstation/"prosumer" workloads Apple had mind when designing the Max.
 
Last edited:
  • Like
Reactions: Homy and M4pro

innerproduct

macrumors regular
Jun 21, 2021
222
353
the really good thing here is that these days apple is a useful platform for 3d work. Not the best but clearly viable. Quite the difference from when the bad old days when nvidia was 10x the perf of an apple solution. Just two and half years ago, the 3090 in my threadripper system was so much better than a m1max or a 2019 mac pro that benchmarks was not even meaningful to look at. Now a 4090 is just about twice as fast and i imagine that the 5090 to m4 “ultra” will have about the same relationship. Quite impressive how well the rt cores work on apple compared to what amd managed to produce. Also, let’s not forget that for many tasks in 3d the cpu is still very important so for sims and general data shuffling the complete apple package packs quite the punch. Now, the thing that needs most improvement is the software support for ai frameworks but it seems apple is doing great progress there as well. Good times!
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
Since in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...

That's the thing, Apple's RAM upgrades are expensive (though better now that the base 16GB), but VRAM in the PC world, especially beyond 24GB is much, much worse. This is where Apple's approach really shines. When it comes to VRAM/bandwidth (above 24GB), Apple is actually the cost effective solution ... feels weird man to write that :). Hopefully that results in price pressure on Nvidia in particular to bring its workstation GPU prices down or increase max RAM on its consumer GPUs, but that may be unlikely given Nvidia's overall market dominance.

==============================



One thing I didn't make clear in my original post: even if you compare Apple with say the desktop 4070 whose stats are more comparable to the M3/M4 Max being on TSMC 4N rather than Samsung 8 ... why is Nvidia able to seemingly fit more shader cores/SMs per transistor than Apple and how does that relate to TFLOPs and so forth? Nvidia is able to fit 46 SMs in 36 billion transistors (which also includes IO and tensor cores, the first of which I didn't include in the above for the M3 Max and the other Apple doesn't have .. oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU) compared to 40 Cores in 33 billion for the M3 Max (and presumably more transistors for the M4 Max).

I'm glad you asked! :)

So Nvidia in the last few generations reports 128 FP32 shader cores per SM, but, unlike Apple, Nvidia actually has 2 FP32 pipes per individual compute subunit, one of which can double as a INT32 pipe. Apple does not. So when Apple reports 128 FP32 shader units per Core, that includes 128 of all the other pipes as well (Integer, FP16, Complex, etc ...). This means Apple's cores are going to be bigger than Nvidia's SMs even though SMs also include 4 Tensor cores. Further, because of the way the pipes work, Nvidia GPUs are not always going to be able to take advantage of the extra FP32 pipe - like if they have integer work to do or not enough parallel FP32 work within the subunit.


Side note: The AMD chips are, as far as I can tell, worse in this respect where RDNA 3 reports "TFLOPs" that are only achievable if there is instruction level parallelism (like on a CPU thread). Compare calculating the TFLOPs per shader core for the linked RDNA 3 GPU vs the 4070 above.

In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit.

Of course there is Apple's (ImgTech's) TBDR which helps in rendering complex scenes where lots of objects are occluding each other and this speeds up the rasterization portion of the process. TBDR is not always a net win, but when it is, it is.

Finally, as I mentioned above, Apple has much better bandwidth/VRAM per TFLOP than Nvidia typically offers on its consumer GPUs and its workstation GPUs are extremely expensive. So when memory bandwidth and footprint are at a premium during a workload, Apple GPUs will also punch above their weight.

Given the above, I hope it becomes a little more clear how Apple competes with Nvidia in these Blender benchmarks with seemingly smaller GPUs by SM/core and slower clocks, but with similar if not greater transistor counts. Not every workload is going to be this favorable for Apple of course. But these are probably the workstation/"prosumer" workloads Apple had mind when designing the Max.
There's much more going on than just the above. eg
- Apple has distinct FP16 and FP32 pipes, nV does FP16 in the FP32 pipe. The Apple scheme uses more area but less power, since the FP16 pipe is optimized for only FP16.
- Apple (as of M3) has has a much more abstract resource allocation scheme for registers and Threadblock Scratchpad, which takes additional area but allows much more flexibility (ie much more efficient use of the rest of the GPU).
- nV allows for some degree of superscalarity (ie core executes more than one instruction per cycle) but most of the work to ensure this happens in the compiler. Apple gets to the same place, but much of the work to do this happens through the GPU core have multiple "mini-front-ends" feeding into a single set of execution pipelines. The nV scheme takes less area but is much less flexible.

Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).
You can always find microbenchmarks that push a particular GPU design to 100% activity, and choosing the right such benchmark may make nV (or Apple, or AMD) look good. But on REAL GPU code, with all its real messiness, we see the Apple "IPC" advantage, and the lower activity factors for other GPUs.

There are similar techniques operating at higher levels, so that, for example, the traditional scheme might be something like one kernel executes, it finishes and resources are deallocated, then the next kernel starts. As of M4 Apple overlaps these stages so that as one kernel is approaching its end, the resources for the next kernel begin being allocated and pulled into the appropriate cache level. Then the new kernel starts executing, and simultaneously the resources of the previous kernel are now deallocated. So you're no longer wasting the 5% of time or so between kernels that was basically housekeeping, all that's overlapped with real work.

Each of these things seems minor, 1% here, 2% there, but keep at them over a few years, and you land up with a 50% or so "IPC" advantage (ie you get ~50% more performance out of what appear to be essentially the same count of GPU cores).
 

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,219
There's much more going on than just the above. eg
There's even more going here than that. I had myself a longer post detailing the differences between Nvidia and Apple GPUs in my own pipeline, but since you covered some of the rest of that, I'll just add on the extra details at the end here.
- Apple has distinct FP16 and FP32 pipes, nV does FP16 in the FP32 pipe. The Apple scheme uses more area but less power, since the FP16 pipe is optimized for only FP16.
- Apple (as of M3) has has a much more abstract resource allocation scheme for registers and Threadblock Scratchpad, which takes additional area but allows much more flexibility (ie much more efficient use of the rest of the GPU).
- nV allows for some degree of superscalarity (ie core executes more than one instruction per cycle) but most of the work to ensure this happens in the compiler. Apple gets to the same place, but much of the work to do this happens through the GPU core have multiple "mini-front-ends" feeding into a single set of execution pipelines. The nV scheme takes less area but is much less flexible.

Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).
You can always find microbenchmarks that push a particular GPU design to 100% activity, and choosing the right such benchmark may make nV (or Apple, or AMD) look good. But on REAL GPU code, with all its real messiness, we see the Apple "IPC" advantage, and the lower activity factors for other GPUs.

There are similar techniques operating at higher levels, so that, for example, the traditional scheme might be something like one kernel executes, it finishes and resources are deallocated, then the next kernel starts. As of M4 Apple overlaps these stages so that as one kernel is approaching its end, the resources for the next kernel begin being allocated and pulled into the appropriate cache level. Then the new kernel starts executing, and simultaneously the resources of the previous kernel are now deallocated. So you're no longer wasting the 5% of time or so between kernels that was basically housekeeping, all that's overlapped with real work.

Each of these things seems minor, 1% here, 2% there, but keep at them over a few years, and you land up with a 50% or so "IPC" advantage (ie you get ~50% more performance out of what appear to be essentially the same count of GPU cores).

In addition to the FP16 pipeline we both mentioned earlier, Apple also has a "Complex" pipeline for doing sines cosines, etc ... It has never been clear to me if Nvidia has the same as they don't cover that explicitly in their white papers, but given its commonality across different GPU designs and the fact that "fast" approximations for those complex functions exist on Nvidia GPUs, I would assume they do. However, how many they have per core may be different from Apple's where for Apple they have 1 complex data path for every FP16, FP32, and INT32 path, Nvidia may not. There is precedence for doing this e.g. Nvidia also has a smattering of FP64 lanes throughout the consumer GPUs but only a few cores get such a lane, at least active - previous Nvidia GPUs simply fused off FP64 in consumer cores but it's not clear if more recent ones just use a different mask. Also, when I alluded the structure within a subunit but that was an oversimplification as each GPU (Apple and Nvidia are the same here) is actually split into 4:

1 Core/SM is 4 of the following subunits:
Nvidia: 16 FP32, 16 FP32/INT32, + optional 1 FP64, ? Complex
Apple: 16 FP32, 16 INT32, 16 FP16, 16 Complex

Since the M3 generation, Apple is also a superscalar design capable of issuing two commands at once. Previously in the M2 generation and earlier each warp could only issue a single INT32 or an FP32 or FP16 or Complex command. Since then, Apple has allowed the GPU to issue two commands to two pathways at once. I won't repost @leman 's charts without his permission here, but @leman actually found some odd results in his testing of the M3 GPU, namely that while FP16 and FP32 could double the number of operations when performed simultaneously, INT32 could not, but it would still increase, just by a measly 20% instead of double. And issue three commands (FP16, INT32, FP32) at once would increase the throughput by 70%, again instead of double. It's unclear why this is the case or if the M4 generation is the same. I know @leman is preparing a deep dive on the M3 GPU architecture so maybe he'll have more insight when he is ready to present his full findings.

The net result is actually fascinating so where Nvidia GPUs drop FP32 throughput in the presence of FP16/INT32, it is possible that Apple GPUs won't going forwards and even more intriguingly is the possibility that future Apple GPUs could, like Nvidia, turn one of their other pipelines in an optional FP32 pipe if they desired to increase potential FP32 throughput without sacrificing additional die area (you'd still pay for the power of course).

Interestingly it's less clear if the modern Nvidia GPU actually is a true superscalar GPU. What's even more interesting even some Nvidia engineers themselves are unsure about it. Robert Crovella is an Nvidia engineer who is often seen on the forums fielding questions from those seeking help on CUDA and other issues. Even he doesn't know and admits it isn't documented. His hypothesis is that, unlike Kepler which was superscalar, post-Kepler GPUs aren't at the warp level (except one brief generation where oddly Nvidia experimented with doubling FP16 through the FP32 pipe but abandoned that in subsequent generations - I think that was the 2000-series GPUs). Instead what they do is every other clock issue commands to the different subunits to give the appearance of being superscalar at least at the level of throughput. I have to admit I am a little unsure if that would work and if that couldn't be seen by people doing throughput level tests given FMA calculations are typically done in a single clock cycle. So he stressed that he wasn't sure and the Nvidia GPU may indeed by superscalar.


It should be pointed out though that superscalar GPUs do not always deliver better performance, unlike on the CPU where it always a net win. Post-Kepler Nvidia deliberately moved away from superscalar designs to simplify their core designs, make them smaller and then put more of them in the same area - thus widening the design by adding the capability to simply run more threads at once. That can be more beneficial for ensuring a GPU's power to performance ratio as having more dedicated pipes IF the work you want is focused on having more of a specific KIND of pipe. Everything is a tradeoff. Nvidia made one set of decisions, Apple made a different set and sometimes Apple's pays off.

Similarly for the cache, as I said and you went into more detail, Apple in the M3 has a brand new cache structure which allows them to increase thread occupancy, and for those interested in not spending every waking moment compiling every damn option to do so at run time instead of compile time. While it is not clear to me how big Apple's GPU caches are now (probably a lot bigger than they were), what is clear is that the structure of them is entirely different from the classic GPU data cache. Again, I won't post @leman's chart without his permission, but if you were able to look at his chart you would no longer see the tell tale signs of bank conflicts in shared memory that you see in M1 and M2 and for Nvidia GPUs.This is not to say that there isn't variation in strided memory access for M3, there is, but it no longer corresponds as neatly to the classic bank conflicts as seen in earlier Apple GPUs. Also for M3 also overall cache bandwidth is much lower than before as well (again probably the cache is much bigger now). So while overall M3 GPUs could outperform M2 GPUs clock for clock, not for every benchmark and benchmark subtest where sometimes the new GPU would fall behind the old one. Again, unclear what is going on here and what the behavior of the M4 GPU is.

Anyway I gotta go and this post is already long enough, but I hope getting into the nitty gritty was interesting for people.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.