Did they add HWRT support finally? Their developer forum says not but that was a few weeks ago, and Nanite wasn’t mentioned then either.
Not sure.
Did they add HWRT support finally? Their developer forum says not but that was a few weeks ago, and Nanite wasn’t mentioned then either.
Since in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...M4 Max 3s faster than desktop 4070 Super in Blender Classroom with Optix.
View attachment 2450949View attachment 2450950
The die area devoted to the GPU is sometimes why I've seen people, only half-jokingly, refer to the Max and Ultra as a GPU with an integrated CPU.
The M3 Max has about 92 billion transistors and the GPU is roughly 36% of that, for about 33 billion transistors. The M4 Max is likely bigger and thus the GPU, in terms of raw transistor count, is likely quite a bit bigger than the 3090 (of course how you use those transistors is even more important).
![]()
Annotated Apple M3 Processor Die Shots Bring Chip Designs to Life
Colorful highlights and annotations help visualize the family.www.tomshardware.com
Don't get me wrong, it's still very impressive! The 3090 packs more SMs at similar/slightly higher clocks (and thus has more raw TFLOPs).
There's much more going on than just the above. egSince in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...
That's the thing, Apple's RAM upgrades are expensive (though better now that the base 16GB), but VRAM in the PC world, especially beyond 24GB is much, much worse. This is where Apple's approach really shines. When it comes to VRAM/bandwidth (above 24GB), Apple is actually the cost effective solution ... feels weird man to write that. Hopefully that results in price pressure on Nvidia in particular to bring its workstation GPU prices down or increase max RAM on its consumer GPUs, but that may be unlikely given Nvidia's overall market dominance.
==============================
One thing I didn't make clear in my original post: even if you compare Apple with say the desktop 4070 whose stats are more comparable to the M3/M4 Max being on TSMC 4N rather than Samsung 8 ... why is Nvidia able to seemingly fit more shader cores/SMs per transistor than Apple and how does that relate to TFLOPs and so forth? Nvidia is able to fit 46 SMs in 36 billion transistors (which also includes IO and tensor cores, the first of which I didn't include in the above for the M3 Max and the other Apple doesn't have .. oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU) compared to 40 Cores in 33 billion for the M3 Max (and presumably more transistors for the M4 Max).
I'm glad you asked!
So Nvidia in the last few generations reports 128 FP32 shader cores per SM, but, unlike Apple, Nvidia actually has 2 FP32 pipes per individual compute subunit, one of which can double as a INT32 pipe. Apple does not. So when Apple reports 128 FP32 shader units per Core, that includes 128 of all the other pipes as well (Integer, FP16, Complex, etc ...). This means Apple's cores are going to be bigger than Nvidia's SMs even though SMs also include 4 Tensor cores. Further, because of the way the pipes work, Nvidia GPUs are not always going to be able to take advantage of the extra FP32 pipe - like if they have integer work to do or not enough parallel FP32 work within the subunit.
Side note: The AMD chips are, as far as I can tell, worse in this respect where RDNA 3 reports "TFLOPs" that are only achievable if there is instruction level parallelism (like on a CPU thread). Compare calculating the TFLOPs per shader core for the linked RDNA 3 GPU vs the 4070 above.
In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit.
Of course there is Apple's (ImgTech's) TBDR which helps in rendering complex scenes where lots of objects are occluding each other and this speeds up the rasterization portion of the process. TBDR is not always a net win, but when it is, it is.
Finally, as I mentioned above, Apple has much better bandwidth/VRAM per TFLOP than Nvidia typically offers on its consumer GPUs and its workstation GPUs are extremely expensive. So when memory bandwidth and footprint are at a premium during a workload, Apple GPUs will also punch above their weight.
Given the above, I hope it becomes a little more clear how Apple competes with Nvidia in these Blender benchmarks with seemingly smaller GPUs by SM/core and slower clocks, but with similar if not greater transistor counts. Not every workload is going to be this favorable for Apple of course. But these are probably the workstation/"prosumer" workloads Apple had mind when designing the Max.
There's even more going here than that. I had myself a longer post detailing the differences between Nvidia and Apple GPUs in my own pipeline, but since you covered some of the rest of that, I'll just add on the extra details at the end here.There's much more going on than just the above. eg
- Apple has distinct FP16 and FP32 pipes, nV does FP16 in the FP32 pipe. The Apple scheme uses more area but less power, since the FP16 pipe is optimized for only FP16.
- Apple (as of M3) has has a much more abstract resource allocation scheme for registers and Threadblock Scratchpad, which takes additional area but allows much more flexibility (ie much more efficient use of the rest of the GPU).
- nV allows for some degree of superscalarity (ie core executes more than one instruction per cycle) but most of the work to ensure this happens in the compiler. Apple gets to the same place, but much of the work to do this happens through the GPU core have multiple "mini-front-ends" feeding into a single set of execution pipelines. The nV scheme takes less area but is much less flexible.
Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).
You can always find microbenchmarks that push a particular GPU design to 100% activity, and choosing the right such benchmark may make nV (or Apple, or AMD) look good. But on REAL GPU code, with all its real messiness, we see the Apple "IPC" advantage, and the lower activity factors for other GPUs.
There are similar techniques operating at higher levels, so that, for example, the traditional scheme might be something like one kernel executes, it finishes and resources are deallocated, then the next kernel starts. As of M4 Apple overlaps these stages so that as one kernel is approaching its end, the resources for the next kernel begin being allocated and pulled into the appropriate cache level. Then the new kernel starts executing, and simultaneously the resources of the previous kernel are now deallocated. So you're no longer wasting the 5% of time or so between kernels that was basically housekeeping, all that's overlapped with real work.
Each of these things seems minor, 1% here, 2% there, but keep at them over a few years, and you land up with a 50% or so "IPC" advantage (ie you get ~50% more performance out of what appear to be essentially the same count of GPU cores).
1 Core/SM is 4 of the following subunits:
Nvidia: 16 FP32, 16 FP32/INT32, + optional 1 FP64, ? Complex
Apple: 16 FP32, 16 INT32, 16 FP16, 16 Complex
Since the M3 generation, Apple is also a superscalar design capable of issuing two commands at once. Previously in the M2 generation and earlier each warp could only issue a single INT32 or an FP32 or FP16 or Complex command. Since then, Apple has allowed the GPU to issue two commands to two pathways at once. I won't repost @leman 's charts without his permission here, but @leman actually found some odd results in his testing of the M3 GPU, namely that while FP16 and FP32 could double the number of operations when performed simultaneously, INT32 could not, but it would still increase, just by a measly 20% instead of double. And issue three commands (FP16, INT32, FP32) at once would increase the throughput by 70%, again instead of double. It's unclear why this is the case or if the M4 generation is the same.
Interestingly it's less clear if the modern Nvidia GPU actually is a true superscalar GPU. What's even more interesting even some Nvidia engineers themselves are unsure about it. Robert Crovella is an Nvidia engineer who is often seen on the forums fielding questions from those seeking help on CUDA and other issues. Even he doesn't know and admits it isn't documented. His hypothesis is that, unlike Kepler which was superscalar, post-Kepler GPUs aren't at the warp level (except one brief generation where oddly Nvidia experimented with doubling FP16 through the FP32 pipe but abandoned that in subsequent generations - I think that was the 2000-series GPUs). Instead what they do is every other clock issue commands to the different subunits to give the appearance of being superscalar at least at the level of throughput. I have to admit I am a little unsure if that would work and if that couldn't be seen by people doing throughput level tests given FMA calculations are typically done in a single clock cycle.
Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).
View attachment 2451530Though no renderer is optimized for the M4 series so soon after its integration in the Mac - from what I’ve seen so far, the Cinebench GPU 2024 benchmark gives a good indication of the generational performance gains Apple achieved with the M4 GPUs.
I wonder what is the source of this improvement? I know they referred to an improved scheduler. Also do we know the clock speed improvement? I haven’t seen it mentioned.That is very impressive given that the core count did not increase and the clock only increased marginally. 30% year-to-year is nothing to sneeze on.
The GPU should clock up to 1.6GHz nowI wonder what is the source of this improvement? I know they referred to an improved scheduler. Also do we know the clock speed improvement? I haven’t seen it mentioned.
Power draw for oneI wonder what is the source of this improvement? I know they referred to an improved scheduler. Also do we know the clock speed improvement? I haven’t seen it mentioned.
The GPU should clock up to 1.6GHz now
I was just watching this. It’s a good review.
Power draw for one![]()
Ooh I think we probably are at those 60-70 W to be honest.Well, given that M1 Max scores ~ 4400 points in that test, I'd gladly take an 4x improvement in performance for a 2x increase in power draw. Especially since 50 watts is still a ridiculously low power draw for a large workstation laptop. Apple can go at least for 60-70 watts for the GPU alone without any detrimental effects.
Yeah I typed the above too quickly and reused the Nvidia numbers as like Nvidia, Apple also has 4 subunits and 128 total FP32 pipes in a Core, but the whole point was that Nvidia's numbers include the optional FP32 pipes while Apple's do not resulting in double the number of pipes per subunit even beyond the extra FP16 pipes. That was the entire point of writing it out like that and I completely goofed it. Sorry.Apple's pipelines are 32-wide (or at least they behave like they are).
Here is the graph for illustration
![]()
As @crazy dave says, it is not clear why float+int is slower. There appears to be some penalty with dual-issuing int and fp instructions.
I thought these things were fairly well known? There is no difference on throughput tests since both subunits per schedule operate concurrently. There is just one cycle at the beginning of the workload and one cycle at the end of the workload where one of the units is inactive. That is not measurable if your workload runs over millions of cycles.
Again, I think my mental model is wrong or at least given my mental model I'm not quite following why we would need to do this at all since according to how I'm thinking about it even just a simple throughput test should tell the difference. Since that's presumably not the case, I must be mistaken somewhere. If you both and Robert are saying this schema is reasonable, it most likely is.It should also be fairly simple to verify experimentally. Run a very small (but long-duration) workload that will only target a single subunit. If the hypothesis is correct, you should see 1/8 of the performance that a SM is capable of.
I know Nvidia has 64K 32bit registers per SM in its register file, do we know what Apple is using now? Chips and Cheese didn't go into that in their unfortunately brief look in the M2 (which is also not the M3 obviously which probably dramatically changed things).I am also wondering about Nvidia's register and shared memory design. Nvidia documentation are very adamant about the need to avoid bank conflicts, and some results I've seen suggest that this is a particularly significant concern for their hardware. Apple's caches have more uniform access timings, but they are also slower. There seem to be a lot of design tradeoffs and different approaches to balancing.
There is also some evidence that Nvidia has a narrower register data path and is willing to sacrifice some hardware utilization efficiency to reduce the core die area. I think this is particularly visible on complex shaders like Blender where Nvidia has considerably lower performance per FLOP than Apple. Nvidia also has a dedicated hardware unit, the operand collector that tracks register usage and aligns bank access to compensate for this.
... oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU ...
... In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit. ...
That is very impressive given that the core count did not increase and the clock only increased marginally. 30% year-to-year is nothing to sneeze on.
It's all a little over my head, but is certainly fascinating! I love this corner of the forum for the knowledgable people sharing their stuff.
Thanks @name99 and @crazy dave
That makes sense to me as the GPU should more or less scale okay in power - Apple didn't increase GPU clock speed in the M3 generation and so had some silicon node advantage to play with in the M4 compared to GPU clocks in the M2.Actually maybe istat menus needs an update, it is way lower than I expected.
Cinebench 2024
Just for comparison my 4070ti super uses about 50 W sitting on the desktop and peaks over 200 running Cinebench.
View attachment 2451596