Did they add HWRT support finally? Their developer forum says not but that was a few weeks ago, and Nanite wasn’t mentioned then either.
Not sure.
Did they add HWRT support finally? Their developer forum says not but that was a few weeks ago, and Nanite wasn’t mentioned then either.
Since in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...M4 Max 3s faster than desktop 4070 Super in Blender Classroom with Optix.
View attachment 2450949 View attachment 2450950
The die area devoted to the GPU is sometimes why I've seen people, only half-jokingly, refer to the Max and Ultra as a GPU with an integrated CPU .
The M3 Max has about 92 billion transistors and the GPU is roughly 36% of that, for about 33 billion transistors. The M4 Max is likely bigger and thus the GPU, in terms of raw transistor count, is likely quite a bit bigger than the 3090 (of course how you use those transistors is even more important).
Annotated Apple M3 Processor Die Shots Bring Chip Designs to Life
Colorful highlights and annotations help visualize the family.www.tomshardware.com
Don't get me wrong, it's still very impressive! The 3090 packs more SMs at similar/slightly higher clocks (and thus has more raw TFLOPs).
There's much more going on than just the above. egSince in Blender's aggregate score the Max matches the regular 4070, I'm assuming the 4070 Super pulls ahead in the other scenes? Is it pulling ahead in the smaller or larger scenes I wonder ... it's such a pity they only report the aggregate scores on Open Data. Also, I'd love to see the results for something like the Moana scene ...
That's the thing, Apple's RAM upgrades are expensive (though better now that the base 16GB), but VRAM in the PC world, especially beyond 24GB is much, much worse. This is where Apple's approach really shines. When it comes to VRAM/bandwidth (above 24GB), Apple is actually the cost effective solution ... feels weird man to write that . Hopefully that results in price pressure on Nvidia in particular to bring its workstation GPU prices down or increase max RAM on its consumer GPUs, but that may be unlikely given Nvidia's overall market dominance.
==============================
One thing I didn't make clear in my original post: even if you compare Apple with say the desktop 4070 whose stats are more comparable to the M3/M4 Max being on TSMC 4N rather than Samsung 8 ... why is Nvidia able to seemingly fit more shader cores/SMs per transistor than Apple and how does that relate to TFLOPs and so forth? Nvidia is able to fit 46 SMs in 36 billion transistors (which also includes IO and tensor cores, the first of which I didn't include in the above for the M3 Max and the other Apple doesn't have .. oh and I didn't include the transistors for the huge 48MB SLC cache which the Apple GPU takes advantage of but doesn't have an analog on the Nvidia GPU) compared to 40 Cores in 33 billion for the M3 Max (and presumably more transistors for the M4 Max).
I'm glad you asked!
So Nvidia in the last few generations reports 128 FP32 shader cores per SM, but, unlike Apple, Nvidia actually has 2 FP32 pipes per individual compute subunit, one of which can double as a INT32 pipe. Apple does not. So when Apple reports 128 FP32 shader units per Core, that includes 128 of all the other pipes as well (Integer, FP16, Complex, etc ...). This means Apple's cores are going to be bigger than Nvidia's SMs even though SMs also include 4 Tensor cores. Further, because of the way the pipes work, Nvidia GPUs are not always going to be able to take advantage of the extra FP32 pipe - like if they have integer work to do or not enough parallel FP32 work within the subunit.
Side note: The AMD chips are, as far as I can tell, worse in this respect where RDNA 3 reports "TFLOPs" that are only achievable if there is instruction level parallelism (like on a CPU thread). Compare calculating the TFLOPs per shader core for the linked RDNA 3 GPU vs the 4070 above.
In addition, while Nvidia had much more L1/L2 cache than Apple's GPUs in the M2 generation, Apple dramatically overhauled its cache configuration in the M3 and I'm no longer as certain how much cache Apple has per core/GPU total. I suspect it has gone up by quite a bit.
Of course there is Apple's (ImgTech's) TBDR which helps in rendering complex scenes where lots of objects are occluding each other and this speeds up the rasterization portion of the process. TBDR is not always a net win, but when it is, it is.
Finally, as I mentioned above, Apple has much better bandwidth/VRAM per TFLOP than Nvidia typically offers on its consumer GPUs and its workstation GPUs are extremely expensive. So when memory bandwidth and footprint are at a premium during a workload, Apple GPUs will also punch above their weight.
Given the above, I hope it becomes a little more clear how Apple competes with Nvidia in these Blender benchmarks with seemingly smaller GPUs by SM/core and slower clocks, but with similar if not greater transistor counts. Not every workload is going to be this favorable for Apple of course. But these are probably the workstation/"prosumer" workloads Apple had mind when designing the Max.
There's even more going here than that. I had myself a longer post detailing the differences between Nvidia and Apple GPUs in my own pipeline, but since you covered some of the rest of that, I'll just add on the extra details at the end here.There's much more going on than just the above. eg
- Apple has distinct FP16 and FP32 pipes, nV does FP16 in the FP32 pipe. The Apple scheme uses more area but less power, since the FP16 pipe is optimized for only FP16.
- Apple (as of M3) has has a much more abstract resource allocation scheme for registers and Threadblock Scratchpad, which takes additional area but allows much more flexibility (ie much more efficient use of the rest of the GPU).
- nV allows for some degree of superscalarity (ie core executes more than one instruction per cycle) but most of the work to ensure this happens in the compiler. Apple gets to the same place, but much of the work to do this happens through the GPU core have multiple "mini-front-ends" feeding into a single set of execution pipelines. The nV scheme takes less area but is much less flexible.
Ultimately the Apple GPU is evolving to something like the Apple CPU, in the sense that Apple is willing to spend area for the sake of saving energy.
Much of the *actual* way this is done is by boosting the "IPC" of the GPU, making it spend a lot less time waiting for something to happen.
GPUs have obviously always operated by swapping in one warp when another warp cannot execute, but this was always something of an aspiration, with about 50% or so activity; Apple does everything they can to ramp it to 100%, from providing more registers (so more warps can be active) to a very weird instruction flow (which ensures that the core rarely has to wait for an instruction cache miss) to the mini-front-ends (which, among other things, ensure the system doesn't waste a cycle or two pulling in a register that's not yet in the operand cache).
You can always find microbenchmarks that push a particular GPU design to 100% activity, and choosing the right such benchmark may make nV (or Apple, or AMD) look good. But on REAL GPU code, with all its real messiness, we see the Apple "IPC" advantage, and the lower activity factors for other GPUs.
There are similar techniques operating at higher levels, so that, for example, the traditional scheme might be something like one kernel executes, it finishes and resources are deallocated, then the next kernel starts. As of M4 Apple overlaps these stages so that as one kernel is approaching its end, the resources for the next kernel begin being allocated and pulled into the appropriate cache level. Then the new kernel starts executing, and simultaneously the resources of the previous kernel are now deallocated. So you're no longer wasting the 5% of time or so between kernels that was basically housekeeping, all that's overlapped with real work.
Each of these things seems minor, 1% here, 2% there, but keep at them over a few years, and you land up with a 50% or so "IPC" advantage (ie you get ~50% more performance out of what appear to be essentially the same count of GPU cores).