I'm also really surprised with the dishonest and misleading desktop RTX 2070 comparison, since your posts tend to be technical and of high quality. If you already want to compare M1 GPU to RTX 2070 then 2070 super max-q is more fair, with peak power of 80-90W for the entire card and a multiple of M1 GPU performance. Of course a significant portion of this power, possibly even half, goes to 8GB VRAM, it's controllers and busses, and even more goes to power significant parts of the chip that support functionality that doesn't even exist on a M1 GPU. 2070 is also built on an ancient 12nm process vs the bleeding edge 5nm for the M1.
It's a fair point, so let me explain why I chose that particular comparison. It was not my intension to be misleading or to downplay the Nvidia GPU — it is obviously gong to be much faster in a real world scenario (as I did point out in my post). I picked the RTX 2070 Super because it has 2560 ALUs (what Nvidia calls "CUDA cores") — exactly the same amount as a hypothetical 20-core Apple GPU (with 4x32 ALUs per core). My intention was to look at different designs with comparable compute cluster configurations at different power consumption levels and in different frequency ranges. You are right that I should have picked the Max-q laptop variant, since it has clocks that makes it very comparable to Apple Silicon. Looking at these side by side (all of them have 2560 ALUs):
2070 Super (1770 Mhz, 215 W, 9TFLOPS) = 23.8 Watts/TFLOP
2070 Super Max-Q (1290 Mhz, 90W, ~6.6TFLOPS) = 13.6 Watts/TFLOP
Radeon Pro 5600M (1030 Mhz, 50W, 5.3TFLOPS) = 9.4 Watts/TFLOP
20-core Apple GPU (1270Mhz, 25W + 10W RAM, ~6.5TFLOPS) = 5.4 Watts/TFLOP
To be more "fair" I have added 10 watts to our hypothetical Apple GPU to account for RAM power usage (it's likely to be a very conservative estimate, since M1 RAM draws less than 1 Watts in GPU-related benchmarks and less than 1.5Watts in memory-intensive tests).
What we see here is Apple design is capable of reaching the same peak FLOPS at 60% of Navi's power consumption or 40% of Turing power consumption at the same clock speed. There is no doubt that superior process node plays a role here, but I don't think that it explains all of the difference we see here. For instance, Ampere is manufactured at a 7nm node, but it's only 20-30% faster than Turing at the same power usage — and that includes all the microarchitectural improvements like improved superscalar execution etc... Even Navi at 7nm, with highly binned 5600M parts and very power-efficient HBM2 is only 30% more efficient than the Max-Q Turing on 12nm node... No, I think a substantial portion of the difference here is the GPU microarchitecture itself. It seems to me that Apple is using a very streamlined approach to GPU design, with their cores ending up smaller, simpler, and more efficient overall, even if they trade some functionality for it. I think Apple GPUs are really underestimated, as people traditionally tend to swipe mobile GPUs aside as underpowered and boring.
Now, please keep in mind that we are looking at TDP vs. theoretical peak FLOPS! Real world is different and that's where other factors come into play. Apple ALUs are much simpler than the competitors, and Apple Silicon (at least what we've seen before) doesn't have the bandwidth to hold it's own agains dGPUs, so dedicated GPUs with similar on-paper performance are ending up considerably faster in many synthetic compute benchmarks. I would really like to see what Apple Silicon can do if it has 300-400GBps available memory bandwidth...
On the other hand, Apple GPUs don't need nearly as much bandwidth for rasterization, so they really punch above their weight class in gaming benchmarks. The 2070 Super Max-Q scores around 43246 in Wild Life, M1 with 8 GPU cores scores around 16K if I remember correctly — our 20-core 25W GPU would translate to somewhere around very respectable 35-40k.
That's rather unrealistic, though, since a single 5600m comes with dedicated 8GB of ~400GB/s RAM vs ~60GB/s for a whole M1 SoC. Even doubling the RAM channels and assigning the entirety of the SoC bandwidth to the GPU only brings it up to about a quarter of a single 5600m. Regardless of how fast each M1X GPU core is (and M1 GPU cores are really not very fast, but rather power efficient), it can't do much when data starved.
True, but I think it's obvious that any high-end GPU system Apple ships will need to have appropriate memory bandwidth available to it. The rumored M1X should be fine with 2x of M1's bandwidth (or maybe even 1.5x if they use LPDDR5 instead). To play with the big boys Apple would need wider memory. I am fairly confident that 200GBs should be enough to reach 5600M/2060 MaxQ rasterization performance levels due to savings from TBDR. Compute might be a different story but then again I have no idea how all these things interact in the real world. Synthetic benchmarks will obviously fly on a GPU with very fast dedicated RAM, but do they take the time needed to copy the data to the GPU into account? Similar, for many productivity workloads the elimination of PCIe data transfer can more then offset the loss of some RAM bandwidth, especially in a heterogenous processing environment (data moving between the CPU, GPU and ML coprocessors).