Is there something credible out there to support this? Genuinely curious. I'd love to see Apple put out such awesome GPU power, but I have doubts if they can run circles around Nvidia as they have around Intel.
You mean other than benchmarks and GPU specs? Not much. My post was based on available graphical benchmarks, in particular the new 3DMark Wild Life that runs natively on both M1 machines and Windows PCs. The M1 scores around 5000 points in that benchmark, a desktop RX 3060 scores 18000. Quadruple the GPU cores and assuming linear performance scaling (as GPUs usually do, especially if they are so power efficient), and that's what you get.
What would be their fundamental design change that would allow such a step up?
No design change needed, you just use more GPU cores. Graphics (and compute) is a parallelizable tasks: you can just shade more triangles/process more compute threads in parallel using more compute cores (this is an oversimplification, but it gives you an idea).
A single M1-based G13 GPU core consists of 4 32-wide FP32 ALUs, making it 128 "compute units" per GPU core. Apple shader units are capable of a single floating point or integer operation per clock. Apple GPU in the M1 runs at approximately 1.26ghz, which gives it a peak compute throughtput of 2.6TFLOPS (1024 compute units * 1.26 ghz * 2 — the multiplier for fused multiply add which is a single operation that counts as two).
To compare this with Nvidia Ampere, it's "compute unit" (CUDA core) can do either a FP32+INT32 or two FP32 operations per clock. Take an RTX 3060 that has somewhere around 1920 units (it was never really clear to me) with the boost frequency of 1.77 ghz and you get the peak throughput of 1920*2 (two FP32 per clock)*1.77*2(again used multiple add) ~ 13.5 TFLOPS, which is the number you can see in Nvidia's marketing material.
Of course, these numbers should be taken with a big grain of salt as the represent theoretical maximums that will never be reached under normal conditions. I did reach the declared 2.6TFLOPS on the Apple GPU using a specially written compute shader that did nothing but a long chain of multiply+adds, but that's not how real world code looks like.
One thing about Apple GPUs is that they are just so incredibly power efficient. M1 (total power consumption including memory ~ 12 watts) needs just 4.6 watts per TFTLOP, while the RTX 3060 needs 13.5 watts (this number will be similar for all Ampere GPUs). So a hypothetical 32-core Apple GPU using G13 technology would be a 10TFLOPS part that uses just around 40-45 watts of power.