It’s hard to directly compare so I extrapolated from Geekbench compute benchmarks. On Metal the M1 is a bit slower than an AMD Radeon RX 560. Looking at the Vulkan benchmark puts the the RX 560 at about equivalent to the NVIDIA GeForce GTX 1050 Ti at about 20,500. The NVIDIA GeForce RTX 3090 is about 137,000. This about an 6.5 to 7 times performance improvement. So being conservative I rounded up to an 8x improvement needed for the ASi GPU to match the RTX 3090. Hence 64 M1 GPU cores.
It is interesting from a FPU32 perspective more cores are needed for the ASi GPU to match the 3090. But as you say, TFlops really aren’t reliable. I trust Geekbench a bit but until real world tests can be performed we won’t really know. Maybe it is just because Metal is more performant than Vulkan.
Might also depend on the size of the data sets. I don't know how large datasets Geekbench works with - if it's fairly small bits of data sent to the GPU and then massively manipulated. Apple's TBDR GPU has quite substantial tile cache that could perhaps be speeding up the compute tasks if it can store more of the data in a cache than the other GPUs. Main memory access will then be slower on the M1, but if the other GPUs need to access more of the data from GDDR and the M1 access it from L2 then that's a substantial difference.
It's also worth noting that, generally speaking, TBDR GPUs will perform better with fragment shaders and all sorts of screen space effects and worse with geometry. Now I don't know the precise characteristics of the M1 relative to AMD/Nvidia GPUs and I know that since Maxwell Nvidia has used some form of "tiling" in their GPUs even though they are still considered immediate mode renderers, but the different GPUs will have different strengths and weaknesses.
Geekbench may also use a variety of precision levels, FP16, 32, 64, INT8 - and GPU performance will scale differently with different data types - some GPUs at least historically - have been artificially handicapped in FP64 to make those who need good FP64 performance have to buy Quadro/FirePro/Tesla
In short I'm echoing the sentiment that FP32 peak performance isn't everything, Geekbench is one valid testing solution but also won't show the whole picture and only tests GPGPU. We also don't know how well it's going to scale with more GPU cores - GPUs of course generally scale well, being massively parallel, but we don't know how Apple is going to make the larger GPU core chips. And whether it will get to a point of memory starvation - TBDRs are generally more efficient with respect to memory bandwidth but the more you scale up the more bandwidth is needed and so on and so on. - So really time will tell and we will have more data points to extrapolate from if/when we get a 16 core and/or 32 core (or other numbers) GPU(s) and I think in particular, seeing the scaling from the M1 to whatever options we get will be fascinating, especially with respect to memory starvation, and finding out if those chips will do something special to try and prevent that - or whether it'll really even be a problem for the architecture at that scale yet at all.