I'm not really sure how a 16/32 core GPU that shares its memory with the system is going to "blow past" something like a mid-ranged nVidia card with 1,500 cores and memory that is about 4x as fast as the memory in the M1 (14,000 MT/s vs. 4,000MT/s for the M1's low power RAM).
For the M1 and on-board graphics, the biggest bottleneck is absolutely going to be the "unified memory architecture", which is great marketing but it's just "shared memory", which is much slower than the GDDR6 in a modern GPU. That's a BIG bottleneck for just about every GPU application.
You are focusing too much on the fact that the memory is shared while you should be focusing on the memory implementation instead. Quad-channel LPDDR5 for example would yield in excess on 200GB/s. This is still half of faster desktop GPUs, but then again Apple GPUs use a different approach to rendering that requires less bandwidth.
For example, M1 GPU with it's 10 watts TDP and 68GB/s scores around 5000 in 3Dmark Wild Life Extreme. A desktop 1650 (75+W TDP, 130+GB/s) scores 7500, a mobile 1650 variant (50W TDP) scores around 6900. That's around 30% difference with 80% lower power consumption and 2x lower bandwidth. And you can also see it in real-world games (not that there are many right now, but M1 manages quite steady ~40fps on 1920p high in Metro Exodus).
(nVidias high-end GPU's have over 10,000 cores)
Because everyone counts cores in a different way. Nvidia counts individual ALUs, Apple counts actual cores (as in, minimal organisational unit in a GPU cluster). Counting things Nvidia way, Apple M1 has 1024 cores (each Apple GPU core contains 4x32 FP32 ALUs), counting things Apple's way something like Nvidia RTX 3060 (GA106) is either a a tree-core GPU (if you count GPCs, with 40x32 FP ALUs each), a 15-core GPU (if you count TPCs, with 4x32 FP32 ALUs each) or a 30-core GPU (if you count SMs with 2x32 FP32 ALUs each, or maybe its 4x16 FP32 ALUs, I was never quite certain).
But those never came close to what was available on the Desktop. If the next version of the M1 GPU is twice as fast as the current one, then it's still slower than a current-generation mid-raged Desktop GPU.
As I have shown above, M1 is already around 70-80% as fast as current entry-level desktop GPUs. Quadruple the number of processing units and give them more memory bandwidth and you get yourself a RTX 3060 performance easily, with a power consumption of a 1650 Ti max-Q