The 3080 has 30TFLOPS, the M1 Ultra has 20TFLOPS, so that makes a difference of 50%. Whether the other 50% are because of the software or hardware maturity or some other factor, we don't know. What's quite interesting though that if the 24-core M1 Max scaled linearly we would expect 712/24*64 = ~1900 points, and multiplying it by 1.5 (the 50% of TFLOPS difference to 3080) yields 2850, which is oddly similar to the CUDA score (2900) that the 3080 has. Sure, it's just some random napkin arithmetics, but the sheer coincidentally of all this almost suggests like the Metal Blender backend has some massive thread scaling issues.
P.S. Your wattage estimates for the M1 blender seem quite off. I am not even breaking 30W running blender benchmark on my 32-core Max...