I bet if Tom used a 100% stress situation the difference would be even more massive (encoding/rendering) - exactly what the Mac Pro is supposed to be for...
i don't think that's true.. there are (comparatively) humongous limitations which happen well before a gpu can fire all cores at max speed for rendering.. as well as limitations which arise downstream once the data has been through the gpu.
if we follow the logic that #of processors, speed of processors, wattage, and whatnot--
a d700 would be 50x faster for parallelized tasks than a 12core cpu @ 3GHz.
(2048cores X .85GHz = 1740GHz --vs-- 12core x 3GHz = 36GHz)
but in reality, that speed increase you'll likely experience in a well coded application is 10-20x faster instead of 50x faster.
i think a problem is delivering usable data to a gpu at a rate near what it can devour.. apparently, the amount of video ram is critical in this delivery and we'll possibly see vram sizes leapfrogging typical ram size in the next decade.
another bottleneck is that you can't simply divide any given problem up into enough smaller problems to saturate the gpu resources.. say you are building a house with 1000 blocks.. you can't distribute those 1000 blocks to 5000 workers.. further, just delivering the 1000 individual blocks takes considerable time when compared to delivering one or two (non-parallel) blocks..
and even further-- to build the house, you can't just place all 1000 bricks at once.. the second row can't happen until the first row has been laid.. if the bricks are to be placed 100 rows tall with 10 per layer, the most efficient way to lay them be something like 5 workers (processors) handling 2 bricks per layer.
then even further still... say you solved all of the above and can now process 100,000 requests on 100,000cores in the same amount of time it takes to process one request on one core.
all of that data still needs to be combined into a unified whole once it comes out of the gpu.. and will likely need to be combined A+B = C then C + D = E.. etc instead of (A+B+C+D+E).. combining the data is a string of calculations that have to happen in order instead of one single calculation.. this takes time and in this scenario, the single core on the cpu would still be causing the holdup while the gpu is creating a backlog of data.
anyway, there are definite improvements happening regarding parallelization of large tasks.. new algorithms.. advanced mathematics etc.. next level stuff.. but still, we're not to the point where you can just read a spec and expect it to work X amount faster/slower than that other gpu.. 100MHz faster? cool.. but can the process make use of that?
---
[edit] bad maths above.. i'll sort it a little later..
edit2 - sort of sorted.. meh.