Why do you continue to focus on "clock-for-clock"? (Other than maybe that's what you read in ATI press releases.)
And "theoretical TFLOPs" are nearly as useless. If the GPUs internal datapaths and schedulers can't hit the theoretical numbers, it's just a fraud.
For example, say that I have a 1 GHz clock on a GPU that delivers in real benchmarks 5 FP32 TFLOPs.
And, for example, say that I have another GPU that has a 2 GHz clock that delivers (in the same real benchmarks) 7 TFLOPs.
Will you say that the 7 TFLOP GPU is inferior to the 5 TFLOP GPU because it's not so good on the ridiculous "clock for clock" metric?
I think that most people would prefer the 7 TFLOP GPU over the 5 TFLOP GPU. (Especially if the 5 TFLOP GPU needs 100 more watts for poorer performance - looking at you Vega.)
It appears to me that you have zero clue what clock-for-clock, core-for-core means...
It basically means that in FP32, 3840 CUDA core, 1.1 GHz GP100 chip is 38% faster than 3840 1,1 GHz GP102 chip.
It basically means that normalized performance for 3840 CUDA core, 1.1 GHz GV100 chip is 7% faster than GP100 chip.
Core-for-core, clock-for-clock has zero to do with TFLOPs numbers. Im dumbfounded that you had no idea about this. This is what defines: "IPC" performance.
Even if GPU has 7 TFLOPs it still can be slower, because what defines performance of a GPU are two things: Core throughput, and software performance.
Code for Kepler architecture was able to extract 70% of its maximum potential. Code for Maxwell/Consumer Pascal - 75-80%. The architectures were pretty straight forward, and low-IPC, High clock speeds.
For example code for GCN1 extracted 30% out of its maximum potential. Code for GCN4, even if the architectures from Compute perspective have not changed a bit: around 60%.
In the case of comparisons of GP100 and GV for IPC, for both architectures code is the same, and even knowing this, and remembering that it still should be slightly tweaked for GV100, because of different scheduling, and core layout it still maintained 7% higher IPC performance than GP100 is very meaningful, and very telling.
In essence: GPU with that 5 TFLOPs can still be faster than that 7 TFLOPs GPU. And this is what happened in the past: GTX 980 was faster than GTX 780 Ti in both Compute and games, because of new, higher core throughput layout.
And it smashes the Geforce Titan XP in other applications...
Games are a bit of an anomaly due to Nvidia gameworks, in real world compute, Vega hits Nvidia pretty hard.
You talk about "real apps" and then use games for reference...
This forum, if you will hang on for quite a while, will tell you that only way you should judge GPUs usefulness is by looking at game benchmarks. Not compute, not editing, not anything meaningful - just games. Thats how professionals judge GPUs these days.
I suppose we are on Pro-Gamer forum.