It is because AMD GCN is compute oriented GPU, whereas Nvidia CUDA architecture is Geometry throughput oriented architecture, at least it was pre GP100/GV100.
If we talk about code execution we have to get to bases. AMD GCN has shorter pipeline of execution, which not allows the architecture to clock very high, but is able to do more work per clock cycle, because it has 64 KB instructions(wavefront), whereas each Warp in CUDA architecture has 32 KB size.
Each CU in GCN is made from 4, 16-Wide SIMD's which accounts for total 64 cores, in each CU.
Pre-Vega GPUs of never were graphics throughput masters - this is why weak efficiency of the architecture in games, and much stronger compute throughput of the GPUs - great efficiency in compute.
Both architectures work with 256 KB Register File Sizes, but differ in core counts that has access to those 256 KB RFS.
Nvidia Kepler had 192 cores/ 256 KB RFS.
Nvidia Maxwell Had 128 Cores/256 KB RFS.
Nvidia Consumer Pascal had 128 Cores/256 KB RFS.
GP100 has 64 cores/256 KB RFS.
GV100 has 64 cores/256 RFS.
Why we have seen increase both in gaming and compute performance per clock and per core with Maxwell vs Kepler? Because the cores in Maxwell were less starved for resources.
Why we haven't seen increase in performance with Pascal consumer GPUs per clock and per core? Because it is essentially Maxwell architecture on 16 nm process, with higher core clocks.
Why we see increase in performance with Pascal GP100 per clock and per core? Because those core are less starved for resources compared to consumer Pascal.
On a side note, Nvidia will reuse this architecture(actually - they already did with GV100) in consumer GPUs, so also expect per clock and per core increase in performance, and efficiency.
How did it looked like with GCN?
GCN1 - 64 cores/256 KB Register File Size
GCN2 - 64 cores/256 KB RFS
GCN3 - 64 cores/256 KB RFS
GCN4 - 64 cores/256 KB RFS.
Compute performance per clock were always dead flat, compared to previous generations of GCN, per clock, and per core.
What AMD lacked always was software optimization, and underutilization of its GPUs, both on compute side(if you optimize your codes and boarders, for not conflicting when you are operating on 32 KB Warp, and 256 KB RFS per 128 cores, on an architecture where you have 64 KB Wavefront, 256 KB Register File Size per 64 cores - this is what you will get - underutilization of the latter architecture). It can be partially blamed on lazy developers, partially not, because actually it requires quite a lot of optimization of your code to get things done on GCN properly. But if you will get it, expect two times higher performance on for example Hawaii(R9 390X, 2816 GCN cores) GPU compared to GTX 980 Ti(2816 CUDA Cores). This is why I have called many, many times on this forum to you, professional: demand optimization of software for all IEMs.
This is compute side.
What AMD GCN always lacked is Geometry throughput, hence its bad efficiency in games. GCN was able to register 1 Triangle per clock per Shader Engine. If you have had 1 GHz GPU this meant your GPU was able to register 1 mln triangles per clock, with each cycle, per shader engine.
Nvidia on the other hand with Kepler was able to register the same amount of triangles, but per SM(compute cluster). So not only they were able to clock higher, but also Nvidia GPUs had more SM's in their layout on high level.
With Maxwell Nvidia added, however, Tile Based Rasterization, which resulted in massively increased efficiency. Tile Based Rasterization removes the need for moving the data strictly from memory to the GPU each time it has to be refreshed, because needed data are stored in ROP caches, and L2 cache. So all of graphics important parts in the execution pipeline(Polymorh Engine, Pixel Engines, etc) were connected directly to L2 caches, not Memory controller. L2 cache then was connected to memory controller.
GCN on the other hand always L2 cache, and other important parts of the graphics pipeline(Pixel Engine, Geometry Engine, etc) were connected to Memory controller, rather than L2 cache. This increased inefficiency, because GCN always in situation with unoptimized software caused stalls of the pipeline, and underutilization of the Compute Units. To be fair, Asynchronous Compute is actually one way to deal with it, but its like trying to fix sinking Titanic with a bandage.
GCN4 resolved quite a lot of problems however with GCN in geometry throughput, but it had few design flaws(unbalanced design in the first place). In the end, Polaris in Geometry Throughput is exactly on par with consumer Pascal GPUs and Maxwell GPUs, but it outpaces it in Compute throughput per clock and per core. My company did a little experiment.
We have tried to declock GTX 1050 Ti with 768 CUDA cores to 855 MHz(actually we have been able to declock it to 900 MHz only, but close enough), the same level of core clock Radeon Pro 555 has, with the same core count(768 GCN4 cores, with 855 MHz). End result? In Nvidia optimized games we have seen 3 FPS difference for Nvidia GPU. AMD GPu was still faster than Nvidia in compute.
As for Vega GPU. It has new execution pipeline, which is longer, thats why it allows to clock it to 1.6 GHz. It has Tile Based Rasterization, which should increase efficiency of the memory bandwidth, and less stalls in the piepline(increased utilization). Vega has better load balancing, but what AMD is doing is they are dynamically declock, or overclock different Shader Engines to manually balance the load on short length shaders, and long length shaders. Pretty neat feature, which helps balancing the load even more. To even more balance the load, and help fully utilize the GPU is Infinity Fabric, which underneath everything connects every part of the GPU, together. And on top of everything - GCN has finally geometry based features connected to L2 cache, not memory controller, which will help balance the load even more, and should fix its efficiency.
AMD GCN on the other side, always had "some form" of Tile Based Rasterization. Each Shader Engine was working on its own part of display/screen. If you have had for example 4 Shader Engines, display was partitioned into 4 pieces, and each Shader Engine was working on its own. However, differences in complexity of the parts of the screen ended up in smaller load on some Shader engines, and bigger on the others - creating unbalanced pipeline, and lowering the performance of the GPU, because it created stalls. Previous Generations of GCN, as I have said, were able to register only 1 triangle per clock.
One last thing, to put on top of Vega architecture.
One of ways AMD could icnrease throughput of the GPU was by enabling higher amount of L2 cache, available to the cores. It would affect both: Compute and graphics.
Here is compute comparison per clock Vega versus Fiji(Fury X):
And here is comparison of Vega with Fury X in most AMD friendly games:
Now, the question is: Is AMD incompetent?
Simplest answer - no.
Previously, there was a little argument on this forum between few people on this topic. Some people believed that Vega outpaces Titan Xp in compute just because it is Geometry based benchmark(SpecPerf). If that would be the case, then Vega should be smoking over Titan Xp in games. Wiping the floor with it. Flushing it in the toilet. And yet, in games, per clock the architecture is slower, than Fiji. At least that is how it appears on the first glance.
What is the problem, then? Unfortunately, I do not have an answer. Most probable is that software is not ready: both drivers, and applications(Applications have to be rewritten to use for Example Primitive Shaders, which increases throughput of the GPU on geometry front). I do not believe that AMD is incompetent. Vega has almost the same features as Pascal, and Volta, but it still will have 64 KB Wavefront, so it will do more work per clock(slightly...) cycle, even if finally Nvidia paired their architecture layout in compute throughput with Vega(64 cores/256 KB RFS).
As a side note. In perfect world, in GCN optimized games, Vega 10 should be around 40% faster than Fiji XT was, per clock, per core, in gaming scenarios. It has around 15% higher core throughput, and much, much more robust graphics pipeline, which should end in that level performance. Second note, just increasing the clock on Vega, without the new features, from 1050 MHz on Fiji, to 1.6 GHz on Vega should make it on par with GTX 1080 Ti/Titan Xp. Something is bottlenecking the architecture. But on hardware level - there is absolutely nothing that could cause this. Absolutely NOTHING. It has the same features, the same layout of graphics pipeline as Nvidia GPUs has. If Nvidia can have great gaming performance, Vega also should.
If you have got to the end of this post, congratulations.