It's up to you to believe it or not. But I successfully predicted that your card is normal, and able to get ~27000 in Luxmark to indicate that your GPU is actually able to perform normally on the cMP.I downloaded Luxmark v3.1 and completed the test, the results is 28985 (I even seen a bit over 30000 somewhere in the middle of the test) but I'm still not convinced about CPU bottleneck.
No matter how many cores the game / benchmark can use, there is always a main thread. And if the multi threading optimisation isn't good enough, then that main thread will become the bottleneck.I can't believe that only 1 core is used when for example playing games. Or that pcie ver2.1 is the bottleneck against pcie v3, because I'm talking for more than half speed lost 3200 vs 7000 score points on time spy of 3dmark.
You can run Unigine Heaven (Extreme preset). This will run the benchmark in window mode. So that you can watch Activity Monitor at the same time. You should able see a process that stay at a bit more than 100%. That's the sign of CPU single thread limiting.
But if you do the same thing for Luxmark, you won't see the same thing. The CPU loading is very low and never reach 100% during the test. That's why it's so effective to avoid CPU bottleneck on GPU benchmark.
You can try, but most likely it won't work, the CPU will try to balance the workload between cores (to balance the heat exhaust, therefore, reduce the chance of having thermal throttling on a single core). e.g. For a ~110% CPU loading process, you may see ~9% loading to each core, but not 100% on one of the core.I think I can check speed of the individual cores to see if any is saturated and compare.
The CPU works at more than 3 millions cycles per second. Even the process is CPU single thread limiting, it can be...
Core 0 complete the 1st step computation at first CPU cycle.
Core 1 use the result from Core 0 to perform step 2 computation at the 2nd cycle.
Core 2 use the result from Core 1 to perform step 3 computation at the 3rd cycle.
......
And after a second, each core performed around 300,000 computation already, but you can only see ~9% workload on each core, nothing is saturated. However, since each step must wait for the result from the last step (which is from another core), therefore, the process is still CPU single thread limiting.
CPU single thread limiting is the process itself cannot utilise multi core effectively, but not the CPU will only use one core thread to finish the whole process.
Unfortunately, it can also be due to CPU too slow. As I said before, the GPU driver is also a CPU single thread process. If it isn't fast enough to feed / control the GPU in that 0.1 second, the FPS will drop. But you can't even see any CPU limiting process in activity monitor, because that only happen at a split second. The average CPU loading with the whole second is still low.looking at GFX benchmarks are like something is pulling the card back from time to time, and then speed up again for a sec or 2 to full speed, eg, FPS are at ~30 and suddenly fall to single digits and the up again as a loop. I think I can check speed of the individual cores to see if any is saturated and compare.
For comparison, my HD7970 that is a simpler design card without power managing (ie Power limits) when power is requested it sits at a constant 14~15 FPS without ups and downs on the same tests and you can feel its playable in comparison with the Vega. HD7970 scores at about 2100, so Vega does only 50% more on scores but I find it worst at watching the scenes.
In general, whenever you see the CPU loading is more than 100%. That can be CPU single thread limiting, and not a good benchmark for GPU (at least not good not that particular computer).
Then why the HD7970 can do better (more consistent)? TBH, I don't know, but it may be because the newer driver (for Vega) is expected to run on a faster CPU, which our Xeon simply can't catch up. But the 7970 is the Westmere generation's GPU, the driver should be expected to run nicely on a Westmere GPU. Or may be simply the HD7970 has better driver optimisation overall.
Anyway, Luxmark isn't a good software to compare performance between GPU. e.g. The Vega56 clearly isn't 10x faster than a HD7970. So, don't compare the numbers between different family's GPU. That's meaningless. However, it's a good tool to check if the GPU itself able to perform on a computer (when no other limitation). e.g. In your case, we know the Vega 56 is working because it can get ~28000.
Another way to check if you are CPU single thread limiting on those 3D benchmarks is by reducing the resolution.I'm still not convinced about CPU bottleneck.
e.g. If you get ~30FPS at 1920x1080. Then you reduce the resolution to 1280x720, and still just get ~30FPS. Then you further reduce to 640x480, but still only get ~30FPS. Then it's clearly not GPU limiting, but something else. [N.B. the FPS may improve little bit due to some objects may be missing in lower resolution, therefore, lowered the CPU's work load. However, if the benchmark is GPU limiting, the FPS should increase a lot when you lower the resolution like that]
In Windows, you can use HWMonitor etc to monitor the PCIe bus interface loading. I am quite sure you won't see it stay at 100% during those benchmarks. You can have a try.
If the problem is not coming from the GPU, nor PCIe bus interface, but the CPU loading is above 100%. Then I think we can safely assume the problem is coming from the CPU single thread performance.