I'm not sure what you are trying to say by stating it's "just marketing". 10nm and 7nm are hard physical measurements. You can't fudge your numbers to say 7 is equal 10. That's not how it works.
Unfortunately, and to our confusion, this is not the case.
The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way […]
www.hpcwire.com
https://en.wikipedia.org/wiki/7_nm_process (check out the comparison tables)
And reality shows that linear scaling is not how it works. There is no assumption with regards to scaling. Twice the core count just doesn't equate to twice the performance. That only happens in an ideal model, and real world is not ideal.
The problem is not parallelism, but that there is no way to estimate the "weight" of a workload, so a linear distribution system just doesn't work. For instance, for a GPU, you can definitely segment rendering workload into sections of a frame and have them all execute in parallel, but certain sections will definitely render faster than others (say, if you're just rendering a background in one section versus in another section where you have to render models + shaders + background). So performance is more dependent on how efficient your scheduler is versus how many processing units you have. And you can't make a scheduler parallel.
Most of the time we are talking about distribution workload of X items over N parallel processors. If we assume that the items are drawn from the same distribution, and there are no further bottlenecks, then by doubling the number of processors (while keeping the processors exactly the same) you can reasonably expect amortized halving of execution time. The work scheduler on GPUs is a batch system, similar in spirit to what you find on HPC clusters. For compute workloads on GPUs, where you explicitly state "I want this routine to run over an array of these 10000 items" — something like Apple A12Z (8 cores) will then have to run ~1250 items per core while a 2080 GTX (46 cores) will only need to do 220 items per core. For graphical work, these items could be triangles and/or screen-space tiles (in case of Apple TBDR GPUs). E.g. a 4 core Apple GPU can shade 4 tiles in parallel wheel an 8 core Apple GPU can work on 8 tiles in parallel etc. If you want to know more details about how work is scheduled on a modern Nvidia GPU, this document is
gold.
Of course, as you rightfully point out, real world is much more complicated. You have latencies, shared limited resources (memory bandwidth, caches etc.), inter-core synchronization (in rare cases), setup overheads etc. But I think that linear scaling is a reasonable approximation if we can assume that the work is indeed compute-limited (and not say, memory or setup limited) — which is often the case with relatively low-powered GPUs. Again, when you look at benchmarks of iPhone vs iPad GPUs, you will observe scaling very close to linear. Studying scaling effects on other GPUs is more complicated since there are a lot of factors that change (bus width, memory bandwidth, clocks etc.), but still, something like a
GTX 1660 Ti (24 cores) is approximately 50-60% faster than a
GTX 1650 (16 cores), which is fairly close to linear scaling when you consider other differences. Beyond those performance levels, I agree with you you start noticing diminished effects as setup overhead becomes more pronounced.
There's a reason why RTX 3080 with twice the core count can only achieve at most a 80% performance boost over 2080.
Well, one of the reasons is simply that RTX 2080 does not have twice the core count. This is again Nvidia's marketing department talking. A RTX 2080 has the same number of cores as the RTX 2080, it's just that one of it's ALUs gained additional capabilities. More specifically, an execution unit in Turing contains a floating-point and an integer ALU. In Ampere, Nvidia has updated the integer unit to be capable of executing floating-point operations. So where Turing could do 16-wide float+int per clock per execution unit, Ampere can potentially do float+float or float+int. Since Nvidia counts "shader cores" as theoretical peak of FP32 MADD throughput, they can claim 100% incense in this metric over Turing. This is similar if Intel were to claim the Ice Lake doubles the amount of cores compared to Coffee Lake (because you know, AVX-512). Of course, we don't take this kind of crap from CPU makers, but we happily chew down the claims of GPU marketing divisions.
Put differently, Ampere increases instruction level parallelism (more work potentially done per core), but I was talking about thread level parallelism (increasing the number of cores that can execute threads).
Apple's Metal API is nowhere near the maturity needed for efficiency and performance as Vulkan or OpenGL, for instance. The most one can say about Metal is... well, it gives Apple more control, and it requires less effort from developers, but that's about it.
Could you elaborate more how you arrive at this conclusion? Metal is a fairly simple API and it brings all (or at least most) tools one needs. What else do you need to efficiency and performance?
I am also surprised that you mention OpenGL and "efficiency" in the same sentence. OpenGL is just a confusing mess of obsolete abstractions at this point. Yes, you can make reasonably performant OpenGL code by leveraging extensions and whatnot, but at this point you are way better off just using Vulkan or DX12.