Apple is also in a unique position to pull this off, because their TBDR architecture allows them to focus on async compute, where forward rendering GPUs have other concerns like locality and pixel ordering. Of course, Apple is also currently held back by the mobile heritage of their GPUs (relatively small register file, tiny L2, limited coordination between individual cores). There is still a lot of work to be done.
I'm not sure these SPECIFIC complaints carry water.
As I understand the numbers, on a per core basis (the only counting that makes sense, and if you give a BS marketing "make the number as big as possible" response, I'll immediately tune you out...)
- Apple's visible register space is up to 128 4B registers (or 256 2B registers) per thread. I think nV and AMD are the same. So up to 512B of register space per lane, 16K per SIMD.
- a core is the equivalent of an (current nV, not past) SM, and split into four essentially independent quadrants.
- the storage per core is 384KB per core. Which is 96kB per quadrant.
Which is 6 "full" threadblocks (using the entire set of registers) or up to 24 threadblocks (if each uses only a quarter of the maximum register space it is entitled to).
I believe these numbers are essentially 1.5x the equivalents on the nV side.
Apple also have small L1 and local address space compared to nV. It's unclear how much of an issue this is. Only about 3% of nV apps are constrained by local address space (as opposed to number of registers).
Apple have very clever (and basically unknown) functionality to identity data streams relevant to each GPU client and optimize their use in cache from L1 to L2 to SLC. That probably gives them better effective cache performance than nV and AMD. (A dumb naive cache is not especially great for GPUs because all the data of any real interest consists of large streams...)
I'm not sure that the "limited coordination between individual cores" matters much. What I have SEEN in this space is that most of the nV work in this area is hacks around their performance issues, followed by devs who don't understand Apple Silicon assuming those same hacks are necessary on Apple. I've seen ZERO comparison of the issue in terms of real (important) algorithms written by people who know what they are doing and have genuinely optimized for each platform comparing and contrasting.
Where Apple is lacking compared to nV and AMD (ie genuine, not BS, complaints):
- no support for 64b FP (just as an MSL, even if poorly performing) [STEP 1]
- no hardware support for 64b integers and FP (the integer support could be added at fairly low cost, even the multiplies if we allow them to take a few cycles; the FP support could be added as slow multicycle in the FP32 unit, or as something like a single FP64 unit shared by sixteen lanes). Along with this, decent "complete" support for 64b atomics. [STEP 2]
- coalescing of both divergent control and divergent memory access. nV has support (kinda, in a lame fashion, and for ray tracing only?) for control, nothing as far as I know for divergent memory access. I'd hope an Apple solution is better engineered both at the HW level to support both, and at the MSL level to indicate that a particular kernel is one where coalescing by the hardware is allowed (so not just stochastic processes like ray tracing, but also things like fancy multipole PDE solvers, where there's no natural spatial structure to the problem anyway, so coalescing is no big deal).
These three would at least start Apple down the path of being a reasonable throughput engine for a variety of compute tasks, not just the most obvious GPU/ray tracing/AI that get all the attention today. My guess is they will come, everything just takes time. (And there are the constraints that they can't evolve Metal too fast until they are willing to drop support for AMD and Intel iGPU... Until then, they mostly have to keep Metal functionality, and thus visible Apple Silicon functionality, kinda in sync across all three sets of devices.)