and if you give a BS marketing "make the number as big as possible" response, I'll immediately tune you out...
Why would I do that? I am interested in the objective truth, not some internet dick measuring contest. If I am wrong, that would be an opportunity to learn something, which would be great.
- Apple's visible register space is up to 128 4B registers (or 256 2B registers) per thread. I think nV and AMD are the same. So up to 512B of register space per lane, 16K per SIMD.
- a core is the equivalent of an (current nV, not past) SM, and split into four essentially independent quadrants.
- the storage per core is 384KB per core. Which is 96kB per quadrant.
Which is 6 "full" threadblocks (using the entire set of registers) or up to 24 threadblocks (if each uses only a quarter of the maximum register space it is entitled to).
I believe these numbers are essentially 1.5x the equivalents on the nV side.
I suppose you are looking at
Philip Turner's numbers? I'd be curious how he arrives at these numbers.
Rosenzweig estimates the register file per core to be 208KB large (52KB per partition), so that's the minimum (and the number I personally consider more realistic). Then there is the argument with 24 threadgroups running in parallel on M1, but the details of this were never clear to me. I think this is definitely an area that needs more investigation...
At any rate, the estimates we have right now range from 52KB per partition (12KB less than Nvidia) to 384KB per partition (1.5x more than Nvidia). This is different from what I remembered and so I stand corrected.
I'm not sure that the "limited coordination between individual cores" matters much. What I have SEEN in this space is that most of the nV work in this area is hacks around their performance issues, followed by devs who don't understand Apple Silicon assuming those same hacks are necessary on Apple. I've seen ZERO comparison of the issue in terms of real (important) algorithms written by people who know what they are doing and have genuinely optimized for each platform comparing and contrasting.
I fully agree that it's a niche requirement. But I think this shortcoming of Apple GPUs is relevant in the context of the current discussion, as it makes them less interesting to advanced GPU programmers. Which again has the effect that certain technical discussions and advances are only focused on Nvidia GPUs.
Where Apple is lacking compared to nV and AMD (ie genuine, not BS, complaints):
- no support for 64b FP (just as an MSL, even if poorly performing) [STEP 1]
- no hardware support for 64b integers and FP (the integer support could be added at fairly low cost, even the multiplies if we allow them to take a few cycles; the FP support could be added as slow multicycle in the FP32 unit, or as something like a single FP64 unit shared by sixteen lanes). Along with this, decent "complete" support for 64b atomics. [STEP 2]
- coalescing of both divergent control and divergent memory access. nV has support (kinda, in a lame fashion, and for ray tracing only?) for control, nothing as far as I know for divergent memory access.
I would add support for texture atomics (which is now emulated as a fairly convoluted combination of device memory loads with some SIMD black magic — although I would be curious to know how NV does it). On the topic of divergent memory access, if I remember correctly Apple did mention a technique in a WWDC session last year. It was something about using SIMD intrinsics to reorder threads based on RT hits, but I can't remember the details.
One thing that I am not sure about is FP64 support (and in general 64-bit support). It seems like it would introduce a lot of additional complexity to support a very niche case. There is a good reason why others do it with an anaemic auxiliary processing unit (which Apple doesn't have). What I would find more interesting is addition of assist instructions that make it cheaper to perform 64-bit operations using 32bit hardware. E.g. a 64bit integer addition on my M1 Max compiles as
Code:
// 64 int summands are in r0_r1 and r2_r3, respectively
iadd r2.cache, r0.cache, r2.discard
iadd r1.cache, r1.discard, r3.discard
icmpsel ult, r0l.cache, r2, r0.discard, 1, 0
iadd r0, r0l.discard, r1.discard
which I think is a neat way to do it (also, Apple's cool icmpsel instruction really comes in handy in all kind of contexts). They could accelerate it further by implementing add with carry bit — would still be strictly 32-bit math, but 2x faster. And I suppose that similar could be done for 64-bit FP — maybe some instructions to help align mantissas etc. that would bring down the number of required instructions to around 8 or so.
I'd hope an Apple solution is better engineered both at the HW level to support both, and at the MSL level to indicate that a particular kernel is one where coalescing by the hardware is allowed (so not just stochastic processes like ray tracing, but also things like fancy multipole PDE solvers, where there's no natural spatial structure to the problem anyway, so coalescing is no big deal).
Their RT patents explicitly talk about the RT unit launching new compacted and coalesced threadblocks to process hits. From what I understand, this would be a completely new way to handle these things. Nvidia still uses monolithic shaders where each thread launches a request and receives a result. An analogy I like is that Nvidia's RT is like async programming (each tread starts an `await test_ray()`) while Apple is more like continuation passing style (the thread block executes `yield test_rays(&hit_epilogue_program)`). The second is obviously more efficient.