This is interesting, but not in the way you think.
Suppose you have a scalar CPU that's calculating a long dot product. Essentially the code looks like a loop of C=A*B+C, perhaps with A and B cycling through register 0, 1, 2, 3, ... and some other code sequentially loading these registers 0, 1, 2, 3.
So a consequence of this design is that we're continually reading register C from the register file, adding it into the multiply, then storing C back in the register file. That's not ideal! Now in any decent modern CPU, we'll in fact pull C off the bypass bus as it is being written back to the register file, so we can at least avoid the latency of re-reading from the register file. (It's still a shame we keep writing back the intermediate C result which is immediately overwritten, but that's a problem for later).
Switch to the GPU, and imagine we're similarly calculating a whole lot of long dot products in parallel. We'd have a similar code structure of C=A*B+C, only with A and B each 32 elements long and being pointwise multiplied, then accumulated into a 32-wide register C, to calculate 32 dot products in parallel.
Once again we have a situation where we waste time and energy moving the register C back and forth between the execution unit and the operand cache. Can we avoid this?
Suppose we provided a single register of storage attached to each lane of each execution unit (maybe the FP16, FP32 and INT pipelines all have their own separate storage per lane, maybe they share a common 32bit storage, that's a detail). Now we can define a new instruction that looks something like
local=A*B+local, and avoid all that back and forth movement of C.
This is essentially what (2024)
https://patents.google.com/patent/US20250103292A1 Matrix Multiplier Caching is about.
The more common task we care about is matrix multiplication, not exactly dot products, and you should recall that we the Apple GPU provides a matrix multiply instruction that handles the multiplication of something like two side by side 4*4 submatrices with a 4*4 submatrix to give two 4*4 results, achieved by packing these matrices in appropriate order into the 32 lanes of A and B registers, and then shuffling the values around as we pass the registers repeatedly through the FMAC execution units.
This is nice if we want to calculate a 4*4 matrix product, but usually we have larger matrices, so we have to repeat this operation. At which point (you can work through the algebra if you like, or can just imagine it in your head as operating like dot products, only with the "basic units" being 4*4 matrices that are multiplied together and accumulated) we are back to our earlier dot product explanation.
As long as we have one register of local storage available to each lane, we can keep accumulating our matrix multiplication while avoiding the latency and energy costs of frequent movement of the C values.
Interestingly it also avoids use of the third track of the operand cache into the execution units, which in turn means that we don't have to expand the bandwidth of the operand cache too much to get interesting consequences.
Suppose the operand cache today can supply three operands per cycle, and suppose we boosted it to supply four operands per cycle. This would in turn allow us to supply two operands to each of two pipelines, meaning that, for example, we could now run two large matrix multiplies in parallel without as much expansion of the hardware as earlier seemed necessary...
Of course this still requires two of the appropriate execution pipelines, but it mostly solves what looked like the hardest part of the problem of making GPU execution wider.
So to summarize:
- This doesn't (directly) point to any new FMA hardware. The large examples they give (like a 16*16 matrix) are not saying "we can multiply that in one cycle", they are saying "this is the kind of example that benefits from being able to use this local per-lane FMA storage".
- It's building on the existing GPU matrix multiplication (which is interesting in its own way, because the permute network it uses can also be used for other things...). You can look at the original matrix multiply patent,
https://patents.google.com/patent/US11256518B2 , but I think I'm correct in describing as essentially multiply two side by side 4*4s against a separate 4*4.
- The local storage could be used for other things (eg genuine dot products), and probably will be whenever the compiler is smart enough to see a data reuse pattern like I described. The performance and energy boosts are nice, not earth-shattering, but nice, and every bit helps.
- I'm most interested in how it probably makes it easier to widen GPU core execution (and how that's probably the best overall option for boosting performance at low energy, as compared to simply adding more simple GPOU cores, ie the nVidia type solution).