OK, I've had a chance to look at the patent. (God, patenscope is awful! SO slow, and keeps "refreshing" pages, ie losing your place. I don't care if google patents is a few weeks slower, at least it's usable!)Ah, I see, so you haven't looked at the details yet. Would be great to hear your thoughts once you've had the opportunity to study the patents more precisely. In particular, here is the depiction of the dot product circuit from the matrix multiplier caching patent.
View attachment 2509822
This is very different from what they've been doin until now. M1-M4 uses stepwise dot product calculation using SIMD and quad shifts. The circuit in the figure instead uses reduction to calculate dot product directly (Nvidia tensor cores use the same basic approach). That's 4x increase in compute, if implemented as depicted. Of course, it is possible that the figure is merely cosmetic, however, I find that unlikely.
Some points.
1. The original matrix multiply patent references 8*8 multiplication - but for FP16...
I think it's easier to think in terms of FP32, where I think we're restricted to 4*4 blocks. 8*8 FP16 works because you can pack two entries into a single "register" using the .H and .L components.
2. The big idea in the Operand Selection Circuitry patent is to move some of the shuffle network required by the various supported Apple shuffle operations into the wiring of the Operand Cache SRAM. Doing this (in the way they describe) seems to allow for a slightly richer set of shuffle operations, in particular they get a free transpose for 4*4 matrices. Their scheme also allows for augmenting the existing set of shuffles (within quad, cross-lane shifts, and what's needed for matrix multiply) with a few permutes (ie "rearrangements that might also duplicate a value") but I don't know if they will expose these or use them.
3. There may be some process benefit to the way they moved some part of the shuffle into the SRAM (maybe you can get some free wiring for data movement above the first few metal layers in a way that's not available for logic?)If this is so, the scheme does allow for a way to provide some degree of shuffling between two *different* registers. This in turn allows for a zip primitive, which is an important part of being able to transpose large matrices faster.
Alternatively (or in addition) maybe doing one stage of data rearrangement in the SRAM saves a little energy?
Either way I did not see in the patent anything that definitively indicated support for a larger base unit of matrix multiplication (eg 8*8 FP32) though that's certainly possible.
4. I'm still not sure what the "meaning" is of the dot product circuit you provided.
If we stick to matrix multiplication, it doesn't help because we need balanced multiplies and adds.
It does give us "single cycle" (throughput, not latency) dot product for 4-long vectors, but at the cost of "disaggregated" FMAs.
Maybe the real point is that
- the FMAs *were* (in some sense) disaggregated
- some minor cross-lane data movement was provided
so that 4 FMA units can ALSO be used as four ("multiply, then add") units following the pattern of the diagram?
Adding additional adders to the existing FMAs seems a terrible idea for such a niche use case, but maybe adding some additional wiring +buffering (and given that GPU timing is not very aggressive) was quite feasible? So you get a faster dot product for the cases that care (presumably a lot of graphics) at close to free?
I could imagine a thinking that starts with "let's add a register to each FMA lane to avoid data movement costs" then, after doing this realizes "hell we could also add a few buffers with very minimal area and no cycle hit. hmm?" which flows into "hey, with a few extra buffers we could rearrange data flow between our adders and multipliers to speed up dot product!"
This is not 4x in compute, not really. It does get you your dot product at lower latency, but it's the same number of multipliers and adders as before. You just get to use the adders in a different way.