Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Ah, I see, so you haven't looked at the details yet. Would be great to hear your thoughts once you've had the opportunity to study the patents more precisely. In particular, here is the depiction of the dot product circuit from the matrix multiplier caching patent.


View attachment 2509822

This is very different from what they've been doin until now. M1-M4 uses stepwise dot product calculation using SIMD and quad shifts. The circuit in the figure instead uses reduction to calculate dot product directly (Nvidia tensor cores use the same basic approach). That's 4x increase in compute, if implemented as depicted. Of course, it is possible that the figure is merely cosmetic, however, I find that unlikely.
OK, I've had a chance to look at the patent. (God, patenscope is awful! SO slow, and keeps "refreshing" pages, ie losing your place. I don't care if google patents is a few weeks slower, at least it's usable!)

Some points.
1. The original matrix multiply patent references 8*8 multiplication - but for FP16...
I think it's easier to think in terms of FP32, where I think we're restricted to 4*4 blocks. 8*8 FP16 works because you can pack two entries into a single "register" using the .H and .L components.

2. The big idea in the Operand Selection Circuitry patent is to move some of the shuffle network required by the various supported Apple shuffle operations into the wiring of the Operand Cache SRAM. Doing this (in the way they describe) seems to allow for a slightly richer set of shuffle operations, in particular they get a free transpose for 4*4 matrices. Their scheme also allows for augmenting the existing set of shuffles (within quad, cross-lane shifts, and what's needed for matrix multiply) with a few permutes (ie "rearrangements that might also duplicate a value") but I don't know if they will expose these or use them.

3. There may be some process benefit to the way they moved some part of the shuffle into the SRAM (maybe you can get some free wiring for data movement above the first few metal layers in a way that's not available for logic?)If this is so, the scheme does allow for a way to provide some degree of shuffling between two *different* registers. This in turn allows for a zip primitive, which is an important part of being able to transpose large matrices faster.
Alternatively (or in addition) maybe doing one stage of data rearrangement in the SRAM saves a little energy?

Either way I did not see in the patent anything that definitively indicated support for a larger base unit of matrix multiplication (eg 8*8 FP32) though that's certainly possible.

4. I'm still not sure what the "meaning" is of the dot product circuit you provided.
If we stick to matrix multiplication, it doesn't help because we need balanced multiplies and adds.
It does give us "single cycle" (throughput, not latency) dot product for 4-long vectors, but at the cost of "disaggregated" FMAs.

Maybe the real point is that
- the FMAs *were* (in some sense) disaggregated
- some minor cross-lane data movement was provided
so that 4 FMA units can ALSO be used as four ("multiply, then add") units following the pattern of the diagram?
Adding additional adders to the existing FMAs seems a terrible idea for such a niche use case, but maybe adding some additional wiring +buffering (and given that GPU timing is not very aggressive) was quite feasible? So you get a faster dot product for the cases that care (presumably a lot of graphics) at close to free?

I could imagine a thinking that starts with "let's add a register to each FMA lane to avoid data movement costs" then, after doing this realizes "hell we could also add a few buffers with very minimal area and no cycle hit. hmm?" which flows into "hey, with a few extra buffers we could rearrange data flow between our adders and multipliers to speed up dot product!"

This is not 4x in compute, not really. It does get you your dot product at lower latency, but it's the same number of multipliers and adders as before. You just get to use the adders in a different way.
 
4. I'm still not sure what the "meaning" is of the dot product circuit you provided.
If we stick to matrix multiplication, it doesn't help because we need balanced multiplies and adds.
It does give us "single cycle" (throughput, not latency) dot product for 4-long vectors, but at the cost of "disaggregated" FMAs.

The way I interpret it that it’s a new hardware pipe just for matrix multiplication (quite similar in principle to Nvidia's “tensor cores”). Each lane would have 4x multipliers and 4x adders and process 4-wide vectors per cycle. This would also require a new permute network, since you have to forward 4x data elements to each lane.

Note also that they explicitly mention that the dot product circuit contains both an FP and an INT data path. This is a significant departure from the current implementation where the FP and INT are separate pipes. This is another reason why I think we are looking at a principally new pipe.

I think if they went this way the die area increase would remain modest, likely under 10%, which sounds to me like a decent investment for much improved GEMM performance. This is a domain where Apple lags behind and given the interest in machine learning, closing this gap could dramatically improve the perceived value of Apple hardware to data professionals and hobbyists alike. Personally, I wouldn't be sad if they even drop the SME in favor of dedicated GPU-based GEMM.
 
Last edited:
1. The original matrix multiply patent references 8*8 multiplication - but for FP16...

I interpret that patent somewhat differently. From what I see, it describes multiplying two 8x8 matrices using 32-wide SIMD over 16 cycles (Figures 5A-D). They discuss some FP16-specific storage and routing optimizations, but the computation should be identical for FP16 and FP32 (incidentally, M1-M4 have the same GEMM throughput for FP16 and FP32 data).


Which again is a quite significant difference from the newer patent. The 2019 patent calculates the dot product iteratively, by chaining FMAs. The 2024 patent depicts dot product calculation as a parallel operation with reduction.
 
  • Like
Reactions: OptimusGrime
The way I interpret it that it’s a new hardware pipe just for matrix multiplication (quite similar in principle to Nvidia's “tensor cores”). Each lane would have 4x multipliers and 4x adders and process 4-wide vectors per cycle. This would also require a new permute network, since you have to forward 4x data elements to each lane.

Note also that they explicitly mention that the dot product circuit contains both an FP and an INT data path. This is a significant departure from the current implementation where the FP and INT are separate pipes. This is another reason why I think we are looking at a principally new pipe.

I think if they went this way the die area increase would remain modest, likely under 10%, which sounds to me like a decent investment for much improved GEMM performance. This is a domain where Apple lags behind and given the interest in machine learning, closing this gap could dramatically improve the perceived value of Apple hardware to data professionals and hobbyists alike. Personally, I wouldn't be sad if they even drop the SME in favor of dedicated GPU-based GEMM.
Well I guess we'll see soon, maybe this year maybe next year!
 
  • Like
Reactions: leman
The way I interpret it that it’s a new hardware pipe just for matrix multiplication (quite similar in principle to Nvidia's “tensor cores”). Each lane would have 4x multipliers and 4x adders and process 4-wide vectors per cycle. This would also require a new permute network, since you have to forward 4x data elements to each lane.

Note also that they explicitly mention that the dot product circuit contains both an FP and an INT data path. This is a significant departure from the current implementation where the FP and INT are separate pipes. This is another reason why I think we are looking at a principally new pipe.

I think if they went this way the die area increase would remain modest, likely under 10%, which sounds to me like a decent investment for much improved GEMM performance. This is a domain where Apple lags behind and given the interest in machine learning, closing this gap could dramatically improve the perceived value of Apple hardware to data professionals and hobbyists alike. Personally, I wouldn't be sad if they even drop the SME in favor of dedicated GPU-based GEMM.

If these improvements are slated for a newer process such as 2nm or smaller, couldn't that theoretically result in a similar overall die size? The move to 3nm was similar in that regard, as M3 and M4 both have more transistors in the same. With M4, Apple even broke out display management functions into the separate 'Display Engine" without significant increases in overall die size.
 
If these improvements are slated for a newer process such as 2nm or smaller, couldn't that theoretically result in a similar overall die size? The move to 3nm was similar in that regard, as M3 and M4 both have more transistors in the same.

It might even be feasible on N3. Most of this is logic (and some data routing), which should scale well on newer processes.

With M4, Apple even broke out display management functions into the separate 'Display Engine" without significant increases in overall die size.

What do you mean? Display Engine has been a thing since earlier A-series.
 
It might even be feasible on N3. Most of this is logic (and some data routing), which should scale well on newer processes.



What do you mean? Display Engine has been a thing since earlier A-series.

With the debut of Tandem OLED on the M4 iPad Pro, Apple needed a dedicated section of the SoC to offload that work from the CPU and GPU. Previously that logic was part of the CPU/GPU rather than its own section of the silicon.

From PC Mag:
What is new, however, is that the M4 is the first Apple chip with dedicated display engine hardware, offloading display-related tasks from the CPU and GPU.

From Apple:
M4 also features an entirely new display engine designed with pioneering technologies, enabling the stunning precision, color accuracy, and brightness uniformity of the Ultra Retina XDR display, a state-of-the-art display created by combining the light of two OLED panels.
 
With the debut of Tandem OLED on the M4 iPad Pro, Apple needed a dedicated section of the SoC to offload that work from the CPU and GPU. Previously that logic was part of the CPU/GPU rather than its own section of the silicon.

All M-series SoCs have a dedicated display engine. PCmag must have confused something. M4 might have a new version of the display engine, but it’s not a new component.

We’ve also had a number of discussions here in the past about the display engine and how it limited multi-monitor support on previous Macs. It’s a fairly beefy IP block as well as it contains a lot of SRAM to hold framebuffer data.

You can read about the display engine (DCP) here: https://asahilinux.org/2021/08/progress-report-august-2021/
 
Last edited:
With the debut of Tandem OLED on the M4 iPad Pro, Apple needed a dedicated section of the SoC to offload that work from the CPU and GPU. Previously that logic was part of the CPU/GPU rather than its own section of the silicon.

From PC Mag:

From Apple:
These fall under "never trust magazine article writers to get technical details right" and "never trust PR either". There are exceptions, of course, but if you assume something got garbled in translation from engineer to article/PR, you won't go far wrong.

The offloaded work in question is the act of reading out the contents of a frame buffer in RAM and using that data to construct a signal that is sent to a display. Offloading this work to a hardware engine is a very, very old concept. (*)

M1 onwards all have Display Engines. The thing that's new in M4 is support for Tandem OLED, which needs to drive two display pixels from each frame buffer pixel. I think they probably need to do some math on the values sent to one or both layers to make it work better as a single display.


* - As far as I'm aware, the one and only example of a mass manufactured personal computing device which supported video display but didn't have this offload was the Atari 2600 game console. They left it out to reduce parts count so it was cheaper to build; their more expensive 8-bit home computers had a 'display engine'.
 
  • Like
Reactions: leman
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.