Apple Silicon deep learning performance

name99 · May 11, 2025

leman said:
Ah, I see, so you haven't looked at the details yet. Would be great to hear your thoughts once you've had the opportunity to study the patents more precisely. In particular, here is the depiction of the dot product circuit from the matrix multiplier caching patent.

View attachment 2509822

This is very different from what they've been doin until now. M1-M4 uses stepwise dot product calculation using SIMD and quad shifts. The circuit in the figure instead uses reduction to calculate dot product directly (Nvidia tensor cores use the same basic approach). That's 4x increase in compute, if implemented as depicted. Of course, it is possible that the figure is merely cosmetic, however, I find that unlikely.

OK, I've had a chance to look at the patent. (God, patenscope is awful! SO slow, and keeps "refreshing" pages, ie losing your place. I don't care if google patents is a few weeks slower, at least it's usable!)

Some points.
1. The original matrix multiply patent references 8*8 multiplication - but for FP16...
I think it's easier to think in terms of FP32, where I think we're restricted to 4*4 blocks. 8*8 FP16 works because you can pack two entries into a single "register" using the .H and .L components.

2. The big idea in the Operand Selection Circuitry patent is to move some of the shuffle network required by the various supported Apple shuffle operations into the wiring of the Operand Cache SRAM. Doing this (in the way they describe) seems to allow for a slightly richer set of shuffle operations, in particular they get a free transpose for 4*4 matrices. Their scheme also allows for augmenting the existing set of shuffles (within quad, cross-lane shifts, and what's needed for matrix multiply) with a few permutes (ie "rearrangements that might also duplicate a value") but I don't know if they will expose these or use them.

3. There may be some process benefit to the way they moved some part of the shuffle into the SRAM (maybe you can get some free wiring for data movement above the first few metal layers in a way that's not available for logic?)If this is so, the scheme does allow for a way to provide some degree of shuffling between two *different* registers. This in turn allows for a zip primitive, which is an important part of being able to transpose large matrices faster.
Alternatively (or in addition) maybe doing one stage of data rearrangement in the SRAM saves a little energy?

Either way I did not see in the patent anything that definitively indicated support for a larger base unit of matrix multiplication (eg 8*8 FP32) though that's certainly possible.

4. I'm still not sure what the "meaning" is of the dot product circuit you provided.
If we stick to matrix multiplication, it doesn't help because we need balanced multiplies and adds.
It does give us "single cycle" (throughput, not latency) dot product for 4-long vectors, but at the cost of "disaggregated" FMAs.

Maybe the real point is that
- the FMAs *were* (in some sense) disaggregated
- some minor cross-lane data movement was provided
so that 4 FMA units can ALSO be used as four ("multiply, then add") units following the pattern of the diagram?
Adding additional adders to the existing FMAs seems a terrible idea for such a niche use case, but maybe adding some additional wiring +buffering (and given that GPU timing is not very aggressive) was quite feasible? So you get a faster dot product for the cases that care (presumably a lot of graphics) at close to free?

I could imagine a thinking that starts with "let's add a register to each FMA lane to avoid data movement costs" then, after doing this realizes "hell we could also add a few buffers with very minimal area and no cycle hit. hmm?" which flows into "hey, with a few extra buffers we could rearrange data flow between our adders and multipliers to speed up dot product!"

This is not 4x in compute, not really. It does get you your dot product at lower latency, but it's the same number of multipliers and adders as before. You just get to use the adders in a different way.

leman · May 12, 2025

name99 said:
4. I'm still not sure what the "meaning" is of the dot product circuit you provided.
If we stick to matrix multiplication, it doesn't help because we need balanced multiplies and adds.
It does give us "single cycle" (throughput, not latency) dot product for 4-long vectors, but at the cost of "disaggregated" FMAs.

The way I interpret it that it’s a new hardware pipe just for matrix multiplication (quite similar in principle to Nvidia's “tensor cores”). Each lane would have 4x multipliers and 4x adders and process 4-wide vectors per cycle. This would also require a new permute network, since you have to forward 4x data elements to each lane.

Note also that they explicitly mention that the dot product circuit contains both an FP and an INT data path. This is a significant departure from the current implementation where the FP and INT are separate pipes. This is another reason why I think we are looking at a principally new pipe.

I think if they went this way the die area increase would remain modest, likely under 10%, which sounds to me like a decent investment for much improved GEMM performance. This is a domain where Apple lags behind and given the interest in machine learning, closing this gap could dramatically improve the perceived value of Apple hardware to data professionals and hobbyists alike. Personally, I wouldn't be sad if they even drop the SME in favor of dedicated GPU-based GEMM.

leman · May 12, 2025

name99 said:
1. The original matrix multiply patent references 8*8 multiplication - but for FP16...

I interpret that patent somewhat differently. From what I see, it describes multiplying two 8x8 matrices using 32-wide SIMD over 16 cycles (Figures 5A-D). They discuss some FP16-specific storage and routing optimizations, but the computation should be identical for FP16 and FP32 (incidentally, M1-M4 have the same GEMM throughput for FP16 and FP32 data).

Which again is a quite significant difference from the newer patent. The 2019 patent calculates the dot product iteratively, by chaining FMAs. The 2024 patent depicts dot product calculation as a parallel operation with reduction.

name99 · May 12, 2025

leman said:
The way I interpret it that it’s a new hardware pipe just for matrix multiplication (quite similar in principle to Nvidia's “tensor cores”). Each lane would have 4x multipliers and 4x adders and process 4-wide vectors per cycle. This would also require a new permute network, since you have to forward 4x data elements to each lane.

Note also that they explicitly mention that the dot product circuit contains both an FP and an INT data path. This is a significant departure from the current implementation where the FP and INT are separate pipes. This is another reason why I think we are looking at a principally new pipe.

I think if they went this way the die area increase would remain modest, likely under 10%, which sounds to me like a decent investment for much improved GEMM performance. This is a domain where Apple lags behind and given the interest in machine learning, closing this gap could dramatically improve the perceived value of Apple hardware to data professionals and hobbyists alike. Personally, I wouldn't be sad if they even drop the SME in favor of dedicated GPU-based GEMM.

Well I guess we'll see soon, maybe this year maybe next year!

dmccloud · May 13, 2025

leman said:
The way I interpret it that it’s a new hardware pipe just for matrix multiplication (quite similar in principle to Nvidia's “tensor cores”). Each lane would have 4x multipliers and 4x adders and process 4-wide vectors per cycle. This would also require a new permute network, since you have to forward 4x data elements to each lane.

Note also that they explicitly mention that the dot product circuit contains both an FP and an INT data path. This is a significant departure from the current implementation where the FP and INT are separate pipes. This is another reason why I think we are looking at a principally new pipe.

I think if they went this way the die area increase would remain modest, likely under 10%, which sounds to me like a decent investment for much improved GEMM performance. This is a domain where Apple lags behind and given the interest in machine learning, closing this gap could dramatically improve the perceived value of Apple hardware to data professionals and hobbyists alike. Personally, I wouldn't be sad if they even drop the SME in favor of dedicated GPU-based GEMM.

If these improvements are slated for a newer process such as 2nm or smaller, couldn't that theoretically result in a similar overall die size? The move to 3nm was similar in that regard, as M3 and M4 both have more transistors in the same. With M4, Apple even broke out display management functions into the separate 'Display Engine" without significant increases in overall die size.

leman · May 13, 2025

dmccloud said:
If these improvements are slated for a newer process such as 2nm or smaller, couldn't that theoretically result in a similar overall die size? The move to 3nm was similar in that regard, as M3 and M4 both have more transistors in the same.

It might even be feasible on N3. Most of this is logic (and some data routing), which should scale well on newer processes.

dmccloud said:
With M4, Apple even broke out display management functions into the separate 'Display Engine" without significant increases in overall die size.

What do you mean? Display Engine has been a thing since earlier A-series.

dmccloud · May 14, 2025

leman said:
It might even be feasible on N3. Most of this is logic (and some data routing), which should scale well on newer processes.

What do you mean? Display Engine has been a thing since earlier A-series.

With the debut of Tandem OLED on the M4 iPad Pro, Apple needed a dedicated section of the SoC to offload that work from the CPU and GPU. Previously that logic was part of the CPU/GPU rather than its own section of the silicon.

From PC Mag:

What is new, however, is that the M4 is the first Apple chip with dedicated display engine hardware, offloading display-related tasks from the CPU and GPU.

From Apple:

M4 also features an entirely new display engine designed with pioneering technologies, enabling the stunning precision, color accuracy, and brightness uniformity of the Ultra Retina XDR display, a state-of-the-art display created by combining the light of two OLED panels.

leman · May 14, 2025

dmccloud said:
With the debut of Tandem OLED on the M4 iPad Pro, Apple needed a dedicated section of the SoC to offload that work from the CPU and GPU. Previously that logic was part of the CPU/GPU rather than its own section of the silicon.

All M-series SoCs have a dedicated display engine. PCmag must have confused something. M4 might have a new version of the display engine, but it’s not a new component.

We’ve also had a number of discussions here in the past about the display engine and how it limited multi-monitor support on previous Macs. It’s a fairly beefy IP block as well as it contains a lot of SRAM to hold framebuffer data.

You can read about the display engine (DCP) here: https://asahilinux.org/2021/08/progress-report-august-2021/

mr_roboto · May 14, 2025

dmccloud said:
With the debut of Tandem OLED on the M4 iPad Pro, Apple needed a dedicated section of the SoC to offload that work from the CPU and GPU. Previously that logic was part of the CPU/GPU rather than its own section of the silicon.

From PC Mag:

From Apple:

These fall under "never trust magazine article writers to get technical details right" and "never trust PR either". There are exceptions, of course, but if you assume something got garbled in translation from engineer to article/PR, you won't go far wrong.

The offloaded work in question is the act of reading out the contents of a frame buffer in RAM and using that data to construct a signal that is sent to a display. Offloading this work to a hardware engine is a very, very old concept. (*)

M1 onwards all have Display Engines. The thing that's new in M4 is support for Tandem OLED, which needs to drive two display pixels from each frame buffer pixel. I think they probably need to do some math on the values sent to one or both layers to make it work better as a single display.

* - As far as I'm aware, the one and only example of a mass manufactured personal computing device which supported video display but didn't have this offload was the Atari 2600 game console. They left it out to reduce parts count so it was cheaper to build; their more expensive 8-bit home computers had a 'display engine'.

Xiao_Xi · Jul 14, 2025

It looks like Apple's deep learning framework is getting a CUDA backend.

[WIP] CUDA backend by zcbenz · Pull Request #1983 · ml-explore/mlx

This PR is an ongoing effort to add a CUDA backend to MLX, very little things work now but you can run the tutorial example already. To build and test: $ cmake . -Bbuild -DMLX_BUILD_CUDA=ON -DMLX_B...

github.com

novagamer · Jul 15, 2025

Xiao_Xi said:
It looks like Apple's deep learning framework is getting a CUDA backend.

[WIP] CUDA backend by zcbenz · Pull Request #1983 · ml-explore/mlx

This PR is an ongoing effort to add a CUDA backend to MLX, very little things work now but you can run the tutorial example already. To build and test: $ cmake . -Bbuild -DMLX_BUILD_CUDA=ON -DMLX_B...

github.com

This person is not on the MLX team or affiliated with Apple.

They are somewhat prolific in releasing things but have low domain experience with this and are using it as a learning project per their Twitter timeline.

I would not expect much from this for a while, if ever, unless it gets picked up by Apple – who nearly lost their MLX team in late spring.

Hopefully Apple resources this, they apparently countered whatever offers the MLX folks were considering but I’m not sure about how hard Apple leadership is willing to pivot despite the industry trends.

There really is a narrow window to capitalize on their Silicon architecture for DL and AI workloads. We all think major improvements are coming with M5 or M6 but if they hold them off until the SoIC (M6) I think they will get lapped by competition.

Hopefully we see some surprises late this early or early next in both hardware and software.

edit: incorrect. see next post(s)

OptimusGrime · Jul 15, 2025

novagamer said:
This person is not on the MLX team or affiliated with Apple.

They are somewhat prolific in releasing things but have low domain experience with this and are using it as a learning project per their Twitter timeline.

I would not expect much from this for a while, if ever, unless it gets picked up by Apple – who nearly lost their MLX team in late spring.

Hopefully Apple resources this, they apparently countered whatever offers the MLX folks were considering but I’m not sure about how hard Apple leadership is willing to pivot despite the industry trends.

There really is a narrow window to capitalize on their Silicon architecture for DL and AI workloads. We all think major improvements are coming with M5 or M6 but if they hold them off until the SoIC (M6) I think they will get lapped by competition.

Hopefully we see some surprises late this early or early next in both hardware and software.

The work is sponsored by Apple.

novagamer · Jul 15, 2025

OptimusGrime said:
The work is sponsored by Apple.

Is there more information on this? I looked and didn't see it, that's good news but I'm curious how much they're resourcing it.

I know they've been engaging more with the open source community but how much they're directly involved seems opaque.

thenewperson · Jul 15, 2025

novagamer said:
Is there more information on this? I looked and didn't see it, that's good news but I'm curious how much they're resourcing it.

I know they've been engaging more with the open source community but how much they're directly involved seems opaque.

This was the announcement I believe:

https://twitter.com/x/status/1903613421419163978

novagamer · Jul 15, 2025

thenewperson said:
This was the announcement I believe:
https://twitter.com/x/status/1903613421419163978

Thanks! I guess their timeline didn't load completely for me, I did go back to early 2025, sorry for the misinformation I will edit.

Good stuff.

thenewperson · Jul 15, 2025

More on what the CUDA Backend is (and isn’t):

https://twitter.com/x/status/1945187376415805695

senttoschool · Jul 24, 2025

thenewperson said:
More on what the CUDA Backend is (and isn’t):

https://twitter.com/x/status/1945187376415805695

So it's a way for Apple to guard its software developer friendly OS.

senttoschool · Jul 29, 2025

leman said:
At any rate, if at least some of this functionality comes to M5, Apple laptops will become a viable development platform for ML. And it makes a lot of sense to invest in this area. Even if the raw performance will be lower than what Nvidia offers, the massive unified memory, large caches, and overall focus on efficiency and maximizing hardware occupancy could be a winning ticket for Apple.

Are you predicting M5 generation to get matmul acceleration in the GPU?

leman · Jul 30, 2025

senttoschool said:
Are you predicting M5 generation to get matmul acceleration in the GPU?

I don't think I am predicting anything, merely stating that patents and new Metal APIs present strong evidence that more advanced GPU matmul is being worked on and is likely to be shipped soon.

senttoschool · Jul 31, 2025

leman said:
I don't think I am predicting anything, merely stating that patents and new Metal APIs present strong evidence that more advanced GPU matmul is being worked on and is likely to be shipped soon.

I hope soon. I want to invest in a high VRAM Macbook Pro but prompt processing is a huge bottleneck for current AS GPUs.

senttoschool · Aug 28, 2025

https://twitter.com/x/status/1932158994698932505

Was this ever posted here? This person found some documentation in Metal 4 that suggests M5 will have tensor cores.

leman · Aug 28, 2025

senttoschool said:
https://twitter.com/x/status/1932158994698932505

Was this ever posted here? This person found some documentation in Metal 4 that suggests M5 will have tensor cores.

Yes, there was some discussion on this.

BTW, Metal shading language had (hardware-accelerated) matrix multiplication for several years now. Metal 4 just provides a new, more flexible interface with more supported types. The API changes alone are not very informative — one might simply say that Apple is improving the developer experience — but paired with the new patents I'd say the evidence is quite strong.

P.S. Please don't call them "tensor cores" — that's Nvidia's marketing terminology.

Xiao_Xi · Sep 4, 2025

Some people may find this interesting.

[R]esults show that it's possible to automatically drive significant improvements to model performance by automating the kernel optimization without any user code changes, new frameworks, or porting.

AI can take on portions of optimization that a human kernel engineer would do, leaving the human effort focused on the most complex optimizations.

Gimlet Blog

A blog about research on high performance AI systems.

gimletlabs.ai

leman · Sep 10, 2025

Now that MXU has been confirmed for Apple GPUs, let’s talk performance estimates. They claim 4x improvements over the previous cores, but do not specify the precision.

I have reasons to believe that this refers to FP16 matrix multiplication with FP32 accumulation, which would put the M5 Max at the same level as the 5070 RTX or the Spark superchip. Of course, Nvidia tensor cores are capable of much faster rates using limited precision formats, which I don’t believe will be supported by Apple in this iteration.

At any rate, a huge boost, and a massive increase in the value proposition for the future Mac!

OptimusGrime · Sep 10, 2025

leman said:
Now that MXU has been confirmed for Apple GPUs, let’s talk performance estimates. They claim 4x improvements over the previous cores, but do not specify the precision.

I have reasons to believe that this refers to FP16 matrix multiplication with FP32 accumulation, which would put the M5 Max at the same level as the 5070 RTX or the Spark superchip. Of course, Nvidia tensor cores are capable of much faster rates using limited precision formats, which I don’t believe will be supported by Apple in this iteration.

At any rate, a huge boost, and a massive increase in the value proposition for the future Mac!

Eagerly awaiting a tech talk for these gpus, similar to the one they made for the A17/M3.

Apple Silicon deep learning performance

macrumors 68030

macrumors Core

macrumors Core

macrumors 68030

macrumors 68040

macrumors Core

macrumors 68040

macrumors Core

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 6502

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 68030

macrumors 68030

macrumors Core

macrumors 68030

macrumors 68030

macrumors Core

macrumors 68000

macrumors Core

macrumors 6502

Our Staff