Apple Silicon deep learning performance

GrumpyCoder · Nov 14, 2024

I don't do much speech, mostly vision, but... memory. The A5000 runs circles around the M4 Max for compute, but it's limited to 24GB.

Also, it's ironic that Intel MBPs ran so hot, loud and throttled and now we're in the same situation with the 14" MBP (at least throttling as it seems). There were already complains about noise with the M3 Max in the 14". Looks like M4 Pro is the new best choice for 14", but that also means back to a mobile + a desktop Mac instead of using the maxed out MBP for everything.

Gnattu · Nov 14, 2024

GrumpyCoder said:
Also, it's ironic that Intel MBPs ran so hot, loud and throttled and now we're in the same situation with the 14" MBP (at least throttling as it seems). There were already complains about noise with the M3 Max in the 14". Looks like M4 Pro is the new best choice for 14", but that also means back to a mobile + a desktop Mac instead of using the maxed out MBP for everything.

Back in the Intel days you don't even have a choice to put the same chip into the 13 inch as the 16 inch, and the 13 inch has less than half of the performance of the 16 inch one, and the fact that we can have such Max chip in 14 inch is already a big win.

The maxed out MBP 16 inch does not throttle with M3 Max (with maxed fan) and I think it will not throttle with M4 Max either as the M4 Max actually improved the thermal a bit. So if you want to sacrifice the portability a bit you can get the 16 inch one.

The core problem here is that the Max GPU is too big for 14 inch to run at full clock sustained and unfortunately is the case for machine learning workloads. I personally still prefer 14inch though, because the 16inch is a bit too heavy to carry every day, at least in my opinion it is a bit too heavy.

I'm still excited to have the choice to run 70B models on a 14inch laptop with usable token generation speed, and MBP 14 is still the only choice at the moment due to the aggressive VRAM limit Nvidia is enforcing on its laptop chips.

leman · Nov 14, 2024

GrumpyCoder said:
Also, it's ironic that Intel MBPs ran so hot, loud and throttled and now we're in the same situation with the 14" MBP (at least throttling as it seems). There were already complains about noise with the M3 Max in the 14". Looks like M4 Pro is the new best choice for 14", but that also means back to a mobile + a desktop Mac instead of using the maxed out MBP for everything.

Well, Intel chips were already heavily throttling when opening a spreadsheet. The fan would go turbo in single-threaded tasks. Throttling with Intel chips meant that the high-end CPU models were often slower than midrange ones. It’s really not the same thing.

theorist9 · Nov 14, 2024

GrumpyCoder said:
Also, it's ironic that Intel MBPs ran so hot, loud and throttled and now we're in the same situation with the 14" MBP (at least throttling as it seems). There were already complains about noise with the M3 Max in the 14". Looks like M4 Pro is the new best choice for 14", but that also means back to a mobile + a desktop Mac instead of using the maxed out MBP for everything.

Sorry to pile on with everyone else (I like your posts!), but I don't believe this is an analogous situation. I checked the Wikipedia table for the last generation of Intel MBP's (Touch Bar, 2016-2020), and the 13" always got a much lower-TDP class of processor (U or NG7-series) than the 15" (H, HQ, or HK-series).

So it's only because Apple has been able to offer the same class of CPU/GPU in the 14" and 16" AS MBP's that we're starting to see significant thermal issues with the maxxed-out (no pun intended) 14".

A more direct analog to the Intel era would be a 14" AS Pro (or even 14" base AS) vs. a 13" U/NG7-series, and a 16" AS Max vs. a 15" H/HQ/HX-series.

Source:

MacBook Pro (Intel-based) - Wikipedia

en.wikipedia.org

dmccloud · Nov 15, 2024

Gnattu said:
Back in the Intel days you don't even have a choice to put the same chip into the 13 inch as the 16 inch, and the 13 inch has less than half of the performance of the 16 inch one, and the fact that we can have such Max chip in 14 inch is already a big win.

The maxed out MBP 16 inch does not throttle with M3 Max (with maxed fan) and I think it will not throttle with M4 Max either as the M4 Max actually improved the thermal a bit. So if you want to sacrifice the portability a bit you can get the 16 inch one.

The core problem here is that the Max GPU is too big for 14 inch to run at full clock sustained and unfortunately is the case for machine learning workloads. I personally still prefer 14inch though, because the 16inch is a bit too heavy to carry every day, at least in my opinion it is a bit too heavy.

I'm still excited to have the choice to run 70B models on a 14inch laptop with usable token generation speed, and MBP 14 is still the only choice at the moment due to the aggressive VRAM limit Nvidia is enforcing on its laptop chips.

Apparently Apple made a choice to keep fan speeds down in the 14" models, which can lead to throttling. I have seen a couple of YouTube videos where the reviewers used TGPro to keep the fans at higher RPMs than Apple's fan curves, and the M4 was not throttling as a result. The question for any user would be whether the increase in fan/exhaust noise is worth the performance boost.

flybass · Nov 15, 2024

dmccloud said:
Apparently Apple made a choice to keep fan speeds down in the 14" models, which can lead to throttling. I have seen a couple of YouTube videos where the reviewers used TGPro to keep the fans at higher RPMs than Apple's fan curves, and the M4 was not throttling as a result. The question for any user would be whether the increase in fan/exhaust noise is worth the performance boost.

Question - why wouldn't apple allow the fans to ramp up with excessive heat? Isn't that the whole point of dynamic fan speeds?

name99 · Nov 15, 2024

Gnattu said:
Back in the Intel days you don't even have a choice to put the same chip into the 13 inch as the 16 inch, and the 13 inch has less than half of the performance of the 16 inch one, and the fact that we can have such Max chip in 14 inch is already a big win.

The maxed out MBP 16 inch does not throttle with M3 Max (with maxed fan) and I think it will not throttle with M4 Max either as the M4 Max actually improved the thermal a bit. So if you want to sacrifice the portability a bit you can get the 16 inch one.

The core problem here is that the Max GPU is too big for 14 inch to run at full clock sustained and unfortunately is the case for machine learning workloads. I personally still prefer 14inch though, because the 16inch is a bit too heavy to carry every day, at least in my opinion it is a bit too heavy.

I'm still excited to have the choice to run 70B models on a 14inch laptop with usable token generation speed, and MBP 14 is still the only choice at the moment due to the aggressive VRAM limit Nvidia is enforcing on its laptop chips.

If this is a real problem (I don't think we know that it is, just people assuming this) the obvious solution is that going forward Apple only sells the 14" with 32 GPU cores (or the equivalent with the M5).

There are TWO M4 Max's, and if the biggest one doesn't fit, just don't use it!

name99 · Nov 15, 2024

flybass said:
Question - why wouldn't apple allow the fans to ramp up with excessive heat? Isn't that the whole point of dynamic fan speeds?

About Power Modes on your Mac - Apple Support

Power Modes allow you to tailor your Mac's performance to your workload.

support.apple.com

"

High Power Mode

High Power Mode allows the fans to run at higher speeds. The additional cooling capacity may allow the system to deliver higher performance in very intensive workloads. When High Power Mode is enabled, you may hear additional fan noise.
"

Of course YouTubers gotta get them clicks of outrage, regardless of whether Apple already provides a solution or not...

flybass · Nov 15, 2024

name99 said:
About Power Modes on your Mac - Apple Support

Power Modes allow you to tailor your Mac's performance to your workload.

support.apple.com

"

High Power Mode
High Power Mode allows the fans to run at higher speeds. The additional cooling capacity may allow the system to deliver higher performance in very intensive workloads. When High Power Mode is enabled, you may hear additional fan noise.
"

Of course YouTubers gotta get them clicks of outrage, regardless of whether Apple already provides a solution or not...

If I'm reading correctly, high power mode was not available on the 14 inch until M3. So then I'm guessing for M2 or below, fans can already kick up to max or this is an omission to help with marketing new models.

Edit: Read a bit more -- so probably heat dissipation wasn't good enough on earlier 14 inch generations.

Gnattu · Nov 15, 2024

name99 said:
If this is a real problem (I don't think we know that it is, just people assuming this) the obvious solution is that going forward Apple only sells the 14" with 32 GPU cores (or the equivalent with the M5).

There are TWO M4 Max's, and if the biggest one doesn't fit, just don't use it!

That is not the solution because that also reduced memory bandwidth and it also has less CPU cores. You don’t have to be afraid of GPU throttling that much. A big GPU works at lower clock will still have efficiency advantages and Nvidia is doing the same on some of its SKUs, those are just hard locked to a lower frequency. You can think of the 14 inch one as the same. When you choose a form factor you have to accept the trade off. You don’t always push GPU to its max but the form factor is always there. I really like the portability of the 14 inch compared with the 2KG 16 inch

Bungaree.Chubbins · Nov 17, 2024

flybass said:
If I'm reading correctly, high power mode was not available on the 14 inch until M3. So then I'm guessing for M2 or below, fans can already kick up to max or this is an omission to help with marketing new models.

Edit: Read a bit more -- so probably heat dissipation wasn't good enough on earlier 14 inch generations.

On the M3s I think it's exclusive to the M3 Max. I wish it were available on my M3 Pro.

leman · Mar 31, 2025

Apple has a new patent detailing what appears like a GPU matrix acceleration hardware (similar to tensor cores). Link here: https://patentscope.wipo.int/search/en/detail.jsf?docId=US452614511&_cid=P12-M8WPOS-61919-1

What is interesting about this one is the level of detail, which closely matches the architecture of current Apple GPUs. Some highlights:

- dot product architecture
- 256 FP32 matmul per cycle per partition (presumably 1024 FP32 per GPU core)
- possibly multi precision support (mentioned in text)
- focus on operation chaining and reuse of partial results when multiplying multiple matrices in sequence, could enable higher ILP due to better register file utilization

theorist9 · Mar 31, 2025

leman said:
Apple has a new patent detailing what appears like a GPU matrix acceleration hardware (similar to tensor cores). Link here: https://patentscope.wipo.int/search/en/detail.jsf?docId=US452614511&_cid=P12-M8WPOS-61919-1

What is interesting about this one is the level of detail, which closely matches the architecture of current Apple GPUs. Some highlights:

- dot product architecture
- 256 FP32 matmul per cycle per partition (presumably 1024 FP32 per GPU core)
- possibly multi precision support (mentioned in text)
- focus on operation chaining and reuse of partial results when multiplying multiple matrices in sequence, could enable higher ILP due to better register file utilization

Do you have a feel for how sophisticated Apple's current GPU architecture is compared to NVIDIA's?

E.g., if you had the same number of cores running at the same clocks, about how comparable would they be in performance and power usage?

crazy dave · Mar 31, 2025

theorist9 said:
Do you have a feel for how sophisticated Apple's current GPU architecture is compared to NVIDIA's?

E.g., if you had the same number of cores running at the same clocks, about how comparable would they be in performance and power usage?

This was my write up on the subject wrt to ray tracing and standard FP32 raster/compute work (mostly raster):

https://forums.macrumors.com/threads/3d-rendering-on-apple-silicon-cpu-gpu.2269416/page-104?post=33712963#post-33712963

But if you're asking specifically about matmul performance, the linked post above doesn't cover that. For that particular case though, matmul, Apple simply won't be in the same performance bracket as Nvidia. The patent @leman shared may portend changes in that regards, but right now Apple doesn't have the same kind of hardware acceleration for matrix multiplication (especially sparse matrices) that Nvidia has with its tensor units (and Intel with XMX units - AMD with RDNA 4 is somewhat similar to Apple as, like Apple, RDNA 4 doesn't have any matrix units either, but somehow they claim increased sparsity performance over RDNA 3 anyway). I might be misremembering what @leman said, but Apple does have some hardware to help move data amongst the standard compute pipes when doing matrix multiplication, but no dedicated matrix hardware (in the GPU, obviously they have AMX and Neural cores).

leman · Apr 1, 2025

theorist9 said:
Do you have a feel for how sophisticated Apple's current GPU architecture is compared to NVIDIA's?

E.g., if you had the same number of cores running at the same clocks, about how comparable would they be in performance and power usage?

It is difficult to make informed comparisons due to the lack of the established comparison basis. Even performance comparisons can be tricky as implementations and ecosystems can be of varying quality. I will try to share the little I know (or believe to know) from the standpoint of the architecture.

An Apple GPU core is roughly comparable to a Streaming Multiprocessor (SM) in Nvidia GPU — both feature four distinct sub-partitions with one instruction scheduler and a set of executions unit each. There are 32 FMA FP32 units in each sub-partition, or 128 per core/SM. So if you care about the peak FP32 compute throughput (as advertised in marketing materials), that is identical for Nvidia and Apple at the same clock and assuming the same number of cores.

At the same time, there are a lot of differences. Apple and Nvidia schedule instructions differently and use different execution unit layouts. Half of Nvidia's execution units dub as integer ALUs, so the SM sub-partition can execute FP and INT instructions concurrently (at a loss of some FP throughput). From what I understand this changed in Blackwell, with FP/INT units being fully unified (so the sub-partition can do either 32 INT or 32 FP32 per clock). Apple M3 and later can execute FP and INT in parallel, as they feature separate execution pipes. However, while one would expect to get a 2x improvement in speed on mixed codes (equal number of INT and FP instructions), the observed improvement is only 1.5x, suggesting that there is not enough register file bandwidth to feed both of them. There also might be some differences in the latency and throughput of integer instructions, I am not entirely sure.

Apple GPUs are incredibly sophisticated in some key areas. One of course is dynamic caching. While we do not know exactly the extent of this technology, patents and Apple dev videos suggest that rather than GPU registers on M3 and later are virtual and backed by memory hierarchy (I imagine this works quite similarly to virtual memory we know and love on the CPU side). This means that register accesses can spill to various cache hierarchies or even system RAM. While all this sounds very complex to manage, the huge advantage is that you can just send your shaders at the GPU and the system will figure out optimal occupancy. We already saw how dual-issue and dynamic caching can make a dramatic difference for running complex shaders in Blender (if I remember correctly, it was over 50% improvement from M2 even before enabling hardware RT).

Nvidia on the other hand holds some other hey advantages. One of course are the tensor cores, enabling much faster ML-related applications. They have larger register files (but that is probably rendered irrelevant with Apple's dynamic caching approach). They support more advanced features for writing complex shaders such as work synchronization, unified virtual memory, and deadlock elimination. Not to mention the superior libraries, documentation, and development ecosystem in general. And of course, Nvidia is pioneering a lot of exiting work in raytracing and neural rendering, which will certainly only become more important in the future.

Finally, Nvidia designs have more SMs and can afford running higher clocks, and it will be very difficult to overcome this performance gap. Let's say that Apple implements the matrix multiplication patent just as published, and that it supports FP16 natively. Let's further assume that there will be one matmul unit per sub-partition. Then an Apple GPU core could do 1024 dense FP16 matrix FMAs — same as Blackwell SM. But something like an RTX 5070 Ti (laptop) has 50 SMs, while Apple will still likely only offer around 40 GPU cores for the Max due to die size constraints, so it will be difficult to match Nvidia in peak performance.

I do believe that Apple has some very interesting GPU products in the pipeline. Some time ago I have mentioned that they could implement multiple FP32 pipelines and widening the data bus by 33%, which would allow them to do effectively double the FP32 FMA throughput. They have a new patent (albeit a very generic one) describing out-of-order execution on a processor that looks suspiciously like a GPU, so maybe that is what they are going for. Paired with multiple parallel execution units, that could be an interesting way to massively boost performance without having to increase the cache structures. We will see.

At any rate, if at least some of this functionality comes to M5, Apple laptops will become a viable development platform for ML. And it makes a lot of sense to invest in this area. Even if the raw performance will be lower than what Nvidia offers, the massive unified memory, large caches, and overall focus on efficiency and maximizing hardware occupancy could be a winning ticket for Apple.

crazy dave · Apr 1, 2025

crazy dave said:
This was my write up on the subject wrt to ray tracing and standard FP32 raster/compute work (mostly raster):

https://forums.macrumors.com/threads/3d-rendering-on-apple-silicon-cpu-gpu.2269416/page-104?post=33712963#post-33712963

But if you're asking specifically about matmul performance, the linked post above doesn't cover that. For that particular case though, matmul, Apple simply won't be in the same performance bracket as Nvidia. The patent @leman shared may portend changes in that regards, but right now Apple doesn't have the same kind of hardware acceleration for matrix multiplication (especially sparse matrices) that Nvidia has with its tensor units (and Intel with XMX units - AMD with RDNA 4 is somewhat similar to Apple as, like Apple, RDNA 4 doesn't have any matrix units either, but somehow they claim increased sparsity performance over RDNA 3 anyway). I might be misremembering what @leman said, but Apple does have some hardware to help move data amongst the standard compute pipes when doing matrix multiplication, but no dedicated matrix hardware (in the GPU, obviously they have AMX and Neural cores).

According to this video I'm wrong, AMD does indeed have dedicated tensor cores in RDNA 4, though I could've sworn they didn't and it was just a more advanced version of the RDNA 3 WMMA circuitry and true matrix cores weren't coming until UDNA (this may depend on definitions and exactly what is being accelerated):

leman · Apr 1, 2025

crazy dave said:
According to this video I'm wrong, AMD does indeed have dedicated tensor cores in RDNA 4, though I could've sworn they didn't and it was just a more advanced version of the RDNA 3 WMMA circuitry and true matrix cores weren't coming until UDNA (this may depend on definitions and exactly what is being accelerated):

The way I understand newer RDNA is that it each SIMD is 64-wide, but only half of them is used typically. The full SIMD is only used for matrix instructions and the “dual issue” instruction VODP. However, my understanding might be flawed - there is a lot of contradictory information out there. I am basing this on my interpretation of the instruction set documentation.

crazy dave · Apr 1, 2025

leman said:
The way I understand newer RDNA is that it each SIMD is 64-wide, but only half of them is used typically. The full SIMD is only used for matrix instructions and the “dual issue” instruction VODP. However, my understanding might be flawed - there is a lot of contradictory information out there. I am basing this on my interpretation of the instruction set documentation.

That is still the case in RDNA 4 and the Youtuber goes over that as well though admits his head hurts after reading AMD's convoluted descriptions, even with cheese's help, as btw does mine - I think I understand their system better now after re-reading chipsandcheese on RDNA 3/3.5 a few dozen times, but it was a slog and there is still a piece or two that confuses me.

But High Yield believes there is also enough additional circuitry specifically related to matrix multiplication with enough additional functionality (e.g. additional precision formats, sparsity acceleration, and extra registers for load/transpose of matrices) that they should also count as having tensor cores. He claims he asked multiple people if they felt the changes meant AMD had "tensor cores" and they said yes (I wouldn't count asking AMD since I'm pretty sure AMD claimed RDNA 3's should count and no one else did 🙃 ).

Timestamps for screenshots and slides from AMD about the AI accelerators:

15:40

16:55

leman · Apr 1, 2025

crazy dave said:
That is still the case in RDNA 4 and the Youtuber goes over that as well though admits his head hurts after reading AMD's convoluted descriptions, even with cheese's help, as btw does mine - I think I understand their system better now after re-reading chipsandcheese on RDNA 3/3.5 a few dozen times, but it was a slog and there is still a piece or two that confuses me.

But High Yield believes there is also enough additional circuitry specifically related to matrix multiplication with enough additional functionality (e.g. additional precision formats, sparsity acceleration, and extra registers for load/transpose of matrices) that they should also count as having tensor cores. He claims he asked multiple people if they felt the changes meant AMD had "tensor cores" and they said yes (I wouldn't count asking AMD since I'm pretty sure AMD claimed RDNA 3's should count and no one else did 🙃 ).

Timestamps for screenshots and slides from AMD about the AI accelerators:

15:40

16:55

I get the argument. It's just that it goes in the direction "how do we name things", which IMO is not a technical conversation anymore. People love assigning labels to things, but labels need to be meaningful. That's why I am more in favor for a "multivariate" approach, where you identify dimensions of interest and classify things along these dimensions.

To be frank, I don't think we should be using terms like "tensor cores" anyway. It's a marketing slogan, invented by Nvidia, and used to describe their unique approach to the problem. To me, the interesting question is how the matrix compute is done. And the large difference between Nvidia and AMD here is that the former uses specialized dot-product engines whose purpose is to do matmul and not much else, while AMD makes their SIMD flexible enough to support a variety of formats relevant to matrix multiplication. At the same time AMD's matmul performance is limited by their SIMD, while Nvidia's is not (and to me that is the crucial difference).

leman · Apr 3, 2025

BTW, AMD's way to do matrix multiplication is described here:

WIPO - Search International and National Patent Collections

While I haven't read these in detail yet, it does seem like they are talking about leveraging vector ALUs to do matrix multiplication, with dedicated functionality being caching/routing circuits (very similar to what Apple has been doing since M1, but with added multi-precision support).

crazy dave · Apr 3, 2025

leman said:
BTW, AMD's way to do matrix multiplication is described here:

WIPO - Search International and National Patent Collections

WIPO - Search International and National Patent Collections

While I haven't read these in detail yet, it does seem like they are talking about leveraging vector ALUs to do matrix multiplication, with dedicated functionality being caching/routing circuits (very similar to what Apple has been doing since M1, but with added multi-precision support).

That's the sense I got as well, but I wasn't sure, which is why I was hesitant to fully endorse the idea that AMD has "tensor cores" now - though I also agree that semantics is the least important aspect to all of this.

name99 · May 9, 2025

leman said:
Apple has a new patent detailing what appears like a GPU matrix acceleration hardware (similar to tensor cores). Link here: https://patentscope.wipo.int/search/en/detail.jsf?docId=US452614511&_cid=P12-M8WPOS-61919-1

What is interesting about this one is the level of detail, which closely matches the architecture of current Apple GPUs. Some highlights:

- dot product architecture
- 256 FP32 matmul per cycle per partition (presumably 1024 FP32 per GPU core)
- possibly multi precision support (mentioned in text)
- focus on operation chaining and reuse of partial results when multiplying multiple matrices in sequence, could enable higher ILP due to better register file utilization

This is interesting, but not in the way you think.

Suppose you have a scalar CPU that's calculating a long dot product. Essentially the code looks like a loop of C=A*B+C, perhaps with A and B cycling through register 0, 1, 2, 3, ... and some other code sequentially loading these registers 0, 1, 2, 3.
So a consequence of this design is that we're continually reading register C from the register file, adding it into the multiply, then storing C back in the register file. That's not ideal! Now in any decent modern CPU, we'll in fact pull C off the bypass bus as it is being written back to the register file, so we can at least avoid the latency of re-reading from the register file. (It's still a shame we keep writing back the intermediate C result which is immediately overwritten, but that's a problem for later).

Switch to the GPU, and imagine we're similarly calculating a whole lot of long dot products in parallel. We'd have a similar code structure of C=A*B+C, only with A and B each 32 elements long and being pointwise multiplied, then accumulated into a 32-wide register C, to calculate 32 dot products in parallel.
Once again we have a situation where we waste time and energy moving the register C back and forth between the execution unit and the operand cache. Can we avoid this?
Suppose we provided a single register of storage attached to each lane of each execution unit (maybe the FP16, FP32 and INT pipelines all have their own separate storage per lane, maybe they share a common 32bit storage, that's a detail). Now we can define a new instruction that looks something like
local=A*B+local, and avoid all that back and forth movement of C.

This is essentially what (2024) https://patents.google.com/patent/US20250103292A1 Matrix Multiplier Caching is about.
The more common task we care about is matrix multiplication, not exactly dot products, and you should recall that we the Apple GPU provides a matrix multiply instruction that handles the multiplication of something like two side by side 4*4 submatrices with a 4*4 submatrix to give two 4*4 results, achieved by packing these matrices in appropriate order into the 32 lanes of A and B registers, and then shuffling the values around as we pass the registers repeatedly through the FMAC execution units.
This is nice if we want to calculate a 4*4 matrix product, but usually we have larger matrices, so we have to repeat this operation. At which point (you can work through the algebra if you like, or can just imagine it in your head as operating like dot products, only with the "basic units" being 4*4 matrices that are multiplied together and accumulated) we are back to our earlier dot product explanation.
As long as we have one register of local storage available to each lane, we can keep accumulating our matrix multiplication while avoiding the latency and energy costs of frequent movement of the C values.

Interestingly it also avoids use of the third track of the operand cache into the execution units, which in turn means that we don't have to expand the bandwidth of the operand cache too much to get interesting consequences.
Suppose the operand cache today can supply three operands per cycle, and suppose we boosted it to supply four operands per cycle. This would in turn allow us to supply two operands to each of two pipelines, meaning that, for example, we could now run two large matrix multiplies in parallel without as much expansion of the hardware as earlier seemed necessary...
Of course this still requires two of the appropriate execution pipelines, but it mostly solves what looked like the hardest part of the problem of making GPU execution wider.

So to summarize:
- This doesn't (directly) point to any new FMA hardware. The large examples they give (like a 16*16 matrix) are not saying "we can multiply that in one cycle", they are saying "this is the kind of example that benefits from being able to use this local per-lane FMA storage".
- It's building on the existing GPU matrix multiplication (which is interesting in its own way, because the permute network it uses can also be used for other things...). You can look at the original matrix multiply patent, https://patents.google.com/patent/US11256518B2 , but I think I'm correct in describing as essentially multiply two side by side 4*4s against a separate 4*4.
- The local storage could be used for other things (eg genuine dot products), and probably will be whenever the compiler is smart enough to see a data reuse pattern like I described. The performance and energy boosts are nice, not earth-shattering, but nice, and every bit helps.
- I'm most interested in how it probably makes it easier to widen GPU core execution (and how that's probably the best overall option for boosting performance at low energy, as compared to simply adding more simple GPOU cores, ie the nVidia type solution).

leman · May 10, 2025

name99 said:
This is interesting, but not in the way you think.

Suppose you have a scalar CPU that's calculating a long dot product. Essentially the code looks like a loop of C=A*B+C, perhaps with A and B cycling through register 0, 1, 2, 3, ... and some other code sequentially loading these registers 0, 1, 2, 3.
So a consequence of this design is that we're continually reading register C from the register file, adding it into the multiply, then storing C back in the register file. That's not ideal! Now in any decent modern CPU, we'll in fact pull C off the bypass bus as it is being written back to the register file, so we can at least avoid the latency of re-reading from the register file. (It's still a shame we keep writing back the intermediate C result which is immediately overwritten, but that's a problem for later).

Switch to the GPU, and imagine we're similarly calculating a whole lot of long dot products in parallel. We'd have a similar code structure of C=A*B+C, only with A and B each 32 elements long and being pointwise multiplied, then accumulated into a 32-wide register C, to calculate 32 dot products in parallel.
Once again we have a situation where we waste time and energy moving the register C back and forth between the execution unit and the operand cache. Can we avoid this?
Suppose we provided a single register of storage attached to each lane of each execution unit (maybe the FP16, FP32 and INT pipelines all have their own separate storage per lane, maybe they share a common 32bit storage, that's a detail). Now we can define a new instruction that looks something like
local=A*B+local, and avoid all that back and forth movement of C.

This is essentially what (2024) https://patents.google.com/patent/US20250103292A1 Matrix Multiplier Caching is about.
The more common task we care about is matrix multiplication, not exactly dot products, and you should recall that we the Apple GPU provides a matrix multiply instruction that handles the multiplication of something like two side by side 4*4 submatrices with a 4*4 submatrix to give two 4*4 results, achieved by packing these matrices in appropriate order into the 32 lanes of A and B registers, and then shuffling the values around as we pass the registers repeatedly through the FMAC execution units.
This is nice if we want to calculate a 4*4 matrix product, but usually we have larger matrices, so we have to repeat this operation. At which point (you can work through the algebra if you like, or can just imagine it in your head as operating like dot products, only with the "basic units" being 4*4 matrices that are multiplied together and accumulated) we are back to our earlier dot product explanation.
As long as we have one register of local storage available to each lane, we can keep accumulating our matrix multiplication while avoiding the latency and energy costs of frequent movement of the C values.

Interestingly it also avoids use of the third track of the operand cache into the execution units, which in turn means that we don't have to expand the bandwidth of the operand cache too much to get interesting consequences.
Suppose the operand cache today can supply three operands per cycle, and suppose we boosted it to supply four operands per cycle. This would in turn allow us to supply two operands to each of two pipelines, meaning that, for example, we could now run two large matrix multiplies in parallel without as much expansion of the hardware as earlier seemed necessary...
Of course this still requires two of the appropriate execution pipelines, but it mostly solves what looked like the hardest part of the problem of making GPU execution wider.

So to summarize:
- This doesn't (directly) point to any new FMA hardware. The large examples they give (like a 16*16 matrix) are not saying "we can multiply that in one cycle", they are saying "this is the kind of example that benefits from being able to use this local per-lane FMA storage".
- It's building on the existing GPU matrix multiplication (which is interesting in its own way, because the permute network it uses can also be used for other things...). You can look at the original matrix multiply patent, https://patents.google.com/patent/US11256518B2 , but I think I'm correct in describing as essentially multiply two side by side 4*4s against a separate 4*4.
- The local storage could be used for other things (eg genuine dot products), and probably will be whenever the compiler is smart enough to see a data reuse pattern like I described. The performance and energy boosts are nice, not earth-shattering, but nice, and every bit helps.
- I'm most interested in how it probably makes it easier to widen GPU core execution (and how that's probably the best overall option for boosting performance at low energy, as compared to simply adding more simple GPOU cores, ie the nVidia type solution).

Hi Maynard, thanks for commenting. Do have a look at figures 1 and 3 which depict dedicated dot product accumulation circuitry in great detail. The patent itself refers to "dot product accumulate circuits" on multiple occasions and these circuits are also part of the patent claims. I would be surprised if all these depictions are merely cosmetic. They also directly state in the patent that the 32 circuits (SIMD width!) operate on 128 elements of the matrix — that's four elements per lane. I would also like to refer you to the sister patent that has been published approximately at the same time: https://patentscope.wipo.int/search/en/detail.jsf?docId=US453093419 This patent describes a permute network at the register file level, designed to ideally forward operands to matrix multiplication circuits. As such, this patent is a continuation of the US11256518B2 you mention. However, current implementation via quad-swizzles already achieves ideal forwarding of arguments to SIMD — Apple is able to saturate 100% of the SIMD throughput with matrix operations on current hardware. So I don't see any need for a new permute network unless the hardware can do matrix operations faster.

Now, I fully agree with you that the matrix caches are the real true "secret sauce" of this patent. After all, dedicated matmul hardware is nothing special, Nvidia has been doing it for years and pretty much everybody has it nowadays. But the accumulator cache as Apple describes it can eliminate storage allocation for intermediate products, which is huge in improving efficiency and reducing pressure on the register file. This also plays very well with Apple's strategy of dispatching multiple GPU instructions per cycle, and I believe they will be pushing this even further.

name99 · May 10, 2025

leman said:
Hi Maynard, thanks for commenting. Do have a look at figures 1 and 3 which depict dedicated dot product accumulation circuitry in great detail. The patent itself refers to "dot product accumulate circuits" on multiple occasions and these circuits are also part of the patent claims. I would be surprised if all these depictions are merely cosmetic. They also directly state in the patent that the 32 circuits (SIMD width!) operate on 128 elements of the matrix — that's four elements per lane. I would also like to refer you to the sister patent that has been published approximately at the same time: https://patentscope.wipo.int/search/en/detail.jsf?docId=US453093419 This patent describes a permute network at the register file level, designed to ideally forward operands to matrix multiplication circuits. As such, this patent is a continuation of the US11256518B2 you mention. However, current implementation via quad-swizzles already achieves ideal forwarding of arguments to SIMD — Apple is able to saturate 100% of the SIMD throughput with matrix operations on current hardware. So I don't see any need for a new permute network unless the hardware can do matrix operations faster.

Now, I fully agree with you that the matrix caches are the real true "secret sauce" of this patent. After all, dedicated matmul hardware is nothing special, Nvidia has been doing it for years and pretty much everybody has it nowadays. But the accumulator cache as Apple describes it can eliminate storage allocation for intermediate products, which is huge in improving efficiency and reducing pressure on the register file. This also plays very well with Apple's strategy of dispatching multiple GPU instructions per cycle, and I believe they will be pushing this even further.

The point is that the Apple GPU calculates matrix multiplies by iterated *dot product* as opposed to iterated *outer product* (as done by AMX). That's the point of all that verbiage.
And given that this is what they do, having that additional local storage is valuable.

As for the precise details, I've no idea. The patent you reference hasn't yet appeared on Google Patents, and the UI on patentscope is so awful (useless for iPad) that I'll wait for Google Patents before examining it closely.

One possibility I could imagine is that the original multiplication scheme is as I described, based on two 4*4 matrices multiplied against a single 4*4 matrix. This only requires limited permute, and the numbers kinda fit,
but it makes sub-optimal use of the second register.
If you go up to 8*8 multiplied against 8*8 you have 128 operands (matching what they say) but each matrix block is spread over two registers, so now you have to provide some way to hold onto the intermediate results while you switch input from the first 32 values to the second 32 values. This may have been their original intent, to move the matrix multiplication to a larger block size, and the local storage is a way to make that work? which nicely also helps with a few other things...

leman · May 11, 2025

name99 said:
The point is that the Apple GPU calculates matrix multiplies by iterated *dot product* as opposed to iterated *outer product* (as done by AMX). That's the point of all that verbiage.
And given that this is what they do, having that additional local storage is valuable.

As for the precise details, I've no idea. The patent you reference hasn't yet appeared on Google Patents, and the UI on patentscope is so awful (useless for iPad) that I'll wait for Google Patents before examining it closely.

One possibility I could imagine is that the original multiplication scheme is as I described, based on two 4*4 matrices multiplied against a single 4*4 matrix. This only requires limited permute, and the numbers kinda fit,
but it makes sub-optimal use of the second register.
If you go up to 8*8 multiplied against 8*8 you have 128 operands (matching what they say) but each matrix block is spread over two registers, so now you have to provide some way to hold onto the intermediate results while you switch input from the first 32 values to the second 32 values. This may have been their original intent, to move the matrix multiplication to a larger block size, and the local storage is a way to make that work? which nicely also helps with a few other things...

Ah, I see, so you haven't looked at the details yet. Would be great to hear your thoughts once you've had the opportunity to study the patents more precisely. In particular, here is the depiction of the dot product circuit from the matrix multiplier caching patent.

This is very different from what they've been doin until now. M1-M4 uses stepwise dot product calculation using SIMD and quad shifts. The circuit in the figure instead uses reduction to calculate dot product directly (Nvidia tensor cores use the same basic approach). That's 4x increase in compute, if implemented as depicted. Of course, it is possible that the figure is merely cosmetic, however, I find that unlikely.

Apple Silicon deep learning performance

macrumors 68020

macrumors 65816

macrumors Core

macrumors 601

macrumors 68040

macrumors 6502

macrumors 68030

macrumors 68030

High Power Mode​

macrumors 6502

High Power Mode​

macrumors 65816

macrumors 6502a

macrumors Core

macrumors 601

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors 68000

macrumors Core

macrumors Core

macrumors 68000

macrumors 68030

macrumors Core

macrumors 68030

macrumors Core

Our Staff

High Power Mode

High Power Mode