Apple Silicon deep learning performance

name99 · Sep 12, 2025

OptimusGrime said:
Eagerly awaiting a tech talk for these gpus, similar to the one they made for the A17/M3.

OK guys, I think I know what happened with the new GPUs. It's NOT addition of a "tensor core" and it's not addition of quantization support (or for that matter sparsity support). It's the addition of
1. a third FP pipeline (so we we now have an FP16 pipelines, an FP32 pipeline, and a mixed pipeline, which multiplies FP16 with FP16, but accumulates to FP32)
2. modification of the FP32 pipeline so that it can take in FP16 results and perform FP16 arithmetic
3. the addition I described earlier of a local register attached to each pipeline cutting down the back and forth traffic between the register cache and the pipelines when performing dot products (eg as a component of matrix multiply).

Put these together and the net result is that
- a lot of generic and graphics code will see a nice boost because FP16 ops can execute in two, rather than one, pipeline AND
- matrix multiply code can take advantage of all the above features to in fact schedule FP16 matrix multiply across three pipelines rather than one, giving us the 3x factor that Apple talked about in the A19 presentation.

My current guess is that this is the technical meaning of the GPU "neural accelerator". Not a tensor core, and not support any sort of quantization. Maybe those will come, I have no idea. Or maybe Apple felt it was most important to prioritize flexibility over specific techniques (like quantization or structured sparsity) that right now seem very important but which may be replaced by something else in two or three years?
The local register patent was https://patents.google.com/patent/US20250103292A1
My writeup for the new patent is ...................................................................

Suppose you want to increase your GPU compute, but not to spend much area, so you can't increase the number of cores. What are your options? The most obvious choice, as we have already discussed, is to go superscalar to 2-wide, so that more of your execution units can be active every cycle. Apple's first pass at this, presumably just getting infrastructure in place, had a disappointing performance boost because while three execution pipes are available (FP32, FP16, and INT) INT is rarely used, so its dual issue doesn't help much, and it's rare to have either a single warp that uses both FP16 and FP32, or two independent warps that use these two different types simultaneously.
Other GPUs have either modified their pipelines so that the same pipeline can handle both FP16 and FP32, or even more ambitiously, so that the same pipeline can handle FP32 and dual FP16 operation. Both of these make sense. An FP32 unit has to have a multiplier and an adder/shifter, and running those slightly less wide gives you essentially FP16; while you need to add a little more logic to get dual FP16 (ie an FP16 pair) but you can transport the data into/out of the execution unit along the existing 32-bit wide path to the operand cache, so you're making better use of the operand cache, which is one of your tightest resources.

Apple has so far avoided this path, saying in patents that to save power they wanted the FP16 and FP32 pipes to be maximally energy efficient with no extra overhead.
This changes with (2022) https://patents.google.com/patent/US12405803B1 Superscalar execution using pipelines that support different precisions, which appears to be one of the features we know is present in the A19, and presumably coming in the M5.

So the above are, in a sense the obvious elements of the patent. There's one less obvious element.
The patent speaks of each SIMD possessing not two but three pipelines, for FP16, FP32, and "mixed precision" meaning FP16 and FP32. Essentially we seem to have pipelines specialized for
- FP16 only,
- FP32 only, and
- FP32=FP16+FP16*FP16 and FP32=FP32+FP16*FP16 (so FP16 multiplies, but accumulating to FP32)

So it seems that in fact
- the amount of logic has increased (the additional, mixed, pipeline added)
- under different circumstances (wider instruction issue, or specialized instructions) not two but three FP16 operations might be possible per cycle.
The A19 presentation talked about "Neural Accelerators" added to the GPU, and 3x the "OPs" (whatever that might refer to) possible per cycle. One possible interpretation of this is that for LLM purposes, executing mostly matrix multiples in FP16, we go from having one FP16 pipeline available to three pipelines available; and while generic FP16 code cannot access all three simultaneously, the specific matrix multiply instructions (making use of the local per FMA register we discussed in the previous patent) can use all pipelines simultaneously?

leman · Sep 12, 2025

name99 said:
OK guys, I think I know what happened with the new GPUs. It's NOT addition of a "tensor core" and it's not addition of quantization support (or for that matter sparsity support). It's the addition of
1. a third FP pipeline (so we we now have an FP16 pipelines, an FP32 pipeline, and a mixed pipeline, which multiplies FP16 with FP16, but accumulates to FP32)
2. modification of the FP32 pipeline so that it can take in FP16 results and perform FP16 arithmetic
3. the addition I described earlier of a local register attached to each pipeline cutting down the back and forth traffic between the register cache and the pipelines when performing dot products (eg as a component of matrix multiply).

Put these together and the net result is that
- a lot of generic and graphics code will see a nice boost because FP16 ops can execute in two, rather than one, pipeline AND
- matrix multiply code can take advantage of all the above features to in fact schedule FP16 matrix multiply across three pipelines rather than one, giving us the 3x factor that Apple talked about in the A19 presentation.

My current guess is that this is the technical meaning of the GPU "neural accelerator". Not a tensor core, and not support any sort of quantization. Maybe those will come, I have no idea. Or maybe Apple felt it was most important to prioritize flexibility over specific techniques (like quantization or structured sparsity) that right now seem very important but which may be replaced by something else in two or three years?
The local register patent was https://patents.google.com/patent/US20250103292A1
My writeup for the new patent is ...................................................................

Suppose you want to increase your GPU compute, but not to spend much area, so you can't increase the number of cores. What are your options? The most obvious choice, as we have already discussed, is to go superscalar to 2-wide, so that more of your execution units can be active every cycle. Apple's first pass at this, presumably just getting infrastructure in place, had a disappointing performance boost because while three execution pipes are available (FP32, FP16, and INT) INT is rarely used, so its dual issue doesn't help much, and it's rare to have either a single warp that uses both FP16 and FP32, or two independent warps that use these two different types simultaneously.
Other GPUs have either modified their pipelines so that the same pipeline can handle both FP16 and FP32, or even more ambitiously, so that the same pipeline can handle FP32 and dual FP16 operation. Both of these make sense. An FP32 unit has to have a multiplier and an adder/shifter, and running those slightly less wide gives you essentially FP16; while you need to add a little more logic to get dual FP16 (ie an FP16 pair) but you can transport the data into/out of the execution unit along the existing 32-bit wide path to the operand cache, so you're making better use of the operand cache, which is one of your tightest resources.

Apple has so far avoided this path, saying in patents that to save power they wanted the FP16 and FP32 pipes to be maximally energy efficient with no extra overhead.
This changes with (2022) https://patents.google.com/patent/US12405803B1 Superscalar execution using pipelines that support different precisions, which appears to be one of the features we know is present in the A19, and presumably coming in the M5.

So the above are, in a sense the obvious elements of the patent. There's one less obvious element.
The patent speaks of each SIMD possessing not two but three pipelines, for FP16, FP32, and "mixed precision" meaning FP16 and FP32. Essentially we seem to have pipelines specialized for
- FP16 only,
- FP32 only, and
- FP32=FP16+FP16*FP16 and FP32=FP32+FP16*FP16 (so FP16 multiplies, but accumulating to FP32)

View attachment 2547872

So it seems that in fact
- the amount of logic has increased (the additional, mixed, pipeline added)
- under different circumstances (wider instruction issue, or specialized instructions) not two but three FP16 operations might be possible per cycle.
The A19 presentation talked about "Neural Accelerators" added to the GPU, and 3x the "OPs" (whatever that might refer to) possible per cycle. One possible interpretation of this is that for LLM purposes, executing mostly matrix multiples in FP16, we go from having one FP16 pipeline available to three pipelines available; and while generic FP16 code cannot access all three simultaneously, the specific matrix multiply instructions (making use of the local per FMA register we discussed in the previous patent) can use all pipelines simultaneously?

I came to a similar conclusion when I saw that patent a few days ago. We’d still need to test the performance to understand these things better, but it seems likely that we are dealing with some clever pipe reuse to implement MXU. The math does not complete add up to me yet since on the IPhone Pro the advertised improvement is 4x compute, but we will see.

At the end of the day however it’s still effectively hardware-accelerated dot product, so it plays out the same way as Nvidia’s tech (who are also using general SIMD for accumulate and whatnot). So in my book it qualifies as a “tensor core” ( just let’s please not call them like this).

name99 · Sep 13, 2025

leman said:
I came to a similar conclusion when I saw that patent a few days ago. We’d still need to test the performance to understand these things better, but it seems likely that we are dealing with some clever pipe reuse to implement MXU. The math does not complete add up to me yet since on the IPhone Pro the advertised improvement is 4x compute, but we will see.

At the end of the day however it’s still effectively hardware-accelerated dot product, so it plays out the same way as Nvidia’s tech (who are also using general SIMD for accumulate and whatnot). So in my book it qualifies as a “tensor core” ( just let’s please not call them like this).

To my mind the significant difference is that the right way to design a DEDICATED matrix multiply unit is like the AMX/SME unit - a 2D array of FMA units performing an outer product - rather than accumulating dot products. The downside is such hardware is very specialized.

I assumed that nV tensor cores are based on outer product based on how "separate" they appear to be from the rest of the datapath, and in distinction to Apple who, so far at least, have preferred to add a permute unit to the existing datapath rather than add specialized HW that wouldn't be relevant to the primary task of graphics.

Am I wrong in this? Do you believe that nV execute a matrix multiply op on the generic datapath? I don't think so because so much seems incompatible with that. For just example, matrix multiply accumulates to some intermediate FP format (like FP24), whereas using datapath would accumulate to a standard FP format like either FP16 or FP32. And the numbers don't seem to match. You can easily scale up the performance of an outer product engine when handling smaller widths (see eg how AMS/SME scales up), as nV does, whereas it's more difficult to do that in datapath.

leman · Sep 13, 2025

name99 said:
To my mind the significant difference is that the right way to design a DEDICATED matrix multiply unit is like the AMX/SME unit - a 2D array of FMA units performing an outer product - rather than accumulating dot products. The downside is such hardware is very specialized.

I assumed that nV tensor cores are based on outer product based on how "separate" they appear to be from the rest of the datapath, and in distinction to Apple who, so far at least, have preferred to add a permute unit to the existing datapath rather than add specialized HW that wouldn't be relevant to the primary task of graphics.

I have no background in designing processing circuits, so all I can do is speculate. I agree with you that outer product engines have a lot of really interesting properties which makes them great for large matrix engines (not to mention the added flexibility). From what I understand one advantage of using hardware dot product circuits is that they can be made smaller - you don’t need a full FMA per input pair, and you can reduce (add) the product in multiple steps using reduced precision. This might be advantageous for GPUs where transistor area is at a premium.

Here are the main reasons why I believe that these new units use some sort of hardware-accelerated dot product rather than outer products:

- relevant parents (both for Apple and Nvidia) explicitly talk about dot product hardware, Apples matrix patent even describes the circuitry in great detail (incidentally they mention 4-way dot product which is consistent with advertised 4x increase in matrix compute)
- dot product hardware likely offer opportunities for die area optimization and synergies with the existing SIMD pipes
- low-level operations for matrix multiplication in Nvidia hardware are instructions rather than loops
- Nvidia’s marketing material always portrays tensor cores as dot product engines
- I’ve seen people likely familiar with the matter (e.g. RWT) mention that Nvidia uses dot products

Last two arguments are weak of course, but provide some food for thought.

Overall, I do agree with you that there is likely some pipe reuse happening and that the new backend works together somehow to implement these reductions. But it’s more than just running matrix multiplication on the data path with shifts like they did until now. I hope that Apple will have a tech note shortly giving us more details.

name99 said:
Am I wrong in this? Do you believe that nV execute a matrix multiply op on the generic datapath? I don't think so because so much seems incompatible with that. For just example, matrix multiply accumulates to some intermediate FP format (like FP24), whereas using datapath would accumulate to a standard FP format like either FP16 or FP32. And the numbers don't seem to match. You can easily scale up the performance of an outer product engine when handling smaller widths (see eg how AMS/SME scales up), as nV does, whereas it's more difficult to do that in datapath.

I believe that Nvidia tensor cores are dot product engines and not outer product engines. I also believe that they do the final accumulate in FP16xFP16 + FP32 mode using general datapath, which is where the 2x performance penalty comes from.

leman · Sep 14, 2025

@name99 Ive been pondering the topic some more this morning, and while I’m far from presenting a coherent argument, I think there are at least some thoughts worth exploring as to why outer product might not be optimal fit for a GPU.

The main observation is that a GPU is fundamentally a vector processor which we want to enhance with some matrix capability. We also want to be as economical as possible, and we want to reuse the existing data pathways and register interfaces as much as possible while minimizing data moves.

Let’s take a GPU with SIMD width of 32, like what Apple uses. How can we build a MXU unit around this width? Full 32 x 32 outer product is obviously out of the question, as that would require a huge FMA and storage tiles. So we need to interpret those 32 elements as a matrix let’s say we pick the primary matrix size of 8x4. Then we could use a 8x8 outer product. That would require 64 FMAs elements (easily done, we already have more compute per partition), and 64 elements (two registers worth) of tile storage. We could work on 8x8 sub-vectors (row-columns) per cycle, and produce a 8x8 matrix output in 8 cycles using a pipelined design.

How do we scale things, however? As said previously, increasing the outer product size is impractical. And the described arrangement strikes me as suboptimal: we only use 64 FMAs, fetch 8 elements per input per cycle (although our data path is 32-wide), and need to provide 64 elements of temporary accumulator storage (more in practice, if we want to process multiple matrices and cache the intermediate results). How do we make things more symmetrical?

I think an obvious solution is to introduce intermediate accumulation of the results. Using a four-wide fold, we need the equivalent of 128 FMAs (again, easily done - and folding can be done at reduced precision, saving us some transistors), data transfer becomes more symmetrical, we don’t need a larger intermediate storage tile, and we can do the same multiplication in two passes only. We can also scale in the future, e.g. to a 8-wide dot product, without changing the data layout or needing any additional intermediate storage. And of course, what I am describing here is just a dot product.

I hope any of this makes sense?

name99 · Sep 14, 2025

leman said:
@name99 Ive been pondering the topic some more this morning, and while I’m far from presenting a coherent argument, I think there are at least some thoughts worth exploring as to why outer product might not be optimal fit for a GPU.

The main observation is that a GPU is fundamentally a vector processor which we want to enhance with some matrix capability. We also want to be as economical as possible, and we want to reuse the existing data pathways and register interfaces as much as possible while minimizing data moves.

Let’s take a GPU with SIMD width of 32, like what Apple uses. How can we build a MXU unit around this width? Full 32 x 32 outer product is obviously out of the question, as that would require a huge FMA and storage tiles. So we need to interpret those 32 elements as a matrix let’s say we pick the primary matrix size of 8x4. Then we could use a 8x8 outer product. That would require 64 FMAs elements (easily done, we already have more compute per partition), and 64 elements (two registers worth) of tile storage. We could work on 8x8 sub-vectors (row-columns) per cycle, and produce a 8x8 matrix output in 8 cycles using a pipelined design.

How do we scale things, however? As said previously, increasing the outer product size is impractical. And the described arrangement strikes me as suboptimal: we only use 64 FMAs, fetch 8 elements per input per cycle (although our data path is 32-wide), and need to provide 64 elements of temporary accumulator storage (more in practice, if we want to process multiple matrices and cache the intermediate results). How do we make things more symmetrical?

I think an obvious solution is to introduce intermediate accumulation of the results. Using a four-wide fold, we need the equivalent of 128 FMAs (again, easily done - and folding can be done at reduced precision, saving us some transistors), data transfer becomes more symmetrical, we don’t need a larger intermediate storage tile, and we can do the same multiplication in two passes only. We can also scale in the future, e.g. to a 8-wide dot product, without changing the data layout or needing any additional intermediate storage. And of course, what I am describing here is just a dot product.

I hope any of this makes sense?

Remember you don't have to convince me on the Apple side. I know that Apple (right now anyway) perform matrix multiply via dot product!

The question is on the nV side. I could believe it either way, but nV not using an outer product seems to me sub-optimal. Outer product engine is simply lower power (because less data movement). Of course because of history they may have painted themselves into a corner...
Apple is in a different situation in that (for now...) their GPU has to cover all eventualities so they need flexible HW. Of course if the future of graphics is some combination of NERFs and Gaussian Splats, as we keep being told, in five years the optimal hardware for a "Graphics" Processing Unit might be very different.

leman · Sep 15, 2025

name99 said:
Remember you don't have to convince me on the Apple side. I know that Apple (right now anyway) perform matrix multiply via dot product!

The question is on the nV side. I could believe it either way, but nV not using an outer product seems to me sub-optimal. Outer product engine is simply lower power (because less data movement). Of course because of history they may have painted themselves into a corner...
Apple is in a different situation in that (for now...) their GPU has to cover all eventualities so they need flexible HW. Of course if the future of graphics is some combination of NERFs and Gaussian Splats, as we keep being told, in five years the optimal hardware for a "Graphics" Processing Unit might be very different.

Is it really the case that outer product require less data movement? You need to iterate through all column-row pairs to compute the final matrix product. It’s a great model for large matrices, but GPUs operate on very small matrices. There are a lot of GPU cores, and it does not make sense to equip watch one with a large matrix engine. If I remember correctly earlier Nvidia designs operated on 8x4 or even 4x4 matrices. If you look at the problem from this perspective, I think for product engines make a lot of sense since they allow you to achieve maximal compute density on small matrices.

Besides, what is a dot product engine if not a parallel outer product engine with built-in reduce?

P.S. I believe this to be the relevant patent from Nvidia: https://patents.google.com/patent/US10338919B2 I only briefly skimmed it, but the drawings appear to depict dot product circuitry and discussion how it can be optimized.

Populus · Sep 15, 2025

OptimusGrime said:
Eagerly awaiting a tech talk for these gpus, similar to the one they made for the A17/M3.

Yeah, me too.

name99 · Sep 15, 2025

leman said:
Is it really the case that outer product require less data movement? You need to iterate through all column-row pairs to compute the final matrix product. It’s a great model for large matrices, but GPUs operate on very small matrices. There are a lot of GPU cores, and it does not make sense to equip watch one with a large matrix engine. If I remember correctly earlier Nvidia designs operated on 8x4 or even 4x4 matrices. If you look at the problem from this perspective, I think for product engines make a lot of sense since they allow you to achieve maximal compute density on small matrices.

Besides, what is a dot product engine if not a parallel outer product engine with built-in reduce?

P.S. I believe this to be the relevant patent from Nvidia: https://patents.google.com/patent/US10338919B2 I only briefly skimmed it, but the drawings appear to depict dot product circuitry and discussion how it can be optimized.

"GPUs operate on very small matrices"
For GRAPHICS yes. But we are talking AI... I thought that was the obvious background context to everything we were saying – nV has already restructured their design to put AI frontmost, whereas Apple is still trying to straddle the halfway position. Of course, as I said nV are building on prior history.

A basic problem is that "we" (that is the public) have no real idea, as far as I can tell, what a "tensor core" actually is.
Is it a dedicated piece of hardware that does NOTHING but execute matrix multiplies?
Or is it a marketing fiction for some instructions and permute hardware that allow the standard FMA pipelines to run 4x8 or 4x4 matrix multiplies at full speed? The latter is what Apple do and is, I suspect, what "Neural Cores" will turn out to be.
nV (and now Apple) draw these things as separate blocks on their marketing diagrams. Which means what? I assumed it meant dedicated HW, but, as I said, marketing fiction?

To quote from https://semianalysis.com/2025/06/23/nvidia-tensor-core-evolution-from-volta-to-blackwell/
"An SM of a Tesla V100 GPU contains 8 Tensor Cores, grouped in partitions of two. Each Tensor Core is capable of computing an equivalent of 4x4x4 matrix multiplication per cycle".
Let's go through this.
So each quad of an SM (conceptually a 32 wide SIMD) has a pair of 4x4 matrix multipliers, which matches 32 FMAs, and matches Apple. And requires two pairs of inputs each 4*4=16 values wide = two 32 wide input registers. So far so good, again matches Apple, and all you need is the permute network.
BUT
if they were performing this via say dot product, they'd need to iterate this over 4 cycles, repermuting the data each cycle. That's what Apple does.

But nV claim
1. they do not need this, they can execute the whole thing in one cycle...
2. they accumulate to something like FP24, whereas reusing the FMA pipes would accumulate to either FP16 or FP32?
(To be fair, I think the FP24 comes somewhat later, in these early generations it's accumulate to FP16 or FP32.)

So I don't know! How do we resolve this?
The only solution I see is something like an nV tensor core is NOT just the existing FMA hardware (although it may be based on it). Rather it's something like four sets of this HW (specialized to only do the FMA part, dropping everything else) and some internal register latches. So every cycle two 32 wide input registers corresponding to two 4*4 input matrices get delivered to the engine, but get latched by a rotating modulo 4 version of the FMA HW, and over the four cycles until the next inputs are latched, the 4*4*4 matrix multiply is performed. (Whether the HW is exactly dot product or outer product no longer matters; it will take the same number of cycles and it's really a question of which would require a smaller internal permute network to rearrange the latched inputs every cycle.)

This is, in a sense, an astonishing claim - that nV have, in each tensor core, 4x the FP HW of their mainline data path! It's presumably only viable because EVERYTHING ELSE has been dropped -- all the instruction control flow, various SRAMs, the FP32, integer and SFU HW, etc etc. Meaning the actually FP16 pipeline was not THAT large a component of an SM quad, small enough that adding a quadruple of it was feasible.

If what I'm saying is correct (and who knows?) this suggests that old Apple could never have hoped to match nV on, let's specify precisely, FP16 matrix multiply. A18 was going into the game with one FP16 pipe per quadrant that could handle matrix multiply, whereas nV was going in with 4 of these. Assuming they're both equally good at large scale scheduling, pulling in the data, etc, then nV has 4x the IPC available.

Secondly, if my theory of what's in the A19 is correct, Apple now has three pipes available (the baseline FP16 pipe, a new pipe that can handle FP16*FP16+ and seems dedicated to that task, and modifications to the FP32 pipe to allow it to handle FP16*FP16+). So Apple is now at 3/4 parity with nV rather than 1/4. Still behind, but not so bad, AND with HW that's somewhat more usable for generic tasks (graphics and compute) not just endless matrix multiply.

Does that seem plausible?

Of course all this is Apple 3/4 matching Volta. After Volta we
- with Turing add INT4 and INT8 support to the "tensor core". I think this is feasible and "easy" given my model for what nV has done. The constraint is the connection to the outside world, the delivery every cycle of those two input registers packing 32 values. If those values are now, say, INT8, the obvious presentation is something like a baseline 8*8*8 matrix multiply. If we treat each FP16 FMA as also incorporating (shared HW or side by side) two INT8*INT8 multipliers, this means we could handle with over 16 cycles (not 4), so INT8 runs both at 2x the FP16 multiply rate, but also at 1/4 the absolute max INT8 rate possible if hardware density were no object. Likewise for four INT4*INT4 multipliers, given a further 2x boost over INT8, but also likewise running at 1/16th the rate if HW density were no limit.

- before Ampere data had to flow from the memory system through the register file to shared memory to the tensor core. This obviously costs some energy, and it also suggests to that, regardless of other constraints, you can't get much useful simultaneous work of the tensor core and the rest of the GPU because the rest of the system is each cycle (more or less) engaged in this moving of data into where the tensor core can access it.
With Ampere some of this is fixed, so we get a path straight from memory to shared memory bypassing some caching and the register files. But the flow per cycle is still shared memory to register file to tensor core.

- with Hopper the tensor core can now read data directly from shared memory. (Much of this actually looks like the tensor core picking up functionality from the ANE, in that it can autonomously loop over n-D data structures and perform data movement without requiring the help of HW that's better used doing more sophisticated data movement).
This at least opens up the hope of doing other useful work simultaneously with matrix multiply.
Of course Apple can't really replicate this with their system, if they don't have the extra HW that the tensor core represents.
Somewhere around Hopper nV also shrink the accumulator for FP16*FP16 down to FP24.

- with Blackwell we replace the shared memory path with a new block of SRAM called tensor memory that only supports specialized access patterns

I didn't expect to go so deep into the weeds!
I guess the big summary is
- as far as I can tell Tensor Core is real, SEPARATE HW within each nV SM that (hah!) each upgrade starts to look more like ANE in various ways

- At the Volta vs A18 level, nV had about a 4x IPC advantage
- At the Volta vs A19 level, nV has about a 1.33x IPC advantage
- At the Turing level nV picks up an additional 2x advantage if you care about INT8, or 4x if you care about INT4, which later translates to a 2x FP8 IPC advantage
- at the Hopper level, nV picks up an additional "autonomous execution" advantage in that the rest of the GPU can do quite a lot while the tensor core grinds away, using its HW to pull data into its SRAM, largely decoupled from the rest of the GPU.
So you were (and probably still are) correct that, in some sense, the lowest level of the tensor core still operates (for FP16*FP16). And I was correct that (for Hopper, and for most purposes) the tensor engine presents as something that takes in and multiplies much larger matrices. As far as dot product vs outer, we may never know, unless someone wants to start the project of deep diving into nV patents. It looks to me like everything is now abstracted enough that they could for example, still use 4*4*4 based dot product as of Blackwell then, if the energy or density simulations show an advantage, switch to an 8*8 based outer product scheme as of Rubin.

leman · Sep 15, 2025

We are talking about ML, yes. The point I’m trying to make that each Nvidia GPU core (SM) is fairly small, and only operates on small portion of the matrix. It’s hundreds of SMs/cores solving the problem cooperatively where the throughput comes from. It’s a very different approach from using just a few large outer product engines to achieve high throughput.

I don’t find the 4*4*4 matmul per cycle to be an astonishing claim either. That’s merely 64 multiply+add operating at half precision, using extremely optimized circuitry for adders. We also know that Nvidia’s SM’s are fairly small, and it doesn’t sound like the tensor core is the largest component. An SM also contains a full SIMD array of 32-bit FP and Int FMA pipes, a few FP64 ALUs, and other stuff.

Besides, Apples claim is not that far off. They claim 4x matrix throughout with these new cores, that would mean 128 matrix multiply-add elements per partition (or 512 per core!) - whether they do some smart pipe reuse trickery to achieve this or something else remains to be determined (and I think your logic makes a lot of sense here).

name99 · Sep 17, 2025

leman said:
We are talking about ML, yes. The point I’m trying to make that each Nvidia GPU core (SM) is fairly small, and only operates on small portion of the matrix. It’s hundreds of SMs/cores solving the problem cooperatively where the throughput comes from. It’s a very different approach from using just a few large outer product engines to achieve high throughput.

I don’t find the 4*4*4 matmul per cycle to be an astonishing claim either. That’s merely 64 multiply+add operating at half precision, using extremely optimized circuitry for adders. We also know that Nvidia’s SM’s are fairly small, and it doesn’t sound like the tensor core is the largest component. An SM also contains a full SIMD array of 32-bit FP and Int FMA pipes, a few FP64 ALUs, and other stuff.

Besides, Apples claim is not that far off. They claim 4x matrix throughout with these new cores, that would mean 128 matrix multiply-add elements per partition (or 512 per core!) - whether they do some smart pipe reuse trickery to achieve this or something else remains to be determined (and I think your logic makes a lot of sense here).

I think I've figured it out.

We have
- definitely improved FP16 "graphics/compute" performance (presumably bcs of extra FP16 pipe)
- definitely improved FP32 "graphics/compute" performance (bcs of just dynamic caching 2? still unclear)
- definitely improved ray tracing performance (in large part due to dynamic caching 2, but also due to now HW support for instancing)

Two big mysteries left:
- is FP16 matrix multiply 2x or 3x improved? Still have not seen a test for this, and maybe we won't see anything legit until code is recompiled.

- what are the Neural Accelerators? I think I now have an answer for this.
Look at https://patents.google.com/patent/US12405786B1
As is usual for just released patents on Google, this comes with no diagrams. You can try the official USPTO web site but, as just one of the many services it provides in its attempt to be worst fscking website on earth, that site won;t allow you to link into it... Good luck trying to navigate it to find the images...

Anyway. I haven't yet read it closely but the GIST is two elements are described
1. hardware that takes in integers (eg 8 bit, but quantized and so 6 or 4b are also feasible), and converts it to FP, along with optional dequant and zero shift; and that can do the reverse, to transform FP into int subject to a quant and zero offset.

2. an INTEGER matrix multiplier that takes values from the register file and outputs integers

The essence seems to be recreation of something like a Turing (ie 2nd gen) nV tensor core.
Differences include
- only integer support
- scaling and offset are already part of the system (so in this respect they're matching what nV has as of around Hopper I think, maybe even Blackwell's block-based scaling?)
- some addition HW (looks like it's in an FP pipeline, maybe that new "mixed" pipeline) to allow rapid conversion between FP material and this block integer material [maybe nV also has this, they've just never bothered to talk about it?]

- presumably if they like this direction they can head down the nV path as rapidly as they like, including adding more HW (so one engine can perform more INT8 or INT4 multiply-add's per cycle) AND, the big next leap forward, making the engine more autonomous from the rest of the GPU, so that it can load/store data in parallel with the rest of GPU compute

This may look somewhat chaotic, providing both extra FP16 AND extra INT8/INT4 (even as they run Apple's own AFM models on the ANE)! My best guess is that when the HW was designed (and remember, this is say 3 to 4 yrs ago) they're operating on ignorance of the future, and rumors about nV. Safest path is to provide what you hope is useful along every dimension!
- ANE still there,
- extra FP16 (generally useful for GPU even if ML inference goes down a path that doesn't use it [as appears to be the case right now?])
- a decent amount of INT8/INT4 support just in case it turns out that really is where LLM inference converges (as again seems to be the case)
- and all this in the GPU, again just in case on-device training takes off (?) or the next big thing, whatever it is, is a poor fit to ANE, or developers want to run their own models on Apple GPU (something like matching CUDA) rather than on ANE?

leman · Sep 17, 2025

Could be. I’ve seen that patent a week ago and interpreted it mostly as trying to eliminate additional overhead from doing data conversion using the regular datapath.

What puzzles me the most is Apples claim of 4x matrix compute improvement combined with the fact that we only see 2x improvement in Geekbench CoreML GPU. I hope someone will play around with the new tensor APIs soon to figure out what’s going on. My iPhone will sadly only arrive in n October…

name99 · Sep 18, 2025

leman said:
Could be. I’ve seen that patent a week ago and interpreted it mostly as trying to eliminate additional overhead from doing data conversion using the regular datapath.

What puzzles me the most is Apples claim of 4x matrix compute improvement combined with the fact that we only see 2x improvement in Geekbench CoreML GPU. I hope someone will play around with the new tensor APIs soon to figure out what’s going on. My iPhone will sadly only arrive in n October…

I think it's clear.
You definitely get 2x FP16 down the normal path.
I *think* you will probably get 3x FP16 down matrix multiply paths (may require recompile)
You get 4x INT8 down matrix multiply paths, and this will likely also require a recompile.

But the existing benchmarks won't show this. GB6 ML maybe could, after recompile?

I think in a month this should all be more clear.

leman · Sep 18, 2025

name99 said:
I think it's clear.
You definitely get 2x FP16 down the normal path.
I *think* you will probably get 3x FP16 down matrix multiply paths (may require recompile)
You get 4x INT8 down matrix multiply paths, and this will likely also require a recompile.

But the existing benchmarks won't show this. GB6 ML maybe could, after recompile?

I think in a month this should all be more clear.

Looking forward to learning more!

Xiao_Xi · Sep 22, 2025

Mojo will support Apple Silicon GPUs very soon.

Apple Silicon GPU support in Mojo

The latest nightly releases of Mojo (and our next stable release) include initial support for a new accelerator architecture: Apple Silicon GPUs! We know that one of the biggest barriers to programming GPUs is access to hardware. It’s our hope that by making it possible to use Mojo to develop...

forum.modular.com

impulse462 · Sep 30, 2025

Aren't the tensor cores systolic arrays?

Also, any idea or insight if apple will add fp8 hardware?

leman · Oct 3, 2025

impulse462 said:
Aren't the tensor cores systolic arrays?

Depends on your definition I suppose. Classical systolic array matrix multiplication has less parallelism than Nvidias tensor cores.

impulse462 said:
Also, any idea or insight if apple will add fp8 hardware?

Current metal shading language does not define a fp8 type. It does seem that A19 has matmul acceleration for i8 however, I measure around 10TFLOPS (~5.5x speed up per core) for the 5-core model.

crazy dave · Oct 3, 2025

leman said:
Depends on your definition I suppose. Classical systolic array matrix multiplication has less parallelism than Nvidias tensor cores.

Current metal shading language does not define a fp8 type. It does seem that A19 has matmul acceleration for i8 however, I measure around 10TFLOPS (~5.5x speed up per core) for the 5-core model.

Is that 5.5x speed up relative to 32bit?

leman · Oct 3, 2025

crazy dave said:
Is that 5.5x speed up relative to 32bit?

It’s relative to what you could do before. Technically, yes, 32bit, but if you wanted to do i8 matmul on pre-A19, that’s what you had at your disposal.

I am preparing a more detailed write-up with the results (it’s slow-going since Apple documentation is absolutely terrible and does not actually match the shipped implementation). But the preliminary findings align with 4x improvement for fp16 with fp32 accumulate and around 5.5x for int8. While nvidia tensor cores are still considerably more performant for low-precision inference, Apple should at least be competitive with Blackwell for fp16 with fp32 accumulate.

OptimusGrime · Oct 3, 2025

leman said:
It’s relative to what you could do before. Technically, yes, 32bit, but if you wanted to do i8 matmul on pre-A19, that’s what you had at your disposal.

I am preparing a more detailed write-up with the results (it’s slow-going since Apple documentation is absolutely terrible and does not actually match the shipped implementation). But the preliminary findings align with 4x improvement for fp16 with fp32 accumulate and around 5.5x for int8. While nvidia tensor cores are still considerably more performant for low-precision inference, Apple should at least be competitive with Blackwell for fp16 with fp32 accumulate.

Is lower precision performance something that can be improved in software, or would it require hardware changes?

leman · Oct 3, 2025

OptimusGrime said:
Is lower precision performance something that can be improved in software, or would it require hardware changes?

You generally need hardware support.

OptimusGrime · Oct 3, 2025

leman said:
You generally need hardware support.

Ok thanks.

Macintosh IIcx · Oct 3, 2025

leman said:
I am preparing a more detailed write-up with the results (it’s slow-going since Apple documentation is absolutely terrible and does not actually match the shipped implementation). But the preliminary findings align with 4x improvement for fp16 with fp32 accumulate and around 5.5x for int8. While nvidia tensor cores are still considerably more performant for low-precision inference, Apple should at least be competitive with Blackwell for fp16 with fp32 accumulate.

This is rather encouraging! Is this based on testing with your iPhone 17 Pro?

leman · Oct 3, 2025

Macintosh IIcx said:
This is rather encouraging!

It’s not bad of a start. Quite frankly though, the software and documentation quality is extremely poor.

Macintosh IIcx said:
Is this based on testing with your iPhone 17 Pro?

iPhone 17, non pro.

OptimusGrime · Oct 3, 2025

leman said:
It’s not bad of a start. Quite frankly though, the software and documentation quality is extremely poor.

Sigh. That seems to be a common thing with Apple lately. I wish they would improve it.

Apple Silicon deep learning performance

macrumors 68030

macrumors Core

macrumors 68030

macrumors Core

macrumors Core

macrumors 68030

macrumors Core

macrumors 604

macrumors 68030

macrumors Core

macrumors 68030

macrumors Core

macrumors 68030

macrumors Core

macrumors 68000

macrumors 68020

macrumors Core

macrumors 68000

macrumors Core

macrumors 6502

macrumors Core

macrumors 6502

macrumors 6502a

macrumors Core

macrumors 6502

Our Staff