Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
16 Execution Units per core
8 Arithmetic Logic Units per Execution Unit

This information is wrong, just as @crazy dave has stated previously. I understand the confusion, since this is what Wikipedia claims. I tried correcting it, but my edits were rejected on the grounds that "it does not matter whether the information is correct or not, it must have a source". When I pointed out that "16 units" has no source either, the reply was "doesn't matter, since it's already on the Wikipedia". Which tells you a lot about the editorial process and the reliability of the Wikipedia.

Each Apple GPU core (this is true for every Apple Silicon chip until now) contains four compute partitions (cores proper), with each partition having it's own register file, instruction scheduler, and execution units. The execution units are generally 32-wide SIMD. So if we look only at FP32 units, that's 128 such units per core or 4x 32-wide units per compute partition. The complete picture is much more complex of course, since there are other execution units and instruction co-issue (since M3).
I THINK (patents and some benchmarks I've seen both suggest this) that M5 FP16 throughput is 3x FP32 -- but ONLY for matrix multiply.

M5 has 2x FP16 SIMD throughout (due to the ability of issuing two FP16 instructions per clock per partition, courtesy of this patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342). Note that I only measure 70% higher throughput for FP16 on A19, which is likely to throttling of the phone chip I was able to test.

For matrix multiplication, the throughput is 4x for FP16 and 8x for INT8. For A19, I measure ~1850 GFLOPS for SIMD/Matrix FP32, 7500 GFLOPS for matrix FP16/BF16, and ~ 14800 TOPS for matrix INT8. This is based on latest benchmarks done on 26.1, with BFloat16 support enabled. I still need to update my report.
 
Last edited:
This information is wrong, just as @crazy dave has stated previously. I understand the confusion, since this is what Wikipedia claims. I tried correcting it, but my edits were rejected on the grounds that "it does not matter whether the information is correct or not, it must have a source". When I pointed out that "16 units" has no source either, the reply was "doesn't matter, since it's already on the Wikipedia". Which tells you a lot about the editorial process and the reliability of the Wikipedia.

Yikes. Does stack overflow serves as a link? An execution width of 32 makes 16 units an impossibility.


The alternative is to just link to Apple's metal API threadexecutionwidth and just say the answer is 32 again, 16 becomes an impossibility. The only alternative I can think is make a GitHub page with the sole reason of running a displaying the answer and link to that (maybe waaaaayyyy too much effort).

Each Apple GPU core (this is true for every Apple Silicon chip until now) contains four compute partitions (cores proper), with each partition having it's own register file, instruction scheduler, and execution units. The execution units are generally 32-wide SIMD. So if we look only at FP32 units, that's 128 such units per core or 4x 32-wide units per compute partition. The complete picture is much more complex of course, since there are other execution units and instruction co-issue (since M3).


M5 has 2x FP16 SIMD throughout (due to the ability of issuing two FP16 instructions per clock per partition, courtesy of this patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342). Note that I only measure 70% higher throughput for FP16 on A19, which is likely to throttling of the phone chip I was able to test.

For matrix multiplication, the throughput is 4x for FP16 and 8x for INT8. For A19, I measure ~1850 GFLOPS for SIMD/Matrix FP32, 7500 GFLOPS for matrix FP16/BF16, and ~ 14800 TOPS for matrix INT8. This is based on latest benchmarks done on 26.1, with BFloat16 support enabled. I still need to update my report.
 
Moderator's Suggestion

If all the information is put into the first post, and that's made into a WikiPost, then people who spot errors can fix them directly.

None of the individual posts is overly long, so a single WikiPost with all the info wouldn't be insane.
If things start getting too long, you can use the Spoiler tag to make pieces hideable.

This would also simplify things when the M6 comes out. And the M7. They can all be edited into the first WikiPost.

A moderator can merge posts and convert the thread to a WikiPost. Use the "Report" button to make the request.
 
M5 has 2x FP16 SIMD throughout (due to the ability of issuing two FP16 instructions per clock per partition, courtesy of this patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342). Note that I only measure 70% higher throughput for FP16 on A19, which is likely to throttling of the phone chip I was able to test.

For matrix multiplication, the throughput is 4x for FP16 and 8x for INT8. For A19, I measure ~1850 GFLOPS for SIMD/Matrix FP32, 7500 GFLOPS for matrix FP16/BF16, and ~ 14800 TOPS for matrix INT8. This is based on latest benchmarks done on 26.1, with BFloat16 support enabled. I still need to update my report.
Do you think that patent is active?

1. It suggests that 3x throughput for FP16 (instructions routing through FP16, mixed, and FP32 2/FP16 rounding pipelines).
2. Which in turn requires ability to read register file 3-wide.
3. Because of 2 I assumed that 1 only held for FP16 matrix multiply (with ability to use the pipeline local register as accumulator, thus avoiding some register reads).
4. Which in turn implies 3x for FP16 matrix multiply, and no FP16 on the tensor unit.

But what your numbers suggest is that the tensor unit does have FP16.

The best way I can square this is something like the idea of the mixed unit was perhaps suggested as one way to boost FP16 (and maybe also INT8) matrix performance, but was overridden with the more aggressive idea of the tensor unit. And so (at least for now) there is no mixed pipeline, but there is still FP16 rounding in the INT32 pipeline so that under the right conditions you can get 2x FP16 (because, let's face it, there just aren't that many occasions when you want to run lots of FP32 and lots of FP16 simultaneously).
 
Do you think that patent is active?

1. It suggests that 3x throughput for FP16 (instructions routing through FP16, mixed, and FP32 2/FP16 rounding pipelines).
2. Which in turn requires ability to read register file 3-wide.
3. Because of 2 I assumed that 1 only held for FP16 matrix multiply (with ability to use the pipeline local register as accumulator, thus avoiding some register reads).
4. Which in turn implies 3x for FP16 matrix multiply, and no FP16 on the tensor unit.

But what your numbers suggest is that the tensor unit does have FP16.

The best way I can square this is something like the idea of the mixed unit was perhaps suggested as one way to boost FP16 (and maybe also INT8) matrix performance, but was overridden with the more aggressive idea of the tensor unit. And so (at least for now) there is no mixed pipeline, but there is still FP16 rounding in the INT32 pipeline so that under the right conditions you can get 2x FP16 (because, let's face it, there just aren't that many occasions when you want to run lots of FP32 and lots of FP16 simultaneously).

The patent I have linked is very specific and was made public just shortly before the iPhone 17 made debut, which follows the Apple's pattern of publishing hardware-specific patents. Whether it is implemented in A19/M5 as described, or whether A19/M5 uses some other schema to double FP16 instruction dispatch.

I think an important point is that the instruction issue for the new GPU is still limited to at most two instructions per clock per partition, so it cannot take advantage of more pipelines even if they were available. Regarding the register file bandwidth, we already know that the GPU can fetch at at least 4x32-bit operands per register file per clock. There are two pieces of evidence for this: M3 can co-issue FP32 FMA and INT32 ADD, with the addition running at half speed; and the GPU has a dedicated 4-operand compare and select instruction.

What is also interesting to me is that the mixed-precision unit mentioned in the patent is said to be capable of FP32 addition. One should test if there any interesting effects mixing FP32 FMA + FP32 ADD.
 
  • Like
Reactions: crazy dave
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.