16 Execution Units per core
8 Arithmetic Logic Units per Execution Unit
This information is wrong, just as @crazy dave has stated previously. I understand the confusion, since this is what Wikipedia claims. I tried correcting it, but my edits were rejected on the grounds that "it does not matter whether the information is correct or not, it must have a source". When I pointed out that "16 units" has no source either, the reply was "doesn't matter, since it's already on the Wikipedia". Which tells you a lot about the editorial process and the reliability of the Wikipedia.
Each Apple GPU core (this is true for every Apple Silicon chip until now) contains four compute partitions (cores proper), with each partition having it's own register file, instruction scheduler, and execution units. The execution units are generally 32-wide SIMD. So if we look only at FP32 units, that's 128 such units per core or 4x 32-wide units per compute partition. The complete picture is much more complex of course, since there are other execution units and instruction co-issue (since M3).
I THINK (patents and some benchmarks I've seen both suggest this) that M5 FP16 throughput is 3x FP32 -- but ONLY for matrix multiply.
M5 has 2x FP16 SIMD throughout (due to the ability of issuing two FP16 instructions per clock per partition, courtesy of this patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342). Note that I only measure 70% higher throughput for FP16 on A19, which is likely to throttling of the phone chip I was able to test.
For matrix multiplication, the throughput is 4x for FP16 and 8x for INT8. For A19, I measure ~1850 GFLOPS for SIMD/Matrix FP32, 7500 GFLOPS for matrix FP16/BF16, and ~ 14800 TOPS for matrix INT8. This is based on latest benchmarks done on 26.1, with BFloat16 support enabled. I still need to update my report.
Last edited: