Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I was testing it in Python, using MLX. I just call the matmul function provided by MLX and select the CPU as the device type.

I just had a quick look at the MLX sources, it seems they are delegating it to the BNNS framework. I assume it has optimized implementations and takes advantage of all hardware features. So you should be getting best possible performance.

The GPU indeed performs FP16 at the same rate as FP32, but as @name99 and @Philip Turner explained earlier in the thread, the register pressure is lower in half precision so the GPU manages to achieve throughput closer to theoretical maximum.

I would be surprised if you are running into register pressure issues on matrix multiplication, it's not very register-heavy. My guess would be better utilization of the available memory bandwidth that helps feeding the ALUs.


When I ran the benchmark from corsix, I found that matfp_f16f16_x*y+z achieved double the GFLOPS of matfp_f32f32_x*y+z. As the author explained, the code doesn't issue any load/stores, so it's not representative of real-world performance. Not sure what to make of that? f16 matmul would seem to be supported by AMX on the instruction level, but is not used by AppleAccelerate?

I found it quite surprising that the M4 SME does not expose FEAT_SME_F16F16. It does seem to be supported by AMX. No idea what the disparity is.
 
I just had a quick look at the MLX sources, it seems they are delegating it to the BNNS framework. I assume it has optimized implementations and takes advantage of all hardware features. So you should be getting best possible performance.
If it’s forwarding to BNNS, that would explain why I observed a notable performance difference when doing matmul in Julia on AMX via AppleAccelerate. The latter forwards it to cblas and produces superior performance to MLX. Even more pronounced is the performance difference as a function of matrix size. However, in Julia it outright refuses to run fp16 matmul on AMX and instead runs it on NEON, completely tanking the performance.
 
If it’s forwarding to BNNS, that would explain why I observed a notable performance difference when doing matmul in Julia on AMX via AppleAccelerate. The latter forwards it to cblas and produces superior performance to MLX. Even more pronounced is the performance difference as a function of matrix size. However, in Julia it outright refuses to run fp16 matmul on AMX and instead runs it on NEON, completely tanking the performance.

As far as I am aware, Apple's BLAS does not have FP16 support. MLX seems to use cblas for FP32 and BNSS for FP16. I'd expect the codebases for Apple's colas and BNSS to be comparable as they are part of the same framework and hopefully worked on by the same team.
 
As far as I am aware, Apple's BLAS does not have FP16 support. MLX seems to use cblas for FP32 and BNSS for FP16. I'd expect the codebases for Apple's colas and BNSS to be comparable as they are part of the same framework and hopefully worked on by the same team.
The fact that it is dispatching to different libraries between FP32 and FP16 could explain the performance differences then.

What does Apple use AMX for? For small matrices it is faster than the GPU in matmul and more power efficient, but ANE even more so for FP16 and INT8. I doubt they are doing inference in FP32.

Edit: The ANE does seem to suffer from anemic memory bandwidth compared to AMX and the GPU. This could perhaps be one scenario where AMX would be preferred. My observation is based on these benchmark results: https://github.com/Anemll/anemll-bench/blob/main/Results.MD
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.