Apple Silicon in Sciences

leman · May 5, 2025

PetarM said:
I was testing it in Python, using MLX. I just call the matmul function provided by MLX and select the CPU as the device type.

I just had a quick look at the MLX sources, it seems they are delegating it to the BNNS framework. I assume it has optimized implementations and takes advantage of all hardware features. So you should be getting best possible performance.

PetarM said:
The GPU indeed performs FP16 at the same rate as FP32, but as @name99 and @Philip Turner explained earlier in the thread, the register pressure is lower in half precision so the GPU manages to achieve throughput closer to theoretical maximum.

I would be surprised if you are running into register pressure issues on matrix multiplication, it's not very register-heavy. My guess would be better utilization of the available memory bandwidth that helps feeding the ALUs.

PetarM said:
When I ran the benchmark from corsix, I found that matfp_f16f16_x*y+z achieved double the GFLOPS of matfp_f32f32_x*y+z. As the author explained, the code doesn't issue any load/stores, so it's not representative of real-world performance. Not sure what to make of that? f16 matmul would seem to be supported by AMX on the instruction level, but is not used by AppleAccelerate?

I found it quite surprising that the M4 SME does not expose FEAT_SME_F16F16. It does seem to be supported by AMX. No idea what the disparity is.

PetarM · May 5, 2025

leman said:
I just had a quick look at the MLX sources, it seems they are delegating it to the BNNS framework. I assume it has optimized implementations and takes advantage of all hardware features. So you should be getting best possible performance.

If it’s forwarding to BNNS, that would explain why I observed a notable performance difference when doing matmul in Julia on AMX via AppleAccelerate. The latter forwards it to cblas and produces superior performance to MLX. Even more pronounced is the performance difference as a function of matrix size. However, in Julia it outright refuses to run fp16 matmul on AMX and instead runs it on NEON, completely tanking the performance.

leman · May 5, 2025

PetarM said:
If it’s forwarding to BNNS, that would explain why I observed a notable performance difference when doing matmul in Julia on AMX via AppleAccelerate. The latter forwards it to cblas and produces superior performance to MLX. Even more pronounced is the performance difference as a function of matrix size. However, in Julia it outright refuses to run fp16 matmul on AMX and instead runs it on NEON, completely tanking the performance.

As far as I am aware, Apple's BLAS does not have FP16 support. MLX seems to use cblas for FP32 and BNSS for FP16. I'd expect the codebases for Apple's colas and BNSS to be comparable as they are part of the same framework and hopefully worked on by the same team.

PetarM · May 5, 2025

leman said:
As far as I am aware, Apple's BLAS does not have FP16 support. MLX seems to use cblas for FP32 and BNSS for FP16. I'd expect the codebases for Apple's colas and BNSS to be comparable as they are part of the same framework and hopefully worked on by the same team.

The fact that it is dispatching to different libraries between FP32 and FP16 could explain the performance differences then.

What does Apple use AMX for? For small matrices it is faster than the GPU in matmul and more power efficient, but ANE even more so for FP16 and INT8. I doubt they are doing inference in FP32.

Edit: The ANE does seem to suffer from anemic memory bandwidth compared to AMX and the GPU. This could perhaps be one scenario where AMX would be preferred. My observation is based on these benchmark results: https://github.com/Anemll/anemll-bench/blob/main/Results.MD

Search

Search

Apple Silicon in Sciences

leman

macrumors Core

PetarM

macrumors newbie

leman

macrumors Core

PetarM

macrumors newbie

Our Staff