I was testing it in Python, using MLX. I just call the matmul function provided by MLX and select the CPU as the device type.
I just had a quick look at the MLX sources, it seems they are delegating it to the BNNS framework. I assume it has optimized implementations and takes advantage of all hardware features. So you should be getting best possible performance.
The GPU indeed performs FP16 at the same rate as FP32, but as @name99 and @Philip Turner explained earlier in the thread, the register pressure is lower in half precision so the GPU manages to achieve throughput closer to theoretical maximum.
I would be surprised if you are running into register pressure issues on matrix multiplication, it's not very register-heavy. My guess would be better utilization of the available memory bandwidth that helps feeding the ALUs.
When I ran the benchmark from corsix, I found that matfp_f16f16_x*y+z achieved double the GFLOPS of matfp_f32f32_x*y+z. As the author explained, the code doesn't issue any load/stores, so it's not representative of real-world performance. Not sure what to make of that? f16 matmul would seem to be supported by AMX on the instruction level, but is not used by AppleAccelerate?
I found it quite surprising that the M4 SME does not expose FEAT_SME_F16F16. It does seem to be supported by AMX. No idea what the disparity is.