A quick comment on this: if I remember correctly most int operations execute at half the rate compared to fp arithmetics on M3. You can however execute fp and int operations simultaneously (similar to Nvidia) which can improve performance. Apple also offers native 24-bit int MAC, which might be faster. Division and modulo operations are very slow as they are implemented via injected subroutines. Bit operations on the other hand are fast (Apple GPUs have native bit sequence extract and popcount instructions).
I am currently rewriting my benchmarking code for Apple GPU and hope to publish a comprehensive overview in the next couple of months.
Looking forward to it!