metal-benchmarks 2.0!
Running linear algebra as fast as possible on Apple silicon - philipturner/amx-benchmarks
github.com
Thanks for this update! Did you see the recent action in the Twitterverse as it was realized that M2/A15 AMX has BF16 (along with some other improvements like being able to load more X and Y registers in a single instruction)?
I think we continue to differ in our root understanding of AMX. I think your (4x4 block) analysis is not helpful; it may correspond to some aspect of the physical design, but it doesn't scale understanding further than that. I'd prefer to root my understanding in the patents, which is (as far as I can tell) the same as dougall's and corsix' analyses. The patents talk about an 8x8 array with a sidebar that describes how this could be implemented as a quadruple-pumped 8x2 array (presumably what is used for the E-cores).
The basic unit of that 8x8 array (replicated 64 times) is some Z storage, an FP64 MAC, four FP32MACs, and some number (maybe four, maybe eight, I can't remember) of FP16 MACs, along with some integer stuff.
The instructions are embedded in the CPU instruction stream and "executed" by the load-store unit which packs them up into a transaction that sends to to the AMX unit. It's unclear (to me anyway) whether these are treated as stores (so only two can be handled per cycle on the CPU side) or as generic LSU instructions (in which case four can be handled per cycle on the CPU side). The patent record definitely suggests the option to pack multiple instructions in a single transaction; my best guess is that on M1
- AMX instructions were treated as writes, so max of two per cycle, packed as two into a single transaction to AMX
on A15/M2
- AMX instructions were treated as generic LS, so max of four per cycle, packed as up to four in a single transaction to AMX
The data transactions (eg load X or Y, store Z) happen to L2, not between the CPU and AMX.
One thing that I think would help in the above document is if you were to clarify what you have actually measured on your various devices vs what you project based on your understanding.
At the very least this would clarify various points like your assumption (which I suspect to be false) that A15 has essentially double the AMX matrix multiply FP64 throughput of A14. (I could very well believe that it has twice the throughput for OTHER operations, in particular as I have stated before, I think Apple is trying to steer AMX to something that's also, in a somewhat hand-waving way, like AVX-512. Not the AVX512 permute stuff, but with ever stronger support for matrix-vector and vector-vector operations. This sort of vector-vector throughput seems to have double from what I can tell of the internet chatter.)
Out of curiosity does your chemistry code alternate between higher and lower res relaxations, so you can actually take advantage of 16 and 32 bit FP on the way to the final result? Or is everything stable enough that we can get useful answers with pure BF16?