Apple Silicon in Sciences

Xiao_Xi · Mar 13, 2023

Philip Turner said:
https://github.com/xrq-phys/blis_apple has some graphs demonstrating GEMM performance. Accelerate is not optimized to switch to a different execution mode once L2 is exceeded. The first peak is exceeding L2, the second is exceeding P-CPU L2D + SLC.

If those graphs are true, Apple has a problem. How is it possible that an open source BLAS implementation based on limited information can beat Apple's BLAS implementation?

It looks like Apple chose the worst option for their BLAS/LAPLACK implementation: underfund their proprietary solution and ignore open source solutions. If Apple was more open, those developers would have given Apple a better implementation of BLAS/LAPLACK for free.

Philip Turner · Mar 13, 2023

Not exactly. The interface to standard BLAS requires putting complex numbers in an interleaved format, which deterioriates AMX performance. There's only so much a standardized linear algebra interface can provide. My work will probably use assembly language to store real and complex numbers separately, maybe even use BF16 on my iPhone's AMX. None of that is possible through BLAS.

name99 · Mar 13, 2023

Philip Turner said:
metal-benchmarks 2.0!

GitHub - philipturner/amx-benchmarks: Running linear algebra as fast as possible on Apple silicon

Running linear algebra as fast as possible on Apple silicon - philipturner/amx-benchmarks

github.com

Thanks for this update! Did you see the recent action in the Twitterverse as it was realized that M2/A15 AMX has BF16 (along with some other improvements like being able to load more X and Y registers in a single instruction)?

I think we continue to differ in our root understanding of AMX. I think your (4x4 block) analysis is not helpful; it may correspond to some aspect of the physical design, but it doesn't scale understanding further than that. I'd prefer to root my understanding in the patents, which is (as far as I can tell) the same as dougall's and corsix' analyses. The patents talk about an 8x8 array with a sidebar that describes how this could be implemented as a quadruple-pumped 8x2 array (presumably what is used for the E-cores).
The basic unit of that 8x8 array (replicated 64 times) is some Z storage, an FP64 MAC, four FP32MACs, and some number (maybe four, maybe eight, I can't remember) of FP16 MACs, along with some integer stuff.

The instructions are embedded in the CPU instruction stream and "executed" by the load-store unit which packs them up into a transaction that sends to to the AMX unit. It's unclear (to me anyway) whether these are treated as stores (so only two can be handled per cycle on the CPU side) or as generic LSU instructions (in which case four can be handled per cycle on the CPU side). The patent record definitely suggests the option to pack multiple instructions in a single transaction; my best guess is that on M1
- AMX instructions were treated as writes, so max of two per cycle, packed as two into a single transaction to AMX
on A15/M2
- AMX instructions were treated as generic LS, so max of four per cycle, packed as up to four in a single transaction to AMX
The data transactions (eg load X or Y, store Z) happen to L2, not between the CPU and AMX.

One thing that I think would help in the above document is if you were to clarify what you have actually measured on your various devices vs what you project based on your understanding.
At the very least this would clarify various points like your assumption (which I suspect to be false) that A15 has essentially double the AMX matrix multiply FP64 throughput of A14. (I could very well believe that it has twice the throughput for OTHER operations, in particular as I have stated before, I think Apple is trying to steer AMX to something that's also, in a somewhat hand-waving way, like AVX-512. Not the AVX512 permute stuff, but with ever stronger support for matrix-vector and vector-vector operations. This sort of vector-vector throughput seems to have double from what I can tell of the internet chatter.)

Out of curiosity does your chemistry code alternate between higher and lower res relaxations, so you can actually take advantage of 16 and 32 bit FP on the way to the final result? Or is everything stable enough that we can get useful answers with pure BF16?

name99 · Mar 13, 2023

leman said:
If your data is very large, you will be limited by the system RAM bandwidth. AMX units have tremendous throughput and are connected to the fast L2 cache directly, but as your dimensions increase you'll spend increasing amount of time waiting on RAM.

At least this would be my first intuition, I could very well be wrong.

For anything else, yes. Not for matrix multiply...
You can see how this plays out with the BLIS link Phillip included, xrq-phys/blis_apple

BLIS are (I assume) doing adequate blocking so that they can mostly run compute-limited; looks like Accelerate just aren't bother with blocking (perhaps beyond some super-basic register-sized blocking) so at larger sizes they're at the mercy of whatever prefetching notices and gives them. (Which may be nothing! Unlike most designs, Apple's pre-fetcher sits in the CPUs watching the entire stream, it doesn't sit in the L1 or L2; so if there's no stream visible to the CPU prefetcher ...!
I never thought of this before -- seems like there's scope for a nice win here just by tacking some prefetcher logic onto each AMX unit...?)

The numbers kinda make sense. DGEMM Accelerate peaks at around 500x500, so around 2MB of storage -- fits nicely in the 3MB of "near" L2 of a single core; and we retain good performance out to about 1000x1000 8MB fitting in the entire L2. (Of course we don't know how exactly how the near/far L2 distinction plays into AMX, but it does seem to be present.) Then for Accelerate it's off the cliff as we thrash L2, whereas BLIS is (I assume) using some sort of large blocking, so gets much more reuse out of each block that's shifted into L2.

Of course we don't know the motivations driving Apple's optimizations. They've clearly done a great job for smaller matrices so that may be the primary use case within the company, with larger matrices not (yet?) a priority.

Philip Turner · Mar 13, 2023

name99 said:
quadruple-pumped 8x2 array

You're saying that on the A13/A14/M1 P-AMX and A13/A14/M1/A15 E-AMX, each block pumps 8x2 each cycle? That would explain the geometry of the AMX and the lopsided aspect ratio of each individual block. If this analysis is correct, I'll change 4x4 to 2x8 for FP64, scaling accordingly to other precisions. The P-AMX coprocessor would have 4 times (2x8) = 8x8 = 1 GEMM instruction/cycle. E-AMX coprocessor would have 2 times (2x8) = 4x8 = 1/2 GEMM instruction/cycle.

name99 said:
One thing that I think would help in the above document is if you were to clarify what you have actually measured on your various devices vs what you project based on your understanding.
At the very least this would clarify various points like your assumption (which I suspect to be false) that A15 has essentially double the AMX matrix multiply FP64 throughput of A14.

I suspect is to be false as well: https://github.com/corsix/amx/issues/6

name99 said:
Out of curiosity does your chemistry code alternate between higher and lower res relaxations, so you can actually take advantage of 16 and 32 bit FP on the way to the final result? Or is everything stable enough that we can get useful answers with pure BF16?

I don't know yet, but Jack Dongarra made some HPC code that uses FP16 tensor cores for FP64 linear algebra. The Fugaku supercomputer also cheated some FP64 benchmarks to achieve 1 ExaFLOPS with FP16 for the bulk of calculations, using FP64 error corrections. Another research paper said FP32 alone won't suffice for DFT, but maybe mixed FP32-FP64 will.

I hope to perform some kind of error analysis, then find the precision I really need for DFT. If FP32 has 4x the throughput of FP64, BF16 has 4x the throughput of FP32, that really motivates me to use those lower precisions. It's much faster even if it costs slightly more iterations of the Newton-Raphson iteration. If it was only 2x better, it probably wouldn't be worth trying. Another core realization is, my use case likely doesn't have ill-structured problems. Dongarra said that his mixed-precision linear algebra works, as long as the problem isn't ill-structured. That would be true for heavy metals with close d-orbitals, but not carbon atoms used for molecular nanotechnology. My goal is to try experimenting with APM tooltips vastly larger than Freitas could access with a 4-core Intel CPU from 2010.

name99 · Mar 13, 2023

Xiao_Xi said:
If those graphs are true, Apple has a problem. How is it possible that an open source BLAS implementation based on limited information can beat Apple's BLAS implementation?

It looks like Apple chose the worst option for their BLAS/LAPLACK implementation: underfund their proprietary solution and ignore open source solutions. If Apple was more open, those developers would have given Apple a better implementation of BLAS/LAPLACK for free.

No, it only shows that Apple have not bothered to implement blocking at the larger levels (by page and/or L2 cache size). I assume at some point they will get to it.

We have no idea what their priorities are, but they PROBABLY reflect the scramble that is happening in the AI world right now, meaning that Accelerate (and whatever gets added to HW) is driven by whatever are the currently hot sizes required for diffusion and LLMs. Look at the Apple performance at smaller sizes -- they are nicely (slightly) faster than BLIS, which suggests they have worked hard to optimize for those sizes.

Philip Turner · Mar 13, 2023

name99 said:
Of course we don't know the motivations driving Apple's optimizations. They've clearly done a great job for smaller matrices so that may be the primary use case within the company, with larger matrices not (yet?) a priority.

My use case will likely be those smaller matrices. 100-200 atoms is unheard of in quantum chemistry on consumer devices; probably taking days to simulate because DFT scales O(n^3). With 4 electrons per carbon atom, N=electrons, bfloat16-complex, matrices could be 640 KB to 2.5 MB for my use cases. That might also fit into the "near L2" you describe. How much "near L2" is available to any single core on A15, and does that differ from the M1 Max?

Philip Turner · Mar 13, 2023

name99 said:
reflect the scramble that is happening in the AI world right now, meaning that Accelerate (and whatever gets added to HW) is driven by whatever are the currently hot sizes required for diffusion and LLMs.

In your work, did you find the ANE supporting native unpacking from Int4 to FP16? I'm trying to get the LLM leaked from 4chan to run on Apple silicon.

leman · Mar 13, 2023

name99 said:
, I think Apple is trying to steer AMX to something that's also, in a somewhat hand-waving way, like AVX-512. Not the AVX512 permute stuff, but with ever stronger support for matrix-vector and vector-vector operations.

I think there is a lot of correspondence between AMX and the SVE streaming mode. In fact, I wouldn't be surprised if ARM got some design input from Apple here. If I am not mistaken, Apple was one of the first to implement a hardware division between latency-optimised short vectors and throughtput-optimised large vector units?

Anyway, I wouldn't be surprised if the next generation exposed AMX units to developers via SVE streaming mode, while retaining the regular 128-bit vectors in the "normal" SVE mode. Looking at corsix repo, there doesn't seem to be much functionality missing.

name99 · Mar 13, 2023

Philip Turner said:
You're saying that on the A13/A14/M1 P-AMX and A13/A14/M1/A15 E-AMX, each block pumps 8x2 each cycle? That would explain the geometry of the AMX and the lopsided aspect ratio of each individual block. If this analysis is correct, I'll change 4x4 to 2x8 for FP64, scaling accordingly to other precisions. The P-AMX coprocessor would have 4 times (2x8) = 8x8 = 1 GEMM instruction/cycle. E-AMX coprocessor would have 2 times (2x8) = 4x8 = 1/2 GEMM instruction/cycle.

No. P AMX units have the full 8x8 array.
E AMX units have the 8x2 array.

The 2019 patent https://patents.google.com/patent/US20200272597A1 seems the best in terms of explaining this (as opposed to the older 2016/17 patents that most people refer to). The whole patent is interesting, but the discussion around FIGs 15..17 talks about alternative possible ways to shrink an implementation.
Think eg

Screenshot 2023-03-13 at 12.58.30 PM.png

You have one quarter the amount of FP logic (but the same amount of register storage) and you perform each operation by sending, over four cycles, successive two elements of Y to be multiplies against the full X, and stored in the appropriate Z.
There are multiple ways you can slice this (as the patent discusses) but this one is the easiest to explain and (IMHO) the most likely given the AVX-512-like future.

I discuss this a lot more in the AMX section of volume 3 of my PDFs: https://github.com/name99-org/AArch64-Explore

name99 · Mar 13, 2023

Philip Turner said:
In your work, did you find the ANE supporting native unpacking from Int4 to FP16? I'm trying to get the LLM leaked from 4chan to run on Apple silicon.

There is stuff for doing that. Probably the genLUT instruction is the best choice.

amx/genlut.md at main · corsix/amx

Apple AMX Instruction Set. Contribute to corsix/amx development by creating an account on GitHub.

github.com

You know that LLaMa has already been ported! The internet is fast...

https://twitter.com/x/status/1634488664150487041

But seems like they are running on CPU, not AMX or GPU? At least that's the impression I get from quickly scanning the thread.

name99 · Mar 13, 2023

Philip Turner said:
My use case will likely be those smaller matrices. 100-200 atoms is unheard of in quantum chemistry on consumer devices; probably taking days to simulate because DFT scales O(n^3). With 4 electrons per carbon atom, N=electrons, bfloat16-complex, matrices could be 640 KB to 2.5 MB for my use cases. That might also fit into the "near L2" you describe. How much "near L2" is available to any single core on A15, and does that differ from the M1 Max?

The CPU L2 details are unclear but what SEEMS to be the case is something like (I'll give M1 numbers, details differ of course for other designs)
- we have four P cores in a cluster, and 12MB of L2
- but that L2 is not "flat" rather it seems like
- each core definitely has 3MB of "near" L2 that is slightly faster (say 15 cycles)
- it can use the other quadrants as L2 BUT this is slightly slower (say 18 cycles) AND there appears to be a limit of quite how much you can use, something like half of it but not all of it.
So what appears to be the case is a single core has access to something like 3+(3*1.5)=7.5M of L2.

You see these effects in graphs like

The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test

www.anandtech.com

and

Apple Announces The Apple Silicon M1: Ditching x86 - What to Expect, Based on A14

www.anandtech.com

(for the A14)

where we get glitches at unexpected places. (The A14 glitches at what appears to be 4MB [near] then at 6MB [near+2MB of far]. The M1 graph doesn't seem to show anything at 3, but does glitch at about 7.5.)

I think the takeaway is that if you are pushing L2 occupancy (and are limiting code to a single core) see what happens if you constrain yourself to 3 and 7.5MB as opposed to rushing to 12MB. (And of course, like I said we have no idea how this actually plays with AMX, I'm just looking at the BLIS graph...)

For A15 the graph is

The Apple A15 SoC Performance Review: Faster & More Efficient

www.anandtech.com

If we believe the design to be like A14/M1, you presumably have something like 6M of near L2, and access to a further 3MB of far L2. The graph definitely suggests that (first glitch at ~6MB, second glitch at what could be 8 or 9MB).

I thought with M1 Max the CPU L2 is same as M1, so my numbers above would be the same per cluster?

name99 · Mar 13, 2023

leman said:
I think there is a lot of correspondence between AMX and the SVE streaming mode. In fact, I wouldn't be surprised if ARM got some design input from Apple here. If I am not mistaken, Apple was one of the first to implement a hardware division between latency-optimised short vectors and throughtput-optimised large vector units?

Anyway, I wouldn't be surprised if the next generation exposed AMX units to developers via SVE streaming mode, while retaining the regular 128-bit vectors in the "normal" SVE mode. Looking at corsix repo, there doesn't seem to be much functionality missing.

Oh, there's massive functionality missing if you want to implement SVE on AMX. Totally different design principles.
I am unaware of any list of exactly what the SVE streaming mode omits, but I would expect it still includes the shuffles (which do not fit into what AMX right now can do) and predicates (which means predicate registers and various logic to set and manipulate those registers) which is missing today.

Honestly I now suspect Apple will bypass SVE. It seems like the germ of a good idea that got swallowed by marketing so it was re-prioritized from "optimal performance and developer ease of use" to "optimal for slideware that says it can be used everywhere from watches to supercomputers". Some of what's in there is just SO DUMB (vectors that 3 or 5 or 7 times 128B in length?) that it's hard to believe this came out of ARM (rather than Intel...)

Jeff Gonion, a guy at Apple with lots of seniority put a lot of work (like fifteen years or so) into an ISA design called Macroscalar, with a LARGE patent footprint. The idea behind this is a few primitives added to a base ISA that allow
- the instruction set to specify loops of variable length that are like SIMD loops AND
- the HW to detect these loops at Decode and execute them on specialized HW.

The first installment is here, and it goes on for years filling in ever more details:

US7975134B2 - Macroscalar processor architecture - Google Patents

A macroscalar processor architecture is described herein. In one embodiment, an exemplary processor includes one or more execution units to execute instructions and one or more iteration units coupled to the execution units. The one or more iteration units receive one or more primary...

patents.google.com

(There is a public proposal for the same sort of thing for POWER, but that's not [yet anyway] endorsed by IBM.)
This scheme gives you basically the advantages of SVE without the disadvantages. We'll see what happens!

leman · Mar 13, 2023

name99 said:
Oh, there's massive functionality missing if you want to implement SVE on AMX. Totally different design principles.
I am unaware of any list of exactly what the SVE streaming mode omits, but I would expect it still includes the shuffles (which do not fit into what AMX right now can do) and predicates (which means predicate registers and various logic to set and manipulate those registers) which is missing today.

I was also unable to locate an exact list of features retained in streaming SVE mode, although I’d be surprised if complex shuffles are still part of it. The SVE matrix extension on the other hand appears to be very similar to AMX in functionality, as both are based on outer product engines and have equivalent instructions.

name99 said:
Honestly I now suspect Apple will bypass SVE.

It’s possible, but it were advantageous if the ISA were stabilized and directly useable by software writers at some point. Could really unlock new markets for Macs.

name99 said:
Some of what's in there is just SO DUMB (vectors that 3 or 5 or 7 times 128B in length?) that it's hard to believe this came out of ARM (rather than Intel...)

They fixed that. Turns out it’s easy to retroactively edit a spec if there are no hardware implementations

name99 said:
Jeff Gonion, a guy at Apple with lots of seniority put a lot of work (like fifteen years or so) into an ISA design called Macroscalar, with a LARGE patent footprint. The idea behind this is a few primitives added to a base ISA that allow
- the instruction set to specify loops of variable length that are like SIMD loops AND
- the HW to detect these loops at Decode and execute them on specialized HW.

Kind of sound like the approach RISC–V took with RVV

Xiao_Xi · Mar 13, 2023

Philip Turner said:
My work will probably use assembly language to store real and complex numbers separately, maybe even use BF16 on my iPhone's AMX. None of that is possible through BLAS.

Are you going to make an addon for BLIS?

Simply put, an addon in BLIS provides additional APIs, operations, and/or implementations that may be useful to certain users. An addon can be thought of as a standalone extension of BLIS that does not depend on any other addon, although addons may utilize existing functionality or kernels within the core framework.

blis/docs/Addons.md at master · flame/blis

BLAS-like Library Instantiation Software Framework - flame/blis

github.com

Philip Turner · Mar 13, 2023

Xiao_Xi said:
Are you going to make an addon for BLIS?

No; I'll probably hard-code a small subset of the LAPACK functionality into a custom quantum chemistry library. I need to check name99's PDF - or just run some Accelerate benchmarks - to see whether the A15 AMX can process interleaved complex at full throughout. If it can't, my work will be impossible to expose in BLAS or LAPACK anyway.

Philip Turner · Mar 13, 2023

name99 said:
But seems like they are running on CPU, not AMX or GPU? At least that's the impression I get from quickly scanning the thread.

No; I explained that in a GitHub thread I linked.

name99 said:
There is stuff for doing that. Probably the genLUT instruction is the best choice.

I was asking about the Apple Neural Engine. However, the GPT architecture seems to be O(n^2) when not doing batches, so it's probably memory bound. People are running shader code-only code on Nvidia GPUs with tensor cores, and that's the fastest implementation out there. The O(n^3) AMX probably won't provide much speedup over multicore CPU.

Philip Turner · Mar 13, 2023

name99 said:
If we believe the design to be like A14/M1, you presumably have something like 6M of near L2, and access to a further 3MB of far L2. The graph definitely suggests that (first glitch at ~6MB, second glitch at what could be 8 or 9MB).

I thought with M1 Max the CPU L2 is same as M1, so my numbers above would be the same per cluster?

The M1 Max should have the same numbers, because I'm considering sticking to single-core only. If the A15 can reach the same or higher throughput than both P-blocks combined on M1 Max, it's easier to just write an algorithm for single-core/single-block. Develop on M1, then adapt to A15 when I'm ready for production. Glitching at a higher number of megabytes also means the A15 has more working memory than M1/Max.

This is kind of deja vu, where the phone has more processing power than the desktop. I was once in a similar situation with my A14 iPhone 12 Pro Max and the Intel Mac mini. I wanted to run machine learning frameworks on the A14, but didn't have access to tensorflow-macos because the A14 wasn't running macOS. To be fair, I did overestimate the A14 GPU's FP32 and underestimate the Intel CPU's FP32 by a factor of 1.5-2 each.

I have a desk fan ready to blow at my phone while it runs a day-long simulation. I can transform a rolled up yoga mat into a compressed air tunnel, amplifying the air velocity at the other end. I've always wanted to do this...

name99 · Mar 13, 2023

Philip Turner said:
No; I'll probably hard-code a small subset of the LAPACK functionality into a custom quantum chemistry library. I need to check name99's PDF - or just run some Accelerate benchmarks - to see whether the A15 AMX can process interleaved complex at full throughout. If it can't, my work will be impossible to expose in BLAS or LAPACK anyway.

In theory this sort of de-interleaving SHOULD be possible in that the AMX engine has a front-end between the actual load/store and the X or Y registers, and this front-end can perform a variety of shuffles and interleaves on the data. I'm not sure the extent to which this functionality has been exposed in Dougall and corsix' work, but if you are willing to play with assembly you may find you can get what you want at close to free...

Philip Turner said:
The M1 Max should have the same numbers, because I'm considering sticking to single-core only. If the A15 can reach the same or higher throughput than both P-blocks combined on M1 Max, it's easier to just write an algorithm for single-core/single-block. Develop on M1, then adapt to A15 when I'm ready for production. Glitching at a higher number of megabytes also means the A15 has more working memory than M1/Max.

This is kind of deja vu, where the phone has more processing power than the desktop. I was once in a similar situation with my A14 iPhone 12 Pro Max and the Intel Mac mini. I wanted to run machine learning frameworks on the A14, but didn't have access to tensorflow-macos because the A14 wasn't running macOS. To be fair, I did overestimate the A14 GPU's FP32 and underestimate the Intel CPU's FP32 by a factor of 1.5-2 each.

I have a desk fan ready to blow at my phone while it runs a day-long simulation. I can transform a rolled up yoga mat into a compressed air tunnel, amplifying the air velocity at the other end. I've always wanted to do this...

The main hassle with doing it on your phone is the dev environment is, of course, more hassle. But if you have the code debugged and ready to run, go for it!

leman · Mar 14, 2023

Philip Turner said:
The M1 Max should have the same numbers, because I'm considering sticking to single-core only.

I was under impression that one had to use at least two cores from the same cluster to saturate an AMX unit?

Xiao_Xi · Mar 14, 2023

name99 said:
Look at the Apple performance at smaller sizes -- they are nicely (slightly) faster than BLIS, which suggests they have worked hard to optimize for those sizes.

Aren't you giving too much credit to the team that didn't bother to update BLAS/LAPLACK for so long so that 3rd party math libraries could use them?

Could those graphs show the legacy behavior of those libraries when Apple developed them for Intel CPUs? What should those graphs look like for Apple's Intel-based hardware?

Out of curiosity, is there any difference between BLIS and BLAS?

leman · Mar 14, 2023

Xiao_Xi said:
Aren't you giving too much credit to the team that didn't bother to update BLAS/LAPLACK for so long so that 3rd party math libraries could use them?

Updating the interfaces wasn’t a priority for a while, it obviously changed recently. Generally, Apple‘s team skills in numerical optimizations are top notch. A few years ago I spent some time trying to match some geometric routines they implement in Accelerate (this was on x86), but after using every trick in the book known to me I was still 5–10% slower.

Policy (which features to implement when) is another matter entirely.

Philip Turner · Mar 14, 2023

leman said:
A few years ago I spent some time trying to match some geometric routines they implement in Accelerate (this was on x86), but after using every trick in the book known to me I was still 5–10% slower.

That's why if I want to achieve maximum performance, I have to invest a disproportionate amount of time into just the narrow subset of operations that I need. Also, I'm only considering square matrices. George Hotz did supposedly achieve perfect 80% utilization - maximum performance - on M1 Max GEMM using custom simdgroup_matrix code, but not optimized for edge cases.

leman said:
I was under impression that one had to use at least two cores from the same cluster to saturate an AMX unit?

That was an old interpretation that I recently did away with. Accelerate's GEMM performance deteriorates when you have >1 threads/P-block, hopefully just because of a software thread scheduling issue and not hardware limitations. corsix/amx did achieve slightly higher ALU utilization using multithreaded. I might give 4 threads (M1 Max) and 2 threads (A15) a shot to improve ALU utilization, but only after investing extensive time into the single-threaded algorithm.

I think my current README on philipturner/amx-benchmarks is misleading about CPU-AMX bandwidth. I come from the background of GPGPU, where register-ALU bandwidth is a major performance bottleneck. But @name99's clarification reminded me, there's not a direct data link between any CPU core and the P/E-block's AMX coprocessor. It just sends multiple instructions/cycle, which is 128-256 bits of instructions plus 128-256 bits of pointers. The bulk data - 1024-2048 bits - is fed directly from L2 to AMX. Sort of like the AMX coprocessor is its own CPU core. The only way multicore CPU improves AMX utilization is by maybe issuing more instructions/cycle.

Also note that theoretically, I can offload any FP32 steps to the GPU. There's just tradeoffs between the latency of synchronizing multiple threads/accelerators and the marginal throughput gains of doing so. The M1 Max's theoretical GPU FP32 throughput slightly exceeds the A15's AMX BF16 throughput.

leman · Mar 14, 2023

name99 said:
Oh, there's massive functionality missing if you want to implement SVE on AMX. Totally different design principles.

P.S. Another interesting thing that ARM SVE/SME and Apple use outer products as their basic operation, while Intel/Nvidia use inner product (dot product). I am curious about the relative merits of both approaches. The dot product route allows matrix multiplication to be done as a single operation, but you can't really do anything else; the outer product requires multiple instruction invocations to do matrix multiplication, but I suppose that it has more potential applications?

Philip Turner · Mar 14, 2023

No, I think it’s the other way around. Outer product does GEMM in a single instruction.

Apple Silicon in Sciences

macrumors 68000

macrumors regular

macrumors 68030

macrumors 68030

macrumors regular

macrumors 68030

macrumors regular

macrumors regular

macrumors Core

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68030

macrumors Core

macrumors 68000

macrumors regular

macrumors regular

macrumors regular

macrumors 68030

macrumors Core

macrumors 68000

macrumors Core

macrumors regular

macrumors Core

macrumors regular

Our Staff