Apple Silicon in Sciences

Alex W. · Jan 20, 2023

Use Linux on the mac.

Philip Turner · Jan 20, 2023

leman said:
Which mental model is correct?

I just came to a realization that I was wrong about this. When you're reverse-engineering something, you make a mental model of it. Test whether it explains your observations, then accept it if it does. I had recently read a 2016 research paper about how GPUs hide latencies. It heavily referenced little's law, which goes like this in simple terms.

A train will take 4 clock cycles to pass through a station. On the Apple GPU, imagine there's 128 scalar ALUs and your occupancy is 128 threads. ILP=1, so everything is back-to-back dependent. The first cycle, you dispatch a train to a station. Then wait, wait, wait. Now you can dispatch another. This technique does not hide the latency - only 25% ALU utilization! The research paper showed how you need 512 threads. Every other cycle, you dispatch 128 of those threads to the ALUs. My mistake was thinking 512 trains means 512 stations.

However, you can have a more optimal design. Let's say we have 512 stations, each at 1-cycle latency. How might we have 4 cycles latency? Not every station does multiplication. 128 stations check a number's exponent, another 128 multiply, another 128 add, another 128 reconstruct the number. You can take one of each type, plaster them together into one ALU. That's 128 super-stations, each super-station containing 4 actual stations.

I need to correct my visualization of the Apple GPU pipelines. Currently, I say there are 512 stations that do either FFMA, IADD, BITWISE, ICMPSEL. Each takes 4 cycles, so you dispatch 128 trains/cycle. Then there's 128 stations that do either IMUL, ROUND, or EXP2. These also take 4 cycles. I'll correct this in a bit, but I'll show how this model resulted in my understanding of the ANE. I thought each ANE core was a very inefficient station. Someone reverse-engineering the ANE said each core multiplies 32 x 32 halfs in a single cycle @ 335 MHz. That's a terrible clock speed - you could quadruple performance by making it closer to GPU! No, I thought it's like those 512-stations-per-GPU core. The ANE core multiplies 32 x 32 halfs in four cycles @ 1340 MHz.

With @leman's idea, I have to rework the NEON, GPU, and ANE. The GPU's ALU does not include transcendental pipelines. There are 128 super-stations, each a pipeline of 4 actual stations. In each actual station, the train resides for just one cycle. The GPU also has 32 SFU's (special function units). These perform IMUL, ROUND, EXP2 and are pipelined similarly. Thus one "ALU" by Apple's marketing contains 1/4 of an SFU.

Now revisit the ANE. Perhaps whoever reverse-engineered it was correct, it takes 32 x 32 halfs 335 million times each second. It breaks those into blocks of 16 x 16 halfs, passed into stations at 1340 million times each second. Each block takes 4 cycles to pass through a station, where it generates an outer product and accumulates into a 16x16 grid of pre-existing data. My old (false) model of the ANE was a 32x32 grid of multipliers taking 4 cycles. That predicted die size quite well, but was wrong/oversimplified. Rather it's a 16x16 grid of multipliers taking 1 cycles.

Philip Turner · Jan 20, 2023

You can easily see how the old (corrupt) model leads to a certain understanding of the M1 Pro's AMX-cluster. Each block on the die shot, I thought was a separate station. These stations are incredibly inefficient, their multipliers are active for only one of every 4 cycles. They duplicate the circuitry required to generate an outer product from two 64-byte operands. This matched my understanding that all FFMA circuitry has 4x redundancy and 4x latency, 1x throughput.

Let's redesign it like the ANE. Every cycle, you dispatch a 64-byte by 64-byte FFMA to the AMX. It breaks that up into 32-byte by 32-byte chunks, then passes said chunk into separate chunks on the die shot. Hardware multipliers can multiply this data in a single cycle, but the train station is complex. Other circuitry must preprocess the data before multiplying. The experienced latency should jump to 4 cycles, not 1 cycle.

The way the AMX actually dispatches, is more complex. A central dispatcher receives data from the CPU cores, then reroutes it to the 4 AMX-blocks I talk about. There's also no need for Z registers to physically reside in the blocks. Each cycle, the block simultaneously do this:

Stage 1:
(a) Read 8x8 floats or 4x4 doubles of Z-register from the central dispatcher
(b) Read 8 floats of 4 doubles of X-register from the central dispatcher, scatter them across the 8x8/4x4 grid like I visually depicted in another comment.
(c) Read 8 floats of 4 doubles of Y-register from the central dispatcher, scatter them across the 8x8/4x4 grid like I visually depicted in another comment.

Stage 2:
(d) Multiply whatever data's currently in the 8x8/4x4 grid, writing the outputs to another 8x8/4x4 grid of registers.

Stage 3:
(e) Add whatever data's in the second set of registers, to the Z-stuff you originally fetched from the central dispatcher. Write these outputs to yet another 8x8/4x4 grid of registers.

Stage 4:
(f) Write the final Z registers back to the central dispatcher.

I also assumed there's immense overhead in writing the entire chunk of Z-registers back to the scheduler. Let's see what implications are, if the Z-registers cannot just be trivially written back to the scheduler.

(a) You want to keep Z-data inside the AMX-blocks for as long as possible. Outer products accumulate to existing data.
(b) When just one CPU thread is using the AMX, no issues. You split a 64-byte by 64-byte multiplication into four 32-byte by 32-byte multiplications. Each chunk writes to a separate, non-related block of results.
(c) The next AMX instruction ideally performs an outer product on those same set of Z-registers. You can dispatch 32Bx32B chunks to the same AMX blocks in the same order, without moving the Z-data around.
(d) If all four CPUs used the AMX simultaneously, you'd want to partition one AMX block to each CPU. Each CPU would simultaneously work on a different set of Z-registers.
(e) If two CPUs used the AMX simultaneously, you'd similarly avoid excessive shuffling by partitioning two AMX blocks to each CPU. The CPU cores themselves wouldn't know what blocks are doing their multiplies, but the AMX internally knows.

Except this model doesn't match the ISA. Each block supposedly stores 8x8 floats (256 B) or 4x4 doubles of Z registers (64 B). However the ISA supports 64 Z-registers each being 64 bytes. That would imply each block stores 1024 B worth of registers. Or it doesn't store anything at all. The bandwidth requirements are so low (256 B/cycle/block) that you could shuffle Z contents in/out of every block, each cycle. There's no need to try and route certain multiplications to certain blocks, in an effort to reuse locally cached Z registers.

leman · Jan 20, 2023

@Philip Turner Thanks, yes, the idea that the hardware would be this inefficient and spend so much time doing nothing is exactly what bothered me. Happy to hear that we agree on this.

One more comment. I think that the entire notion of pipelines and how they work can be safely swept under the rug if all we care about is throughput. If doesn't really matter how many pipelines are there are how long it takes to execute an operation as long as we can issue an instruction every clock (new train enters a track aka. a super-station) and retire an instruction every clock (a train leaves a track aka. a super-station). To be honest, I do not believe there is a good way to measure latency on the GPU anyway with the tools at out disposal — CPU has appropriate performance counters and fast timing instructions for this while the GPU is not directly accessible.

Philip Turner · Jan 20, 2023

To be fair, Little's Law didn't predict something wrong either. Each GPU cores has 512 sub-stations for FP32, each ANE cores has 1024 sub-stations for FP16, each AMX "block" has 256 sub-stations for FP32 / alternatively 64 stations for FP64. But not all of them are multipliers.

Or perhaps, the GPU has 640 sub-stations for FP32 with 5-cycle pipeline depth. What matters is, 128 of them are multipliers, limiting the rate to 128 FFMAs/second. For ANE and AMX, we should only think about the stations that are multipliers, each taking 1 cycle to multiply. AMX block has 64 FP32 multipliers. 4 multipliers can simultaneously tackle each 32-bit x 32-bit chunk of a 64-bit x 64-bit multiply. That makes 16 virtual FP64 multipliers. At 3.2 GHz, 2 FLOPs/FFMA, 16 FFMAs/cycle, that's 100 FP64 FLOPs/cycle/visual AMX block.

Philip Turner · Jan 20, 2023

Now, it seems my entire understanding of latencies, number of ALU/SFUs, maybe even the dispatcher is out of whack. I think I can set up a separate benchmark, which actually measures latency of every instruction. Then we see how that latency changes based on the number of occupant simds and ILP.

The Apple GPU seems to have hardware profiling counters, used by Metal Frame Capture. They aren't exposed to the Metal API. It would be extremely helpful to access those directly.

leman · Jan 20, 2023

Actually, I just became aware of an interesting detail. I ran my GPU ALU throughput benchmarks again, modifying them slightly to use dependent operation chains, i.e.

x = fma(x, y, x)
x = fma(x, y, x)
x = fma(x, y, x)
...

and so on for 512 total instructions. To my surprise, this code still achieves the 10 TFLOPS on my M1 Max — which is the theoretical peak. The profiler shows 99% ALU utilisation as well.

Wouldn't this suggest that the results of the FMA instruction is available immediately for the next instruction to use? In other words, the latency of the FMA is one cycle? I thought that FP multipliers always needed multiple cycles? Very confused right now...

Philip Turner · Jan 20, 2023

Show me your benchmark, and the execution times. I'm very concerned that you're off by a factor of 2. Also, be careful that the compiler isn't optimizing anything away. Experiment with rearranging the operands so it's fma(x, x, y) as well. Is this single or half precision?

My benchmarks differed, in that each thread was working with numerous different operands simultaneously. This created very long cyclical dependencies that the compiler can't optimize away. Perhaps that overfilled the register cache?

Philip Turner · Jan 20, 2023

leman said:
Wouldn't this suggest that the results of the FMA instruction is available immediately for the next instruction to use?

I highly suggest you read this research paper on how GPUs hide latency: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf. It explains things like Little's Law and how ALU utilization varies with occupancy.

name99 · Jan 20, 2023

leman said:
I think the idea of AMX being a coprocessor simply means that it’s invoked from the CPU, using CPU instructions, as opposed to other units which operate on their own and are controlled by passing messages or commands.

What makes AMX special however is that it’s shared between all the CPUs in a cluster. If you want, you can consider it a specialized fifth processor that shares the same L2 cache.

Both these points are correct.

AMX instructions are embedded in the CPU instruction stream and fit into the pattern of ARMv8 instructions. They are "executed" by the load-store unit in a CPU, which creates a transaction that looks like an L2 transaction to transport the instruction from the CPU to the (shared across the cluster) AMX unit. There they are tagged by the originating CPU, placed in an instruction queue, and eventually executed.
Given that no state is shared between AMX and the CPU (in particular no shared registers or flags) the CPUs and AMX can mostly execute asynchronously relative to each other, only requiring synchronization fences or barriers at points where shared memory need to be handled carefully. Fortunately the most common forms of shared memory (eg data that might be in L2 or L1) are handled by coherency and so do not require this sort of special treatment.

name99 · Jan 20, 2023

Philip Turner said:
You can easily see how the old (corrupt) model leads to a certain understanding of the M1 Pro's AMX-cluster. Each block on the die shot, I thought was a separate station. These stations are incredibly inefficient, their multipliers are active for only one of every 4 cycles. They duplicate the circuitry required to generate an outer product from two 64-byte operands. This matched my understanding that all FFMA circuitry has 4x redundancy and 4x latency, 1x throughput.

Let's redesign it like the ANE. Every cycle, you dispatch a 64-byte by 64-byte FFMA to the AMX. It breaks that up into 32-byte by 32-byte chunks, then passes said chunk into separate chunks on the die shot. Hardware multipliers can multiply this data in a single cycle, but the train station is complex. Other circuitry must preprocess the data before multiplying. The experienced latency should jump to 4 cycles, not 1 cycle.

The way the AMX actually dispatches, is more complex. A central dispatcher receives data from the CPU cores, then reroutes it to the 4 AMX-blocks I talk about. There's also no need for Z registers to physically reside in the blocks. Each cycle, the block simultaneously do this:

Stage 1:
(a) Read 8x8 floats or 4x4 doubles of Z-register from the central dispatcher
(b) Read 8 floats of 4 doubles of X-register from the central dispatcher, scatter them across the 8x8/4x4 grid like I visually depicted in another comment.
(c) Read 8 floats of 4 doubles of Y-register from the central dispatcher, scatter them across the 8x8/4x4 grid like I visually depicted in another comment.

Stage 2:
(d) Multiply whatever data's currently in the 8x8/4x4 grid, writing the outputs to another 8x8/4x4 grid of registers.

Stage 3:
(e) Add whatever data's in the second set of registers, to the Z-stuff you originally fetched from the central dispatcher. Write these outputs to yet another 8x8/4x4 grid of registers.

Stage 4:
(f) Write the final Z registers back to the central dispatcher.

I also assumed there's immense overhead in writing the entire chunk of Z-registers back to the scheduler. Let's see what implications are, if the Z-registers cannot just be trivially written back to the scheduler.

(a) You want to keep Z-data inside the AMX-blocks for as long as possible. Outer products accumulate to existing data.
(b) When just one CPU thread is using the AMX, no issues. You split a 64-byte by 64-byte multiplication into four 32-byte by 32-byte multiplications. Each chunk writes to a separate, non-related block of results.
(c) The next AMX instruction ideally performs an outer product on those same set of Z-registers. You can dispatch 32Bx32B chunks to the same AMX blocks in the same order, without moving the Z-data around.
(d) If all four CPUs used the AMX simultaneously, you'd want to partition one AMX block to each CPU. Each CPU would simultaneously work on a different set of Z-registers.
(e) If two CPUs used the AMX simultaneously, you'd similarly avoid excessive shuffling by partitioning two AMX blocks to each CPU. The CPU cores themselves wouldn't know what blocks are doing their multiplies, but the AMX internally knows.

Except this model doesn't match the ISA. Each block supposedly stores 8x8 floats (256 B) or 4x4 doubles of Z registers (64 B). However the ISA supports 64 Z-registers each being 64 bytes. That would imply each block stores 1024 B worth of registers. Or it doesn't store anything at all. The bandwidth requirements are so low (256 B/cycle/block) that you could shuffle Z contents in/out of every block, each cycle. There's no need to try and route certain multiplications to certain blocks, in an effort to reuse locally cached Z registers.

Phillip we KNOW how the system is designed, you can read the (long stream of) patents!
There is value to working things out in your mind and seeing what will and won't work.
But there's also value to learning from the people who designed the system...

Things like the local storage of the Z register are not obvious. I did not expect this until I read the patents. But when you see how it fits together, it makes perfect sense.
Honestly, I don't understand why you don't just read my PDF:
https://github.com/name99-org/AArch64-Explore
look at vol 3, page 188 et seq. If you want more details, I give reference to every patent along with other internet material I found.

leman · Jan 20, 2023

Philip Turner said:
Show me your benchmark, and the execution times. I'm very concerned that you're off by a factor of 2. Also, be careful that the compiler isn't optimizing anything away. Experiment with rearranging the operands so it's fma(x, x, y) as well. Is this single or half precision?

My benchmarks differed, in that each thread was working with numerous different operands simultaneously. This created very long cyclical dependencies that the compiler can't optimize away. Perhaps that overfilled the register cache?

The benchmark kernel itself is very simple:

C++:

// computes the geometric series 1/(1-x) as  1 + x + x^2 + x^3 + ...
template<typename T, uint N>
struct fma_sum_t {
    static T compute(T x) {
        fma(x, fma_sum_t<T, N-1>::compute(x), 1);
    }
};

template<typename T>
struct fma_sum_t<T, 0> {
    static T compute(T x) {
        return 1;
    }
};

template<typename T, uint N>
[[kernel]]
void fma_sum(device float* out [[buffer(0)]],
                    uint index [[thread_position_in_grid]],
                    uint nthreads [[threads_per_grid]])
{
  float x = ((T) index)/((T) nthreads);
  out[index] = fma_sum_t<T, N>::compute(x);
}


// benchmark function instances
template [[host_name("fma_sum_float_512")]]
[[kernel]]
void fma_sum<float, 512>(device float*, uint, uint);

template [[host_name("fma_sum_float_1024")]]
[[kernel]]
void fma_sum<float, 1024>(device float*, uint, uint);

template [[host_name("fma_sum_float_2048")]]
[[kernel]]
void fma_sum<float, 2048>(device float*, uint, uint);

I am fairly certain that the compiler is not optimising this away as the execution time on large dispatch grids scales linearly with the constant N. The computed FLOPs are within 95% of theoretical peak FMA for these GPUs, which again tells me that the kernels work as expected. Plus, I am checking that the results of the computation are correct (and they are).

But you are right that I have massively underestimated the complexity of how the hardware works. For example, executing 2048 FMAs on a single SIMD (1 threadgroup with 32 threads) takes 22 microseconds on my M1 Max. But it's still 22 microseconds for running 1 threadgroup with 1024 threads. In fact, to go beyond 22 microseconds I have to raise the amount of threads to over 8192 — twice the number of ALUs on an M1 Max. No idea whether 22 microseconds is some sort of internal synchronisation limit or whether the hardware can indeed somehow hide latency for this many concurrent threads. A lot of things that could have been happening here. Peak throughput is only reached with over half a million dispatched threads...

Philip Turner said:
I highly suggest you read this research paper on how GPUs hide latency: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf. It explains things like Little's Law and how ALU utilization varies with occupancy.

Will do, there are obviously huge gaps in my understanding of these things!

Philip Turner · Jan 20, 2023

leman said:
The computed FLOPs are within 95% of theoretical peak FMA for these GPUs,

Could you provide the exact numbers you used to get 95%? From the numbers you provided, there were 8192 threads * 2048 FFMAs/thread * 2 FLOPs/FFMA. Therefore 33,554,432 FLOPs total. That occurred in 22 microseconds, so you have 1.52 TFLOPS. That's nowhere close to 10.09 TFLOPS.

leman said:
I have to raise the amount of threads to over 8192

Look at the graphs under "ALU Bottlenecks" on my reverse-engineering repo. You must dispatch at least 8 simds/core (8192 threads) and have ILP=4 to reach 90% ALU utilization, if there's two unique operands per instruction. You have only one operand (X itself) which may decrease register bandwidth. Therefore you might get away with ILP=1 but that doesn't represent typical workloads. I never tested such a situation because the compiler always optimized it away.

Philip Turner · Jan 20, 2023

Regarding eigendecomposition, some bad new regarding CPU performance. Timothy Liu's benchmark divided the matrix's size by 2 before passing into Numpy, so the actual number of performed operations was overestimated by 8x. GFLOPS wasn't 88 (utilizing AMX), it was 10 (not utilizing AMX). Doing multiple eigendecompositions at once didn't help either for practical matrix sizes, only improved by 3x using 4 caller threads. Accelerate probably won't be faster than OpenBLAS because it can't utilize the AMX.

https://gist.github.com/philipturner/6bf642260406fa2398fb25a04d68e405

The speedup amount tells me it's bandwidth-limited on the CPU side. You'd expect a good algorithm to utilize nearly all the bandwidth available to a single core (almost 100 GB/s). A multithreaded algorithm could at minimum double performance, reaching 200 GB/s. The GPU, which may have less theoretical emulated FP64 FLOPS, might have significant potential to outperform the CPU. It not only has double the bandwidth, it can also cache stuff in the massive register file.

There was a Medium post a while back, comparing CPU and GPU performance on eigendecomposition. It was probably done in FP32, but I think the results would apply to FP64. The Nvidia GPU had 137 GFLOPS FP64, 4400 GFLOPS FP32 and the Intel CPU had 400 GFLOPS FP64, 800 GFLOPS FP32. Not bad compared to my setup, ~200 GFLOPS of IEEE-compliant eFP64 on the GPU, ~400 GFLOPS of NEON FP64. The Nvidia GPU utilized 7.0 GFLOPS during EVD, quadruple the performance of the CPU. I don't know whether they were using complex numbers, in which case that would be 28.0 GFLOPS.

On easy compute-bound stuff like matmul and QR, the GPU was 10x faster. On SVD, the GPU was slower than CPU. But for EVD, the GPU was faster, more than its bandwidth alone can explain. This may not help with quantum chemistry though. I recently read something on Nvidia's website, about benchmarking the recently GPU-accelerated Quantum Espresso. They mysteriously didn't report results for certain GPUs. Check TechPowerup, those GPUs have less FP64 compute than the reference CPU (but higher bandwidth). Nvidia only reported anything (obviously a speedup) on $20,000 V100 and A100 GPUs, which have teraFLOPS of FP64. The speedup almost perfectly matched the difference in FP64 compute, less so the difference in bandwidth.

In short, I strongly doubt emulated FP64 will be useful for quantum chemistry. If consumer Nvidia GPUs can't pull it off with native FP64 and fine-tuned cuBLAS, my GPU can't. Second, Accelerate has no advantage over OpenBLAS. I could probably configure a custom build of OpenBLAS, which only uses one CPU core per eigendecomposition. Then it might efficiently scale to several parallel eigendecompositions.

leman · Jan 21, 2023

Philip Turner said:
Could you provide the exact numbers you used to get 95%? From the numbers you provided, there were 8192 threads * 2048 FFMAs/thread * 2 FLOPs/FFMA. Therefore 33,554,432 FLOPs total. That occurred in 22 microseconds, so you have 1.52 TFLOPS. That's nowhere close to 10.09 TFLOPS.

8192 is the number of threads to break though the "22us barrier". To get close to peak throughput I need to dispatch:

- 32768 threadgroups running at minimal width of 32 threads (1M threads in total or 1024 threadgroups per GPU core or 256 threadgroups per SIMD) — this executes in ~450us and produces ~9500 GFLOPS

OR

- 512 threadgroips running at maximal width of 1024 threads (0.5M threads in total or 16 threadgroups per GPU core or 4 threadgroups per SIMD) — this executes in ~220us and produces ~9500 GFLOPS

BTW, dispatching 4096+ trheadgroups of 1024 threads (4M+ threads) consistently delivers over 10TFLOPs.

Philip Turner said:
Look at the graphs under "ALU Bottlenecks" on my reverse-engineering repo. You must dispatch at least 8 simds/core (8192 threads) and have ILP=4 to reach 90% ALU utilization, if there's two unique operands per instruction. You have only one operand (X itself) which may decrease register bandwidth. Therefore you might get away with ILP=1 but that doesn't represent typical workloads. I never tested such a situation because the compiler always optimized it away.

Thanks, I think I got it. The GPU interleaves the execution of multiple programs to hide latency. This can be used to estimate the latency of common operations under different condition. GPUs are even more complicated than I thought. No wonder they need such humongous register files...

I will look some more on your benchmark results and play around with these things. Thank you so much with helping me understand the machinery and of course the very useful literature! It's a shame that Metal offers no primitives for microbenchmarking on the GPU level.

Philip Turner · Jan 21, 2023

leman said:
BTW, dispatching 4096+ trheadgroups of 1024 threads (4M+ threads) consistently delivers over 10TFLOPs.

Update schematics with insight from leman · philipturner/metal-benchmarks@130d0ad

Apple GPU microarchitecture. Contribute to philipturner/metal-benchmarks development by creating an account on GitHub.

github.com

name99 · Jan 21, 2023

Philip Turner said:
Regarding eigendecomposition, some bad new regarding CPU performance. Timothy Liu's benchmark divided the matrix's size by 2 before passing into Numpy, so the actual number of performed operations was overestimated by 8x. GFLOPS wasn't 88 (utilizing AMX), it was 10 (not utilizing AMX). Doing multiple eigendecompositions at once didn't help either for practical matrix sizes, only improved by 3x using 4 caller threads. Accelerate probably won't be faster than OpenBLAS because it can't utilize the AMX.

https://gist.github.com/philipturner/6bf642260406fa2398fb25a04d68e405

In short, I strongly doubt emulated FP64 will be useful for quantum chemistry. If consumer Nvidia GPUs can't pull it off with native FP64 and fine-tuned cuBLAS, my GPU can't. Second, Accelerate has no advantage over OpenBLAS. I could probably configure a custom build of OpenBLAS, which only uses one CPU core per eigendecomposition. Then it might efficiently scale to several parallel eigendecompositions.

Did you try the same code but using dgeev (ie or sgeev), ie float/double rather than complex?
My suspicion is that the real versions will route to AMX, while the complex versions do not.
(Do not yet...? Unclear.
It's possible that they simply haven't had time to hook up the complex routines? Do more basic calls like cgemm or zgemm route to AMX?
It's also possible that the hardware, as currently implemented, doesn't yet have the right primitives for doing the swizzling of interleaved real and imaginary elements.
My understanding of this is that
- there is functionality in AMX to handle some degree of stride.
- there is also functionality to handle rows that consist of 2x2 blocks rather than a simple vector
There's also functionality to perform multiply-subtract rather than multiply-add.

My guess is that between all of these, it should be possible to provide complex support, but
- I'm not at all familiar with the precise details of what the HW can do; I noted the fact that these two options exist but did not look at them very closely with any particular problem in mind [like complex arithmetic].
- it's not completely trivial to get the details right once you consider all the different options, handling edge cases (eg matrices that aren't a multiple of 8 in size) etc, so it's possible this has been so far implemented not at all or only partially (especially if users expect to also handle the various packed cases, like packed hermitian [which I notice your code used, as U upper triangular], adding an additional layer of finicky detail).

It's also worth noting that both of this functionality (2x2 blocks and strides) are not present in the original AMX patent and appear a year later. My guess is that they are part of AMX2, and so present in the A14/M1, but who can be sure?, and even if they are present, code always lags behind HW, so attempts to exploit this additional functionality within Accelerate may still not be complete (or even seriously have begun! My guess is that more obviously high profile tasks like speeding up NumPy and PyTorch and TensorFlow were placed to the top of the numerical group's list for obvious publicity and market size reasons.)

All of which sucks if your practical goal is simply to get quantum chemistry working as fast as possible as soon as possible! But if your goal is to understand what should be possible and where we are headed (either in future HW or in a subsequent update of Accelerate) these points are worth considering.
I think it might be interesting and helpful to the entire community to run a series of tests over the various obvious cases (eg sgemm, dgemm, cgemm, zgemm; then perhaps with various packed forms; then the same thing again with eigenvalue+vector [both for "simple" layouts and packed layouts] to get a feeling for what's implemented in AMX so far and what's still missing...

Philip Turner · Jan 22, 2023

name99 said:
Did you try the same code but using dgeev (ie or sgeev), ie float/double rather than complex?

Somewhere above, I said that DSYEVD achieved 160 GFLOPS. Correcting the error, that becomes 20 GFLOPS. Accelerate achieved similar performance when OpenBLAS was using multiple cores. Perhaps a minority of the computations were level-3 BLAS, which could benefit from acceleration via AMX or multicore. I should investigate why eigenvalue decomposition takes much longer than the theoretical GFLOPS suggests, and why Nvidia GPUs can't outperform decent multicore Intel CPUs.

name99 said:
sgemm, dgemm, cgemm, zgemm

First, there is a repository containing tests from n=1 to n=200. I used a combination of that and my already existing SGEMM/DGEMM data to extract information about ZGEMM. Small complex GEMM has half the GFLOPS (150) of the real version (300). Hopefully, large enough matrices can scale to enough cores.

name99 said:
I think it might be interesting and helpful to the entire community to run a series of tests

I'll consider that when I have enough time. We need larger tests than Daniel Chalef's, and ones that might utilize two P-blocks worth of AMX. Also, we should test how much concurrent multiplications help or harm performance, and call into OpenBLAS from Swift.

name99 · Jan 22, 2023

Philip Turner said:
Somewhere above, I said that DSYEVD achieved 160 GFLOPS. Correcting the error, that becomes 20 GFLOPS. Accelerate achieved similar performance when OpenBLAS was using multiple cores. Perhaps a minority of the computations were level-3 BLAS, which could benefit from acceleration via AMX or multicore. I should investigate why eigenvalue decomposition takes much longer than the theoretical GFLOPS suggests, and why Nvidia GPUs can't outperform decent multicore Intel CPUs.

First, there is a repository containing tests from n=1 to n=200. I used a combination of that and my already existing SGEMM/DGEMM data to extract information about ZGEMM. Small complex GEMM has half the GFLOPS (150) of the real version (300). Hopefully, large enough matrices can scale to enough cores.

I'll consider that when I have enough time. We need larger tests than Daniel Chalef's, and ones that might utilize two P-blocks worth of AMX. Also, we should test how much concurrent multiplications help or harm performance, and call into OpenBLAS from Swift.

Regarding that repository, the question of interest that I raised is not "how does AMX [in the form of SGEMM or DGEMM] perform at different sizes", the question is
- "how MUCH of LAPACK is accelerated by AMX? Is triangular matrix multiply? Is complex matrix multiply?" etc.

Philip Turner · Jan 22, 2023

name99 said:
- "how MUCH of LAPACK is accelerated by AMX? Is triangular matrix multiply? Is complex matrix multiply?" etc.

Anyone should be able to scan the openblas-benchmarks-1 repository for `results.arm64/*.veclib.txt` files. If something achieves over 51.2 GFLOPS FP64/102.4 GFLOPS FP32 with vecLib, it's certainly AMX-accelerated.

Examples that don't exceed single-core NEON GFLOPS	Examples that do	Not present in the benchmark
CGEMV, DGEMV, SGEMV, ZGEMV	CGEMM, DGEMM, SGEMM, ZGEMM
CHEMV, DSYMV, SSYMV, ZHEMV	CHEMM, ZHEMM	DSYMM, SSYMM
CTRMV, DTRMV, STRMV, ZTRMV	CTRMM, DTRMM, STRMM, ZTRMM
CCHOLESKY, ZCHOLESKY	SCHOLESKY, DCHOLESKY	CPOTRF, DPOTRF, SPOTRF, ZPOTRF
	CSYRK, DSYRK, SSYRK, ZSYRK, CHERK, ZHERK
CTRSV, DTRSV, STRSV, ZTRSV
CTPMV, DTPMV, STPMV, ZTPMV
CGESV, ZGESV	SGESV, DGESV	CHEEV, DSYEV, SSYEV, ZHEEV
CGEEV, DGEEV, SGEEV, ZGEEV
CGETRI, ZGETRI	SGETRI, DGETRI

name99 · Jan 22, 2023

Philip Turner said:
Anyone should be able to scan the openblas-benchmarks-1 repository for `results.arm64/*.veclib.txt` files. If something achieves over 51.2 GFLOPS FP64/102.4 GFLOPS FP32 with vecLib, it's certainly AMX-accelerated.

Examples that don't exceed single-core NEON GFLOPS Examples that do Not present in the benchmark
CGEMV, DGEMV, SGEMV, ZGEMV CGEMM, DGEMM, SGEMM, ZGEMM
CHEMV, DSYMV, SSYMV, ZHEMV CHEMM, ZHEMM DSYMM, SSYMM
CTRMV, DTRMV, STRMV, ZTRMV CTRMM, DTRMM, STRMM, ZTRMM
CCHOLESKY, ZCHOLESKY SCHOLESKY, DCHOLESKY CPOTRF, DPOTRF, SPOTRF, ZPOTRF
CSYRK, DSYRK, SSYRK, ZSYRK, CHERK, ZHERK
CTRSV, DTRSV, STRSV, ZTRSV
CTPMV, DTPMV, STPMV, ZTPMV
CGESV, ZGESV SGESV, DGESV CHEEV, DSYEV, SSYEV, ZHEEV
CGEEV, DGEEV, SGEEV, ZGEEV
CGETRI, ZGETRI SGETRI, DGETRI

Very interesting! Thanks for that.
It really looks like Apple has made the effort to capture the most obvious and easy cases in AMX while not yet handling the more complicated cases. And the fact that eigenvalues both aren't required by AI and are non-trivial to code means they're at the of the line

. Sigh!

The matrix-vector multiplies may(?) be limited by memory bandwidth rather than compute, explaining why they haven't been AMX'd even though that would not be too hard.

We see clear cases where complex is not handled while real is, like I said probably because handling complex is more finicky in terms of interleaving/de-interleaving the data into AMX (the sort of thing that could, and perhaps maybe will) be fixed in HW? It's tougher to do for FP64 because it would require each two adjacent AMX compute units to communicate (to swap real and imaginary parts) but should be a lot easier to fit into FP32 which already is set up correctly.

Apple already supports the NEON FCMLA instructions (essentially FP complex multiply with imaginary in the high bits, real in the low bits) so in a perfect world they could essentially provide that functionality in a future AMX.
(In the past FCMLA was not at all supported by LLVM codegen, only the assembler part. Then it picked up some iffy support where the complex multiply patterns were only occasionally recognized.
Recently it's supposed to handle pretty much all the obvious cases correctly, but you do need a recent compiler to be sure of that. My guess is that both the results in this benchmark corpus (Accelerate and OpenBLAS) reflect LLVM before it handled FCMLA, and so tests done on a recently compiled Accelerate or OpenBLAS should at least reflect some boost from that (more computing, less data shuffling!)

Using an FCMLA type engine makes the base unit of functionality quite a bit larger (two FP64 MACs rather than one, and with a small amount of duplicating the Y data flow) but I think it's perfectly feasible once they have implemented everything else they want to do. (It might also be nice to add some hardware that can handle the indexing and data flow for at least the most common sparse patterns like symmetric, antisymmetric, and hermitian!)

theorist9 · Jan 26, 2023

FYI @casperes1996 @richmlow @name99 :

I just updated my Mathematica comparison post with new results from @leman . Same benchmarks as before, but these were run plugged-in on high-power mode. They are essentially the same as caspares1996's results on battery, indicating that neither of these makes a difference.

https://forums.macrumors.com/threads/apple-silicon-in-sciences.2374458/page-3?post=31869605#post-31869605

name99 · Jan 26, 2023

theorist9 said:
FYI @casperes1996 @richmlow @name99 :

I just updated my Mathematica comparison post with new results from @leman . Same benchmarks as before, but these were run plugged-in on high-power mode. They are essentially the same as caspares1996's results on battery, indicating that neither of these makes a difference.

https://forums.macrumors.com/threads/apple-silicon-in-sciences.2374458/page-3?post=31869605#post-31869605

Thanks for the update!

It would also be interesting (even without tracking precise details) to see how the other known problematic areas are going. These include:
- anything involving bignums (int and fp)
This has been an issue with ARMv8 for a while, but presumably it is on Wolfram's list.

- special functions on packed long arrays
Some of these are supported in Accelerate (eg sqrt or exp). But MMA has to route to the relevant code. On x86 MKL will definitely run these on AVX512 or AVX2 or whatever.

- basic BLAS on large packed matrices. Again the question is whether MMA uses their own code or routes to the system BLAS and so uses AMX.

And of course it would be interesting to see if there are unexpected performance improvements on M2 class machines! My personal timing is I'm waiting for the N3 machines (possibly announced at WWDC?) before I buy, at which point I'll do some of this investigation myself

theorist9 · Jan 26, 2023

name99 said:
Thanks for the update!

It would also be interesting (even without tracking precise details) to see how the other known problematic areas are going. These include:
- anything involving bignums (int and fp)
This has been an issue with ARMv8 for a while, but presumably it is on Wolfram's list.

- special functions on packed long arrays
Some of these are supported in Accelerate (eg sqrt or exp). But MMA has to route to the relevant code. On x86 MKL will definitely run these on AVX512 or AVX2 or whatever.

- basic BLAS on large packed matrices. Again the question is whether MMA uses their own code or routes to the system BLAS and so uses AMX.

And of course it would be interesting to see if there are unexpected performance improvements on M2 class machines! My personal timing is I'm waiting for the N3 machines (possibly announced at WWDC?) before I buy, at which point I'll do some of this investigation myself

Since you use Mathematica, do you want to post a few code examples corresponding to each of these areas of concern? Or you could PM me to arrange to email a notebook. [You'll want to annotate it so we know which code corresponds to which potential issue.]

Ideally, each example shoud be configured to take (roughly) 5 s – 100 s to finish (long enough that I/O overhead doesn't matter, short enough that they run in a reasonable period of time). [The average for my symbolic benchmark was ~35 s/calculation on the iMac.]

I could then run it on my i9 iMac, and then email those who've been willing to run it on AS.

richmlow · Jan 27, 2023

theorist9 said:
FYI @casperes1996 @richmlow @name99 :

I just updated my Mathematica comparison post with new results from @leman . Same benchmarks as before, but these were run plugged-in on high-power mode. They are essentially the same as caspares1996's results on battery, indicating that neither of these makes a difference.

https://forums.macrumors.com/threads/apple-silicon-in-sciences.2374458/page-3?post=31869605#post-31869605

Hello theorist9,

Thank you for updating your Mathematica comparison post with the new results from leman!

All the best,
richmlow

Apple Silicon in Sciences

macrumors 6502

macrumors regular

macrumors regular

macrumors Core

macrumors regular

macrumors regular

macrumors Core

macrumors regular

macrumors regular

macrumors 68030

macrumors 68030

macrumors Core

macrumors regular

macrumors regular

macrumors Core

macrumors regular

macrumors 68030

macrumors regular

macrumors 68030

macrumors regular

macrumors 68030

macrumors 601

macrumors 68030

macrumors 601

macrumors 6502

Our Staff