I just came to a realization that I was wrong about this. When you're reverse-engineering something, you make a mental model of it. Test whether it explains your observations, then accept it if it does. I had recently read a 2016 research paper about how GPUs hide latencies. It heavily referenced little's law, which goes like this in simple terms.Which mental model is correct?
I highly suggest you read this research paper on how GPUs hide latency: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf. It explains things like Little's Law and how ALU utilization varies with occupancy.Wouldn't this suggest that the results of the FMA instruction is available immediately for the next instruction to use?
Both these points are correct.I think the idea of AMX being a coprocessor simply means that it’s invoked from the CPU, using CPU instructions, as opposed to other units which operate on their own and are controlled by passing messages or commands.
What makes AMX special however is that it’s shared between all the CPUs in a cluster. If you want, you can consider it a specialized fifth processor that shares the same L2 cache.
Phillip we KNOW how the system is designed, you can read the (long stream of) patents!You can easily see how the old (corrupt) model leads to a certain understanding of the M1 Pro's AMX-cluster. Each block on the die shot, I thought was a separate station. These stations are incredibly inefficient, their multipliers are active for only one of every 4 cycles. They duplicate the circuitry required to generate an outer product from two 64-byte operands. This matched my understanding that all FFMA circuitry has 4x redundancy and 4x latency, 1x throughput.
Let's redesign it like the ANE. Every cycle, you dispatch a 64-byte by 64-byte FFMA to the AMX. It breaks that up into 32-byte by 32-byte chunks, then passes said chunk into separate chunks on the die shot. Hardware multipliers can multiply this data in a single cycle, but the train station is complex. Other circuitry must preprocess the data before multiplying. The experienced latency should jump to 4 cycles, not 1 cycle.
The way the AMX actually dispatches, is more complex. A central dispatcher receives data from the CPU cores, then reroutes it to the 4 AMX-blocks I talk about. There's also no need for Z registers to physically reside in the blocks. Each cycle, the block simultaneously do this:
Stage 1:
(a) Read 8x8 floats or 4x4 doubles of Z-register from the central dispatcher
(b) Read 8 floats of 4 doubles of X-register from the central dispatcher, scatter them across the 8x8/4x4 grid like I visually depicted in another comment.
(c) Read 8 floats of 4 doubles of Y-register from the central dispatcher, scatter them across the 8x8/4x4 grid like I visually depicted in another comment.
Stage 2:
(d) Multiply whatever data's currently in the 8x8/4x4 grid, writing the outputs to another 8x8/4x4 grid of registers.
Stage 3:
(e) Add whatever data's in the second set of registers, to the Z-stuff you originally fetched from the central dispatcher. Write these outputs to yet another 8x8/4x4 grid of registers.
Stage 4:
(f) Write the final Z registers back to the central dispatcher.
I also assumed there's immense overhead in writing the entire chunk of Z-registers back to the scheduler. Let's see what implications are, if the Z-registers cannot just be trivially written back to the scheduler.
(a) You want to keep Z-data inside the AMX-blocks for as long as possible. Outer products accumulate to existing data.
(b) When just one CPU thread is using the AMX, no issues. You split a 64-byte by 64-byte multiplication into four 32-byte by 32-byte multiplications. Each chunk writes to a separate, non-related block of results.
(c) The next AMX instruction ideally performs an outer product on those same set of Z-registers. You can dispatch 32Bx32B chunks to the same AMX blocks in the same order, without moving the Z-data around.
(d) If all four CPUs used the AMX simultaneously, you'd want to partition one AMX block to each CPU. Each CPU would simultaneously work on a different set of Z-registers.
(e) If two CPUs used the AMX simultaneously, you'd similarly avoid excessive shuffling by partitioning two AMX blocks to each CPU. The CPU cores themselves wouldn't know what blocks are doing their multiplies, but the AMX internally knows.
Except this model doesn't match the ISA. Each block supposedly stores 8x8 floats (256 B) or 4x4 doubles of Z registers (64 B). However the ISA supports 64 Z-registers each being 64 bytes. That would imply each block stores 1024 B worth of registers. Or it doesn't store anything at all. The bandwidth requirements are so low (256 B/cycle/block) that you could shuffle Z contents in/out of every block, each cycle. There's no need to try and route certain multiplications to certain blocks, in an effort to reuse locally cached Z registers.
Show me your benchmark, and the execution times. I'm very concerned that you're off by a factor of 2. Also, be careful that the compiler isn't optimizing anything away. Experiment with rearranging the operands so it's fma(x, x, y) as well. Is this single or half precision?
My benchmarks differed, in that each thread was working with numerous different operands simultaneously. This created very long cyclical dependencies that the compiler can't optimize away. Perhaps that overfilled the register cache?
// computes the geometric series 1/(1-x) as 1 + x + x^2 + x^3 + ...
template<typename T, uint N>
struct fma_sum_t {
static T compute(T x) {
fma(x, fma_sum_t<T, N-1>::compute(x), 1);
}
};
template<typename T>
struct fma_sum_t<T, 0> {
static T compute(T x) {
return 1;
}
};
template<typename T, uint N>
[[kernel]]
void fma_sum(device float* out [[buffer(0)]],
uint index [[thread_position_in_grid]],
uint nthreads [[threads_per_grid]])
{
float x = ((T) index)/((T) nthreads);
out[index] = fma_sum_t<T, N>::compute(x);
}
// benchmark function instances
template [[host_name("fma_sum_float_512")]]
[[kernel]]
void fma_sum<float, 512>(device float*, uint, uint);
template [[host_name("fma_sum_float_1024")]]
[[kernel]]
void fma_sum<float, 1024>(device float*, uint, uint);
template [[host_name("fma_sum_float_2048")]]
[[kernel]]
void fma_sum<float, 2048>(device float*, uint, uint);
I highly suggest you read this research paper on how GPUs hide latency: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf. It explains things like Little's Law and how ALU utilization varies with occupancy.
Could you provide the exact numbers you used to get 95%? From the numbers you provided, there were 8192 threads * 2048 FFMAs/thread * 2 FLOPs/FFMA. Therefore 33,554,432 FLOPs total. That occurred in 22 microseconds, so you have 1.52 TFLOPS. That's nowhere close to 10.09 TFLOPS.The computed FLOPs are within 95% of theoretical peak FMA for these GPUs,
Look at the graphs under "ALU Bottlenecks" on my reverse-engineering repo. You must dispatch at least 8 simds/core (8192 threads) and have ILP=4 to reach 90% ALU utilization, if there's two unique operands per instruction. You have only one operand (X itself) which may decrease register bandwidth. Therefore you might get away with ILP=1 but that doesn't represent typical workloads. I never tested such a situation because the compiler always optimized it away.I have to raise the amount of threads to over 8192
Could you provide the exact numbers you used to get 95%? From the numbers you provided, there were 8192 threads * 2048 FFMAs/thread * 2 FLOPs/FFMA. Therefore 33,554,432 FLOPs total. That occurred in 22 microseconds, so you have 1.52 TFLOPS. That's nowhere close to 10.09 TFLOPS.
Look at the graphs under "ALU Bottlenecks" on my reverse-engineering repo. You must dispatch at least 8 simds/core (8192 threads) and have ILP=4 to reach 90% ALU utilization, if there's two unique operands per instruction. You have only one operand (X itself) which may decrease register bandwidth. Therefore you might get away with ILP=1 but that doesn't represent typical workloads. I never tested such a situation because the compiler always optimized it away.
BTW, dispatching 4096+ trheadgroups of 1024 threads (4M+ threads) consistently delivers over 10TFLOPs.
Did you try the same code but using dgeev (ie or sgeev), ie float/double rather than complex?Regarding eigendecomposition, some bad new regarding CPU performance. Timothy Liu's benchmark divided the matrix's size by 2 before passing into Numpy, so the actual number of performed operations was overestimated by 8x. GFLOPS wasn't 88 (utilizing AMX), it was 10 (not utilizing AMX). Doing multiple eigendecompositions at once didn't help either for practical matrix sizes, only improved by 3x using 4 caller threads. Accelerate probably won't be faster than OpenBLAS because it can't utilize the AMX.
https://gist.github.com/philipturner/6bf642260406fa2398fb25a04d68e405
In short, I strongly doubt emulated FP64 will be useful for quantum chemistry. If consumer Nvidia GPUs can't pull it off with native FP64 and fine-tuned cuBLAS, my GPU can't. Second, Accelerate has no advantage over OpenBLAS. I could probably configure a custom build of OpenBLAS, which only uses one CPU core per eigendecomposition. Then it might efficiently scale to several parallel eigendecompositions.
Somewhere above, I said that DSYEVD achieved 160 GFLOPS. Correcting the error, that becomes 20 GFLOPS. Accelerate achieved similar performance when OpenBLAS was using multiple cores. Perhaps a minority of the computations were level-3 BLAS, which could benefit from acceleration via AMX or multicore. I should investigate why eigenvalue decomposition takes much longer than the theoretical GFLOPS suggests, and why Nvidia GPUs can't outperform decent multicore Intel CPUs.Did you try the same code but using dgeev (ie or sgeev), ie float/double rather than complex?
First, there is a repository containing tests from n=1 to n=200. I used a combination of that and my already existing SGEMM/DGEMM data to extract information about ZGEMM. Small complex GEMM has half the GFLOPS (150) of the real version (300). Hopefully, large enough matrices can scale to enough cores.sgemm, dgemm, cgemm, zgemm
I'll consider that when I have enough time. We need larger tests than Daniel Chalef's, and ones that might utilize two P-blocks worth of AMX. Also, we should test how much concurrent multiplications help or harm performance, and call into OpenBLAS from Swift.I think it might be interesting and helpful to the entire community to run a series of tests
Somewhere above, I said that DSYEVD achieved 160 GFLOPS. Correcting the error, that becomes 20 GFLOPS. Accelerate achieved similar performance when OpenBLAS was using multiple cores. Perhaps a minority of the computations were level-3 BLAS, which could benefit from acceleration via AMX or multicore. I should investigate why eigenvalue decomposition takes much longer than the theoretical GFLOPS suggests, and why Nvidia GPUs can't outperform decent multicore Intel CPUs.
First, there is a repository containing tests from n=1 to n=200. I used a combination of that and my already existing SGEMM/DGEMM data to extract information about ZGEMM. Small complex GEMM has half the GFLOPS (150) of the real version (300). Hopefully, large enough matrices can scale to enough cores.
I'll consider that when I have enough time. We need larger tests than Daniel Chalef's, and ones that might utilize two P-blocks worth of AMX. Also, we should test how much concurrent multiplications help or harm performance, and call into OpenBLAS from Swift.
Anyone should be able to scan the openblas-benchmarks-1 repository for `results.arm64/*.veclib.txt` files. If something achieves over 51.2 GFLOPS FP64/102.4 GFLOPS FP32 with vecLib, it's certainly AMX-accelerated.- "how MUCH of LAPACK is accelerated by AMX? Is triangular matrix multiply? Is complex matrix multiply?" etc.
Examples that don't exceed single-core NEON GFLOPS | Examples that do | Not present in the benchmark |
CGEMV, DGEMV, SGEMV, ZGEMV | CGEMM, DGEMM, SGEMM, ZGEMM | |
CHEMV, DSYMV, SSYMV, ZHEMV | CHEMM, ZHEMM | DSYMM, SSYMM |
CTRMV, DTRMV, STRMV, ZTRMV | CTRMM, DTRMM, STRMM, ZTRMM | |
CCHOLESKY, ZCHOLESKY | SCHOLESKY, DCHOLESKY | CPOTRF, DPOTRF, SPOTRF, ZPOTRF |
CSYRK, DSYRK, SSYRK, ZSYRK, CHERK, ZHERK | ||
CTRSV, DTRSV, STRSV, ZTRSV | ||
CTPMV, DTPMV, STPMV, ZTPMV | ||
CGESV, ZGESV | SGESV, DGESV | CHEEV, DSYEV, SSYEV, ZHEEV |
CGEEV, DGEEV, SGEEV, ZGEEV | ||
CGETRI, ZGETRI | SGETRI, DGETRI |
Very interesting! Thanks for that.Anyone should be able to scan the openblas-benchmarks-1 repository for `results.arm64/*.veclib.txt` files. If something achieves over 51.2 GFLOPS FP64/102.4 GFLOPS FP32 with vecLib, it's certainly AMX-accelerated.
Examples that don't exceed single-core NEON GFLOPS Examples that do Not present in the benchmark CGEMV, DGEMV, SGEMV, ZGEMV CGEMM, DGEMM, SGEMM, ZGEMM CHEMV, DSYMV, SSYMV, ZHEMV CHEMM, ZHEMM DSYMM, SSYMM CTRMV, DTRMV, STRMV, ZTRMV CTRMM, DTRMM, STRMM, ZTRMM CCHOLESKY, ZCHOLESKY SCHOLESKY, DCHOLESKY CPOTRF, DPOTRF, SPOTRF, ZPOTRF CSYRK, DSYRK, SSYRK, ZSYRK, CHERK, ZHERK CTRSV, DTRSV, STRSV, ZTRSV CTPMV, DTPMV, STPMV, ZTPMV CGESV, ZGESV SGESV, DGESV CHEEV, DSYEV, SSYEV, ZHEEV CGEEV, DGEEV, SGEEV, ZGEEV CGETRI, ZGETRI SGETRI, DGETRI
Thanks for the update!FYI @casperes1996 @richmlow @name99 :
I just updated my Mathematica comparison post with new results from @leman . Same benchmarks as before, but these were run plugged-in on high-power mode. They are essentially the same as caspares1996's results on battery, indicating that neither of these makes a difference.
Since you use Mathematica, do you want to post a few code examples corresponding to each of these areas of concern? Or you could PM me to arrange to email a notebook. [You'll want to annotate it so we know which code corresponds to which potential issue.]Thanks for the update!
It would also be interesting (even without tracking precise details) to see how the other known problematic areas are going. These include:
- anything involving bignums (int and fp)
This has been an issue with ARMv8 for a while, but presumably it is on Wolfram's list.
- special functions on packed long arrays
Some of these are supported in Accelerate (eg sqrt or exp). But MMA has to route to the relevant code. On x86 MKL will definitely run these on AVX512 or AVX2 or whatever.
- basic BLAS on large packed matrices. Again the question is whether MMA uses their own code or routes to the system BLAS and so uses AMX.
And of course it would be interesting to see if there are unexpected performance improvements on M2 class machines! My personal timing is I'm waiting for the N3 machines (possibly announced at WWDC?) before I buy, at which point I'll do some of this investigation myself
Hello theorist9,FYI @casperes1996 @richmlow @name99 :
I just updated my Mathematica comparison post with new results from @leman . Same benchmarks as before, but these were run plugged-in on high-power mode. They are essentially the same as caspares1996's results on battery, indicating that neither of these makes a difference.