Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,516
19,664
No, I think it’s the other way around. Outer product does GEMM in a single instruction.

How would that work? Outer product is the result of pairwise multiplication of elements over two vectors, i.e. z[i,j] = x[i]*y[j]. Matrix multiplication instead requires you to compute dot products over rows and columns, i.e. z[i,y] = x[i,] · y[,j]. There is obviously some duality between the two operations, but given that my mathematics days are long over, this is as much as I dare say :) Anyway, for GEMM you need to perform an outer product + accumulate for each row and column with matching index.

P.S. Wikipedia contains more detailed formulas and discussion: https://en.wikipedia.org/wiki/Outer_product

P.P.S. My intuition tells me that outer products can be more useful if you know that your matrices have some special constraints or need to perform some other operations. Unfortunately, much of the mainstream linalg education is focused on dot and matrix products, and I suppose that not many people are trained to think about these things in terms of outer products. Then again, linear algebra is a historical misunderstanding and a corner case of the much more general geometric algebra, so maybe we should all relearn :)
 
Last edited:

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Imagine you have a sheet of graph paper with 100 x 100 grid squares. Take a pencil, lay it horizontally oriented at the top of the paper. Roll the pencil downward until it reaches the bottom. This is your X vector. Take another pencil lay it vertically oriented at the left. Roll it sideways until reaching the right. This is your Y vector. Reposition each pencil to the starting position. Now, roll both pencils simultaneously. You just covered 10,000 grid squares - 10,000 multiplications - in a single instruction.

Now take 99 pairs of pencils - and repeat that same process 99 more times. Imagine that each time, you spawn a new sheet of paper. After everything is done, you have something 100 pages thick. It has a volume of 1,000,000 voxels, or 1,000,000 multiplications. Then, take a hammer and smash the entire book into a single page. That's the addition operation to sum everything into the output matrix. With fused multiply-add, you can apply a mini-hammer after generating each single page. That reduces the space complexity from O(n^3) to O(n^2).

Now, try another approach. Locate the top left square on the grid paper. Take a pencil, lay it vertically oriented at the top of the paper. Move it downward, like a missile shooting through that grid square. Take another pencil, lay it horizontally oriented at the left of the paper. Move it rightward, like a missile shooting through the grid square. Move both back to the starting position, then shoot both pencils simultaneously. That was one dot product, completing 1/10,000th of the matrix multiplication. Repeat the entire process 9,999 more times.

Which requires less time and energy spent moving pencils around?
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Imagine you have a sheet of graph paper with 100 x 100 grid squares. Take a pencil, lay it horizontally oriented at the top of the paper. Roll the pencil downward until it reaches the bottom. This is your X vector. Take another pencil lay it vertically oriented at the left. Roll it sideways until reaching the right. This is your Y vector. Reposition each pencil to the starting position. Now, roll both pencils simultaneously. You just covered 10,000 grid squares - 10,000 multiplications - in a single instruction.

Now take 99 pairs of pencils - and repeat that same process 99 more times. Imagine that each time, you spawn a new sheet of paper. After everything is done, you have something 100 pages thick. It has a volume of 1,000,000 voxels, or 1,000,000 multiplications. Then, take a hammer and smash the entire book into a single page. That's the addition operation to sum everything into the output matrix. With fused multiply-add, you can apply a mini-hammer after generating each single page. That reduces the space complexity from O(n^3) to O(n^2).

Now, try another approach. Locate the top left square on the grid paper. Take a pencil, lay it vertically oriented at the top of the paper. Move it downward, like a missile shooting through that grid square. Take another pencil, lay it horizontally oriented at the left of the paper. Move it rightward, like a missile shooting through the grid square. Move both back to the starting position, then shoot both pencils simultaneously. That was one dot product, completing 1/10,000th of the matrix multiplication. Repeat the entire process 9,999 more times.

Which requires less time and energy spent moving pencils around?

Intel and Nvidia hardware does full matrix multiplication (dot product of every row/column combination) as a single operation. You only need to invoke one _tile_dpbf16ps to multiply two matrices. With outer product, you need multiple instructions. That was what I was trying to point out.

As to how matrix multiplication is actually implemented by Intel, I have no idea. I very much doubt that they loop through rows/columns like you describe, that would be inefficient and would render moot the entire idea of a dedicated matrix accelerator. Maybe there is a patent somewhere that explains how it's done. I believe Nvidia operates on 4x4 matrices and computes dot products in parallel (that's 64 multiplications + 16 reductions). Could be Intel does something similar, but with much larger matrices.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
The Apple AMX - called the "outer product engine" by @name99 - uses the outer product approach. There are many ways of parallelizing the operation, and either approach has O(n^3) computations. The former just makes it possible with incredibly small bandwidth.'

Apple's simdgroup_matrix uses a tiled approach. Each thread in the SIMD group inputs two elements of a 64-wide tile. Within 20 cycles, 512 multiplications are performed. This is more in-line with Nvidia GPUs.
 
Last edited:

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
Intel and Nvidia hardware does full matrix multiplication (dot product of every row/column combination) as a single operation. You only need to invoke one _tile_dpbf16ps to multiply two matrices. With outer product, you need multiple instructions. That was what I was trying to point out.
You don't get the result in a single cycle. So it's basically moot how it happens under the sheets, where the loop is occurring...
The Apple scheme does not require sideways data movement from one MAC unit to another so it's lower energy, but is also apparently less flexible. The story of AMX after the first version shipped is one clever tweak after another to get more flexibility from the design *without* giving up that energy advantage.
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Apple's simdgroup_matrix uses a tiled approach. Each thread in the SIMD group inputs two elements of a 64-wide tile. Within 20 cycles, 512 multiplications are performed. This is more in-line with Nvidia GPUs.

With the caveat that Nvidia GPUs have dedicated dot product engines (tensor cores) for these operations.

You don't get the result in a single cycle. So it's basically moot how it happens under the sheets, where the loop is occurring...

Ah, so Intel employs some sort of iterative scheme under the hood? Nvidia describes their tensor cores as doing a full 4x4 matmul per clock (64 mul + 16 accumulate), but I suppose it's more feasible with a small matrix size.


The Apple scheme does not require sideways data movement from one MAC unit to another so it's lower energy, but is also apparently less flexible.

Why less flexible? Apple's AMX units seem to support a much wider range of operations compared to Intel/Nvidia. Or do you mean something else?
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Generally, Apple‘s team skills in numerical optimizations are top notch.
It looks like they have used some of that talent for the update of Accelerate and now it could be a game changer.

The Julia developer in charge of wrapping the BLAS libraries has written:
Accelerate (in the VM) was faster than OpenBLAS (in the VM) by about a factor of 3x when running peakflops().
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
It's been sitting right in front of me the entire time.

Fine-tune one of the leaked LLMs to port all this high-performance scientific software to Apple silicon for me. Probably very capable of translating CUDA to Metal, PyTorch calls to Accelerate or OpenBLAS. An M1 Max would have enough compute to perform transfer learning, I think. Embed domain-specific parameters like SLC size, tradeoffs of different precisions, overhead of CPU-GPU communication, performance of AMX instructions. It might take much less effort, and be more effective, than finishing the SYCL backend. No LLVM compiler engineering needed.

Someone please correct me if I'm wrong, but maybe it could:
- Almost-perfectly translate massive Python code bases (such as AlphaFold) to Swift
- Port GPTQ-for-LLaMa to Accelerate and MPSGraph (although I'll probably port it beforehand because LLaMa-13B seems ideal for this job)
- Auto-generate benchmarks for AMX investigations and other similar endeavors
- Understand linear algebra and O(n^3) scaling
- Scan through oxDNA and convert the CUDA code to Metal, eventually to OpenMM
- Look through Quantum Espresso and similar code bases, helping me port numerous niche variations of DFT
- Embed FP64 emulation into Metal shaders
- Point out code sections it struggled to translate
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Fine-tune one of the leaked LLMs to port all this high-performance scientific software to Apple silicon for me. Probably very capable of translating CUDA to Metal, PyTorch calls to Accelerate or OpenBLAS. An M1 Max would have enough compute to perform transfer learning, I think. Embed domain-specific parameters like SLC size, tradeoffs of different precisions, overhead of CPU-GPU communication, performance of AMX instructions. It might take much less effort, and be more effective, than finishing the SYCL backend. No LLVM compiler engineering needed.
Tuning a model is still very difficult and time consuming. What data are you going to use to tune the model?

You remind me of the author of VkFFT, who needed CUDA-only software to finish his thesis without having an Nvidia GPU on hand.

Out of curiosity, which CUDA-based software do you need?
 
  • Like
Reactions: Philip Turner

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I've been doing a ton of order-of-magnitude calculations, and it seems possible. The gap between M1 Max and Nvidia GPUs is less than you'd expect: square-root scaling means diminishing returns with the larger batch sizes needed to utilize tensor cores (128-256 vs 32). M1 Max could train the order of 10 million tokens/day, 35000 prompts, 1000 prompt batches on LLaMa-7B. These general-purpose models excel in not needing to be trained from scratch, but I anticipate needing some amount of transfer learning or debugging. There's also the option of linking multiple models in an ensemble (like Stable Diffusion), and I've found some promising candidates.

The biggest issue is generating enough data to train the models. The reason ChatGPT and a usage-limited GPT-4 won't work, is I'd have to embed the entire documentation for MPSGraph and other stuff into every prompt. A custom model could absorb that data somehow, whether in model weights or an auxiliary database. There's also a pretty efficient RNN that's easily trained; I could potentially layer it over Alpaca-7B. However, I don't think a documentation file counts as a training data set, and MPSGraph is only one of many capabilities it must have.

If GPT-4 does get leaked like LLaMa, I've imagined a few compression tricks to stream from SSD. I estimated that GPT-4 is no more than 1-10 trillion parameters, which may fit on my 1 TB. One token would take dozens of seconds, an entire prompt multiple hours. I'm really trying to recreate its superior codegen capabilities without paying for ChatGPT Plus, "Open"AI's closedness, or BingGPT4's usage limits. I wish it would just leak on 4chan...
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Sometimes, just coming up with the first code prototype/pseudocode is a major hurdle. Testing and refining it could be much easier. I would have spent hours-days thinking that up myself.
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,516
19,664
What did you try to build?

Some stuff in R, including non-trivial data transformation pipelines that require recursive table joins as well as some low-level stuff that manipulates the runtime state. GPT-4 did very well on simpler data transformation stuff but started producing really weird stuff on more complex questions. There were a lot of subtle mistakes, only apparent if one is already an expert on the topic.

I can imagine that it does better on the question you asked because there is probably more training data pertaining to the topic. I would be curious to know whether the code it spit out works or whether it's broken on a fundamental level...
 
  • Like
Reactions: Philip Turner

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I don't think it's flawed at a fundamental level. For a long time, I've known this is the way to emulate FP64 correctly. There's a working implementation at SoftFloat that looks similar. It's just so conceptually complex that extracting their code into a single function would take too much effort to be worthwhile. GPT-4 basically summarized their code base and simplified it by removing checks for NaNs, denormals, etc. There will be some bugs, just like a human's first try.

Fundamentally, AI compresses vast amounts of information and retrieves it efficiently. If you want to extract the right knowledge, you need to understand the AI's limitations, the scope of the problem, and your own biases. Phrase a question with a very specific task; don't assume the AI understands what you're talking about. I think my experience mastering ChatGPT, as well as explaining complex ideas to other humans, helped me extract what I needed from GPT-4.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I'm trying to decide whether to employ the 30B LLaMa model which consumes 19 GB quantized. That would exceed the RAM for 16GB devices. I'm not sure whether lossless compression techniques will squeeze that into 16 GB with reasonable real-time performance.

What RAM sizes do all of you have?
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I can imagine that it does better on the question you asked because there is probably more training data pertaining to the topic. I would be curious to know whether the code it spit out works or whether it's broken on a fundamental level...
I tried it for a different task - everything compiled and ran on the first try. This is probably the fastest I've ever gone from concept to first official release of a Swift package.

 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
My intuition tells me that outer products can be more useful if you know that your matrices have some special constraints or need to perform some other operations.
The lead developer of BLIS did a course on how to improve matrix multiplication a couple of years ago. It appears that the multiplication of two matrices requires fewer memory accesses when performed with outer product.
 

dgdosen

macrumors 68030
Dec 13, 2003
2,817
1,463
Seattle
Is the code correct? I’ve been experimenting with GPT-4 for non-trivial stuff and it’s a total disaster. But even for simpler things it makes subtle mistakes.
In all things? I'm finding it scary good. That doesn't stop me or anyone from casting a critical eye on any results.

An analogy for me is in something like stock screening. Throw a bunch of quantitative filters at something, then qualitatively take a look at what pops out.

Hopefully, that qualitative reasoning will always help set us apart from the models...
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Does Numpy use Accelerate?
From my anecdotal experience with ICA (running in Python via NumPy), I found that Accelerate is between 5x – 15x faster than OpenBLAS. The OpenBLAS implementation is so slow that even a 10-year-old Intel Mac has pretty much the same performance.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
What kind of workloads should work better with Accelerate than with OpenBlas?

My contact at Apple has asked for any examples of real-world usage that can be shown to work well on Accelerate; [...] They are looking for good examples that they can use to bolster the importance of developing good tools for people like us, so don't be shy!

Edit: added context to question
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,516
19,664
What kind of workloads should work better with Accelerate than with OpenBlas?

Anything related to matrix multiplication, certainly. Other computationally heavy stuff, most likely. In general, Apple has people who are very good at numerical optimisation, so Id' expect most of the algorithms to be very well optimised for Apple hardware specifically.
 
  • Like
Reactions: Xiao_Xi

chris11

macrumors newbie
Nov 30, 2008
16
12
What kind of workloads should work better with Accelerate than with OpenBlas?

Hi,

condaforge offers a choice of OpenBLAS and Accelerate.

When using numpy, with OpenBlas (the default) multiplying a 8192*8192 matrix of random
singles I get ~ 600 GFLOPS on an M2Pro (6+4).

Switching to Accelerate with

conda install "libblas=*=*accelerate"

ref. https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation

this number goes to ~ 2 TFLOPS :oops:

That's quite the bump!

-- Chris
 
  • Like
Reactions: Xiao_Xi and name99
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.