Apple Silicon in Sciences

leman · Mar 14, 2023

Philip Turner said:
No, I think it’s the other way around. Outer product does GEMM in a single instruction.

How would that work? Outer product is the result of pairwise multiplication of elements over two vectors, i.e. z[i,j] = x[i]*y[j]. Matrix multiplication instead requires you to compute dot products over rows and columns, i.e. z[i,y] = x[i,] · y[,j]. There is obviously some duality between the two operations, but given that my mathematics days are long over, this is as much as I dare say

Anyway, for GEMM you need to perform an outer product + accumulate for each row and column with matching index.

P.S. Wikipedia contains more detailed formulas and discussion: https://en.wikipedia.org/wiki/Outer_product

P.P.S. My intuition tells me that outer products can be more useful if you know that your matrices have some special constraints or need to perform some other operations. Unfortunately, much of the mainstream linalg education is focused on dot and matrix products, and I suppose that not many people are trained to think about these things in terms of outer products. Then again, linear algebra is a historical misunderstanding and a corner case of the much more general geometric algebra, so maybe we should all relearn

Philip Turner · Mar 14, 2023

Imagine you have a sheet of graph paper with 100 x 100 grid squares. Take a pencil, lay it horizontally oriented at the top of the paper. Roll the pencil downward until it reaches the bottom. This is your X vector. Take another pencil lay it vertically oriented at the left. Roll it sideways until reaching the right. This is your Y vector. Reposition each pencil to the starting position. Now, roll both pencils simultaneously. You just covered 10,000 grid squares - 10,000 multiplications - in a single instruction.

Now take 99 pairs of pencils - and repeat that same process 99 more times. Imagine that each time, you spawn a new sheet of paper. After everything is done, you have something 100 pages thick. It has a volume of 1,000,000 voxels, or 1,000,000 multiplications. Then, take a hammer and smash the entire book into a single page. That's the addition operation to sum everything into the output matrix. With fused multiply-add, you can apply a mini-hammer after generating each single page. That reduces the space complexity from O(n^3) to O(n^2).

Now, try another approach. Locate the top left square on the grid paper. Take a pencil, lay it vertically oriented at the top of the paper. Move it downward, like a missile shooting through that grid square. Take another pencil, lay it horizontally oriented at the left of the paper. Move it rightward, like a missile shooting through the grid square. Move both back to the starting position, then shoot both pencils simultaneously. That was one dot product, completing 1/10,000th of the matrix multiplication. Repeat the entire process 9,999 more times.

Which requires less time and energy spent moving pencils around?

leman · Mar 14, 2023

Philip Turner said:
Imagine you have a sheet of graph paper with 100 x 100 grid squares. Take a pencil, lay it horizontally oriented at the top of the paper. Roll the pencil downward until it reaches the bottom. This is your X vector. Take another pencil lay it vertically oriented at the left. Roll it sideways until reaching the right. This is your Y vector. Reposition each pencil to the starting position. Now, roll both pencils simultaneously. You just covered 10,000 grid squares - 10,000 multiplications - in a single instruction.

Now take 99 pairs of pencils - and repeat that same process 99 more times. Imagine that each time, you spawn a new sheet of paper. After everything is done, you have something 100 pages thick. It has a volume of 1,000,000 voxels, or 1,000,000 multiplications. Then, take a hammer and smash the entire book into a single page. That's the addition operation to sum everything into the output matrix. With fused multiply-add, you can apply a mini-hammer after generating each single page. That reduces the space complexity from O(n^3) to O(n^2).

Now, try another approach. Locate the top left square on the grid paper. Take a pencil, lay it vertically oriented at the top of the paper. Move it downward, like a missile shooting through that grid square. Take another pencil, lay it horizontally oriented at the left of the paper. Move it rightward, like a missile shooting through the grid square. Move both back to the starting position, then shoot both pencils simultaneously. That was one dot product, completing 1/10,000th of the matrix multiplication. Repeat the entire process 9,999 more times.

Which requires less time and energy spent moving pencils around?

Intel and Nvidia hardware does full matrix multiplication (dot product of every row/column combination) as a single operation. You only need to invoke one _tile_dpbf16ps to multiply two matrices. With outer product, you need multiple instructions. That was what I was trying to point out.

As to how matrix multiplication is actually implemented by Intel, I have no idea. I very much doubt that they loop through rows/columns like you describe, that would be inefficient and would render moot the entire idea of a dedicated matrix accelerator. Maybe there is a patent somewhere that explains how it's done. I believe Nvidia operates on 4x4 matrices and computes dot products in parallel (that's 64 multiplications + 16 reductions). Could be Intel does something similar, but with much larger matrices.

Philip Turner · Mar 14, 2023

The Apple AMX - called the "outer product engine" by @name99 - uses the outer product approach. There are many ways of parallelizing the operation, and either approach has O(n^3) computations. The former just makes it possible with incredibly small bandwidth.'

Apple's simdgroup_matrix uses a tiled approach. Each thread in the SIMD group inputs two elements of a 64-wide tile. Within 20 cycles, 512 multiplications are performed. This is more in-line with Nvidia GPUs.

name99 · Mar 14, 2023

leman said:
Intel and Nvidia hardware does full matrix multiplication (dot product of every row/column combination) as a single operation. You only need to invoke one _tile_dpbf16ps to multiply two matrices. With outer product, you need multiple instructions. That was what I was trying to point out.

You don't get the result in a single cycle. So it's basically moot how it happens under the sheets, where the loop is occurring...
The Apple scheme does not require sideways data movement from one MAC unit to another so it's lower energy, but is also apparently less flexible. The story of AMX after the first version shipped is one clever tweak after another to get more flexibility from the design *without* giving up that energy advantage.

leman · Mar 14, 2023

Philip Turner said:
Apple's simdgroup_matrix uses a tiled approach. Each thread in the SIMD group inputs two elements of a 64-wide tile. Within 20 cycles, 512 multiplications are performed. This is more in-line with Nvidia GPUs.

With the caveat that Nvidia GPUs have dedicated dot product engines (tensor cores) for these operations.

name99 said:
You don't get the result in a single cycle. So it's basically moot how it happens under the sheets, where the loop is occurring...

Ah, so Intel employs some sort of iterative scheme under the hood? Nvidia describes their tensor cores as doing a full 4x4 matmul per clock (64 mul + 16 accumulate), but I suppose it's more feasible with a small matrix size.

name99 said:
The Apple scheme does not require sideways data movement from one MAC unit to another so it's lower energy, but is also apparently less flexible.

Why less flexible? Apple's AMX units seem to support a much wider range of operations compared to Intel/Nvidia. Or do you mean something else?

Xiao_Xi · Mar 14, 2023

leman said:
Generally, Apple‘s team skills in numerical optimizations are top notch.

It looks like they have used some of that talent for the update of Accelerate and now it could be a game changer.

The Julia developer in charge of wrapping the BLAS libraries has written:

Accelerate (in the VM) was faster than OpenBLAS (in the VM) by about a factor of 3x when running peakflops().

BLAS support for M1 ARM64 via Apple's Accelerate · Issue #869 · JuliaLang/LinearAlgebra.jl

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can prov...

github.com

Philip Turner · Mar 15, 2023

It's been sitting right in front of me the entire time.

Fine-tune one of the leaked LLMs to port all this high-performance scientific software to Apple silicon for me. Probably very capable of translating CUDA to Metal, PyTorch calls to Accelerate or OpenBLAS. An M1 Max would have enough compute to perform transfer learning, I think. Embed domain-specific parameters like SLC size, tradeoffs of different precisions, overhead of CPU-GPU communication, performance of AMX instructions. It might take much less effort, and be more effective, than finishing the SYCL backend. No LLVM compiler engineering needed.

Someone please correct me if I'm wrong, but maybe it could:
- Almost-perfectly translate massive Python code bases (such as AlphaFold) to Swift
- Port GPTQ-for-LLaMa to Accelerate and MPSGraph (although I'll probably port it beforehand because LLaMa-13B seems ideal for this job)
- Auto-generate benchmarks for AMX investigations and other similar endeavors
- Understand linear algebra and O(n^3) scaling
- Scan through oxDNA and convert the CUDA code to Metal, eventually to OpenMM
- Look through Quantum Espresso and similar code bases, helping me port numerous niche variations of DFT
- Embed FP64 emulation into Metal shaders
- Point out code sections it struggled to translate

Xiao_Xi · Mar 15, 2023

Philip Turner said:
Fine-tune one of the leaked LLMs to port all this high-performance scientific software to Apple silicon for me. Probably very capable of translating CUDA to Metal, PyTorch calls to Accelerate or OpenBLAS. An M1 Max would have enough compute to perform transfer learning, I think. Embed domain-specific parameters like SLC size, tradeoffs of different precisions, overhead of CPU-GPU communication, performance of AMX instructions. It might take much less effort, and be more effective, than finishing the SYCL backend. No LLVM compiler engineering needed.

Tuning a model is still very difficult and time consuming. What data are you going to use to tune the model?

You remind me of the author of VkFFT, who needed CUDA-only software to finish his thesis without having an Nvidia GPU on hand.

Out of curiosity, which CUDA-based software do you need?

Philip Turner · Mar 16, 2023

I've been doing a ton of order-of-magnitude calculations, and it seems possible. The gap between M1 Max and Nvidia GPUs is less than you'd expect: square-root scaling means diminishing returns with the larger batch sizes needed to utilize tensor cores (128-256 vs 32). M1 Max could train the order of 10 million tokens/day, 35000 prompts, 1000 prompt batches on LLaMa-7B. These general-purpose models excel in not needing to be trained from scratch, but I anticipate needing some amount of transfer learning or debugging. There's also the option of linking multiple models in an ensemble (like Stable Diffusion), and I've found some promising candidates.

The biggest issue is generating enough data to train the models. The reason ChatGPT and a usage-limited GPT-4 won't work, is I'd have to embed the entire documentation for MPSGraph and other stuff into every prompt. A custom model could absorb that data somehow, whether in model weights or an auxiliary database. There's also a pretty efficient RNN that's easily trained; I could potentially layer it over Alpaca-7B. However, I don't think a documentation file counts as a training data set, and MPSGraph is only one of many capabilities it must have.

If GPT-4 does get leaked like LLaMa, I've imagined a few compression tricks to stream from SSD. I estimated that GPT-4 is no more than 1-10 trillion parameters, which may fit on my 1 TB. One token would take dozens of seconds, an entire prompt multiple hours. I'm really trying to recreate its superior codegen capabilities without paying for ChatGPT Plus, "Open"AI's closedness, or BingGPT4's usage limits. I wish it would just leak on 4chan...

Philip Turner · Mar 17, 2023

Jesus.

gpt4-solving-fp64-emulation.md

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

leman · Mar 18, 2023

Philip Turner said:
Jesus.

gpt4-solving-fp64-emulation.md

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

Is the code correct? I’ve been experimenting with GPT-4 for non-trivial stuff and it’s a total disaster. But even for simpler things it makes subtle mistakes.

Philip Turner · Mar 18, 2023

Sometimes, just coming up with the first code prototype/pseudocode is a major hurdle. Testing and refining it could be much easier. I would have spent hours-days thinking that up myself.

Philip Turner · Mar 18, 2023

leman said:
I’ve been experimenting with GPT-4 for non-trivial stuff and it’s a total disaster.

What did you try to build?

leman · Mar 18, 2023

Philip Turner said:
What did you try to build?

Some stuff in R, including non-trivial data transformation pipelines that require recursive table joins as well as some low-level stuff that manipulates the runtime state. GPT-4 did very well on simpler data transformation stuff but started producing really weird stuff on more complex questions. There were a lot of subtle mistakes, only apparent if one is already an expert on the topic.

I can imagine that it does better on the question you asked because there is probably more training data pertaining to the topic. I would be curious to know whether the code it spit out works or whether it's broken on a fundamental level...

Philip Turner · Mar 18, 2023

I don't think it's flawed at a fundamental level. For a long time, I've known this is the way to emulate FP64 correctly. There's a working implementation at SoftFloat that looks similar. It's just so conceptually complex that extracting their code into a single function would take too much effort to be worthwhile. GPT-4 basically summarized their code base and simplified it by removing checks for NaNs, denormals, etc. There will be some bugs, just like a human's first try.

Fundamentally, AI compresses vast amounts of information and retrieves it efficiently. If you want to extract the right knowledge, you need to understand the AI's limitations, the scope of the problem, and your own biases. Phrase a question with a very specific task; don't assume the AI understands what you're talking about. I think my experience mastering ChatGPT, as well as explaining complex ideas to other humans, helped me extract what I needed from GPT-4.

Philip Turner · Mar 18, 2023

I'm trying to decide whether to employ the 30B LLaMa model which consumes 19 GB quantized. That would exceed the RAM for 16GB devices. I'm not sure whether lossless compression techniques will squeeze that into 16 GB with reasonable real-time performance.

What RAM sizes do all of you have?

Philip Turner · Mar 19, 2023

leman said:
I can imagine that it does better on the question you asked because there is probably more training data pertaining to the topic. I would be curious to know whether the code it spit out works or whether it's broken on a fundamental level...

I tried it for a different task - everything compiled and ran on the first try. This is probably the fastest I've ever gone from concept to first official release of a Swift package.

GitHub - philipturner/applegpuinfo: Print all known information about the GPU on Apple-designed chips

Print all known information about the GPU on Apple-designed chips - philipturner/applegpuinfo

github.com

Xiao_Xi · Mar 22, 2023

leman said:
My intuition tells me that outer products can be more useful if you know that your matrices have some special constraints or need to perform some other operations.

The lead developer of BLIS did a course on how to improve matrix multiplication a couple of years ago. It appears that the multiplication of two matrices requires fewer memory accesses when performed with outer product.

PfHP Streaming \(A_{i,p} \) and \(B_{p,j} \)

dgdosen · Mar 26, 2023

leman said:
Is the code correct? I’ve been experimenting with GPT-4 for non-trivial stuff and it’s a total disaster. But even for simpler things it makes subtle mistakes.

In all things? I'm finding it scary good. That doesn't stop me or anyone from casting a critical eye on any results.

An analogy for me is in something like stock screening. Throw a bunch of quantitative filters at something, then qualitatively take a look at what pops out.

Hopefully, that qualitative reasoning will always help set us apart from the models...

Xiao_Xi · Mar 27, 2023

Does Numpy use Accelerate?

From my anecdotal experience with ICA (running in Python via NumPy), I found that Accelerate is between 5x – 15x faster than OpenBLAS. The OpenBLAS implementation is so slow that even a 10-year-old Intel Mac has pretty much the same performance.

BLAS support for M1 ARM64 via Apple's Accelerate · Issue #869 · JuliaLang/LinearAlgebra.jl

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can prov...

github.com

Xiao_Xi · Apr 7, 2023

Apple has confirmed that Accelerate's Cholesky decomposition subroutine has a bug.

Add support for ILP64 Accelerate by staticfloat · Pull Request #113 · JuliaLinearAlgebra/libblastrampoline

The new Accelerate released in macOS v13.3 provides two new interfaces; an upgraded LAPACK for the LP64 interface, and an all-new ILP64 interface (that uses the same upgraded LAPACK). These symbol...

github.com

Xiao_Xi · Apr 14, 2023

What kind of workloads should work better with Accelerate than with OpenBlas?

My contact at Apple has asked for any examples of real-world usage that can be shown to work well on Accelerate; [...] They are looking for good examples that they can use to bolster the importance of developing good tools for people like us, so don't be shy!

BLAS support for M1 ARM64 via Apple's Accelerate · Issue #869 · JuliaLang/LinearAlgebra.jl

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can prov...

github.com

Edit: added context to question

leman · Apr 14, 2023

Xiao_Xi said:
What kind of workloads should work better with Accelerate than with OpenBlas?

Anything related to matrix multiplication, certainly. Other computationally heavy stuff, most likely. In general, Apple has people who are very good at numerical optimisation, so Id' expect most of the algorithms to be very well optimised for Apple hardware specifically.

chris11 · Apr 20, 2023

Xiao_Xi said:
What kind of workloads should work better with Accelerate than with OpenBlas?

Hi,

condaforge offers a choice of OpenBLAS and Accelerate.

When using numpy, with OpenBlas (the default) multiplying a 8192*8192 matrix of random
singles I get ~ 600 GFLOPS on an M2Pro (6+4).

Switching to Accelerate with

conda install "libblas=*=*accelerate"

ref. https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation

this number goes to ~ 2 TFLOPS

That's quite the bump!

-- Chris

Apple Silicon in Sciences

macrumors Core

macrumors regular

macrumors Core

macrumors regular

macrumors 68030

macrumors Core

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors regular

macrumors Core

macrumors regular

macrumors regular

macrumors Core

macrumors regular

macrumors regular

macrumors regular

macrumors 68000

macrumors 68030

macrumors 68000

macrumors 68000

macrumors 68000

macrumors Core

macrumors newbie

Our Staff