Apple Silicon in Sciences

Philip Turner · Feb 26, 2023

HiddenPaul said:
I made this thread because I thought sites regarding Apple Silicon Macs regarding science is lacking and I thought this thread would be a helpful place for any future scientists or current ones to get info regarding Apple Silicon Macs.

Please post only things regarding the usage of Apple Silicon Macs in science. Post articles, github repositories, sites, etc. regarding this topic.

I'm gonna start with this:
Apple Silicon Performance in Scientific Computing

PDF version:

https://arxiv.org/pdf/2211.00720.pdf

I found out why the M1 supposedly beat other GPUs in OpenCL benchmarks. Apple's OpenCL queue profiling API is not Khronos-conformant. It reports Mach timebase ticks (24 MHz) as nanoseconds (1 GHz), creating times that are 1/41.7 times the actual value. It's highly probable that some kind of profiling software assumed that CL_QUEUE_PROFILING_ENABLE works correctly and did not correct the error.

Take whatever results these researchers got, divide the speed by 41.7. If the M1 is reported 10x faster than V100, it now becomes 4x slower.

theorist9 · Feb 26, 2023

Philip Turner said:
I found out why the M1 supposedly beat other GPUs in OpenCL benchmarks. Apple's OpenCL queue profiling API is not Khronos-conformant. It reports Mach timebase ticks (24 MHz) as nanoseconds (1 GHz), creating times that are 1/41.7 times the actual value. It's highly probable that some kind of profiling software assumed that CL_QUEUE_PROFILING_ENABLE works correctly and did not correct the error.

Take whatever results these researchers got, divide the speed by 41.7. If the M1 is reported 10x faster than V100, it now becomes 4x slower.

This is why peer review is important

. At least I assume it was not peer-reviewed, since it was published as a conference proceeding:

Request Rejected

Are these your own findings? If so, you should email the authors, ccing IEEE. Since a comparison to the NVIDIA GPUs is the whole point of the paper, if your findings are correct the authors should withdraw the paper while they work on a corrected version.

Philip Turner · Feb 26, 2023

I found contact info for one author: https://github.com/cdcapano, but not a specific email for IEEE Xplore. I'll email the author and see if he responds.

In the meantime, I found some more red flags. The authors didn't check that the M1 Ultra has 21232 GFLOPS, yet the SGEMM benchmark said 248000 GFLOPS. Dividing by 41.7 and accounting for 25% ALU utilization without simdgroup_matrix, I would expect a faulty benchmark to state 221000 GFLOPS.

iPadified · Feb 26, 2023

Philip Turner said:
I found contact info for one author: https://github.com/cdcapano, but not a specific email for IEEE Xplore. I'll email the author and see if he responds.

In the meantime, I found some more red flags. The authors didn't check that the M1 Ultra has 21232 GFLOPS, yet the SGEMM benchmark said 248000 GFLOPS. Dividing by 41.7 and accounting for 25% ALU utilization without simdgroup_matrix, I would expect a faulty benchmark to state 221000 GFLOPS.

Great work! This is the reason for showing data and discussing them - we learn! The correct thing is to send a note to the authors not to IEEE especially not for conference proceedings.

theorist9 · Feb 27, 2023

iPadified said:
Great work! This is the reason for showing data and discussing them - we learn! The correct thing is to send a note to the authors not to IEEE especially not for conference proceedings.

Yeah, it would probably be better to contact just the authors first.

Philip Turner · Mar 4, 2023

About my earlier posts on this thread about using Apple silicon for computational chemistry...

Me:

I have heard that DFT algorithms have O(n^3) (cubic) computational complexity with respect to number of electrons. Both eigendecomposition and matrix multiplication have the same computational complexity. I cannot recall whether diagonalization was the bottleneck in DFT simulations. Can you help me figure this out?

ChatGPT:

In DFT calculations, the most computationally intensive part is the solution of the Kohn-Sham equations, which has an O(n^3) computational complexity with respect to the number of electrons (n). However, the eigendecomposition is not the bottleneck in DFT simulations, and diagonalization is not explicitly performed during the iterative solution of the Kohn-Sham equations.

Instead, the most computationally demanding operation in DFT simulations is the repeated multiplication of large matrices, which has an O(n^3) computational complexity. This operation is required to solve the Kohn-Sham equations and obtain the self-consistent electron density and potential. The iterative solution of the Kohn-Sham equations involves the use of numerical methods such as the self-consistent field (SCF) method or the density mixing method, which do not explicitly involve diagonalization.

While the eigendecomposition is not explicitly performed during the iterative solution of the Kohn-Sham equations, it is required in the initial setup of the Kohn-Sham equations to obtain the set of orbitals that are used to construct the density matrix. However, this eigendecomposition is typically performed only once, and the resulting set of orbitals is then used for the iterative solution of the Kohn-Sham equations.

Overall, the computational complexity of DFT simulations is dominated by the repeated multiplication of large matrices, rather than the eigendecomposition. Therefore, optimizing the matrix multiplication operations is essential for improving the performance of DFT simulations.

It looks like I'm going to use my AMX for quantum chemistry! A later conversation revealed there's some inherent parallelism, such that multiple matmul operations can happen concurrently. This may lead to greater AMX utilization because multiple CPU cores can harness it. There may also be methods for using GPU mixed precision, requiring different numerical integrators fine-tuned for the characteristics of this precision mode. The difference between AMX FP64 and GPU FP32 is 1:13, so I'll have to determine whether the more complex integrators don't counteract the performance boost.

Another insight is that, since eigendecomposition is not the bottleneck, it could happen on the CPU in FP64. Eigendecomposition is generally an ill-structured problem. Then the bulk of computations (matmuls) could subsequently happen on the GPU using mixed-precision. I recall that Fugaku performed FP64 linear algebra operations using FP16 with error correction methods. So this approach may be possible. Still, I'm suspicious about how Nvidia declined to post quantum chemistry benchmarks for the GPUs without large FP64 compute. Hopefully it's because the software packages required FP64, and not because DFT cannot be rewritten to utilize mixed precision.

If (big if) I do decide on this approach, it would necessitate finishing metal-float64 and the Metal backend for Open SYCL (formerly hipSYCL). That means more scientific software for you Apple silicon users.

Boil · Mar 7, 2023

Looked in here a few times, my brain glazed over while skimming, not my area of expertise by any means...

But the latest rumor is that the forthcoming ASi Mac Pro may have an option for a MPX-style variant of an AMD Instinct add-in card...

The question then being how this might impact those in the assorted scientific fields working in macOS (just a curiosity)...?

leman · Mar 7, 2023

Boil said:
Looked in here a few times, my brain glazed over while skimming, not my area of expertise by any means...

But the latest rumor is that the forthcoming ASi Mac Pro may have an option for a MPX-style variant of an AMD Instinct add-in card...

The question then being how this might impact those in the assorted scientific fields working in macOS (just a curiosity)...?

Impossible to answer this without additional information. Given the current state of affairs, software is more important than hardware. Even if they release hardware like that (which IMO is extremely unlikely), it won’t be very useful without sweeping changes to programmability.

Philip Turner · Mar 8, 2023

I think what they meant is, a modular system for swapping different Apple chips. For example, options for M3 Max or M3 Ultra. That would be standardizable so future generations of Max and Ultra would fit as well. The thermal load would need to stay constant though, which means less performance improvements as Moore's Law draws to a halt. On that note, we haven't even gotten close to the Landauer limit.

Zest28 · Mar 8, 2023

Who cares about all this Apple Silicon stuff.

If you need performance, just rent it. Just like how ChatGPT was calibrated using Azure.

It probably a waste of time trying to get ARM working due to missing libraries or not the correct version available then just doing it in the cloud.

Philip Turner · Mar 8, 2023

That is how Nvidia makes money. Everything in the cloud is an Nvidia GPU running CUDA software. ChatGPT runs on OpenAI's own Triton software, a domain-specific language for using CUDA for AI inference. Some people don't want to contribute to a growing GPGPU monopoly.

Zest28 said:
missing libraries or not the correct version

That's standard software engineering for you. Even Linux users deal with issues like this. Also the cloud (Google Colab) doesn't immediately solve all the problems; you just have a different environment with a different set of conditions. That might help, or might make it worse.

JouniS · Mar 8, 2023

Zest28 said:
Who cares about all this M3 Max and M3 Ultra stuff.

If you need performance, just rent it. Just like how ChatGPT was calibrated using Azure.

It probably a waste of time trying to get ARM working due to missing libraries or not the correct version available then just doing it in the cloud.

The cloud is good for production but not for development.

I've been using cloud instances as a workstation replacement quite often recently. The main weakness is that you have to choose between fast storage and permanent storage. All permanently stored files are effectively on a network drive that's slow by today's standards. There is local SSD, but all data stored on it is lost when you stop the instance. And you have to stop the instance, because at $5/hour, the cost quickly exceeds the price of a nice workstation.

When you don't have fast persistent storage, you can't leave large temporary files around to allow you to resume working faster. You have to download and/or generate them every time you start the instance, which adds significant delays and lowers your productivity.

Philip Turner · Mar 8, 2023

Hence, the benefit of Apple GPUs with large memories and fast SSDs.

Philip Turner · Mar 11, 2023

Apple just deprecated vecLib’s BLAS and replaced it with an ILP64 version. Meanwhile I think I found a workaround for boosting ZGEMM performance. Split the real and complex matrices into separate contiguous memory blocks. You might think to multiply each of 4 sub-multiplications on a separate CPU core, but the more efficient approach seems to have each of 4 CPU cores handle part of each sub-multiplication. Or to force it to use both P-blocks. Sub-divide a 2x2x2=8 multiplication and carefully coordinate the 4 sum operations afterward, so they occur between CPUs in the same block.

Also, for complex eigendecomposition, I think my numbers were off. I found one source that said LAPACK’s implementation was O(9n^3) instead of O(4/3 n^3), which would put Accelerate’s DSYEVD at ~120 GFLOPS. I also found some work by Dongarra, who says the majority of operations can be FP32 with the only FP64 parts in O(n^2). The only exception is ill-conditioned matrices, which occurs with very close atomic orbitals in metals. My nanotech research involves upper P-block elements, and not transition metals with close d-orbitals. In short, I think I can accelerate quantum chemistry on ASi by 8x, using the AMX. If the FP32 parts are still the bottleneck, then move to the GPU with ~330 GFLOPS FP59e.

macOS Ventura 13.3 Release Notes | Apple Developer Documentation

Update your apps to use new features, and test your apps against API changes.

developer.apple.com

Xiao_Xi · Mar 11, 2023

Philip Turner said:
Apple just deprecated vecLib’s BLAS and replaced it with an ILP64 version.

What are the pros/cons of the change? Will LAPLACK/BLAS based software in macOS 13.3 stop working or be slower?

The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. [1]

3.9.1 was released almost two years ago [2]. Why does Apple use such old versions of LAPLACK and BLAS?

[1] https://developer.apple.com/documentation/macos-release-notes/macos-13_3-release-notes
[2] https://netlib.org/lapack/#_lapack_version_3_9_1

leman · Mar 11, 2023

Philip Turner said:
Apple just deprecated vecLib’s BLAS and replaced it with an ILP64 version.

Why do you say deprecated? The notes say it was updated to a much newer version (BLAS and LAPACK were previously stuck on a decade old version).

leman · Mar 11, 2023

Xiao_Xi said:
3.9.1 was released almost two years ago [2]. Why does Apple use such old versions of LAPLACK and BLAS?

The previous ones were much older. These new are cutting edge by comparison

Philip Turner · Mar 11, 2023

Look up the Swift API documentation for BLAS. They deprecated every function in the module. They will probably document the new, un-deprecated 64-bit versions of each function soon.

Xiao_Xi · Mar 11, 2023

leman said:
Why do you say deprecated? The notes say it was updated to a much newer version (BLAS and LAPACK were previously stuck on a decade old version)

These new interfaces provide additional functionality and a new ILP64 interface.

It seems Apple is keeping the LP64 interface.

leman said:
The previous ones were much older. These new are cutting edge by comparison

Hopefully Apple can go the extra mile and make life a little easier for developers of BLAS/LAPLACK based software to program for macOS.

ahurst · Mar 12, 2023

leman said:
The previous ones were much older. These new are cutting edge by comparison

Hoping this means we’ll see Accelerate.framework support added back to SciPy soon! I know they dropped it a few years ago due to the ancient/buggy LAPACK implementation, so I’m curious what kind of performance boost the AMX would provide for those sorts of workloads if it came back.

Philip Turner · Mar 12, 2023

metal-benchmarks 2.0!

GitHub - philipturner/amx-benchmarks: Running linear algebra as fast as possible on Apple silicon

Running linear algebra as fast as possible on Apple silicon - philipturner/amx-benchmarks

github.com

Xiao_Xi · Mar 13, 2023

Anecdotal data suggests that the performance difference between Apple's BLAS and OpenBLAS decreases as the matrix dimensions increase. Is there any comparison between OpenBLAS and Apple's BLAS?

Philip Turner · Mar 13, 2023

I've been using this data for a while:

leman · Mar 13, 2023

Xiao_Xi said:
Anecdotal data suggests that the performance difference between Apple's BLAS and OpenBLAS decreases as the matrix dimensions increase.

If your data is very large, you will be limited by the system RAM bandwidth. AMX units have tremendous throughput and are connected to the fast L2 cache directly, but as your dimensions increase you'll spend increasing amount of time waiting on RAM.

At least this would be my first intuition, I could very well be wrong.

Philip Turner · Mar 13, 2023

Xiao_Xi said:
Anecdotal data suggests that the performance difference between Apple's BLAS and OpenBLAS decreases as the matrix dimensions increase. Is there any comparison between OpenBLAS and Apple's BLAS?

leman said:
If your data is very large, you will be limited by the system RAM bandwidth. AMX units have tremendous throughput and are connected to the fast L2 cache directly, but as your dimensions increase you'll spend increasing amount of time waiting on RAM.

At least this would be my first intuition, I could very well be wrong.

https://github.com/xrq-phys/blis_apple has some graphs demonstrating GEMM performance. Accelerate is not optimized to switch to a different execution mode once L2 is exceeded. The first peak is exceeding L2, the second is exceeding P-CPU L2D + SLC.

Apple Silicon in Sciences

macrumors regular

macrumors 601

macrumors regular

macrumors 68020

macrumors 601

macrumors regular

macrumors 68040

macrumors Core

macrumors regular

macrumors 68030

macrumors regular

macrumors 6502a

macrumors regular

macrumors regular

macrumors 68000

macrumors Core

macrumors Core

macrumors regular

macrumors 68000

macrumors 6502

macrumors regular

macrumors 68000

macrumors regular

Attachments

macrumors Core

macrumors regular

Our Staff