Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I made this thread because I thought sites regarding Apple Silicon Macs regarding science is lacking and I thought this thread would be a helpful place for any future scientists or current ones to get info regarding Apple Silicon Macs.

Please post only things regarding the usage of Apple Silicon Macs in science. Post articles, github repositories, sites, etc. regarding this topic.

I'm gonna start with this:
Apple Silicon Performance in Scientific Computing

PDF version:
I found out why the M1 supposedly beat other GPUs in OpenCL benchmarks. Apple's OpenCL queue profiling API is not Khronos-conformant. It reports Mach timebase ticks (24 MHz) as nanoseconds (1 GHz), creating times that are 1/41.7 times the actual value. It's highly probable that some kind of profiling software assumed that CL_QUEUE_PROFILING_ENABLE works correctly and did not correct the error.

Take whatever results these researchers got, divide the speed by 41.7. If the M1 is reported 10x faster than V100, it now becomes 4x slower.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
I found out why the M1 supposedly beat other GPUs in OpenCL benchmarks. Apple's OpenCL queue profiling API is not Khronos-conformant. It reports Mach timebase ticks (24 MHz) as nanoseconds (1 GHz), creating times that are 1/41.7 times the actual value. It's highly probable that some kind of profiling software assumed that CL_QUEUE_PROFILING_ENABLE works correctly and did not correct the error.

Take whatever results these researchers got, divide the speed by 41.7. If the M1 is reported 10x faster than V100, it now becomes 4x slower.
This is why peer review is important :). At least I assume it was not peer-reviewed, since it was published as a conference proceeding:


Are these your own findings? If so, you should email the authors, ccing IEEE. Since a comparison to the NVIDIA GPUs is the whole point of the paper, if your findings are correct the authors should withdraw the paper while they work on a corrected version.
 
Last edited:

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I found contact info for one author: https://github.com/cdcapano, but not a specific email for IEEE Xplore. I'll email the author and see if he responds.

In the meantime, I found some more red flags. The authors didn't check that the M1 Ultra has 21232 GFLOPS, yet the SGEMM benchmark said 248000 GFLOPS. Dividing by 41.7 and accounting for 25% ALU utilization without simdgroup_matrix, I would expect a faulty benchmark to state 221000 GFLOPS.
 

iPadified

macrumors 68020
Apr 25, 2017
2,014
2,257
I found contact info for one author: https://github.com/cdcapano, but not a specific email for IEEE Xplore. I'll email the author and see if he responds.

In the meantime, I found some more red flags. The authors didn't check that the M1 Ultra has 21232 GFLOPS, yet the SGEMM benchmark said 248000 GFLOPS. Dividing by 41.7 and accounting for 25% ALU utilization without simdgroup_matrix, I would expect a faulty benchmark to state 221000 GFLOPS.
Great work! This is the reason for showing data and discussing them - we learn! The correct thing is to send a note to the authors not to IEEE especially not for conference proceedings.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
Great work! This is the reason for showing data and discussing them - we learn! The correct thing is to send a note to the authors not to IEEE especially not for conference proceedings.
Yeah, it would probably be better to contact just the authors first.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
About my earlier posts on this thread about using Apple silicon for computational chemistry...

Me:

I have heard that DFT algorithms have O(n^3) (cubic) computational complexity with respect to number of electrons. Both eigendecomposition and matrix multiplication have the same computational complexity. I cannot recall whether diagonalization was the bottleneck in DFT simulations. Can you help me figure this out?

ChatGPT:

In DFT calculations, the most computationally intensive part is the solution of the Kohn-Sham equations, which has an O(n^3) computational complexity with respect to the number of electrons (n). However, the eigendecomposition is not the bottleneck in DFT simulations, and diagonalization is not explicitly performed during the iterative solution of the Kohn-Sham equations.

Instead, the most computationally demanding operation in DFT simulations is the repeated multiplication of large matrices, which has an O(n^3) computational complexity. This operation is required to solve the Kohn-Sham equations and obtain the self-consistent electron density and potential. The iterative solution of the Kohn-Sham equations involves the use of numerical methods such as the self-consistent field (SCF) method or the density mixing method, which do not explicitly involve diagonalization.

While the eigendecomposition is not explicitly performed during the iterative solution of the Kohn-Sham equations, it is required in the initial setup of the Kohn-Sham equations to obtain the set of orbitals that are used to construct the density matrix. However, this eigendecomposition is typically performed only once, and the resulting set of orbitals is then used for the iterative solution of the Kohn-Sham equations.

Overall, the computational complexity of DFT simulations is dominated by the repeated multiplication of large matrices, rather than the eigendecomposition. Therefore, optimizing the matrix multiplication operations is essential for improving the performance of DFT simulations.
It looks like I'm going to use my AMX for quantum chemistry! A later conversation revealed there's some inherent parallelism, such that multiple matmul operations can happen concurrently. This may lead to greater AMX utilization because multiple CPU cores can harness it. There may also be methods for using GPU mixed precision, requiring different numerical integrators fine-tuned for the characteristics of this precision mode. The difference between AMX FP64 and GPU FP32 is 1:13, so I'll have to determine whether the more complex integrators don't counteract the performance boost.

Another insight is that, since eigendecomposition is not the bottleneck, it could happen on the CPU in FP64. Eigendecomposition is generally an ill-structured problem. Then the bulk of computations (matmuls) could subsequently happen on the GPU using mixed-precision. I recall that Fugaku performed FP64 linear algebra operations using FP16 with error correction methods. So this approach may be possible. Still, I'm suspicious about how Nvidia declined to post quantum chemistry benchmarks for the GPUs without large FP64 compute. Hopefully it's because the software packages required FP64, and not because DFT cannot be rewritten to utilize mixed precision.

If (big if) I do decide on this approach, it would necessitate finishing metal-float64 and the Metal backend for Open SYCL (formerly hipSYCL). That means more scientific software for you Apple silicon users.
 

Boil

macrumors 68040
Oct 23, 2018
3,477
3,173
Stargate Command
Looked in here a few times, my brain glazed over while skimming, not my area of expertise by any means...

But the latest rumor is that the forthcoming ASi Mac Pro may have an option for a MPX-style variant of an AMD Instinct add-in card...

The question then being how this might impact those in the assorted scientific fields working in macOS (just a curiosity)...?
 

leman

macrumors Core
Oct 14, 2008
19,517
19,664
Looked in here a few times, my brain glazed over while skimming, not my area of expertise by any means...

But the latest rumor is that the forthcoming ASi Mac Pro may have an option for a MPX-style variant of an AMD Instinct add-in card...

The question then being how this might impact those in the assorted scientific fields working in macOS (just a curiosity)...?

Impossible to answer this without additional information. Given the current state of affairs, software is more important than hardware. Even if they release hardware like that (which IMO is extremely unlikely), it won’t be very useful without sweeping changes to programmability.
 
  • Like
Reactions: Boil

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I think what they meant is, a modular system for swapping different Apple chips. For example, options for M3 Max or M3 Ultra. That would be standardizable so future generations of Max and Ultra would fit as well. The thermal load would need to stay constant though, which means less performance improvements as Moore's Law draws to a halt. On that note, we haven't even gotten close to the Landauer limit.
 

Zest28

macrumors 68030
Jul 11, 2022
2,581
3,931
Who cares about all this Apple Silicon stuff.

If you need performance, just rent it. Just like how ChatGPT was calibrated using Azure.

It probably a waste of time trying to get ARM working due to missing libraries or not the correct version available then just doing it in the cloud.
 
Last edited:

Philip Turner

macrumors regular
Dec 7, 2021
170
111
That is how Nvidia makes money. Everything in the cloud is an Nvidia GPU running CUDA software. ChatGPT runs on OpenAI's own Triton software, a domain-specific language for using CUDA for AI inference. Some people don't want to contribute to a growing GPGPU monopoly.

missing libraries or not the correct version
That's standard software engineering for you. Even Linux users deal with issues like this. Also the cloud (Google Colab) doesn't immediately solve all the problems; you just have a different environment with a different set of conditions. That might help, or might make it worse.
 

JouniS

macrumors 6502a
Nov 22, 2020
638
399
Who cares about all this M3 Max and M3 Ultra stuff.

If you need performance, just rent it. Just like how ChatGPT was calibrated using Azure.

It probably a waste of time trying to get ARM working due to missing libraries or not the correct version available then just doing it in the cloud.
The cloud is good for production but not for development.

I've been using cloud instances as a workstation replacement quite often recently. The main weakness is that you have to choose between fast storage and permanent storage. All permanently stored files are effectively on a network drive that's slow by today's standards. There is local SSD, but all data stored on it is lost when you stop the instance. And you have to stop the instance, because at $5/hour, the cost quickly exceeds the price of a nice workstation.

When you don't have fast persistent storage, you can't leave large temporary files around to allow you to resume working faster. You have to download and/or generate them every time you start the instance, which adds significant delays and lowers your productivity.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Apple just deprecated vecLib’s BLAS and replaced it with an ILP64 version. Meanwhile I think I found a workaround for boosting ZGEMM performance. Split the real and complex matrices into separate contiguous memory blocks. You might think to multiply each of 4 sub-multiplications on a separate CPU core, but the more efficient approach seems to have each of 4 CPU cores handle part of each sub-multiplication. Or to force it to use both P-blocks. Sub-divide a 2x2x2=8 multiplication and carefully coordinate the 4 sum operations afterward, so they occur between CPUs in the same block.

Also, for complex eigendecomposition, I think my numbers were off. I found one source that said LAPACK’s implementation was O(9n^3) instead of O(4/3 n^3), which would put Accelerate’s DSYEVD at ~120 GFLOPS. I also found some work by Dongarra, who says the majority of operations can be FP32 with the only FP64 parts in O(n^2). The only exception is ill-conditioned matrices, which occurs with very close atomic orbitals in metals. My nanotech research involves upper P-block elements, and not transition metals with close d-orbitals. In short, I think I can accelerate quantum chemistry on ASi by 8x, using the AMX. If the FP32 parts are still the bottleneck, then move to the GPU with ~330 GFLOPS FP59e.

 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Apple just deprecated vecLib’s BLAS and replaced it with an ILP64 version.
What are the pros/cons of the change? Will LAPLACK/BLAS based software in macOS 13.3 stop working or be slower?

The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. [1]
3.9.1 was released almost two years ago [2]. Why does Apple use such old versions of LAPLACK and BLAS?

[1] https://developer.apple.com/documentation/macos-release-notes/macos-13_3-release-notes
[2] https://netlib.org/lapack/#_lapack_version_3_9_1
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,517
19,664
Apple just deprecated vecLib’s BLAS and replaced it with an ILP64 version.

Why do you say deprecated? The notes say it was updated to a much newer version (BLAS and LAPACK were previously stuck on a decade old version).
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Look up the Swift API documentation for BLAS. They deprecated every function in the module. They will probably document the new, un-deprecated 64-bit versions of each function soon.
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Why do you say deprecated? The notes say it was updated to a much newer version (BLAS and LAPACK were previously stuck on a decade old version)

These new interfaces provide additional functionality and a new ILP64 interface.
It seems Apple is keeping the LP64 interface.

The previous ones were much older. These new are cutting edge by comparison :)
Hopefully Apple can go the extra mile and make life a little easier for developers of BLAS/LAPLACK based software to program for macOS.
 

ahurst

macrumors 6502
Oct 12, 2021
410
815
The previous ones were much older. These new are cutting edge by comparison :)
Hoping this means we’ll see Accelerate.framework support added back to SciPy soon! I know they dropped it a few years ago due to the ancient/buggy LAPACK implementation, so I’m curious what kind of performance boost the AMX would provide for those sorts of workloads if it came back.
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Anecdotal data suggests that the performance difference between Apple's BLAS and OpenBLAS decreases as the matrix dimensions increase. Is there any comparison between OpenBLAS and Apple's BLAS?
 

leman

macrumors Core
Oct 14, 2008
19,517
19,664
Anecdotal data suggests that the performance difference between Apple's BLAS and OpenBLAS decreases as the matrix dimensions increase.

If your data is very large, you will be limited by the system RAM bandwidth. AMX units have tremendous throughput and are connected to the fast L2 cache directly, but as your dimensions increase you'll spend increasing amount of time waiting on RAM.

At least this would be my first intuition, I could very well be wrong.
 
  • Like
Reactions: Xiao_Xi

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Anecdotal data suggests that the performance difference between Apple's BLAS and OpenBLAS decreases as the matrix dimensions increase. Is there any comparison between OpenBLAS and Apple's BLAS?
If your data is very large, you will be limited by the system RAM bandwidth. AMX units have tremendous throughput and are connected to the fast L2 cache directly, but as your dimensions increase you'll spend increasing amount of time waiting on RAM.

At least this would be my first intuition, I could very well be wrong.
https://github.com/xrq-phys/blis_apple has some graphs demonstrating GEMM performance. Accelerate is not optimized to switch to a different execution mode once L2 is exceeded. The first peak is exceeding L2, the second is exceeding P-CPU L2D + SLC.
 
  • Like
Reactions: Xiao_Xi
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.