Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

flashflood101

macrumors member
Feb 11, 2022
33
86
Knowingly sharing disinformation on this forum should result in a final warning, and doing it twice should result in a permanent ban. Absolutely appalling behaviour.
 
  • Wow
Reactions: sorgo †

Grey Area

macrumors 6502
Jan 14, 2008
433
1,030
I agree the paper is of poor quality and results are questionable. That is beside the point. The scientific approach would be to show data that refutes their results and preferably explain their results such as a bug.

Assuming something is wrong because it goes against the knowledge or politics (of the time) is a very dangerous path to take. A recent example is global warming and a few hundred years ago you literally lost your head if you claimed the world is round. There are plenty of such examples in the literature.
In principle I agree. However, there is also that extraordinary claims require extraordinary evidence, and the claims of the authors are quite extraordinary.

The authors used the SHOC benchmark suite. This benchmark was developed mostly before 2015, and the last minor updates predate the release of the M1. It seems safe to say the benchmark was not developed with Apple Silicon in mind.

There is a recent issue report on the SHOC Git: MaxFlops reports crazy numbers for Apple M1 - "...suggest that my M1 Air laptop is the second most powerful supercomputer in Europe...". As per one of the benchmark authors, the likely reason is that the compiler for the M1 optimizes away "useless" work, thus sidestepping the purpose of the benchmark, and the benchmark would have to be updated to prevent this from happening.

Going by this I feel the burden of proof is on the authors. If you want to demonstrate that a laptop is orders of magnitude faster than supercomputers and even the manufacturer's own claims, you need to bring more than an unsuitable benchmark.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Regarding atomic, the documents say that 64-bit int atomic are supported, but I haven’t tried it out. Are you saying it doesn’t work?
Both 64-bit floats and 64-bit atomics are unsupported. The M2 may (unconfirmed) have one obscure non-returning UInt64 max just to run Nanite. Not the full set required for e.g. cl_khr_int64_base_atomics. I've been heavily investing time into predicting the performance of emulated 64-bit floats. It stands at 1:70 for FP64 and 1:40 for FP59. The reduced precision version solves the same problem as FP64: FP32 is not good for highly precise scientific numbers. Many apps like OpenMM do most stuff in FP32 and sum important values in FP64. Mixed precision use cases are what the lack of hardware FP64 kills off.

Regarding 64-bit atomics emulation, I have a real-world P.O.C. where it's 5x slower than 32-bit. Not bad and aligns perfectly with theory. My intent was to accelerate INQ, a scientific software library for GPUs, on M1. It was entirely FP64 and had some atomic addition. Turns out, the CPU is much more powerful at FP64 than I realized! <400 GFLOPS vector (multi-core) and <700 GFLOPS matrix (single-core). Same as RTX 3090's FP64.

preprocessor macros and C++ features of Metal
To port existing software libraries, we need more than just preprocessor macros. Many libraries, such as INQ, rely on single-source APIs. CUDA, HIP, and SYCL are examples, and improve developer productivity for compute. Dual-source APIs like OpenCL, Metal, and DirectX never caught on. I'm working on a Metal backend for hipSYCL, so the Apple GPU can actually be viable for scientific computing. Many scientific libraries use SYCL (single-source) in addition to HIP/CUDA, or are easily portable with Intel SYCLomatic. There is no such thing as Metal-omatic, which may be impossible because Metal is dual-source.

Another issue is something called "Unified Shared Memory" (USM) in CUDA/OpenCL 2.0/SYCL 2020. The idea that the CPU and GPU can read from the same memory address, you'd think Apple allows that. After all, it's perfect for the Unified Memory Architecture. No. The Apple GPU is designed for graphics and TensorFlow, not GPGPU.

USM is supported on Nvidia, AMD, and Intel's drivers for their discrete GPUs. It's accomplished by hardware-accelerated "address translation service" which maps CPU pointers to GPU pointers. Another hardware feature automatically migrates data between the CPU and GPU over PCIe. USM boosts developer productivity and makes massive C++ code bases easily portable from CPU to GPU. Luckily, that can also be emulated with good performance on M1.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I'm working on a Metal backend for hipSYCL, so the Apple GPU can actually be viable for scientific computing.
Is the Metal IR specification publicly available?

Another issue is something called "Unified Shared Memory" (USM) in CUDA/OpenCL 2.0/SYCL 2020. The idea that the CPU and GPU can read from the same memory address, you'd think Apple allows that. After all, it's perfect for the Unified Memory Architecture. No. The Apple GPU is designed for graphics and
Is it a hardware or software limitation?
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Both 64-bit floats and 64-bit atomics are unsupported. The M2 may (unconfirmed) have one obscure non-returning UInt64 max just to run Nanite. Not the full set required for e.g. cl_khr_int64_base_atomics.

Weird, since Metal docs mention atomic_ulong - a 64-bit atomic type. Is that a typo or mistake on Apples side?

I've been heavily investing time into predicting the performance of emulated 64-bit floats. It stands at 1:70 for FP64 and 1:40 for FP59. The reduced precision version solves the same problem as FP64: FP32 is not good for highly precise scientific numbers. Many apps like OpenMM do most stuff in FP32 and sum important values in FP64. Mixed precision use cases are what the lack of hardware FP64 kills off.

Does this take into account FMA optimizations as described in this paper: http://andrewthall.org/papers/df64_qf128.pdf
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Is the Metal IR specification publicly available?

No it’s not, but compiled Metal files are just LLVM byte code. There is a library for converting Metal AIR to SPIR-V: https://github.com/darlinghq/indium

Is it a hardware or software limitation?

The GPU has its own TLB and binds memory at a different virtual address. Unified address space is tricky - you need to ensure that same address ranges are available for both devices. From what I understand, CUDA does this by reserving a huge area of VM addresses and allocating data from that range. But this is much trickier for Apple, where you can bind any client-allocated memory page to the GPU. To do this correctly you’d need to use the same page tables between the CPU and the GPU, leaking GPU state to the application etc.

I wouldn’t be surprised if Apple implements this in the future, but it will need substantial work on hardware, software and kernel level.
 
  • Like
Reactions: jmcube and Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
There is a library for converting Metal AIR to SPIR-V: https://github.com/darlinghq/indium
I can imagine that a library that translates Metal AIR to SPIR-V might be useful for playing on Android with iOS binaries, but how can it be useful for compiling SYCL code into Metal AIR?

Another issue is something called "Unified Shared Memory" (USM) in CUDA/OpenCL 2.0/SYCL 2020.
Does Vulkan/DirectX/WebGPU have Unified Shared Memory? Is it only possible on single-source APIs?

By the way, how is the development of WebGPU going?
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Does this take into account FMA optimizations as described in this paper: http://andrewthall.org/papers/df64_qf128.pdf
I did consider that approach, but it's inexact and not IEEE-compliant. I tried to GPU-accelerate a physics simulation once, but the double-single approach had trouble splitting a single. Perfect emulation requires manipulating the mantissa as integers. FP59 is faster because the integer mantissa can be 48-bit aligned.

From what I understand, CUDA does this by reserving a huge area of VM addresses and allocating data from that range.
I have a working P.O.C. that uses this approach. It allocates a giant chunk of VM and maps it to GPU. You translate addresses simply through integer addition. With compiler tricks, you can even pre-convert some pointers before sending them to GPU.

how can it be useful for compiling SYCL code into Metal AIR?
I've been exploring a lot of design approaches for SYCL, involving LLVM IR, SPIR-V dialects, AIR and MSL. The eventual approach was to structurize the LLVM IR, then -> Vulkan SPIR-V -> MSL using SPIRV-Cross. I'll have to modify SPIRV-Cross to add more compute instructions.

Does Vulkan/DirectX/WebGPU have Unified Shared Memory? Is it only possible on single-source APIs?
None of the graphics APIs have Unified Shared Memory.
 
  • Like
Reactions: leman and Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I've been exploring a lot of design approaches for SYCL, involving LLVM IR, SPIR-V dialects, AIR and MSL. The eventual approach was to structurize the LLVM IR, then -> Vulkan SPIR-V -> MSL using SPIRV-Cross. I'll have to modify SPIRV-Cross to add more compute instructions.
How can SYCL be used on Nvidia GPUs?
Update: This presentation explains how it is done.

How can Julia be used on Apple GPUs?
 
Last edited:

jmho

macrumors 6502a
Jun 11, 2021
502
996
Weird, since Metal docs mention atomic_ulong - a 64-bit atomic type. Is that a typo or mistake on Apples side?
I actually found a post of Philip's on the Unreal Forums where he mentions that those functions only work with metal family 8 aka A15+ / M2, while the M1 is only family 7.

And since Philip is here I'd like to say incredible work!
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
metal family 8 aka A15+ / M2, while the M1 is only family 7.
Just M2 (maybe), and not A15. I got an unconfirmed private message where someone said Apple engineers told them. Perhaps this is like the M1, which has some differences from A14. For example, M1 supports BC compression and MetalFX upscaling. A14 does not.

Apple's A16 was originally going to be another GPU architecture, supporting power-efficient ray tracing (Apple 9?). I wonder whether they would have supported MetalFX and Nanite on it, without their setback. Or they would limit those features to M-series, to emphasize that Macs are for gaming.
 
  • Like
Reactions: jeanlain and leman

jmho

macrumors 6502a
Jun 11, 2021
502
996
Just M2 (maybe), and not A15. I got an unconfirmed private message where someone said Apple engineers told them. Perhaps this is like the M1, which has some differences from A14. For example, M1 supports BC compression and MetalFX upscaling. A14 does not.

Apple's A16 was originally going to be another GPU architecture, supporting power-efficient ray tracing (Apple 9?). I wonder whether they would have supported MetalFX and Nanite on it, without their setback. Or they would limit those features to M-series, to emphasize that Macs are for gaming.
That's a shame. Out of curiosity I tried your test program on my iPhone 14 Pro I can confirm that A16 also gets the same "Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED"

Unfortunately I don't have an M2 to test on.
 
  • Like
Reactions: leman

leman

macrumors Core
Oct 14, 2008
19,516
19,664
I actually found a post of Philip's on the Unreal Forums where he mentions that those functions only work with metal family 8 aka A15+ / M2, while the M1 is only family 7.

And since Philip is here I'd like to say incredible work!

I re-read Apples docs and indeed they say that 64 bit atomics are only supported for min and max and only for “hardware as specified in the Metal feature table”. Of course, no mention of that in the tables. Typical Apple documentation quality…
 

jmho

macrumors 6502a
Jun 11, 2021
502
996
I re-read Apples docs and indeed they say that 64 bit atomics are only supported for min and max and only for “hardware as specified in the Metal feature table”. Of course, no mention of that in the tables. Typical Apple documentation quality…
Yeah. I tried everything from offline compilation of the shader to every possible permutation of setting up a kernel - hoping that some random combo might work, but nope, all you get is a shader compiler crash when setting the pipeline state.

Only Apple knows what the deal is with these two weird functions I guess.
 

Stene

macrumors newbie
May 2, 2002
15
3
Luleå, Sweden
When making these comparisons - are the latest versions and compatible versions of components installed? Java (Zulu), Numpy, Numba configured to use the Apple Accelerate frameworks etc. There is a beta of MatLab that is semi-optimised for Apple M1 / M2.
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
When making these comparisons - are the latest versions and compatible versions of components installed? Java (Zulu), Numpy, Numba configured to use the Apple Accelerate frameworks etc.
Some time ago I read that some libraries had problems with Apple Accelerate frameworks because they were based on an old version of BLAS/LAPLACK. Have those problems been solved?
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Have those problems been solved?
Not sure, but any OS is supposed to come with an optimized BLAS library. Accelerate is optimized for the AMX, with 10x the compute power compared to SIMD NEON in a single-core process. If libraries aren't using it, they're killing CPU performance. It seemed that PyTorch wasn't using Accelerate (rather, OpenBLAS) for quite a while - not well clarified which they were using.
 

richmlow

macrumors 6502
Jul 17, 2002
390
285
I created two Mathematica benchmarks, and sent them to two posters on another forum who have M1 Max's. These calculate the %difference in wall clock runtime between whatever they're run on and my 2019 i9 iMac (see config details below). They were run using Mathematica 13.0.1 (current version is 13.1).

The tables below show the results from one M1 Max user; the results for the other user didn't differ significantly.

It's been opined that, to the extent Mathematica doesn't do well on AS vs. Intel, it's because AS's math libraries aren't as well optimized as Intel's MKL. Thus my upper table consists entirely of symbolic tasks and, indeed, all of these are faster on AS than my i9 iMac. However, they are not that much faster. You can see the M1 Max averages only 2% to 18% faster for these suites of symbolic computations.

The lower table features a graphing suite, where AS was 21% faster. I also gave it an image-processing task, and it was 46% slower, possibly because it uses numeric computations.

Details:

Symbolic benchmark:
Consists of six suites of tests: Three integration suites, a simplify suite, a solve suite, and a miscellaneous suite. There are a total of 58 calculations. On my iMac, this takes 37 min, so an average of ~40s/calculation. It produces a summary table at the end, which shows the percentage difference in run time between my iMac whatever device it's run on. Most of these calculations appear to be are single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor). However, the last one (polynomial expansion) appears to be multi-core (CPU ~ 500%).

Graphing and image processing benchmark: Consists of five graphs (2D and 3D) and one set of image processing tasks (processing an image taken by JunoCam, which is the public-outreach wide-field visible-light camera on NASA’s Juno Jupiter orbiter). It takes 2 min. on my 2019 i9 iMac. As with the above, it produces a summary table at the end. The four graphing tasks appear to be single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor). However, the imaging processing task appears to be multi-core (CPU ~ 250% – 400%).

Here's how the percent differences in the summary tables are calculated (ASD = Apple Silicon Device, or whatever computer it's run on):

% difference = (ASD time/(average of ASD time and iMac time) – 1)*100.

Thus if the iMac takes 100 s, and the ASD takes 50 s, the ASD would get a value of –33, meaning the ASD is 33% faster; if the ASD takes 200 s, it would get a value of 33, meaning it is 33% slower. By dividing by the average of the iMac and ASD times, we get the same absolute percentage difference regardless of whether the two-fold difference goes in one direction or the other. For instance, If we instead divided by the iMac time, we'd get 50% faster and 100% slower, respectively, for the above two examples.

I also provide a mean and standard deviation for the percentages from each suite of tests. I decided to average the percentages rather than the times so that all processes within a test suite are weighted equally, i.e., so that processes with long run times don't dominate.

iMac details:
2019 27" iMac (19,1), i9-9900K (8 cores, Coffee Lake, 3.6 GHz/5.0 GHz), 32 GB DDR4-2666 RAM, Radeon Pro 580X (8 GB GDDR5)
Mathematica 13.0.1
MacOS Monterey 12.4

View attachment 2131848

Hello theorist9,

Thank you for running your suite of Mathematica benchmarks on Mathematica 13.0.1.

Would it be possible to rerun your benchmarks on the most recent version of Mathematica (v.13.2) and report back your findings? According to the Wolfram Research release notes (for v. 13.2), "extensive" optimizations were made to increase the efficiency of many Mathematica capabilities.


All the best,
richmlow
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
Hello theorist9,

Thank you for running your suite of Mathematica benchmarks on Mathematica 13.0.1.

Would it be possible to rerun your benchmarks on the most recent version of Mathematica (v.13.2) and report back your findings? According to the Wolfram Research release notes (for v. 13.2), "extensive" optimizations were made to increase the efficiency of many Mathematica capabilities.


All the best,
richmlow
Hi richmlow. I'd be happy to rerun them on my i9 iMac, but I'll need to contact those who did the comparison runs for me on their M1 Max's, and I'm not sure if they still have Mathematica running on their machines (at least one, and possibly both, downloaded the 7-day trial version to run the test). I'd offer to send my benchmark tests to you to run, but it doesn't appear you have an M1.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.