Please see the two paragraphs I just added.So the advantage of the Mac over a linux/unix workstation was Office, not the hardware.
Last edited:
Please see the two paragraphs I just added.So the advantage of the Mac over a linux/unix workstation was Office, not the hardware.
In principle I agree. However, there is also that extraordinary claims require extraordinary evidence, and the claims of the authors are quite extraordinary.I agree the paper is of poor quality and results are questionable. That is beside the point. The scientific approach would be to show data that refutes their results and preferably explain their results such as a bug.
Assuming something is wrong because it goes against the knowledge or politics (of the time) is a very dangerous path to take. A recent example is global warming and a few hundred years ago you literally lost your head if you claimed the world is round. There are plenty of such examples in the literature.
Both 64-bit floats and 64-bit atomics are unsupported. The M2 may (unconfirmed) have one obscure non-returning UInt64 max just to run Nanite. Not the full set required for e.g. cl_khr_int64_base_atomics. I've been heavily investing time into predicting the performance of emulated 64-bit floats. It stands at 1:70 for FP64 and 1:40 for FP59. The reduced precision version solves the same problem as FP64: FP32 is not good for highly precise scientific numbers. Many apps like OpenMM do most stuff in FP32 and sum important values in FP64. Mixed precision use cases are what the lack of hardware FP64 kills off.Regarding atomic, the documents say that 64-bit int atomic are supported, but I haven’t tried it out. Are you saying it doesn’t work?
To port existing software libraries, we need more than just preprocessor macros. Many libraries, such as INQ, rely on single-source APIs. CUDA, HIP, and SYCL are examples, and improve developer productivity for compute. Dual-source APIs like OpenCL, Metal, and DirectX never caught on. I'm working on a Metal backend for hipSYCL, so the Apple GPU can actually be viable for scientific computing. Many scientific libraries use SYCL (single-source) in addition to HIP/CUDA, or are easily portable with Intel SYCLomatic. There is no such thing as Metal-omatic, which may be impossible because Metal is dual-source.preprocessor macros and C++ features of Metal
"Knowingly"?Knowingly sharing disinformation on this forum should result in a final warning, and doing it twice should result in a permanent ban. Absolutely appalling behaviour.
Is the Metal IR specification publicly available?I'm working on a Metal backend for hipSYCL, so the Apple GPU can actually be viable for scientific computing.
Is it a hardware or software limitation?Another issue is something called "Unified Shared Memory" (USM) in CUDA/OpenCL 2.0/SYCL 2020. The idea that the CPU and GPU can read from the same memory address, you'd think Apple allows that. After all, it's perfect for the Unified Memory Architecture. No. The Apple GPU is designed for graphics and
Both 64-bit floats and 64-bit atomics are unsupported. The M2 may (unconfirmed) have one obscure non-returning UInt64 max just to run Nanite. Not the full set required for e.g. cl_khr_int64_base_atomics.
I've been heavily investing time into predicting the performance of emulated 64-bit floats. It stands at 1:70 for FP64 and 1:40 for FP59. The reduced precision version solves the same problem as FP64: FP32 is not good for highly precise scientific numbers. Many apps like OpenMM do most stuff in FP32 and sum important values in FP64. Mixed precision use cases are what the lack of hardware FP64 kills off.
Is the Metal IR specification publicly available?
Is it a hardware or software limitation?
I can imagine that a library that translates Metal AIR to SPIR-V might be useful for playing on Android with iOS binaries, but how can it be useful for compiling SYCL code into Metal AIR?There is a library for converting Metal AIR to SPIR-V: https://github.com/darlinghq/indium
Does Vulkan/DirectX/WebGPU have Unified Shared Memory? Is it only possible on single-source APIs?Another issue is something called "Unified Shared Memory" (USM) in CUDA/OpenCL 2.0/SYCL 2020.
I did consider that approach, but it's inexact and not IEEE-compliant. I tried to GPU-accelerate a physics simulation once, but the double-single approach had trouble splitting a single. Perfect emulation requires manipulating the mantissa as integers. FP59 is faster because the integer mantissa can be 48-bit aligned.Does this take into account FMA optimizations as described in this paper: http://andrewthall.org/papers/df64_qf128.pdf
I have a working P.O.C. that uses this approach. It allocates a giant chunk of VM and maps it to GPU. You translate addresses simply through integer addition. With compiler tricks, you can even pre-convert some pointers before sending them to GPU.From what I understand, CUDA does this by reserving a huge area of VM addresses and allocating data from that range.
I've been exploring a lot of design approaches for SYCL, involving LLVM IR, SPIR-V dialects, AIR and MSL. The eventual approach was to structurize the LLVM IR, then -> Vulkan SPIR-V -> MSL using SPIRV-Cross. I'll have to modify SPIRV-Cross to add more compute instructions.how can it be useful for compiling SYCL code into Metal AIR?
None of the graphics APIs have Unified Shared Memory.Does Vulkan/DirectX/WebGPU have Unified Shared Memory? Is it only possible on single-source APIs?
How can SYCL be used on Nvidia GPUs?I've been exploring a lot of design approaches for SYCL, involving LLVM IR, SPIR-V dialects, AIR and MSL. The eventual approach was to structurize the LLVM IR, then -> Vulkan SPIR-V -> MSL using SPIRV-Cross. I'll have to modify SPIRV-Cross to add more compute instructions.
How can Julia be used on Apple GPUs?
I actually found a post of Philip's on the Unreal Forums where he mentions that those functions only work with metal family 8 aka A15+ / M2, while the M1 is only family 7.Weird, since Metal docs mention atomic_ulong - a 64-bit atomic type. Is that a typo or mistake on Apples side?
Just M2 (maybe), and not A15. I got an unconfirmed private message where someone said Apple engineers told them. Perhaps this is like the M1, which has some differences from A14. For example, M1 supports BC compression and MetalFX upscaling. A14 does not.metal family 8 aka A15+ / M2, while the M1 is only family 7.
That's a shame. Out of curiosity I tried your test program on my iPhone 14 Pro I can confirm that A16 also gets the same "Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED"Just M2 (maybe), and not A15. I got an unconfirmed private message where someone said Apple engineers told them. Perhaps this is like the M1, which has some differences from A14. For example, M1 supports BC compression and MetalFX upscaling. A14 does not.
Apple's A16 was originally going to be another GPU architecture, supporting power-efficient ray tracing (Apple 9?). I wonder whether they would have supported MetalFX and Nanite on it, without their setback. Or they would limit those features to M-series, to emphasize that Macs are for gaming.
I actually found a post of Philip's on the Unreal Forums where he mentions that those functions only work with metal family 8 aka A15+ / M2, while the M1 is only family 7.
And since Philip is here I'd like to say incredible work!
Yeah. I tried everything from offline compilation of the shader to every possible permutation of setting up a kernel - hoping that some random combo might work, but nope, all you get is a shader compiler crash when setting the pipeline state.I re-read Apples docs and indeed they say that 64 bit atomics are only supported for min and max and only for “hardware as specified in the Metal feature table”. Of course, no mention of that in the tables. Typical Apple documentation quality…
Some time ago I read that some libraries had problems with Apple Accelerate frameworks because they were based on an old version of BLAS/LAPLACK. Have those problems been solved?When making these comparisons - are the latest versions and compatible versions of components installed? Java (Zulu), Numpy, Numba configured to use the Apple Accelerate frameworks etc.
Not sure, but any OS is supposed to come with an optimized BLAS library. Accelerate is optimized for the AMX, with 10x the compute power compared to SIMD NEON in a single-core process. If libraries aren't using it, they're killing CPU performance. It seemed that PyTorch wasn't using Accelerate (rather, OpenBLAS) for quite a while - not well clarified which they were using.Have those problems been solved?
I created two Mathematica benchmarks, and sent them to two posters on another forum who have M1 Max's. These calculate the %difference in wall clock runtime between whatever they're run on and my 2019 i9 iMac (see config details below). They were run using Mathematica 13.0.1 (current version is 13.1).
The tables below show the results from one M1 Max user; the results for the other user didn't differ significantly.
It's been opined that, to the extent Mathematica doesn't do well on AS vs. Intel, it's because AS's math libraries aren't as well optimized as Intel's MKL. Thus my upper table consists entirely of symbolic tasks and, indeed, all of these are faster on AS than my i9 iMac. However, they are not that much faster. You can see the M1 Max averages only 2% to 18% faster for these suites of symbolic computations.
The lower table features a graphing suite, where AS was 21% faster. I also gave it an image-processing task, and it was 46% slower, possibly because it uses numeric computations.
Details:
Symbolic benchmark: Consists of six suites of tests: Three integration suites, a simplify suite, a solve suite, and a miscellaneous suite. There are a total of 58 calculations. On my iMac, this takes 37 min, so an average of ~40s/calculation. It produces a summary table at the end, which shows the percentage difference in run time between my iMac whatever device it's run on. Most of these calculations appear to be are single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor). However, the last one (polynomial expansion) appears to be multi-core (CPU ~ 500%).
Graphing and image processing benchmark: Consists of five graphs (2D and 3D) and one set of image processing tasks (processing an image taken by JunoCam, which is the public-outreach wide-field visible-light camera on NASA’s Juno Jupiter orbiter). It takes 2 min. on my 2019 i9 iMac. As with the above, it produces a summary table at the end. The four graphing tasks appear to be single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor). However, the imaging processing task appears to be multi-core (CPU ~ 250% – 400%).
Here's how the percent differences in the summary tables are calculated (ASD = Apple Silicon Device, or whatever computer it's run on):
% difference = (ASD time/(average of ASD time and iMac time) – 1)*100.
Thus if the iMac takes 100 s, and the ASD takes 50 s, the ASD would get a value of –33, meaning the ASD is 33% faster; if the ASD takes 200 s, it would get a value of 33, meaning it is 33% slower. By dividing by the average of the iMac and ASD times, we get the same absolute percentage difference regardless of whether the two-fold difference goes in one direction or the other. For instance, If we instead divided by the iMac time, we'd get 50% faster and 100% slower, respectively, for the above two examples.
I also provide a mean and standard deviation for the percentages from each suite of tests. I decided to average the percentages rather than the times so that all processes within a test suite are weighted equally, i.e., so that processes with long run times don't dominate.
iMac details:
2019 27" iMac (19,1), i9-9900K (8 cores, Coffee Lake, 3.6 GHz/5.0 GHz), 32 GB DDR4-2666 RAM, Radeon Pro 580X (8 GB GDDR5)
Mathematica 13.0.1
MacOS Monterey 12.4
View attachment 2131848
Hi richmlow. I'd be happy to rerun them on my i9 iMac, but I'll need to contact those who did the comparison runs for me on their M1 Max's, and I'm not sure if they still have Mathematica running on their machines (at least one, and possibly both, downloaded the 7-day trial version to run the test). I'd offer to send my benchmark tests to you to run, but it doesn't appear you have an M1.Hello theorist9,
Thank you for running your suite of Mathematica benchmarks on Mathematica 13.0.1.
Would it be possible to rerun your benchmarks on the most recent version of Mathematica (v.13.2) and report back your findings? According to the Wolfram Research release notes (for v. 13.2), "extensive" optimizations were made to increase the efficiency of many Mathematica capabilities.
All the best,
richmlow