Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

The Mercurian

macrumors 68020
Mar 17, 2012
2,159
2,442
Sounds awesome. I'm toying with getting a MacMini just to have a foot in the door with this architechture, but I need to sell a 2018 laptop first to justify it ? (it is now not easy to sell 2018 macs ??‍♂️)
 

jeanlain

macrumors 68020
Mar 14, 2009
2,460
954
With native version of R, the gap goes up to 30-40%, which is absolutely insane for this thin and light machine.
So there's already a native version somewhere?
Have you tried using the Accelerate framework with R?
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
It looks like they're using the GPU rather than the neural engine.

I don't really care what they are using as long as it's fast :) It's possible that the Neural Engine is only made to accelerate specific use cases and not flexible enough to implement general-purpose neural networks.

A14 did add dedicated matrix multiplication hardware to the GPU, so why not use it?

So there's already a native version somewhere?
Have you tried using the Accelerate framework with R?

You have to build it yourself, there are instructions in the R admin manual. I will run R benchmark later with and without accelerate framework to see how it impacts the performance of numerical stuff.
 

jerryk

macrumors 604
Nov 3, 2011
7,421
4,208
SF Bay Area
If you use TensorFlow, Apple has added a driver to let you access the GPUs on the M1, and access to those GPUs has been added to TensorFlow. The performance numbers look good. I assume it will not be too long before someone adds support for PyTorch.

So we can more rapidly train models on an M1 system and hopefully, the system will not sound like a 747 taking off for hours, like my 16" MBP does.

here is the blog post with some performance numbers

 

jeanlain

macrumors 68020
Mar 14, 2009
2,460
954
Many/most of the scientific programs we use are compiled from source. We may have the compilers, but I wonder how much the source code is "ARM compatible". I know endianness won't be a problem, but can we take any C code that works on X86 and just compile it to some ARM executable?
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
Many/most of the scientific programs we use are compiled from source. We may have the compilers, but I wonder how much the source code is "ARM compatible". I know endianness won't be a problem, but can we take any C code that works on X86 and just compile it to some ARM executable?

Honestly, it works surprisingly well in practice. The only R packages with C course code that I could't compile are ones that have some really messy dependencies (e.g. gdal). But most of the stuff worked out of the box. Basically, if it works on x86-64, doesn't use inline assembler or CPU-specific intrinsics (very few scientific packages do, but even libraries like Eigen compile perfectly fine) and the build configuration is competent, it is almost guaranteed to work on ARM64.

And even those cases where the software won't build usually boils down to invalid build configuration, e.g. the build system having some hard-coded settings that don't make sense on ARM platform or something else.
 

The Mercurian

macrumors 68020
Mar 17, 2012
2,159
2,442
Honestly, it works surprisingly well in practice. The only R packages with C course code that I could't compile are ones that have some really messy dependencies (e.g. gdal). But most of the stuff worked out of the box. Basically, if it works on x86-64, doesn't use inline assembler or CPU-specific intrinsics (very few scientific packages do, but even libraries like Eigen compile perfectly fine) and the build configuration is competent, it is almost guaranteed to work on ARM64.

And even those cases where the software won't build usually boils down to invalid build configuration, e.g. the build system having some hard-coded settings that don't make sense on ARM platform or something else.

Gdal - so some of the spatial stats packages then ?

Have you tried R-INLA on it ?
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
So, I have run R benchmark 2.5 using different scenarios and settings. It's a really old tool and probably not too representative of real-world workflows (unless you do a lot of linear algebra), but it's still an interesting thing to look at. Keep in mind that these are single-threaded tests (but most of R is single-threaded anyway)!

Detailed results are here: https://pastebin.com/ASywDd7h

The most interesting results:

  • The M1 running under Rosetta is faster on average than my i9 16", in some tasks up to 100% faster (matrix operations). The only exception is "Escoufier's method" benchmark (whatever it does), that's about 60 times slower under Rosetta — probably an unfortunate interaction of R's suboptimal interpreter
  • The M1 running native code is much faster than both the Intel machine and the Rosetta-transpiled version. We are talking about 2x performance increase in most tests. The only regression is the matrix cross-product, which is about 20% slower on the M1 machine. My guess is that the Intel code is vectorized but ARM is not, but one should check the sources
  • Linking agains the Apple Accelerate Framework does not seem to make a big difference here. If anything, the result is slightly better without it

Gdal - so some of the spatial stats packages then ?

Yeah, geospatial stuff (I use it for maps)

Have you tried R-INLA on it ?

Just tried installing it, but it has this ridiculous list of dependencies, including X11 and rgl, so it won't build.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
A14 did add dedicated matrix multiplication hardware to the GPU, so why not use it?

They came in with the A13. Apple iterated to higher performance on the A14. ( I suspect the extension is a bit less memory bandwidth bottlenecked in the A14 more so than getting a much larger transistor budget got thrown at it. )

A13 overview

"...
A big change to the CPU cores which we don’t have very much information on is Apple’s integration of “machine learning accelerators” into the microarchitecture. At heart these seem to be matrix-multiply units with DSP-like instructions, and Apple puts their performance at up to 1 Tera Operations (TOPs) of throughput, claiming an up-to 6x increase over the regular vector pipelines. This AMX instruction set is seemingly a superset of the ARM ISA that is running on the CPU cores.

There’s been a lot of confusion as to what this means, ... but one thing that is clear is that Apple isn’t publicly exposing these new instructions to developers, and they’re not included in Apple’s public compilers. We do know, however, that Apple internally does have compilers available for it, and libraries such as the Acclerate.framework seem to be able to take advantage of AMX. ..."
The Apple A13 SoC: Lightning & Thunder - The Apple iPhone 11, 11 Pro & 11 Pro Max Review: Performance, Battery, & Camera Elevated (anandtech.com)



A list of opcode names
Real World Technologies - Forums - Thread: Apple Math/Matrix (?) Extensions


I think Apple put the AMX augment in the CPU because of synergy with both "AI/ML" , "Computational photography" , and other more general DSP workloads . Some of the matrix "horsepower" that have typically gone to the "Neural Cores" is still in the CPU because they both work out of a common pool of memory. The CPU cuts down on the instructional "hand off" in a subset of the work.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
I am not talking about AMX, I am talking about the matrix multiplication units on the A14 GPU.

More details here: https://developer.apple.com/videos/play/tech-talks/10858/

The CPU's AMX function units share the same memory and system cache with the GPU. Unless the GPU's local cache has completely swallowed every last drop of that set of matrices about to apply the computation to the CPU function units aren't any farther away than a set of redundant matrix function units would be in the GPU.

Apple has the transistor budget to through "redundant" units to be more local to the L2 caches local to the CPU , GPU , ( and Neural ) cores and to tune to different matrix workloads ( in size as well as locality).
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
The CPU's AMX function units share the same memory and system cache with the GPU. Unless the GPU's local cache has completely swallowed every last drop of that set of matrices about to apply the computation to the CPU function units aren't any farther away than a set of redundant matrix function units would be in the GPU.

Not 100% sure I understand what you wrote :) Anyway, I think it's quite clear that these new Metal intrinsics expose the GPU hardware — they execute with the GPU vector width as part of the GPU programs and they are local to GPU threadgroup memory, which a level deeper than the SLC cache.

Apple has the transistor budget to through "redundant" units to be more local to the L2 caches local to the CPU , GPU , ( and Neural ) cores and to tune to different matrix workloads ( in size as well as locality).

It's quite interesting isn' it? There must be a decent benefit in replicating so much specialized functionality across the processing units. Maybe that kind of logic is also cheaper than one would conventionally assume.
 

The Mercurian

macrumors 68020
Mar 17, 2012
2,159
2,442
@leman - random question. With these fast and slow cores, how many cores do you see in Activity Monitor - does it distinguish between fast and slow cores? (I assume its not hyperthreaded - for example on corie i( mac you see 16 logical cores for 8 actual cores).

I'm just wondering if I run something in parallel in R lets say and I choose the max cores, how does it assign to a fast or slow core ? Randomly ?
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
@leman - random question. With these fast and slow cores, how many cores do you see in Activity Monitor - does it distinguish between fast and slow cores? (I assume its not hyperthreaded - for example on corie i( mac you see 16 logical cores for 8 actual cores).

I'm just wondering if I run something in parallel in R lets say and I choose the max cores, how does it assign to a fast or slow core ? Randomly ?

It shows 8 CPU cores and I don't see any way to distinguish between fast and slow ones. Apple CPUs don't have hyper-threading (probably because they don't need it — quite paradoxically).

I don't think that the core scheduling algorithm is officially documented. I would speculate that the OS assigns threads to cores based on priority and availability. E.g. it will try to schedule your normal and high priority threads on the fast cores, and only switch to low cores if the fast cores are already fully occupied. If a fast core becomes available, it will move the execution from the slow core to the fast core. Also, if I understand it correctly, slow cores lack some functionality of high cores, so if certain instructions are encountered, the thread will be automatically moved to a fast core as well. A low-priority thread on the other hand might always stay on a slow core.

So basically, if you start a bunch of threads some of them will run on the fast cores and some of them will run on the slow cores, and they will switch around as necessary. I am sure that the system is optimized to minimize the overall expected runtime.
 

ahurst

macrumors 6502
Oct 12, 2021
410
815
Resurrecting an old thread here, but I'm curious if any resident data scientists have their hands on an M1 Pro or M1 Max chip yet and have any benchmarks or observations to share!

I mostly work in R and my heavy lifting comes from Bayesian models in RStan, though I also do some occasional heavy signal processing for EEG data using MNE in Python. Given the Anandtech report saying that the M1 Max can't make full use of its 400 GB/s memory bandwidth under normal conditions I'm thinking the Pro is the chip to get, though the Max also has double the cache so I'm curious what kind of difference that might make.

Also interested in hearing updates from people with the regular M1, since there's been an official native R version for a while now and the dust has settled a bit with the experimental Fortran compilers (last I checked, the guy who did the experimental gfortran port has gotten official funding of some kind to get his work upstreamed so I'm eager to see that happen).
 
  • Like
Reactions: The Mercurian

jeanlain

macrumors 68020
Mar 14, 2009
2,460
954
I'm also interested in this. I use R all the time on huge tables containing mostly integer data (for genomics).
I'm looking forward new iMacs. I'd rather have 64 GB ram, but I don't need much GPU power...
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
I've been running some benchmarks on my M1 Pro & i9 16" MacBooks (more number-crunching than data science really). Basically I ran every simulation script I could find in my hard drive, with no attempt to optimise or profile it, to see how good it'd be at running things not specifically tailored for it. Here are the results (seconds, lower is better):

Screenshot 2021-10-28 at 19.13.43.png

Screenshot 2021-10-28 at 19.13.50.png

In case you're wondering: first two are running a (stochastic) ODE solver (Runge-Kutta 4) on a very simple potential field, first one compiled using Numba and second one in pure Python. Third one is a SIS epidemic model simulation running on a mean field for a varying set of parameters, fourth one is the same as the one before but on a Barabási-Albert network model instead of a mean field. Fifth one is just a test of how fast can Numpy fill a 50000x50000 matrix with random numbers.

The longer ones are a polymer folding simulation compiled with Numba, and the SIS network epidemic model from before with better parameter resolution.

My main takeaways from this:
  • Installing some scientific packages is still a pain in Apple Silicon (but nothing you can't solve in a day).
  • Numba jit-compiled code consistently offers more improvement in the Intel setup than on the M1 Pro. I'm not sure why, but it might be related to the heuristics of how cores are handled. Maybe it's freaking out because the cores are not homogeneous. This is kind of a niche thing anyway, as most people use Cython.
  • Cython compiled code on the M1 Pro beats the i9, but it's still a lower relative speedup compared to pure Python. I won't read too much into this as I just had one Cython source to try.
  • Numpy seems to be a bit less than twice as fast on the M1 Pro at everything I tried (didn't bother to add more graphs so you'll have to take my word), which is very nice.
  • The largest simulation I ran took 1h 34 minutes on the M1 Pro, and 3h 11 minutes on the i9. In fact, I had never ran the simulation at such a high resolution, since this script was written when I had a 2015 15" MBP. The biggest resolution I could run it back then was 200x200, which the M1 Pro does in less than 15 minutes (it used to take hours).
  • The M1 Pro MacBook itself feels like it's not even trying. No fans spin up at all. It just feels moderately warm to the touch. The Intel MacBook had the fans spinning at near max speed. This would be the difference between me being able to run a simulation while sleeping on the same room or having to run it during the day.
It's mostly anecdotal evidence, but it's still cool to see how fast the M1 Pro is at tasks that took hours just a couple years ago. Maybe the setup can be made to run faster with some compiler flags/dependencies, but most people will leave it in the standard config.

Also, the Numba problems are weird, but I didn't have more time to look into it.
 

kk16kk16

macrumors newbie
Apr 18, 2022
3
0
Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?

If they are using the Accelerate framework, this might be AMX (there is one unit per CPU cluster). How is the performance?
 

The Mercurian

macrumors 68020
Mar 17, 2012
2,159
2,442
Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?
How are you telling CmdStan to use the cores? I'm using CmdStan as a backend for brms and it works with multiple cores - but you gotta tell it a few times.

First I set options for brms after I load the package into R:
# Set brms options for multithreading
options(mc.cores = 8, brms.backend = "cmdstanr")

Then when I call brms I also need to specify threading:
fit_model <- brm(formula, data, chains =4, threads = threading(2))

If I don't add the threading it will only use 4 cores, with the threading uses 8. I don't work with CmdStan directly but maybe you also need to specify threading as well as cores?
 

kk16kk16

macrumors newbie
Apr 18, 2022
3
0
@leman: That's a plausible theory, I haven't ever thought about it! Given the nature of the work I won't be surprised if that's the case. I don't know about the apparent efficiency core usage though.

I must say I'm underwhelmed by the performance, given the hype surrounding M1 series. When I'm using a single core, my MBP is comparable to my 2019 27'' iMac (i5-8500, 6 cores). But since I can easily use all of the six cores on iMac, it outperforms MBP, with the cost being the fan blowing all the time. I'm thankful for your interesting theory, but I'd be pretty sad if I couldn't force M1pro to use the CPU cores instead of AMX.

@The Mercurian: that's helpful to know, thanks! I'm calling CmdStan directly, and it allows specification on the number of threads (num_threads option), but I cannot use it to change the number of cores used on my MBP. I'm using reduce-sum to make my program multithreaded (I can use 6 cores running 4 chains on my aforementioned iMac). One of the reasons I'm using CmdStan directly is that I hate R ?, but I'll look into how brms controls the number of cores.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.