Apple Silicon for data science

The Mercurian · Nov 27, 2020

Sounds awesome. I'm toying with getting a MacMini just to have a foot in the door with this architechture, but I need to sell a 2018 laptop first to justify it 😅 (it is now not easy to sell 2018 macs 🙄🤦‍♂️)

leman · Nov 27, 2020

Yeah, come join the party

With Apple submitting patches to TensorFlow (and I hope that PyTorch and friends will soon follow suite) Mac will become a powerhouse for this kind fo stuff. Just check this out:

https://twitter.com/x/status/1329168059647488000

jeanlain · Nov 27, 2020

leman said:
Yeah, come join the party With Apple submitting patches to TensorFlow (and I hope that PyTorch and friends will soon follow suite) Mac will become a powerhouse for this kind fo stuff. Just check this out:

https://twitter.com/x/status/1329168059647488000

It looks like they're using the GPU rather than the neural engine.

jeanlain · Nov 27, 2020

leman said:
With native version of R, the gap goes up to 30-40%, which is absolutely insane for this thin and light machine.

So there's already a native version somewhere?
Have you tried using the Accelerate framework with R?

leman · Nov 27, 2020

jeanlain said:
It looks like they're using the GPU rather than the neural engine.

I don't really care what they are using as long as it's fast

It's possible that the Neural Engine is only made to accelerate specific use cases and not flexible enough to implement general-purpose neural networks.

A14 did add dedicated matrix multiplication hardware to the GPU, so why not use it?

jeanlain said:
So there's already a native version somewhere?
Have you tried using the Accelerate framework with R?

You have to build it yourself, there are instructions in the R admin manual. I will run R benchmark later with and without accelerate framework to see how it impacts the performance of numerical stuff.

jerryk · Nov 27, 2020

If you use TensorFlow, Apple has added a driver to let you access the GPUs on the M1, and access to those GPUs has been added to TensorFlow. The performance numbers look good. I assume it will not be too long before someone adds support for PyTorch.

So we can more rapidly train models on an M1 system and hopefully, the system will not sound like a 747 taking off for hours, like my 16" MBP does.

here is the blog post with some performance numbers

Accelerating TensorFlow Performance on Mac

Accelerating TensorFlow 2 performance on Mac

blog.tensorflow.org

jeanlain · Nov 27, 2020

Many/most of the scientific programs we use are compiled from source. We may have the compilers, but I wonder how much the source code is "ARM compatible". I know endianness won't be a problem, but can we take any C code that works on X86 and just compile it to some ARM executable?

leman · Nov 27, 2020

jeanlain said:
Many/most of the scientific programs we use are compiled from source. We may have the compilers, but I wonder how much the source code is "ARM compatible". I know endianness won't be a problem, but can we take any C code that works on X86 and just compile it to some ARM executable?

Honestly, it works surprisingly well in practice. The only R packages with C course code that I could't compile are ones that have some really messy dependencies (e.g. gdal). But most of the stuff worked out of the box. Basically, if it works on x86-64, doesn't use inline assembler or CPU-specific intrinsics (very few scientific packages do, but even libraries like Eigen compile perfectly fine) and the build configuration is competent, it is almost guaranteed to work on ARM64.

And even those cases where the software won't build usually boils down to invalid build configuration, e.g. the build system having some hard-coded settings that don't make sense on ARM platform or something else.

The Mercurian · Nov 28, 2020

leman said:
Honestly, it works surprisingly well in practice. The only R packages with C course code that I could't compile are ones that have some really messy dependencies (e.g. gdal). But most of the stuff worked out of the box. Basically, if it works on x86-64, doesn't use inline assembler or CPU-specific intrinsics (very few scientific packages do, but even libraries like Eigen compile perfectly fine) and the build configuration is competent, it is almost guaranteed to work on ARM64.

And even those cases where the software won't build usually boils down to invalid build configuration, e.g. the build system having some hard-coded settings that don't make sense on ARM platform or something else.

Gdal - so some of the spatial stats packages then ?

Have you tried R-INLA on it ?

leman · Nov 28, 2020

So, I have run R benchmark 2.5 using different scenarios and settings. It's a really old tool and probably not too representative of real-world workflows (unless you do a lot of linear algebra), but it's still an interesting thing to look at. Keep in mind that these are single-threaded tests (but most of R is single-threaded anyway)!

Detailed results are here: https://pastebin.com/ASywDd7h

The most interesting results:

The M1 running under Rosetta is faster on average than my i9 16", in some tasks up to 100% faster (matrix operations). The only exception is "Escoufier's method" benchmark (whatever it does), that's about 60 times slower under Rosetta — probably an unfortunate interaction of R's suboptimal interpreter
The M1 running native code is much faster than both the Intel machine and the Rosetta-transpiled version. We are talking about 2x performance increase in most tests. The only regression is the matrix cross-product, which is about 20% slower on the M1 machine. My guess is that the Intel code is vectorized but ARM is not, but one should check the sources
Linking agains the Apple Accelerate Framework does not seem to make a big difference here. If anything, the result is slightly better without it

The Mercurian said:
Gdal - so some of the spatial stats packages then ?

Yeah, geospatial stuff (I use it for maps)

The Mercurian said:
Have you tried R-INLA on it ?

Just tried installing it, but it has this ridiculous list of dependencies, including X11 and rgl, so it won't build.

deconstruct60 · Nov 28, 2020

leman said:
A14 did add dedicated matrix multiplication hardware to the GPU, so why not use it?

They came in with the A13. Apple iterated to higher performance on the A14. ( I suspect the extension is a bit less memory bandwidth bottlenecked in the A14 more so than getting a much larger transistor budget got thrown at it. )

A13 overview

"...
A big change to the CPU cores which we don’t have very much information on is Apple’s integration of “machine learning accelerators” into the microarchitecture. At heart these seem to be matrix-multiply units with DSP-like instructions, and Apple puts their performance at up to 1 Tera Operations (TOPs) of throughput, claiming an up-to 6x increase over the regular vector pipelines. This AMX instruction set is seemingly a superset of the ARM ISA that is running on the CPU cores.

There’s been a lot of confusion as to what this means, ... but one thing that is clear is that Apple isn’t publicly exposing these new instructions to developers, and they’re not included in Apple’s public compilers. We do know, however, that Apple internally does have compilers available for it, and libraries such as the Acclerate.framework seem to be able to take advantage of AMX. ..."
The Apple A13 SoC: Lightning & Thunder - The Apple iPhone 11, 11 Pro & 11 Pro Max Review: Performance, Battery, & Camera Elevated (anandtech.com)

A list of opcode names
Real World Technologies - Forums - Thread: Apple Math/Matrix (?) Extensions

I think Apple put the AMX augment in the CPU because of synergy with both "AI/ML" , "Computational photography" , and other more general DSP workloads . Some of the matrix "horsepower" that have typically gone to the "Neural Cores" is still in the CPU because they both work out of a common pool of memory. The CPU cuts down on the instructional "hand off" in a subset of the work.

leman · Nov 28, 2020

deconstruct60 said:
They came in with the A13. Apple iterated to higher performance on the A14. ( I suspect the extension is a bit less memory bandwidth bottlenecked in the A14 more so than getting a much larger transistor budget got thrown at it. )

I am not talking about AMX, I am talking about the matrix multiplication units on the A14 GPU.

More details here: https://developer.apple.com/videos/play/tech-talks/10858/

deconstruct60 · Nov 28, 2020

leman said:
I am not talking about AMX, I am talking about the matrix multiplication units on the A14 GPU.

More details here: https://developer.apple.com/videos/play/tech-talks/10858/

The CPU's AMX function units share the same memory and system cache with the GPU. Unless the GPU's local cache has completely swallowed every last drop of that set of matrices about to apply the computation to the CPU function units aren't any farther away than a set of redundant matrix function units would be in the GPU.

Apple has the transistor budget to through "redundant" units to be more local to the L2 caches local to the CPU , GPU , ( and Neural ) cores and to tune to different matrix workloads ( in size as well as locality).

leman · Nov 28, 2020

deconstruct60 said:
The CPU's AMX function units share the same memory and system cache with the GPU. Unless the GPU's local cache has completely swallowed every last drop of that set of matrices about to apply the computation to the CPU function units aren't any farther away than a set of redundant matrix function units would be in the GPU.

Not 100% sure I understand what you wrote

Anyway, I think it's quite clear that these new Metal intrinsics expose the GPU hardware — they execute with the GPU vector width as part of the GPU programs and they are local to GPU threadgroup memory, which a level deeper than the SLC cache.

deconstruct60 said:
Apple has the transistor budget to through "redundant" units to be more local to the L2 caches local to the CPU , GPU , ( and Neural ) cores and to tune to different matrix workloads ( in size as well as locality).

It's quite interesting isn' it? There must be a decent benefit in replicating so much specialized functionality across the processing units. Maybe that kind of logic is also cheaper than one would conventionally assume.

The Mercurian · Nov 30, 2020

@leman - random question. With these fast and slow cores, how many cores do you see in Activity Monitor - does it distinguish between fast and slow cores? (I assume its not hyperthreaded - for example on corie i( mac you see 16 logical cores for 8 actual cores).

I'm just wondering if I run something in parallel in R lets say and I choose the max cores, how does it assign to a fast or slow core ? Randomly ?

leman · Nov 30, 2020

The Mercurian said:
@leman - random question. With these fast and slow cores, how many cores do you see in Activity Monitor - does it distinguish between fast and slow cores? (I assume its not hyperthreaded - for example on corie i( mac you see 16 logical cores for 8 actual cores).

I'm just wondering if I run something in parallel in R lets say and I choose the max cores, how does it assign to a fast or slow core ? Randomly ?

It shows 8 CPU cores and I don't see any way to distinguish between fast and slow ones. Apple CPUs don't have hyper-threading (probably because they don't need it — quite paradoxically).

I don't think that the core scheduling algorithm is officially documented. I would speculate that the OS assigns threads to cores based on priority and availability. E.g. it will try to schedule your normal and high priority threads on the fast cores, and only switch to low cores if the fast cores are already fully occupied. If a fast core becomes available, it will move the execution from the slow core to the fast core. Also, if I understand it correctly, slow cores lack some functionality of high cores, so if certain instructions are encountered, the thread will be automatically moved to a fast core as well. A low-priority thread on the other hand might always stay on a slow core.

So basically, if you start a bunch of threads some of them will run on the fast cores and some of them will run on the slow cores, and they will switch around as necessary. I am sure that the system is optimized to minimize the overall expected runtime.

Janichsan · Dec 16, 2020

NAG Fortran compiler for Apple Silicon
An (experimental) build of gfortran for Apple Silicon

ahurst · Oct 27, 2021

Resurrecting an old thread here, but I'm curious if any resident data scientists have their hands on an M1 Pro or M1 Max chip yet and have any benchmarks or observations to share!

I mostly work in R and my heavy lifting comes from Bayesian models in RStan, though I also do some occasional heavy signal processing for EEG data using MNE in Python. Given the Anandtech report saying that the M1 Max can't make full use of its 400 GB/s memory bandwidth under normal conditions I'm thinking the Pro is the chip to get, though the Max also has double the cache so I'm curious what kind of difference that might make.

Also interested in hearing updates from people with the regular M1, since there's been an official native R version for a while now and the dust has settled a bit with the experimental Fortran compilers (last I checked, the guy who did the experimental gfortran port has gotten official funding of some kind to get his work upstreamed so I'm eager to see that happen).

jeanlain · Oct 27, 2021

I'm also interested in this. I use R all the time on huge tables containing mostly integer data (for genomics).
I'm looking forward new iMacs. I'd rather have 64 GB ram, but I don't need much GPU power...

pshufd · Oct 27, 2021

If you guys can provide a workload and directions, I'll run it on my M1 Pro. Just indicate what metrics you want to see.

Andropov · Oct 28, 2021

I've been running some benchmarks on my M1 Pro & i9 16" MacBooks (more number-crunching than data science really). Basically I ran every simulation script I could find in my hard drive, with no attempt to optimise or profile it, to see how good it'd be at running things not specifically tailored for it. Here are the results (seconds, lower is better):

In case you're wondering: first two are running a (stochastic) ODE solver (Runge-Kutta 4) on a very simple potential field, first one compiled using Numba and second one in pure Python. Third one is a SIS epidemic model simulation running on a mean field for a varying set of parameters, fourth one is the same as the one before but on a Barabási-Albert network model instead of a mean field. Fifth one is just a test of how fast can Numpy fill a 50000x50000 matrix with random numbers.

The longer ones are a polymer folding simulation compiled with Numba, and the SIS network epidemic model from before with better parameter resolution.

My main takeaways from this:

Installing some scientific packages is still a pain in Apple Silicon (but nothing you can't solve in a day).
Numba jit-compiled code consistently offers more improvement in the Intel setup than on the M1 Pro. I'm not sure why, but it might be related to the heuristics of how cores are handled. Maybe it's freaking out because the cores are not homogeneous. This is kind of a niche thing anyway, as most people use Cython.
Cython compiled code on the M1 Pro beats the i9, but it's still a lower relative speedup compared to pure Python. I won't read too much into this as I just had one Cython source to try.
Numpy seems to be a bit less than twice as fast on the M1 Pro at everything I tried (didn't bother to add more graphs so you'll have to take my word), which is very nice.
The largest simulation I ran took 1h 34 minutes on the M1 Pro, and 3h 11 minutes on the i9. In fact, I had never ran the simulation at such a high resolution, since this script was written when I had a 2015 15" MBP. The biggest resolution I could run it back then was 200x200, which the M1 Pro does in less than 15 minutes (it used to take hours).
The M1 Pro MacBook itself feels like it's not even trying. No fans spin up at all. It just feels moderately warm to the touch. The Intel MacBook had the fans spinning at near max speed. This would be the difference between me being able to run a simulation while sleeping on the same room or having to run it during the day.

It's mostly anecdotal evidence, but it's still cool to see how fast the M1 Pro is at tasks that took hours just a couple years ago. Maybe the setup can be made to run faster with some compiler flags/dependencies, but most people will leave it in the standard config.

Also, the Numba problems are weird, but I didn't have more time to look into it.

kk16kk16 · Apr 18, 2022

Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?

leman · Apr 18, 2022

kk16kk16 said:
Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?

If they are using the Accelerate framework, this might be AMX (there is one unit per CPU cluster). How is the performance?

The Mercurian · Apr 19, 2022

kk16kk16 said:
Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?

How are you telling CmdStan to use the cores? I'm using CmdStan as a backend for brms and it works with multiple cores - but you gotta tell it a few times.

First I set options for brms after I load the package into R:

# Set brms options for multithreading
options(mc.cores = 8, brms.backend = "cmdstanr")

Then when I call brms I also need to specify threading:

fit_model <- brm(formula, data, chains =4, threads = threading(2))

If I don't add the threading it will only use 4 cores, with the threading uses 8. I don't work with CmdStan directly but maybe you also need to specify threading as well as cores?

kk16kk16 · Apr 19, 2022

@leman: That's a plausible theory, I haven't ever thought about it! Given the nature of the work I won't be surprised if that's the case. I don't know about the apparent efficiency core usage though.

I must say I'm underwhelmed by the performance, given the hype surrounding M1 series. When I'm using a single core, my MBP is comparable to my 2019 27'' iMac (i5-8500, 6 cores). But since I can easily use all of the six cores on iMac, it outperforms MBP, with the cost being the fan blowing all the time. I'm thankful for your interesting theory, but I'd be pretty sad if I couldn't force M1pro to use the CPU cores instead of AMX.

@The Mercurian: that's helpful to know, thanks! I'm calling CmdStan directly, and it allows specification on the number of threads (num_threads option), but I cannot use it to change the number of cores used on my MBP. I'm using reduce-sum to make my program multithreaded (I can use 6 cores running 4 chains on my aforementioned iMac). One of the reasons I'm using CmdStan directly is that I hate R ?, but I'll look into how brms controls the number of cores.

Apple Silicon for data science

macrumors 68020

macrumors Core

macrumors 68020

macrumors 68020

macrumors Core

macrumors 604

macrumors 68020

macrumors Core

macrumors 68020

macrumors Core

macrumors G5

macrumors Core

macrumors G5

macrumors Core

macrumors 68020

macrumors Core

macrumors 68040

macrumors 6502

macrumors 68020

macrumors G4

macrumors 6502a

macrumors newbie

macrumors Core

macrumors 68020

macrumors newbie

Our Staff