Sounds awesome. I'm toying with getting a MacMini just to have a foot in the door with this architechture, but I need to sell a 2018 laptop first to justify it ? (it is now not easy to sell 2018 macs ??♂️)
It looks like they're using the GPU rather than the neural engine.Yeah, come join the party With Apple submitting patches to TensorFlow (and I hope that PyTorch and friends will soon follow suite) Mac will become a powerhouse for this kind fo stuff. Just check this out:
So there's already a native version somewhere?With native version of R, the gap goes up to 30-40%, which is absolutely insane for this thin and light machine.
It looks like they're using the GPU rather than the neural engine.
So there's already a native version somewhere?
Have you tried using the Accelerate framework with R?
Many/most of the scientific programs we use are compiled from source. We may have the compilers, but I wonder how much the source code is "ARM compatible". I know endianness won't be a problem, but can we take any C code that works on X86 and just compile it to some ARM executable?
Honestly, it works surprisingly well in practice. The only R packages with C course code that I could't compile are ones that have some really messy dependencies (e.g. gdal). But most of the stuff worked out of the box. Basically, if it works on x86-64, doesn't use inline assembler or CPU-specific intrinsics (very few scientific packages do, but even libraries like Eigen compile perfectly fine) and the build configuration is competent, it is almost guaranteed to work on ARM64.
And even those cases where the software won't build usually boils down to invalid build configuration, e.g. the build system having some hard-coded settings that don't make sense on ARM platform or something else.
Gdal - so some of the spatial stats packages then ?
Have you tried R-INLA on it ?
A14 did add dedicated matrix multiplication hardware to the GPU, so why not use it?
They came in with the A13. Apple iterated to higher performance on the A14. ( I suspect the extension is a bit less memory bandwidth bottlenecked in the A14 more so than getting a much larger transistor budget got thrown at it. )
I am not talking about AMX, I am talking about the matrix multiplication units on the A14 GPU.
More details here: https://developer.apple.com/videos/play/tech-talks/10858/
The CPU's AMX function units share the same memory and system cache with the GPU. Unless the GPU's local cache has completely swallowed every last drop of that set of matrices about to apply the computation to the CPU function units aren't any farther away than a set of redundant matrix function units would be in the GPU.
Apple has the transistor budget to through "redundant" units to be more local to the L2 caches local to the CPU , GPU , ( and Neural ) cores and to tune to different matrix workloads ( in size as well as locality).
@leman - random question. With these fast and slow cores, how many cores do you see in Activity Monitor - does it distinguish between fast and slow cores? (I assume its not hyperthreaded - for example on corie i( mac you see 16 logical cores for 8 actual cores).
I'm just wondering if I run something in parallel in R lets say and I choose the max cores, how does it assign to a fast or slow core ? Randomly ?
Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?
How are you telling CmdStan to use the cores? I'm using CmdStan as a backend for brms and it works with multiple cores - but you gotta tell it a few times.Resurrecting this old thread, hoping I could hear from someone doing similar stuff... I haven't been able to use all of the 10 cores on my M1 pro MBP (14'') simultaneously. In particular, when I'm using Matlab (parpool) and Stan (reduce-sum on CmdStan), even though I'm telling them to use all of the cores, they only use four cores (interestingly, always core 1, 2, 6, and 10). What I'm doing is highly parallelizable and not super RAM intensive. Does this ring a bell to anyone?
# Set brms options for multithreading
options(mc.cores = 8, brms.backend = "cmdstanr")
fit_model <- brm(formula, data, chains =4, threads = threading(2))