Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
@leman: That's a plausible theory, I haven't ever thought about it! Given the nature of the work I won't be surprised if that's the case. I don't know about the apparent efficiency core usage though.

I must say I'm underwhelmed by the performance, given the hype surrounding M1 series. When I'm using a single core, my MBP is comparable to my 2019 27'' iMac (i5-8500, 6 cores). But since I can easily use all of the six cores on iMac, it outperforms MBP, with the cost being the fan blowing all the time. I'm thankful for your interesting theory, but I'd be pretty sad if I couldn't force M1pro to use the CPU cores instead of AMX.

AMX is much faster than using the FP pipeline, so if you say that the performance is underwhelming that’s not it. Could be interfacing problem though. Did you try asking at CmdStan mailing list? The people there are usually very helpful.
 

ahurst

macrumors 6502
Oct 12, 2021
410
815
I must say I'm underwhelmed by the performance, given the hype surrounding M1 series. When I'm using a single core, my MBP is comparable to my 2019 27'' iMac (i5-8500, 6 cores). But since I can easily use all of the six cores on iMac, it outperforms MBP, with the cost being the fan blowing all the time. I'm thankful for your interesting theory, but I'd be pretty sad if I couldn't force M1pro to use the CPU cores instead of AMX.
This sounds like something's wrong with your compiler settings (or how CmdStan handles things relative to RStan): I wrote a brms benchmark script to compare performance on a basic linear mixed-effects model between computers, and my M1 Pro a) used 6 full cores when I asked it to (the fans came on and everything!), and b) outperformed my Intel iMac's i7-4771 by a clean factor of 2. Per Geekbench 5 your i5-8500 iMac only has ~10% faster single-core than mine, so the performance boost of the M1 should be similar for you.

In addition to the much faster raw single-core and RAM bandwidth, the M1 Pro series of chips also have absurd amounts of cache which according to the Stan forums should be a major benefit to performance (people reporting old Xeon chips outperforming newer and faster i7s due to hefty amounts of cache).

Have you tried running your code in PyStan or RStan and seeing if you have the same problem? I can also clean up the benchmark script I wrote and post it here (since it uses open-access data) so we can compare numbers.
 

The Mercurian

macrumors 68020
Mar 17, 2012
2,159
2,442
@The Mercurian: that's helpful to know, thanks! I'm calling CmdStan directly, and it allows specification on the number of threads (num_threads option), but I cannot use it to change the number of cores used on my MBP. I'm using reduce-sum to make my program multithreaded (I can use 6 cores running 4 chains on my aforementioned iMac). One of the reasons I'm using CmdStan directly is that I hate R ?, but I'll look into how brms controls the number of cores.
Afraid I could not get my head around coding reduce_sum in Stan. Another thought - did you download the arm version of the relevant toolchains:https://stackoverflow.com/a/70664229/2498193
 

kk16kk16

macrumors newbie
Apr 18, 2022
3
0
Wow thank you all for helpful responses! ☺️

@leman: I see, yes, that's my current thinking too. I will look into Stan/CmdStan settings (incl. toolchains) a bit more myself and then ask around in the CmdStan community.

@ahurst: Wow, those numbers restored my faith in M1 pro, haha. I'll see what I can achieve with RStan; it definitely seems more likely now that it's due to the way I complied CmdStan and/or the way I parallelize in my own code.

@The Mercurian: Thanks for the link! I compiled CmdStan on its own (not via CmdStanR) using Apple clang, which could be the culprit. Perhaps I should also try map_rect with OpenMP (once I installed it successfully), which I haven't tried yet.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,677
@The Mercurian: Thanks for the link! I compiled CmdStan on its own (not via CmdStanR) using Apple clang, which could be the culprit. Perhaps I should also try map_rect with OpenMP (once I installed it successfully), which I haven't tried yet.

Na, using Apple toolchain is more than fine. Stan doesn’t use OpenMP anyway and you probably shouldn’t use it either.

Anyway, if you are open to coding things yourself, why not try Metal for GPU acceleration? The shading language is basically C++ and Apple recently released C++ bindings.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.