Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Status
Not open for further replies.

leman

macrumors Core
Oct 14, 2008
19,521
19,679
5800U is great, but it does use more power. I’m glad to help you to find out if this statement is true.

To add to this, it’s almost impossible to find a 5800U out in the wild. I had a quick look at a popular local online store - they list 8 machines with that APU with delivery dates between late May and November (!!!). M1 MacBook Air instead ships immediately, is considerably cheaper, and offer much better display+build quality.
 
  • Like
Reactions: JMacHack

Significant1

macrumors 68000
Dec 20, 2014
1,686
780
AMD Zen 4 will get 5nm, AVX512, DDR5, etc. so it'll be even faster. M1 only looks good compared to Intel but too many trade offs compared to AMD.
And by that time (late 2022 - early 2023) Apple Silicon will very likely already support armV9 instruction set with SVE2 support on 3nm.
 

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
And by that time (late 2022 - early 2023) Apple Silicon will very likely already support armV9 instruction set with SVE2 support on 3nm.
Not that it will matter much if they do or don’t. SVE:s main benefit is arguably vector code portability being (relatively) independent of vector length. But it is a mistake to assume that ”longer is better” when it comes to vector hardware. Given Apples use cases (and hardware features beyond vanilla ARM), a 128-bit hardware vector length may well be the best compromise. Be that as it may, I suspect that Apple will standardize on a single hardware vector length across their product line so the vector-length-agnostic property of SVE, as well as its support of very long vector implementations, is arguably redundant for Apple.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,679
Not that it will matter much if they do or don’t. SVE:s main benefit is arguably vector code portability being (relatively) independent of vector length.

SVE also offers a really nice SIMD programming model (predicates, gather/scatter etc.) and advanced auto-vectorisation support. It’s a straight-up upgrade from Neon even if the ALU width stays unchanged.

And of course, the ability to write portable SIMD code is just amazing altogether. It’s so much more annoying with Intel’s paradigm where you have to completely redesign your code for every supported SIMD length. Not to mention dealing with function multiversioning as well as unpredictable performance. Look at optimal memcpy using x86 and using SVE - it’s a joke…

But it is a mistake to assume that ”longer is better” when it comes to vector hardware. Given Apples use cases (and hardware features beyond vanilla ARM), a 128-bit hardware vector length may well be the best compromise.

I agree. For general-purpose code, 128-bit SIMD (with more independent ALUs) is almost always a superior choice. I could see Apple using wider units on Mac Pro and similar hardware though. At any rate, I’d be surprised if they skip SVE, it just offers too many benefits.
 

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
SVE also offers a really nice SIMD programming model (predicates, gather/scatter etc.) and advanced auto-vectorisation support. It’s a straight-up upgrade from Neon even if the ALU width stays unchanged.
Well I pretty much agree. Playing the devils advocate, I have to note that in the seminal white paper, they did recognise that the vector length agnosticism carried a performance cost to the tune of 10-15%.

But it's is a modernised take across the board, and probably worth it for that alone.
And of course, the ability to write portable SIMD code is just amazing altogether. It’s so much more annoying with Intel’s paradigm where you have to completely redesign your code for every supported SIMD length. Not to mention dealing with function multiversioning as well as unpredictable performance. Look at optimal memcpy using x86 and using SVE - it’s a joke…
Well, that's the way they made their bed from MMX via SSE to AVX. It's not pretty. And the various levels of AVX support across their product line remains a headache. But the one thing I agree with in that mess, is the recognition that 512-bit hardware vector support just doesn't make sense in a lot of client hardware implementations. The cost/benefit analysis is atrocious. Rosetta doesn't even support AVX which is an indication of how little market penetration it has achieved, regardless of technical arguments for/against.
I agree. For general-purpose code, 128-bit SIMD (with more independent ALUs) is almost always a superior choice. I could see Apple using wider units on Mac Pro and similar hardware though. At any rate, I’d be surprised if they skip SVE, it just offers too many benefits.
I'd be surprised too. SVE seems neat. But Apples addition of AMX instructions in addition to their NPU block also show that they are eminently prepared to go their own way, so I wouldn't stake my life on it.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,679
Well I pretty much agree. Playing the devils advocate, I have to note that in the seminal white paper, they did recognise that the vector length agnosticism carried a performance cost to the tune of 10-15%.

I think it depends on the use case and how sophisticated the CPU is. For many algorithms, predicated SIMD alone will bring a healthy performance boost. And a CPU should be able to speculatively unroll SVE-based loops. Anyway, I'm looking forward to seeing all these things in practice and I hope it won't take too lan until we can play around with it :)

But the one thing I agree with in that mess, is the recognition that 512-bit hardware vector support just doesn't make sense in a lot of client hardware implementations. The cost/benefit analysis is atrocious.

The interesting thing is that Intel does sip AVX512 in their consumer Tiger Lake CPUs. And on paper, it could be quite useful. For example, since Google's Swiss Tables are the new standard hash table implementation which heavily relies on SIMD to probe for hash entires, and things like UTF8 and JSON parsing is currently also SIMD-accelerated. Of course, it is all useless in practice if AVX512 always comes with a high warmup cost.

Rosetta doesn't even support AVX which is an indication of how little market penetration it has achieved, regardless of technical arguments for/against.

I thought that Rosetta's lack of AVX was due to licensing issues? According to the Steam hardware survey, AVX is supported on 95% of the client base, which sounds about right (Haswell or later + Jaguar and later). Of course, it's less useful on anything before Skylake because of the heavy state switch penalty it incurred on earlier CPUs.

But Apples addition of AMX instructions in addition to their NPU block also show that they are eminently prepared to go their own way, so I wouldn't stake my life on it.

AMX fulfills a completely different niche though.
 

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
The interesting thing is that Intel does sip AVX512 in their consumer Tiger Lake CPUs. And on paper, it could be quite useful. For example, since Google's Swiss Tables are the new standard hash table implementation which heavily relies on SIMD to probe for hash entires, and things like UTF8 and JSON parsing is currently also SIMD-accelerated. Of course, it is all useless in practice if AVX512 always comes with a high warmup cost.

For clarification, I meant "market penetration" as "meaningfully used in software". As in the lack of AVX support in Rosetta doesn't seem to mean much of anything in terms of software compatibility. And this is for Mac software, mind you, not Windows where developers have had to take AMD:s SIMD implementations into consideration whereas Mac software could assume Intel processors, and not the most cut-back either so you would actually think that AVX would have a larger presence in Mac software.
But I think we are digressing a bit here. Mine was mostly a reaction to tossing technical buzzwords around which felt a lot like "My daddy can beat your daddy any day of the week! No he can't! Yes he can!", and just like those schoolyard arguments it picked a feature that not only doesn't matter much to the majority of users, but where chasing larger numbers arguably makes a worse product.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,679
For clarification, I meant "market penetration" as "meaningfully used in software". As in the lack of AVX support in Rosetta doesn't seem to mean much of anything in terms of software compatibility. And this is for Mac software, mind you, not Windows where developers have had to take AMD:s SIMD implementations into consideration whereas Mac software could assume Intel processors, and not the most cut-back either so you would actually think that AVX would have a larger presence in Mac software.

Got it.

But I think we are digressing a bit here. Mine was mostly a reaction to tossing technical buzzwords around which felt a lot like "My daddy can beat your daddy any day of the week! No he can't! Yes he can!", and just like those schoolyard arguments it picked a feature that not only doesn't matter much to the majority of users, but where chasing larger numbers arguably makes a worse product.

So true :)
 
  • Like
Reactions: BigMcGuire

mercurysquad2

macrumors newbie
May 11, 2021
6
3
Just registered to voice my opinion. In this thread there are many people who are completely dismissing the OP, with sarcastic responses about divorce and offering a cookie. That's not productive.

It doesn't matter whether this "benchmark" is standardised or not, this is absolutely a real world usage. If you're a chess enthusiast, and you bought an M1 MacBook today, you will truly take a huge performance hit for now. This deserves to be publicly discoverable info as it can affect purchase decisions.

I have both in front of me and came to this thread because I was surprised to see that the new M1 is processing positions at 15-20% of the speed of my 2020 Intel MBP! A more productive discourse would be to find out what's the bottleneck here? I'd be interested to see if a rewrite of Stockfish to leverage the Neural Engine will help. I also saw some people saying only the NNUE mode of Stockfish is affected, sorry to say that's not the case. With both NNUE turned on or off, the Intel 2020 MBP gets around 7500 kN/s (8 cores) while the M1 only nets around 1200 kN/s. I checked various different hash sizes (RAM dependent) as well, no difference. I also tried both M1 and Rosetta, the result was about the same. My 8-core iMac actually nets 13000-14000 kN/sec even. So clearly something's very wrong.

Once again: do not dismiss this result just because it's not relevant to you. Chess is not so niche anymore. This needs more publicity and to be picked up by more people, so that the corrections/optimisations can be made to Stockfish and possible other engines. And it also goes to show that not all programs will automatically run faster, in fact some might end up running considerably slower. And that info is relevant to a lot of other developers. Don't just go by Geekbench score.
 
  • Like
Reactions: Appletoni

leman

macrumors Core
Oct 14, 2008
19,521
19,679
It doesn't matter whether this "benchmark" is standardised or not, this is absolutely a real world usage. If you're a chess enthusiast, and you bought an M1 MacBook today, you will truly take a huge performance hit for now. This deserves to be publicly discoverable info as it can affect purchase decisions.

This is definitely not the case. Several users (myself included) in this thread have benchmarked M1 machines using Stockfish engine and got numbers around 11-13 million nodes/s.

Now compare this to the results published on https://openbenchmarking.org/test/pts/stockfish

I have both in front of me and came to this thread because I was surprised to see that the new M1 is processing positions at 15-20% of the speed of my 2020 Intel MBP! A more productive discourse would be to find out what's the bottleneck here?

My guess: probably something wrong with your laptop or your methodology.

According to the link above, a Tiger Lake i7-1165G7 (which is definitely a faster CPU than your Intel MBP) gets around
7.5 million nodes/s (which makes it around 50% slower than M1, as expected based on other benchmarks).

And it also goes to show that not all programs will automatically run faster, in fact some might end up running considerably slower. And that info is relevant to a lot of other developers. Don't just go by Geekbench score.

This is definitely the case. But I can't say that I agree with your analysis of the situation.

The original post was comparing the quad-core, entry level M1 chips to high-end server CPUs with 16 or more cores and declaring this as "disappointing". Some other posters joined in, explaining how "slow" this processor is. This is obviously not a rational position given the fact that M1 performs extremely well in a variety of real world workloads as well as industry standard benchmarks.

In retrospect, it does seem like M1 underperforms in Stockfish relative to expectations. My guess as a programmer with some exposure to low-level optimization is that Stockfish lacks optimizations for the ARM architecture. It is also possible that Stockfish code has inherently low intrusion-level parallelism (it could be the case if Stockfish relies on long chains of interdependent computations), which would be the worst possible match for M1 with it's slow-running very wide cores. If I understood correctly, a Stockfish developer read this thread and promised to investigate. But according to the benchmarks, the quad-core M1 is still way faster than any quad-core Intel or AMD chip and on par with 6-core SMT x86 chips.
 

Jorbanead

macrumors 65816
Aug 31, 2018
1,209
1,438
Once again: do not dismiss this result just because it's not relevant to you.
They are claiming that the CPU is "slow" in the title, but specifically they're talking about this one benchmark when compared to high-end server CPU's. They gave us no information on how they ran this benchmark, and people have debunked their claims based off of what we do know. So yes people are going to be negative in their responses.

See these responses:
Stockfish engine is optimized for x86's SIMD instructions extensively, and its algorithm cannot utilize M1's super-wide ALU effectively, this is why you are having "disappointing speed." If chess analysis is your main use case you should not buy a M1 Mac and you have to stick to x86 for now.
It’s up to the person producing the claim to show that their methodology makes sense and their result is valid. Then we can try to have a productive discussion. OP didn’t show any interest in being constructive, they just came with an isolated statement that contradicts all we know up to now.
In fact, Stockfish appears to perform rather well on the M1, if these results are to be believed (two posters got consistent results). The M1 is 64% faster (single core) than an iMac Pro's xeon, and almost twice faster than my i5 7600k, while it's "just" 65% faster on Geekbench (I get a single-core score of 1041).
The OP must have seen this page, which reports Stockfish scores by CPUs consuming >10 times the power of the M1.

EDIT: my stockfish score of 1560 kN/s seems too low compared to the M1. I may be using a different version from the one tested there. It appears that stockfish and Geekbench give very similar relative scores.

TL,DR: move along. Nothing to see here.
Stockfish is scaling very well with number of CPU cores. The actual performance matters very little, more cores = better score. Your 10 year old CPU is 6c/12t, the M1 is 4hp/4he. Makes a comparison useless as it's not the actual performance that's compared here. It's also the reason why AMD 32c/64c CPUs are on top, sheer number of cores, independent of performance.
I ran the benchmarks with the settings suggested by @mi7chy (stockfish bench 128 NUMCORES 24 default depth). In a nutshell, M1 with it's 4 cores is 10% slower than top-shelf Intel mobile CPU with 8 cores while consuming 80% less power. Looking at the linked https://openbenchmarking.org/test/pts/stockfish, M1 is faster than Tiger Lake. 8-core desktop CPUs are considerably faster. Ah, and M1 is about 50% faster in the single-threaded variant than a 4.8ghz Intel Skylake refresh.

My conclusion: no, M1's performance is not disappointing at all. It's performance is comparable to much larger (and 4-5x hotter!) CPUs.

Few additional observations: AMD seems to do really well in stockfish (for unclear reasons) and this benchmark really loves SMT. Stockfish seems to support Neon (ARM SIMD) but it's not really clear how it is utilized. The fact that is scales so well with STM suggests to me that the SIMD code is suboptimal and could probably be improved to achieve better ILP.
You just disproved your point.

SPEC is a good benchmark. Chess is not. Chess is a good benchmark component. But most computing has a very different profile than chess algorithms. That’s why good benchmarks blend the results of many different kinds of computing activity.
 
Last edited:

mercurysquad2

macrumors newbie
May 11, 2021
6
3
Several users (myself included) in this thread have benchmarked M1 machines using Stockfish engine and got numbers around 11-13 million nodes/s.

Hi @leman could you point me how you got those results? That is exactly what I was expecting seeing as M1 benchmarks about 1.5-2x faster than 10th gen Intel.

How I got my result: Downloaded Stockfish from the Mac App Store, and ran the analysis on a few positions. This is what I also did on the intel MBP and iMac, and corresponds to actual usage of this app.

I'm not interested in finding out whether M1 outperforms Intel in controlled conditions or its performance per watt etc. Those are well known already.

I'm interested in finding out how I can get better or at least the same performance in Stockfish as my other MBP which it's supposed to replace. If not, it's going back...
 

leman

macrumors Core
Oct 14, 2008
19,521
19,679
Hi @leman could you point me how you got those results? That is exactly what I was expecting seeing as M1 benchmarks about 1.5-2x faster than 10th gen Intel.

I followed the instructions for the standard benchmark here: https://openbenchmarking.org/innhold/7b48e20eaf790461ec7160590babc3d317fe4f66

Specifically, I have installed the latest version of Stockfish engine using homebrew and then run the following benchmark command (the number 8 refers to the number of threads to run, I chose 8 for M1 since it's the performance sweet spot):

stockfish bench 128 8 24 default depth


How I got my result: Downloaded Stockfish from the Mac App Store, and ran the analysis on a few positions. This is what I also did on the intel MBP and iMac, and corresponds to actual usage of this app.

Hm, maybe the App Store version is outdated or does not use multiple threads? I've never seen the app, so I really can't say anything relevant.
 

deathdruid

macrumors member
Oct 24, 2006
38
2
Cambridge, MA, USA
Screen Shot 2021-05-11 at 4.19.01 PM.png

Default setting for Stockfish from App Store. Clicking recommended settings changes cores to 4, hash to 8192, etc. I did not actually run any benchmarks, but was just curious.
 

mercurysquad2

macrumors newbie
May 11, 2021
6
3
I did a lot of debugging and I am happy to say: the problem has been solved.

The bottleneck was RAM.

I actually downloaded the source code of the Mac GUI of Stockfish, replaced the engine binary with the one from Hombrew and still no difference. Then I ran the benchmark and confirmed 12000 kN/s. However if you run stockfish from the command line and type "setoption name Threads value 8" and then "go infinite" you'll see "nps" value around 6000 kN/s. But if you also type "setoption name Hash value 4096" before go infinite, the speed decreases to only 800-1000 kN/s.

According to Stockfish documentation, 16MB is the default value (far too small!). That's probably why the benchmark was doing great, because it was running with a minuscule hash table size. Based on some recommendations I checked the hash table utilisation at different sizes for a 75 second search (suggested for a rapid chess game of 15min+10sec increment).

Up to a value of 2-3 GB, things are still fast. I was able to get 11000 kN/s (I am super happy, that's almost as fast as my iMac). However, the hash utilisation was touching 85-90%, begging for it to be increased to at least 5GB or so. Otherwise you're not using the engine's full strength.

BUT setting the hash table size above around 3GB on an 8GB machine the memory pressure will be high and swap will begin to get used. 4GB is already too much, and you will see performance drop to under 10% of what it should be.

I did the same tests on my Intel, and even 8 GB hash works fine there -- but it also has 16 GB RAM.

Bottomline: 8GB MBP is not suitable for Chess enthusiasts. Go with 16GB. With 16GB the M1 will reach 11-12000 kN/sec which is close to what a 6 or 8 core desktop CPU can get. And if you can only use an 8GB MacBook, restrict the hash size to 2048.
 
Last edited:

BigMcGuire

Cancelled
Jan 10, 2012
9,832
14,032
I followed the instructions for the standard benchmark here: https://openbenchmarking.org/innhold/7b48e20eaf790461ec7160590babc3d317fe4f66

Specifically, I have installed the latest version of Stockfish engine using homebrew and then run the following benchmark command (the number 8 refers to the number of threads to run, I chose 8 for M1 since it's the performance sweet spot):

stockfish bench 128 8 24 default depth




Hm, maybe the App Store version is outdated or does not use multiple threads? I've never seen the app, so I really can't say anything relevant.
With these on my M1 MBP I got:

Total time (ms) : 58088
Nodes Searched : 655373913
Nodes/Second : 11282432

Nice.
 

mercurysquad2

macrumors newbie
May 11, 2021
6
3
Sorry I meant restrict to 2GB. Even 3GB is too much on an 8gb MBP.

And a tip for anyone else trying to test if your CPU-bound process is bottlenecking on RAM: check Activity Monitor CPU Usage History window. If you see all of them maxing out in green colour, it's fine. But if you see a substantial chunk in red, that's time the OS is taking away from your process to swap in/out.
 
Last edited:

dmccloud

macrumors 68040
Sep 7, 2009
3,142
1,900
Anchorage, AK
I did a lot of debugging and I am happy to say: the problem has been solved.

The bottleneck was RAM.

I actually downloaded the source code of the Mac GUI of Stockfish, replaced the engine binary with the one from Hombrew and still no difference. Then I ran the benchmark and confirmed 12000 kN/s. However if you run stockfish from the command line and type "setoption name Threads value 8" and then "go infinite" you'll see "nps" value around 6000 kN/s. But if you also type "setoption name Hash value 4096" before go infinite, the speed decreases to only 800-1000 kN/s.

According to Stockfish documentation, 16MB is the default value (far too small!). That's probably why the benchmark was doing great, because it was running with a minuscule hash table size. Based on some recommendations I checked the hash table utilisation at different sizes for a 75 second search (suggested for a rapid chess game of 15min+10sec increment).

Up to a value of 2-3 GB, things are still fast. I was able to get 11000 kN/s (I am super happy, that's almost as fast as my iMac). However, the hash utilisation was touching 85-90%, begging for it to be increased to at least 5GB or so. Otherwise you're not using the engine's full strength.

BUT setting the hash table size above around 3GB on an 8GB machine the memory pressure will be high and swap will begin to get used. 4GB is already too much, and you will see performance drop to under 10% of what it should be.

I did the same tests on my Intel, and even 8 GB hash works fine there -- but it also has 16 GB RAM.

Bottomline: 8GB MBP is not suitable for Chess enthusiasts. Go with 16GB. With 16GB the M1 will reach 11-12000 kN/sec which is close to what a 6 or 8 core desktop CPU can get. And if you can only use an 8GB MacBook, restrict the hash size to 2048.

Actually, the only thing you actually said there is that 8GB is not suitable for people running this one benchmark under specific conditions. Furthermore, Stockfish is still not updated for the M1, so you're trying to compare an application running through Rosetta to the same app running natively on x86 processors, which alone invalidates the comparisons. Also, using one chess app as an indicator of the M1s overall performance is laughable at best and is the very definition of cherry picking data to support your preconceived notions.
 
  • Like
Reactions: JMacHack

mercurysquad2

macrumors newbie
May 11, 2021
6
3
Actually, the only thing you actually said there is that 8GB is not suitable for people running this one benchmark under specific conditions. Furthermore, Stockfish is still not updated for the M1, so you're trying to compare an application running through Rosetta to the same app running natively on x86 processors, which alone invalidates the comparisons. Also, using one chess app as an indicator of the M1s overall performance is laughable at best and is the very definition of cherry picking data to support your preconceived notions.

Hello, I tested all 3 versions of Stockfish: M1-optimised, x86 under Rosetta2, and native x86 on intel.

Secondly, this is not just a benchmark, this is a real-world app that many many people use. The recent World Championship candidates tournament was using Stockfish during their commentary.

And lastly, it was not run "under specific conditions." This was an app downloaded from Mac App Store, which ran at 10% of its expected performance with the recommended settings. That deserved investigation, which I did, and posted my findings. I hope it helps someone.

Sorry for the impolite tone of my previous reply.
 
Last edited:

Significant1

macrumors 68000
Dec 20, 2014
1,686
780
Actually, the only thing you actually said there is that 8GB is not suitable for people running this one benchmark under specific conditions. Furthermore, Stockfish is still not updated for the M1, so you're trying to compare an application running through Rosetta to the same app running natively on x86 processors, which alone invalidates the comparisons. Also, using one chess app as an indicator of the M1s overall performance is laughable at best and is the very definition of cherry picking data to support your preconceived notions.
Activity monitor says kind is Apple not Intel (add the kind column by right clicking the row).
 

dmccloud

macrumors 68040
Sep 7, 2009
3,142
1,900
Anchorage, AK
Activity monitor says kind is Apple not Intel (add the kind column by right clicking the row).
At the time the OP posted his complaint, Stockfish had NOT been updated for the M1. Second, I know how to add that column to Activity Monitor. The point remains that Stockfish is an extremely poor tool when it comes to comparing the M1 to Intel processors. The vast majority of Mac users will never install Stockfish or ever use the app, so tests using it are only relevant to a very small percentage of the userbase. The OP used this one very specific and niche example to justify his perception that the M1 is underpowered. However, real-world testing under more common workloads shows that the M1 actually keeps pace with higher end Intel CPUs, and in some tests actually beat out both the 16" MBP and a beefy Mac Pro (go look at the MaxTech comparison videos on YouTube to see this in action).

In automotive terms, you're using the performance of a car on a test track to determine how well it will perform on the interstate in the middle of rush hour.
 

mercurysquad2

macrumors newbie
May 11, 2021
6
3
Stockfish is an extremely poor tool when it comes to comparing the M1 to Intel processors. [...] However, real-world testing under more common workloads shows that the M1 actually keeps pace with higher end Intel CPUs

Hello again, I think you misunderstood, I am not the OP of this thread. As I said before after adjusting certain settings, Stockfish (a real world app not a benchmark) does perform great and almost beats out Intel.
 
Status
Not open for further replies.
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.