Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Status
Not open for further replies.

AdamNC

macrumors 6502a
Feb 3, 2018
751
1,052
Leland NC
Benchmarks are only good for one thing mass panic. Best way to test a computer is timing tests. Take your old computer run tasks time it repeat with new computer. The best times are the winner. Kinda like tuning a dragster. If you improve your ET then you are running better.
 
  • Like
Reactions: Martyimac

jeanlain

macrumors 68020
Mar 14, 2009
2,462
957
M1 has an 8-wide decoder, which is by fart the widest commercialized design in the industry. Intel Skylake and AMD Zen(1,2,3) only has a 4-wide decoder, mainly due the the complexity of x86 instructions. To utilize such a wide decoder (and the excecution units in the backend) we have to rely on some super-scaler technique, but combination, used in this chess engine, usually has a very sequential instruction dependency which makes instruction level parallelism hard. This particular case is where SMT providing us thread level parallelism shines.

We don't downplay unfavorable benchmarks, but this particular workload is one of many cases that M1 is not good at(for now). If this particular case is very relevant to the user, that user should not buy an M1 equipped Mac. A lot of disagreement here only want to say 'M1 is fast for me' because OP comes with a scary title.
The M1 does appear to perform as it should (= on par with the best cores) in "classic" stockfish. The underperformance appears specific to stockfish NNUE. I don't know how its coded, but is it more sequential that classic stockfish?
There's apparently some X86 specific instructions related to SIMD and bit manipulations. Maybe they've not been translated into ARM instructions yet.
I won't bother to look at the code as I don't know how to recognise those instructions anyway, but the Mac stockfish app has three different executables for X86-64, one whose name contains "sse41", another "bmi2" and a third one "vnni-256", which apparently corresponds to AVX 512. There is just one ARM executable: "stockfish-arm64".

As for SMT, it would be nice to have it on ARM Mac.
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
That would be relevant if Stockfish used a Neural Engine (or its equivalent) on x86 CPUs but does it?
Well if the SoC is technically able to perform the benchmark much faster but doesn't (as of now) because no one ported it to the correct tooling I do think it is relevant. Missing important optimizations available for the M1 does not make the comparison more fair, quite the opposite.
 
  • Like
Reactions: BigMcGuire

jeanlain

macrumors 68020
Mar 14, 2009
2,462
957
Which makes me wonder about the apps using the neural engine (core ML). The only ones I know of use ML-assisted face/object/scene recognition for Animoji, smart selection, reframing, etc. Nothing really "serious".
 

mi7chy

macrumors G4
Oct 24, 2014
10,625
11,296
That would be relevant if Stockfish used a Neural Engine (or its equivalent) on x86 CPUs but does it?

LeelaChessZero utilizes GPU for NN and there's also a native M1 version. Its ELO rating is high and have edged out wins against Stockfish 11 (don't know how it fares against improved ELO of Stockfish 12 and 13) but its nps rate looks to be relatively slower.

https://github.com/LeelaChessZero/lc0

https://docs.google.com/spreadsheet...7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=1508569046

Bit of trivia, elite super grandmaster human players have ELO of ~2800, Stockfish ELO is ~3500+ but Tensor Flow Units have godly ELO of 5000+. 4 TPUs > 176 GPUs. Don't see where TPU can be bought but you can rent them.

1619379473724.png
 
Last edited:

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
Yes. SMT is a crutch for designs where the design is bad and you can’t keep the pipelines full another way.
That’s a bit harsh. ?
You have to save some blame for the code they are asked to run. Can’t allow the software guys a clean get away.
 
  • Like
Reactions: bobcomer

leman

macrumors Core
Oct 14, 2008
19,521
19,679
I ran the benchmarks with the settings suggested by @mi7chy (stockfish bench 128 NUMCORES 24 default depth) with the following results (power usage checked using Apple's powermetrics):

M1 MBP 13":
~ 12000 knodes/s for 8 threads using 15W of power
~ 2500 knodes/s for 1 thead using 5W of power

Intel i9 16" MBP:
~ 13000 knodes/s for 16 threads using 65W of power
~ 1700 knodes/s for 1 thread using 35W of power

In a nutshell, M1 with it's 4 cores is 10% slower than top-shelf Intel mobile CPU with 8 cores while consuming 80% less power. Looking at the linked https://openbenchmarking.org/test/pts/stockfish, M1 is faster than Tiger Lake. 8-core desktop CPUs are considerably faster. Ah, and M1 is about 50% faster in the single-threaded variant than a 4.8ghz Intel Skylake refresh.

My conclusion: no, M1's performance is not disappointing at all. It's performance is comparable to much larger (and 4-5x hotter!) CPUs.

Few additional observations: AMD seems to do really well in stockfish (for unclear reasons) and this benchmark really loves SMT. Stockfish seems to support Neon (ARM SIMD) but it's not really clear how it is utilized. The fact that is scales so well with STM suggests to me that the SIMD code is suboptimal and could probably be improved to achieve better ILP.
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,679
That’s a bit harsh. ?
You have to save some blame for the code they are asked to run. Can’t allow the software guys a clean get away.

SMT is also great for code with long dependency chains and low inherent ILP - as long as you can multithread it of course. Interestingly enough, this seems to be the case with Stockfish.
 
Last edited:
  • Like
Reactions: BigMcGuire

leman

macrumors Core
Oct 14, 2008
19,521
19,679
Where do you see an AMD mobile 8-core on openbenchmarking.org? Only one I see is 4500U which is 6-core 6-thread.

Sorry, my mistake!

8-core 16-thread is 4800U or 5800U which are both 15W parts.

They are only 15W on paper (disregarding the fact that most "high" scores are for 25W configured laptops). For the short duration of this type of test, they will operate at PL2, which is over 40 watts. Which incidentally is the reason why these CPUs do that well in multi-core benchmarks — they essentially operate as "large" mobile/desktop CPUs. But they can only maintain this for couple of minutes at most.
 

mi7chy

macrumors G4
Oct 24, 2014
10,625
11,296
They are only 15W on paper (disregarding the fact that most "high" scores are for 25W configured laptops). For the short duration of this type of test, they will operate at PL2, which is over 40 watts. Which incidentally is the reason why these CPUs do that well in multi-core benchmarks — they essentially operate as "large" mobile/desktop CPUs. But they can only maintain this for couple of minutes at most.

That's actually false. Default power limit is 15W and configurable for 20 to 25W for the AMD U series CPUs. It can even go as low as 10W with core boost disabled. That's the beauty of it is configurable power limit and performance so, for example, can have it set to 10W when on battery to maximize battery life and 20W when plugged in for performance. It never peaks over what it's configured for.

1619465616735.png
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
That's actually false. Default power limit is 15W and configurable for 20 to 25W for the AMD U series CPUs. It can even go as low as 10W with core boost disabled. That's the beauty of it is configurable power limit and performance so, for example, can have it set to 10W when on battery to maximize battery life and 20W when plugged in for performance. It never peaks over what it's configured for.
I haven't found any test for the 4800U/5800U, but the 5980HS surely does peak above its claimed 35W TDP at the start of the benchmarks exactly as @leman says:

Power-P95-5980HS_575px.png


From this AnandTech article.

It seems odd to market the 4800U/5800U as having a 15W TDP (for the 4800U/5800U) and then actually have the power limited to that amount. Every other CPU I can think of peaks at a higher power consumption than the listed TDP.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,679
That's actually false. Default power limit is 15W and configurable for 20 to 25W for the AMD U series CPUs. It can even go as low as 10W with core boost disabled.

There are different power limits. PL1 (the one you are talking about) is the long-term sustained power limit. PL2 is the short-term power limit that can be active for a couple of limits. In default configuration most benchmarks will test the CPU performance under PL2. Which again is the main reason why these 8-core Ryzen parts do so well in benchmarks.

That's the beauty of it is configurable power limit and performance so, for example, can have it set to 10W when on battery to maximize battery life and 20W when plugged in for performance. It never peaks over what it's configured for.

And what scores are you getting when configuring PL1 and PL2 to 10 Watts?
 

hans1972

Suspended
Apr 5, 2010
3,759
3,398
Benchmark: Stockfish (chess) speed

M1 CPU = 13000 kn/s

i7 3930k overclocked (from 2011 = 10 years old) = 13000 kn/s

Others = 40000 kn/s - 80000 kn/s

Desktop CPUs = 230000 kn/s and much stronger

My solution: I live just a short walk from Magnus Carlsen.
 
  • Love
Reactions: BigMcGuire

Appletoni

Suspended
Original poster
Mar 26, 2021
443
177

Apple M1 CPU speed is very disappointing​

To solve some answers and questions, take a look here:

To measure and to compare the engine speed, you can use a free chess programm and then click the analysis button at the beginning of a game (starting position). After 1 minute is over you click the button again to stop the analysis and take the kn/s and run the test again with other engines on different hardware.
You only need to decide at the beginning if you want to compare tests with one core or more cores or maybe something like 2 vs 4 cores.

You will never see exactly the same kn/s but that is today not important to compare.
Stockfish:
kn/s: 1234 (only the first (here the 1xxx) is important to compare). On another hardware it could be 2000 (2xxx).
kn/s: 12345 (only the first + second (here 12xxx) is important to compare).
kn/s: 123456 (only the first + second + third (here 123xxx is important to compare)
The last 3 xxx are always different, even on exactly the same hardware.

Hardware speed:

Measure speed:
Build Tester v.1.4.7.0

Compare benchmarks with your hardware:

Apple M1 chip problems:

Stockfish:
See also Stockfish Discord developer channel https://discord.com/invite/nv8gDtt

CFish:

LC0:

Ceres:
 
Status
Not open for further replies.
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.