Apple M1 CPU & GPU speed is very disappointing

AdamNC · Apr 24, 2021

Benchmarks are only good for one thing mass panic. Best way to test a computer is timing tests. Take your old computer run tasks time it repeat with new computer. The best times are the winner. Kinda like tuning a dragster. If you improve your ET then you are running better.

jeanlain · Apr 25, 2021

Gnattu said:
M1 has an 8-wide decoder, which is by fart the widest commercialized design in the industry. Intel Skylake and AMD Zen(1,2,3) only has a 4-wide decoder, mainly due the the complexity of x86 instructions. To utilize such a wide decoder (and the excecution units in the backend) we have to rely on some super-scaler technique, but combination, used in this chess engine, usually has a very sequential instruction dependency which makes instruction level parallelism hard. This particular case is where SMT providing us thread level parallelism shines.

We don't downplay unfavorable benchmarks, but this particular workload is one of many cases that M1 is not good at(for now). If this particular case is very relevant to the user, that user should not buy an M1 equipped Mac. A lot of disagreement here only want to say 'M1 is fast for me' because OP comes with a scary title.

The M1 does appear to perform as it should (= on par with the best cores) in "classic" stockfish. The underperformance appears specific to stockfish NNUE. I don't know how its coded, but is it more sequential that classic stockfish?
There's apparently some X86 specific instructions related to SIMD and bit manipulations. Maybe they've not been translated into ARM instructions yet.
I won't bother to look at the code as I don't know how to recognise those instructions anyway, but the Mac stockfish app has three different executables for X86-64, one whose name contains "sse41", another "bmi2" and a third one "vnni-256", which apparently corresponds to AVX 512. There is just one ARM executable: "stockfish-arm64".

As for SMT, it would be nice to have it on ARM Mac.

leman · Apr 25, 2021

jeanlain said:
As for SMT, it would be nice to have it on ARM Mac.

Why? Apple CPUs don’t seem to need it. They are able to achieve high backend utilization perfectly well without SMT.

jeanlain · Apr 25, 2021

leman said:
Why? Apple CPUs don’t seem to need it. They are able to achieve high backend utilization perfectly well without SMT.

Is utilisation higher than competing CPUs that benefit from SMT?

leman · Apr 25, 2021

jeanlain said:
Is utilisation higher than competing CPUs that benefit from SMT?

From what I remember, yes (Andrei Frumusanu has published some results on twitter, but I can't find them now).

Andropov · Apr 25, 2021

falainber said:
That would be relevant if Stockfish used a Neural Engine (or its equivalent) on x86 CPUs but does it?

Well if the SoC is technically able to perform the benchmark much faster but doesn't (as of now) because no one ported it to the correct tooling I do think it is relevant. Missing important optimizations available for the M1 does not make the comparison more fair, quite the opposite.

jeanlain · Apr 25, 2021

Which makes me wonder about the apps using the neural engine (core ML). The only ones I know of use ML-assisted face/object/scene recognition for Animoji, smart selection, reframing, etc. Nothing really "serious".

Andropov · Apr 25, 2021

jeanlain said:
Which makes me wonder about the apps using the neural engine (core ML). The only ones I know of use ML-assisted face/object/scene recognition for Animoji, smart selection, reframing, etc. Nothing really "serious".

Pixelmator Pro uses CoreML for a few things too.

cmaier · Apr 25, 2021

jeanlain said:
Is utilisation higher than competing CPUs that benefit from SMT?

Yes. SMT is a crutch for designs where the design is bad and you can’t keep the pipelines full another way.

mi7chy · Apr 25, 2021

falainber said:
That would be relevant if Stockfish used a Neural Engine (or its equivalent) on x86 CPUs but does it?

LeelaChessZero utilizes GPU for NN and there's also a native M1 version. Its ELO rating is high and have edged out wins against Stockfish 11 (don't know how it fares against improved ELO of Stockfish 12 and 13) but its nps rate looks to be relatively slower.

https://github.com/LeelaChessZero/lc0

https://docs.google.com/spreadsheet...7Vul4DpRNfn6K8oeCjBILe6uA/edit#gid=1508569046

Bit of trivia, elite super grandmaster human players have ELO of ~2800, Stockfish ELO is ~3500+ but Tensor Flow Units have godly ELO of 5000+. 4 TPUs > 176 GPUs. Don't see where TPU can be bought but you can rent them.

JMacHack · Apr 25, 2021

Congratulations, you’ve found a benchmark where the M1 loses handily to x86. Here’s your cookie.

xraydoc · Apr 25, 2021

Andropov said:
Pixelmator Pro uses CoreML for a few things too.

Doing an ML-based image super-resolution enlargement takes seconds to complete on the M1. It takes around 3-4 minutes on my dual core i7 2014 Mac mini.

EntropyQ3 · Apr 26, 2021

cmaier said:
Yes. SMT is a crutch for designs where the design is bad and you can’t keep the pipelines full another way.

That’s a bit harsh. 😀
You have to save some blame for the code they are asked to run. Can’t allow the software guys a clean get away.

leman · Apr 26, 2021

I ran the benchmarks with the settings suggested by @mi7chy (stockfish bench 128 NUMCORES 24 default depth) with the following results (power usage checked using Apple's powermetrics):

M1 MBP 13":
~ 12000 knodes/s for 8 threads using 15W of power
~ 2500 knodes/s for 1 thead using 5W of power

Intel i9 16" MBP:
~ 13000 knodes/s for 16 threads using 65W of power
~ 1700 knodes/s for 1 thread using 35W of power

In a nutshell, M1 with it's 4 cores is 10% slower than top-shelf Intel mobile CPU with 8 cores while consuming 80% less power. Looking at the linked https://openbenchmarking.org/test/pts/stockfish, M1 is faster than Tiger Lake. 8-core desktop CPUs are considerably faster. Ah, and M1 is about 50% faster in the single-threaded variant than a 4.8ghz Intel Skylake refresh.

My conclusion: no, M1's performance is not disappointing at all. It's performance is comparable to much larger (and 4-5x hotter!) CPUs.

Few additional observations: AMD seems to do really well in stockfish (for unclear reasons) and this benchmark really loves SMT. Stockfish seems to support Neon (ARM SIMD) but it's not really clear how it is utilized. The fact that is scales so well with STM suggests to me that the SIMD code is suboptimal and could probably be improved to achieve better ILP.

leman · Apr 26, 2021

EntropyQ3 said:
That’s a bit harsh. 😀
You have to save some blame for the code they are asked to run. Can’t allow the software guys a clean get away.

SMT is also great for code with long dependency chains and low inherent ILP - as long as you can multithread it of course. Interestingly enough, this seems to be the case with Stockfish.

mi7chy · Apr 26, 2021

leman said:
M1 is ... on par with 8-core Ryzen mobile CPUs (which will consume around 3x power for the duration of the test).

Where do you see an AMD mobile 8-core on openbenchmarking.org? Only one I see is 4500U which is 6-core 6-thread. 8-core 16-thread is 4800U or 5800U which are both 15W parts.

https://www.amd.com/en/products/apu/amd-ryzen-7-4800u

https://www.amd.com/en/products/apu/amd-ryzen-7-5800u

leman · Apr 26, 2021

mi7chy said:
Where do you see an AMD mobile 8-core on openbenchmarking.org? Only one I see is 4500U which is 6-core 6-thread.

Sorry, my mistake!

mi7chy said:
8-core 16-thread is 4800U or 5800U which are both 15W parts.

They are only 15W on paper (disregarding the fact that most "high" scores are for 25W configured laptops). For the short duration of this type of test, they will operate at PL2, which is over 40 watts. Which incidentally is the reason why these CPUs do that well in multi-core benchmarks — they essentially operate as "large" mobile/desktop CPUs. But they can only maintain this for couple of minutes at most.

mi7chy · Apr 26, 2021

leman said:
They are only 15W on paper (disregarding the fact that most "high" scores are for 25W configured laptops). For the short duration of this type of test, they will operate at PL2, which is over 40 watts. Which incidentally is the reason why these CPUs do that well in multi-core benchmarks — they essentially operate as "large" mobile/desktop CPUs. But they can only maintain this for couple of minutes at most.

That's actually false. Default power limit is 15W and configurable for 20 to 25W for the AMD U series CPUs. It can even go as low as 10W with core boost disabled. That's the beauty of it is configurable power limit and performance so, for example, can have it set to 10W when on battery to maximize battery life and 20W when plugged in for performance. It never peaks over what it's configured for.

Andropov · Apr 26, 2021

mi7chy said:
That's actually false. Default power limit is 15W and configurable for 20 to 25W for the AMD U series CPUs. It can even go as low as 10W with core boost disabled. That's the beauty of it is configurable power limit and performance so, for example, can have it set to 10W when on battery to maximize battery life and 20W when plugged in for performance. It never peaks over what it's configured for.

I haven't found any test for the 4800U/5800U, but the 5980HS surely does peak above its claimed 35W TDP at the start of the benchmarks exactly as @leman says:

From this AnandTech article.

It seems odd to market the 4800U/5800U as having a 15W TDP (for the 4800U/5800U) and then actually have the power limited to that amount. Every other CPU I can think of peaks at a higher power consumption than the listed TDP.

leman · Apr 26, 2021

mi7chy said:
That's actually false. Default power limit is 15W and configurable for 20 to 25W for the AMD U series CPUs. It can even go as low as 10W with core boost disabled.

There are different power limits. PL1 (the one you are talking about) is the long-term sustained power limit. PL2 is the short-term power limit that can be active for a couple of limits. In default configuration most benchmarks will test the CPU performance under PL2. Which again is the main reason why these 8-core Ryzen parts do so well in benchmarks.

mi7chy said:
That's the beauty of it is configurable power limit and performance so, for example, can have it set to 10W when on battery to maximize battery life and 20W when plugged in for performance. It never peaks over what it's configured for.

And what scores are you getting when configuring PL1 and PL2 to 10 Watts?

hans1972 · Apr 26, 2021

Appletoni said:
Benchmark: Stockfish (chess) speed

M1 CPU = 13000 kn/s

i7 3930k overclocked (from 2011 = 10 years old) = 13000 kn/s

Others = 40000 kn/s - 80000 kn/s

Desktop CPUs = 230000 kn/s and much stronger

My solution: I live just a short walk from Magnus Carlsen.

mi7chy · Apr 26, 2021

hans1972 said:
My solution: I live just a short walk from Magnus Carlsen.

Not to take anything away from Magnus but his ELO is ~2850 vs Stockfish's ~3800.

Appletoni · Apr 27, 2021

Apple M1 CPU speed is very disappointing

To solve some answers and questions, take a look here:

To measure and to compare the engine speed, you can use a free chess programm and then click the analysis button at the beginning of a game (starting position). After 1 minute is over you click the button again to stop the analysis and take the kn/s and run the test again with other engines on different hardware.
You only need to decide at the beginning if you want to compare tests with one core or more cores or maybe something like 2 vs 4 cores.

You will never see exactly the same kn/s but that is today not important to compare.
Stockfish:
kn/s: 1234 (only the first (here the 1xxx) is important to compare). On another hardware it could be 2000 (2xxx).
kn/s: 12345 (only the first + second (here 12xxx) is important to compare).
kn/s: 123456 (only the first + second + third (here 123xxx is important to compare)
The last 3 xxx are always different, even on exactly the same hardware.

Hardware speed:

Ipman Chess

Measure speed:
Build Tester v.1.4.7.0