Stockfish patch to improve performance on M1 by up to 80%.

tomO2013 · Dec 13, 2021

Screen Shot 2021-12-13 at 8.19.34 PM.png

Pulled down the latest release of blender and took for a test drive on M1 Max 32core running Monterey 12.1.

Executed a quick render on GPU.

Initial GPU compute support in cycles is available on the nightly builds.

Worked without crashing for me on the nightly build that I used. Interesting CPU and GPU were not showing as fully utilized in activity monitor when I ran. Not sure what's going on there, but there you go!

mi7chy · Dec 13, 2021

tomO2013 said:
View attachment 1928044Pulled down the latest release of blender and took for a test drive on M1 Max 32core running Monterey 12.1.

Executed a quick render on GPU.

Initial GPU compute support in cycles is available on the nightly builds.

Worked without crashing for me on the nightly build that I used. Interesting CPU and GPU were not showing as fully utilized in activity monitor when I ran. Not sure what's going on there, but there you go!

Interesting result. Was expecting around 36s scaling from 24GPU (48s) to 32GPU. Added your time to the Blender thread.

https://forums.macrumors.com/threads/3d-rendering-on-apple-silicon-cpu-gpu.2269416/post-30676733

jeanlain · Dec 13, 2021

mi7chy said:
Monterey has broken OpenCL apps.

Which apps?

BTW, why are we discussing blender in this thread? There's another thread about 3D rendering.

Appletoni · Dec 13, 2021

jeanlain said:
Which apps?

BTW, why are we discussing blender in this thread? There's another thread about 3D rendering.

LC0

Appletoni · Dec 19, 2021

solouki said:
Hi all,

You all are probably already familiar with these results, and I apologize for wasting your time if you are, but I found some of the implications intriguing.

So, first of all, I want to first thank Sopel for his fine optimization work on Stockfish.

I decided to benchmark Stockfish 14.1 as well as Sopel's neon_opt version that I compiled with PGO on a few different benchmarks both with and without nnue and whose run times spanned 15 to 42 minutes on three different laptops: a 2019 15" Intel i9 MBP, a 2020 13" Apple M1 MBP, and a 2021 16" Apple M1 Max MBP.

The following PDF includes the summary results of these benchmarks. Both values as well as relative ratios of values are provided. [If you can't read the red text below, then the tables at the end includes measurements and relative ratios separated by a colon. For instance, the typical units are (nps:rel) that denotes "nodes per second:relative ratios".]
View attachment 1928161
View attachment 1928162

Some of the things to notice in the above tables:

(1) Sopel's optimizations increase the nps for both Apple Silicon as well as for Intel Silicon.

(2) The 2020 MBP 13" with the Apple M1 uses much less power than the 2021 MBP 16" Apple M1 Max, especially relative to its lower nps values.

(3) While the Package Power measurements are not directly comparable between Apple Silicon and Intel, since the Intel power measurements do not include DRAM, GPUs, neural engines, etc. that are included in the Apple Silicon measurements, still the Applie Silicon package powers are significantly lower than Intel's while at the same time providing comparable (M1) or superior (M1Max) nps values. [More on this topic to follow.]

(4) In my hands, the nps results are not much affected by hash table sizes between 1024 MB and 8192 MB, but I decided to use 8192 MB tables in all of the above benchmark runs.

(5) For the optimized code without nnue at the maximum number of threads for each CPU, the M1 Max had nps values 2.04 times larger than the M1 while the Intel i9 had nps values 1.04 times higher than the M1.

Even more intriguing were the temperatures, fan speeds, and CPU histories. The following figures show these.

View attachment 1927699
The above figure shows the 2020 MBP 13" M1's CPU histories for 8 threads (left-hand diagram), for 4 performance threads (middle diagram), and 1 thread (right-hand diagram). These histories are as expected, but more intriguing are the corresponding temperatures and fans, as shown below:

View attachment 1927701
The above figure shows the 2020 MBP 13" M1's temperatures and fan speeds for 8, 4, and 1 threads. Notice that the fan speed never reaches its maximum rate , even for the 8 thread benchmark.

Now for the 2021 MBP 16" M1 Max:

View attachment 1927705
The above histories for the 2021 MBP 16" M1 Max are for 10 theads, 8 performance threads, and 1 thread (left to right diagrams). Interestingly, Stockfish, when utilizing just 1 thread, only appeared to use the P0-Cluster (3 to 6) and not the P1-Cluster (7 to 10) of CPUs.

View attachment 1927709
The above figure shows the corresponding temperatures and fan speeds. Notice that the CPU temperatures for the performance cores basically maxed out at 100C, but the fans never got above about 41% of their maximum RPM speeds. On the other hand, when I set the fans to their maximum RPM speeds and ran the 10 thread benchmark, the following temperatures were found:
View attachment 1927713So notice in the above figure that when the fans are set to their maximum RPMs, then the CPU temperatures fall to 80C or less.

And on to the 2019 MBP 15" Intel i9:

View attachment 1927714The above CPU histories are for the 2019 MBP 15" Intel i9 laptop using 16, 8, and 1 thread (left to right diagrams). Notice that when using just 8 Stockfish threads, the hyperthreaded character of the Intel architecture was not employed, as expected. This was also true for the benchmark using just 1 Stockfish thread.

View attachment 1927733
The above figure shows the temperatures and fan speeds for 16, 8, and 1 Stockfish thread running the benchmarks on the 2019 MBP 15" Intel i9 laptop. Notice that the fans were pegged at their maximum RPM values for both the 16 and 8 thread benchmarks, unlike these benchmarks running on Apple Silicon.

Some things to notice about the above temperature and fan speed results:

(1) Stockfish does not maximize the fans's RPM speeds on Apple Silicon for the benchmarks that I ran. [I know that it is possible to maximize the fans's speeds, however, as my own QFT/perturbation optimization code does it -- it is just that Stockfish is apparently not testing the Apple Silicon laptops to their maximum power performance. It's probably that I just don't know how to get Stockfish to do so, i.e., I'm sure it's my fault.]

(2) The Apple Silicon laptops were essentially "quiet" during these benchmark runs while the Intel i9 laptop sounded like a jet engine (its usual fan noise, in other words).

Sorry about such a long and tedious post; I apologize.

Solouki

Edit: Replaced the PDF results file with images.

Don’t forget that the MacBook Pro 16-inch M1 Max has only 8 TB SSD, only 64 GB RAM and only 10 cores.
Stockfish can use 65536 cores and 33554432 MB RAM (33554 GB RAM) (33 TB RAM).
Even the Stockfish syzygy endgame tablebases need more than 16 TB SSD, so even if Apple would use 2x 8 TB SSD that wouldn’t be enough.
But I still hope that Sopel and other people will find the mysterious improvements to get +300% faster Stockfish kn/s.
Somehow they need to catch or being faster than Intel and AMD.
Do you have any ideas what could be improved or what to look at on Apple devices to make the Slowfish faster?

lcubed · Dec 19, 2021

Appletoni said:
But I still hope that Sopel and other people will find the mysterious improvements to get +300% faster Stockfish kn/s.
Somehow they need to catch or being faster than Intel and AMD.
Do you have any ideas what could be improved or what to look at on Apple devices to make the Slowfish faster?

stockfish is extremely niche case. i imagine that apple is concentrating on avenues that are more broad and of wider application that cause the sale of more than a handful of new machines.

solouki · Dec 19, 2021

Hi all,

I've had a couple requests for further information, so I am substituting a PDF document with more details for this post.

EDIT: Corrected a left-hand/right-hand mixup. Thanks.

EDIT: Recompiled Stockfish 14.1 with PGO to see what difference this makes over the Homebrew Stockfish 14.1 executable.

EDIT: Since the PDF was tl;dr

, and the requests for more information were satisfied, I wanted to take this post back to the original one having just the graphs.

So, SF:14.1 is the Homebrew Stockfish 14.1 executable while SF:NEON_opt is Sopel97's optimized version compiled with PGO. These benchmarks were run with a hash table size of 8192MB, a depth of 28, and the thread counts listed. These runs all took between 15 and 55 minutes.

Below are the temperatures for the M1 Max at thread counts of 10, 8, and 1.

MBP16-M1Max-Stockfish-8192-10.8.1-28-TG-Pro.png

If the fans on the M1 Max are run at maximum speeds, the temperatures are given below.

On the M1 Max, if Stockfish is run to maximize the CPU usage and lc0 is run to maximize the GPU usage in a machine-on-machine chess GUI, the temperatures are as shown below.

Regards,
Solouki

Sopel · Jan 13, 2022

solouki said:
Hi all,

I've had a couple requests for further information, so I am substituting a PDF document with more details for this post.

EDIT: Corrected a left-hand/right-hand mixup. Thanks.
[...]
Regards,
Solouki

The results don't make any sense. How did neon_opt result in a better score on x86 CPUs if nothing changed for them?

januarydrive7 · Jan 13, 2022

Sopel said:
The results don't make any sense. How did neon_opt result in a better score on x86 CPUs if nothing changed for them?

I believe the non-neon-opt versions are using homebrew version of stockfish, which, iirc, does not have up-to-date optimizations for x86.

solouki · Jan 13, 2022

Sopel said:
The results don't make any sense. How did neon_opt result in a better score on x86 CPUs if nothing changed for them?

Hi Sopel,

I found that sort of strange myself, but I'm not the one to answer this issue as it is not my code. I did recompile the neon_opt source code using PGO for the x86 and M1 on both the M1 Max and on the x86 (version numbers are listed in the text), so perhaps PGO makes a difference? I assume the Homebrew Stockfish 14.1 was not compiled with PGO. (Perhaps I should have recompiled Stockfish 14.1 with PGO also, but I wanted to use a readily available, without recompiling, version so this is why I chose Homebrew Stockfish 14.1.)

My PDF file lists absolutely everything I did, including the commands to compile the neon_opt executables and run the benchmarks. The results I tabulate and plot are what they are ... and I even ran the benchmarks multiple times, each time after a fresh reboot, and the results were always within 5% of each other. In addition, the full specifications of the six Apple computers are also given in the PDF file, and I used the same bash script on each machine to run the benchmarks (with the appropriate executable, of course), time the runs, collect the results, and plot the data.

So, since you wrote the neon_opt code, how would PGO compilation of neon_opt lead to the better scores on x86?

Thank you,
Solouki

P.S. While I have never coded for chess engines, I do significant coding for scientific applications. I may examine your neon_opt code, but I haven't taken the time to do so yet.

Sopel · Jan 13, 2022

homebrew might not have specialized builds for different architectures, idk how it works. That definitely solves the mystery though.

leman · Jan 13, 2022

Homebrew Stockfisch is significantly slower than manually compiled version with PGO on Intel hardware. I tried it and can confirm it.

solouki · Jan 13, 2022

leman said:
Homebrew Stockfisch is significantly slower than manually compiled version with PGO on Intel hardware. I tried it and can confirm it.

Sopel said:
homebrew might not have specialized builds for different architectures, idk how it works. That definitely solves the mystery though.

Thanks all, I appreciate the fast answers. As I mentioned, I chose Homebrew Stockfish 14.1 not for its speed but rather because it was a readily available version of Stockfish that anyone can obtain without having to compile it from source.

I compiled the neon_opt code with PGO because I wanted to show what the currently optimized version and thus perhaps the fastest version of Stockfish was capable of achieving on the M1 Max compared with the other CPUs.

Solouki

leman · Jan 13, 2022

solouki said:
Thanks all, I appreciate the fast answers. As I mentioned, I chose Homebrew Stockfish 14.1 not for its speed but rather because it was a readily available version of Stockfish that anyone can obtain without having to compile it from source.

I compiled the neon_opt code with PGO because I wanted to show what the currently optimized version and thus perhaps the fastest version of Stockfish was capable of achieving on the M1 Max compared with the other CPUs.

Solouki

If you want to compare performance between different CPUs it would make sence to choose compiler options that produce the fastest result on that particular CPU. Otherwise there are too many confounding variables.

solouki · Jan 13, 2022

leman said:
If you want to compare performance between different CPUs it would make sence to choose compiler options that produce the fastest result on that particular CPU. Otherwise there are too many confounding variables.

Hi leman,

I totally agree that it would make sense to choose the compiler options for the fastest results on each CPU, but I don't code chess engines and especially not Stockfish, so I don't know what compiler options and/or source codes are the fastest for any particular CPU. Thus I chose the neon_opt code (as a reasonably optimized Stockfish) and compiled it with PGO for both Apple Silicon and Intel and ran benchmarks on these executables. I was assuming that this would yield fairly good performances, but probably not the best, on both Apple Silicon and on Intel CPUs. The back row of candlesticks in my Figure 60.9.1 plots these comparisons.

I was not attempting to obtain the fastest possible results on each CPU, rather I was just attempting to compare CPUs with relatively optimized codes/compilations.

In a nutshell, I don't have the expertise with the Stockfish code necessary to perform the benchmarks for the absolute best results using the best optimized codes and compilations. And I'd have to study the Stockfish code to be able to do this -- while this would be fun, I personally don't really have the time it would take me to do so. Perhaps someone familiar with the Stockfish source code would be in a better position to carry out these tests?

Thank you,
Solouki

Sopel · Jan 13, 2022

`make profile-build ARCH=...` is enough to get close to max possible performance as long as you use a recent (>9.1) gcc. For available archs one can look into the makefile.

solouki · Jan 13, 2022

Sopel said:
`make profile-build ARCH=...` is enough to get close to max possible performance as long as you use a recent (>9.1) gcc. For available archs one can look into the makefile.

Thanks for the information.

I did look at the makefile to find the proper ARCH and used a >9.1 gcc.

Edit: Besides gcc 11.2+, I also compiled with the Apple clang version 13.0.0 (clang-1300.0.29.30) on the M1 Max, with no significant difference for neon_opt's NPS.

EDIT: I also went back and compiled Stockfish 14.1 with PGO and found between a 5% and 11% increase in NPS over the precompiled Homebrew Stockfish 14.1.

Leifi · Jan 14, 2022

Not a big surprise new macs beat old mac's in performance... The annoying part is that expensive new Macs get wupped by older AMD 8-cores that cost many hundreds dollar less than the Apple Silicon based laptops. And surely Macs are beaten by both Alder-lake and AMD options at the high-end laptop space currently.

Leifi · Jan 15, 2022

leman said:
I am not doubting any of this. However, the link I provided above clearly shows that Rog Duo maintains over 4Ghz running Prime95, with CPU cluster power consumption over 70 watts and full laptop power of 130W. That is not 3.3Ghz. So if someone running on an overclocked laptop like this gets 20m nodes/second in Stockfish — 80% more than a "normal" 4900H, I am not surprised. These are numbers well expected from Zen3 running at 4.2-4.3Ghz. Just not at 45W.

Numbers are in the notebookcheck article, did you read it at all? This has nothing to do with GPU, Prime95 does not use the GPU.

Furthermore your numbers you throw around of 35w powerdraw for M1 MAX (during full bench SF) I think is something you should back up with some real evidence..

Some tests have seen macbooks 10-core draw almost 120W and as most of the computing chipset is Apple silicion, most of the power-draw obviously comes from the M1 CPU..

So make sure you compare the actual power draw from compute nodes on CPU before jumping into som Apple-biased conclusions here.

ref.

MacBook Pro 2021 M1 Max: Power Consumption – Low And High Power Mode

Much has been said and written about the new MacBook Pro with M1 Pro and M1 Max. A new test series now takes a look at how many watts the 16-inch notebook draws from the power socket - during everyday tasks as well as under full load.

www.mtech.news

leman · Jan 15, 2022

Leifi said:
Furthermore your numbers you throw around of 35w powerdraw for M1 MAX (during full bench SF) I think is something you should back up with some real evidence..

Here, happy? That's latest stockfish trunk

Leifi said:
Some tests have seen macbooks 10-core draw almost 120W and as most of the computing chipset is Apple silicion, most of the power-draw obviously comes from the M1 CPU..

Utter nonsense. The most you'll see a M1 CPU cluster consume is around 20W (5W*4 cores) at its maximal clock. That's 40W for 8 P-cores — in practice its lower since they are not maintaining full 3.2 ghz under sustained load (and their power consumption at sustained 3ghz is closer to 4.5W).

The only way you are getting even close to 120W is full laptop power consumption with display set to max and CPU+GPU+RAM torture test.

I personally never seen the M1 Max go over 70W (that's CPU+GPU+RAM), even when playing games. Here is the power metric output when running Pathfinder: Wrath of the Righteous (4K, highest settings).

Romain_H · Jan 15, 2022

Leifi said:
Furthermore your numbers you throw around of 35w powerdraw for M1 MAX (during full bench SF) I think is something you should back up with some real evidence..

Some tests have seen macbooks 10-core draw almost 120W and as most of the computing chipset is Apple silicion, most of the power-draw obviously comes from the M1 CPU..

So make sure you compare the actual power draw from compute nodes on CPU before jumping into som Apple-biased conclusions here.

ref.

MacBook Pro 2021 M1 Max: Power Consumption – Low And High Power Mode

Much has been said and written about the new MacBook Pro with M1 Pro and M1 Max. A new test series now takes a look at how many watts the 16-inch notebook draws from the power socket - during everyday tasks as well as under full load.

www.mtech.news

Speaking of evidence, what are the „some tests“ you are speaking of?

Leifi · Jan 15, 2022

Was the link to hard for you to click?

leman · Jan 15, 2022

Leifi said:
Was the link to hard for you to click?

The link you’ve posted contains no useful information, just some numbers with vague explanations.

Leifi · Jan 15, 2022

leman said:
Here, happy? That's latest stockfish trunk

View attachment 1944389

So you compare the full chipset-draw of an AMD with an 8-core compute-core only draw of the M1?

It seems the argument has moved from Software is not optimized to power consumption at 5nm at expense of top-performance compared to bigger die chipsets is "king".. (desperate to get to post a "win" for the M1, right?) ??

I also am a bit puzzled by your picture posted showing a "package power" of 40.3W...

The nps shown on your picture seems much much lower of what has been posted for the M1 Max earlier in this thread (11M vs 16M (or 20M w/o nnue)). Can you explain this low nps, in that picture, please?

Leifi · Jan 15, 2022

leman said:
The link you’ve posted contains no useful information, just some numbers with vague explanations.

It provides measurements of how much the macbook draws under load compared to low-load. Probably at least as useful as the links you posted about Asus laptop power-draw?

Stockfish patch to improve performance on M1 by up to 80%.

macrumors member

Suspended

macrumors 68020

Suspended

Suspended

macrumors 6502a

macrumors 6502

Attachments

macrumors member

macrumors 6502a

macrumors 6502

macrumors member

macrumors Core

macrumors 6502

macrumors Core

macrumors 6502

macrumors member

macrumors 6502

macrumors regular

macrumors regular

macrumors Core

macrumors 6502a

macrumors regular

macrumors Core

macrumors regular

macrumors regular

Our Staff