Apple M1 CPU & GPU speed is very disappointing

hans1972 · Nov 30, 2021

Leifi said:
What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

That's the problem. They are treating the M1 Mac as an ARM CPU.

If you want performance on the M1 Mac, you must treat as a macOS applications using relevant macOS APIs.

I just briefly looked at some of the source code and it seems to be just low-level C++ code without almost any platform specific coding.

hans1972 · Nov 30, 2021

Sopel said:
Apple users: This workload is bad on M1 because it's for ****ing peasants. Who would even use a chess engine? PATHETIC
Also apple users: People laugh at us because they are envious!

btw. I'll let you know that Stockfish is optimized for M1, we have manually written NEON code and compiler option to optimize for apple-silicon. If you think you can do better the burden of proof is on you.

What kind of macOS specific APIs are you using?

Accelerate and ML Compute?
Metal?
Grand Central Dispatch?

cmaier · Nov 30, 2021

Appletoni said:
Maybe not 400W but look at other high end devices. The consumption can be very high. I believe that Apple can build 40 performance cores inside a 16, 18 and 20-inch MacBook Pro Plus with better passive cooling and better fans. The new performance cores will be 3nm.

The performance cores will be 3nm, but the efficiency cores will be 5nm?

bcortens · Nov 30, 2021

Leifi said:
I guess the M1 at least wins the personal-attack benches here anyway

You dismissed benchmarks you perceive to have been optimized for Apple Silicon and then post as your example of a good and honest benchmark one which has clearly had more optimization time put in for x86_64.
That is not a personal attack but a pointing out of a double standard.

Leifi · Nov 30, 2021

throAU said:
That's not a desktop CPU, its a datacenter part

Apart from the price ordering one or two of these in a eeb motherboard and a desktop-case is no problem.

Plenty of workstations with XEON "server" CPUs around.

bcortens · Nov 30, 2021

Leifi said:
Please go read the referenced links to talkchess.com discussion.. Apples GPU is indeed used (trough OpenCL libraries).. Metal is not possible to use effeciently for these tensor AI backends.

Why not?
Metal compute has replaced OpenCL for many peoples compute workloads, why is this particular use case special?

Taz Mangus · Nov 30, 2021

hans1972 said:
Become Mac developers and develop using a Mac

Don't use the architecture which works on Intel/AMD/NVidia and Windows/Linux

Use Xcode and don't rely on cross-plattform C/C++ too much (they tend not to be optimised for the Mac at all)

Use Grand Central Dispatch, Metal, Accelerate and ML Compute

Don't use OpenCL

Don't use TensorFlow (yet)

I don't think anyone would know how much it would help, since none are doing it the correct way for the Mac.

Apple has released a TensorFlow plugin that uses native Metal API.

Tensorflow Plugin - Metal - Apple Developer

Accelerate the training of machine learning models with TensorFlow right on your Mac.

developer.apple.com

leman · Nov 30, 2021

hans1972 said:
If you want performance on the M1 Mac, you must treat as a macOS applications using relevant macOS APIs.

I just briefly looked at some of the source code and it seems to be just low-level C++ code without almost any platform specific coding.

This is certainly not correct. Sure, platform APIs might help - and sometimes they can even deliver better performance by offloading work to specialized hardware, but you do not need to use them to achieve good performance on M1, which already has a very fast CPU. After all, they don’t use any special acceleration on x86 either. The problem is not lack of use of platform APIs, the problem is that the code is likely not properly tested/optimized on M1 specifically.

hans1972 · Nov 30, 2021

Leifi said:
I think you have missed the mark on benchmarking overall.. A benchmark is not "faulty" beacuse it is not specifically tweaked for a certain vendor..

Stockfish isn't a benchmark about CPUs ang GPUs. Its purpose is to calculate chess moves.

Making a general statement about the M1 SOC based on this very specific program isn't smart. What it seems to me, by looking just briefly at the source code, is that they didn't use the tools which are available to Mac developers to make an effective program on the Mac.

You have the same with Adobe Premiere and Final Cut Pro. If you use Premiere as a benchmark for the M1 you will get worse result than using Final Cut Pro. Why? Final Cut Pro is optimised for macOS.

What would be much more interesting is two chess programs where one is build, architectured and optimised for Winter and one is build, architecture and optimised for M1 Mac and macOS, and comparing those.

In my experience, cross plattform programs tends to perform badly on Macs.

januarydrive7 · Nov 30, 2021

leman said:
This is certainly not correct. Sure, platform APIs might help - and sometimes they can even deliver better performance by offloading work to specialized hardware, but you do not need to use them to achieve good performance on M1, which already has a very fast CPU. After all, they don’t use any special acceleration on x86 either. The problem is not lack of use of platform APIs, the problem is that the code is likely not properly tested/optimized on M1 specifically.

In addition, from a brief glance, there are at least some optimization efforts tailored specifically to Linux and Windows environments that simply don't exist for macOS: e.g., misc.cpp has paths to set page sizes to be very large (~2mb) for linux/windows, but uses standard 4kb pages for other environments, including macOS.

januarydrive7 · Nov 30, 2021

hans1972 said:
In my experience, cross plattform programs tends to perform badly on Macs.

In general, poorly architected cross-platform programs might perform poorly on whatever platform is not properly tested.

Also, in general, properly architected cross-platform programs should perform well on all platforms. If there are specific low-level optimization paths for some data paths, then those optimized paths will, in general, run better than non-optimized data paths.

Homy · Nov 30, 2021

Leifi said:
I am quite disappointed in Apple Silicon's real-life performance in open source chess software. Despite all the hype, it gets hammered by cheaper older 7nm CPUs in open source chess benchmarks. And of course, it is nowhere near a modern Nvidia+Amd or Nvidia+AMD laptop these days performance-wise in open source chess software.

Fixed that for you.

throAU · Nov 30, 2021

Leifi said:
Apart from the price ordering one or two of these in a eeb motherboard and a desktop-case is no problem.

Plenty of workstations with XEON "server" CPUs around.

These aren't the regular socket 2066 Xeons mate.

Go find a 12 memory channel desktop board with the correct socket for the 9282. Find it retail, and post the price/link to buy it here.

I'll wait.

edit:
I'll even make it easier. If you can't find a board - find an OEM workstation using one of these bad boys with a link to buy it. Doesn't even have to be an ATX motherboard.

Leifi · Nov 30, 2021

bcortens said:
You dismissed benchmarks you perceive to have been optimized for Apple Silicon and then post as your example of a good and honest benchmark one which has clearly had more optimization time put in for x86_64.
That is not a personal attack but a pointing out of a double standard.

I think it is a fair statement that there should be a similar level of optimization if different hw is going to be benched for software.

But I don't think it is fair to equate optimizing for documented instruction-sets, generic architectures, or using compiler directives, reasonably sized code tweaks with optimizing by completely redesigning or rewriting the benchmark more or less from scratch with a completely different non-compatible toolset, no cross-platform support, only proprietary stuff, no Open-standards, etc.. for completely re-engineered software for comparison.

It is also a value proposition. To do a huge effort to optimize for a proprietary platform with tweaks and optimizations that are completely useless outside a proprietary locked-in system. The evidence that the gains by doing this would have to be big enough to justify the effort also becomes crucial, much greater, compared to more open, and developer-friendly systems.

I think the main takeaway from this long thread is that no one was able to provide any tangible indication that the performance increase would be even close to being worth the effort to tweak chess code for M1 macs as they are so far behind, to begin with running basic cross-platform ARM, x86 code or even base c code compiled with native directives.

bcortens · Nov 30, 2021

Leifi said:
I think it is a fair statement that there should be a similar level of optimization if different hw is going to be benched for software.

But I don't think it is fair to equate optimizing for documented instruction-sets, generic architectures, or using compiler directives, reasonably sized code tweaks with optimizing by completely redesigning or rewriting the benchmark more or less from scratch with a completely different non-compatible toolset, no cross-platform support, only proprietary stuff, no Open-standards, etc.. for completely re-engineered software for comparison.

It is also a value proposition. To do a huge effort to optimize for a proprietary platform with tweaks and optimizations that are completely useless outside a proprietary locked-in system. The evidence that the gains by doing this would have to be big enough to justify the effort also becomes crucial, much greater, compared to more open, and developer-friendly systems.

I think the main takeaway from this long thread is that no one was able to provide any tangible indication that the performance increase would be even close to being worth the effort to tweak chess code for M1 macs as they are so far behind, to begin with running basic cross-platform ARM, or x86 code.

The point I’m making is that stockfish is not just basic cross platform code but features more optimizations for x86_64 than it does for M1.
Many of the benchmarks posted like SPEC from Anandtech, and Cinebench, which show the M1 ahead or at par with the competition, are at least maintained adequately for both Apple Silicon and x86_64.
one of the supposed equal footing benchmarks you posted has had no updates in 4 years for its arm branch, that is not equal effort.

Time and money, yes, that is precisely why people look to proprietary software that costs money, because it is maintained.

Homy · Nov 30, 2021

No Stockfish here, no M1. It seems that the best way to be great at chess is to actually play it using brain power and wooden chess pieces, instead of wasting your time on internet complaining about benchmarks and hardware you don't know and have no interest in.

Firouzja Alireja becomes the youngest chess player to cross 2800

The European Team Championship has finished with Ukraine getting their first gold medals. The above the fold story of the event though is Alireza Firouzja

worldchess.com

JimmyjamesEU · Nov 30, 2021

Leifi said:
I think it is a fair statement that there should be a similar level of optimization if different hw is going to be benched for software.

But I don't think it is fair to equate optimizing for documented instruction-sets, generic architectures, or using compiler directives, reasonably sized code tweaks with optimizing by completely redesigning or rewriting the benchmark more or less from scratch with a completely different non-compatible toolset, no cross-platform support, only proprietary stuff, no Open-standards, etc.. for completely re-engineered software for comparison.

It is also a value proposition. To do a huge effort to optimize for a proprietary platform with tweaks and optimizations that are completely useless outside a proprietary locked-in system. The evidence that the gains by doing this would have to be big enough to justify the effort also becomes crucial, much greater, compared to more open, and developer-friendly systems.

I think the main takeaway from this long thread is that no one was able to provide any tangible indication that the performance increase would be even close to being worth the effort to tweak chess code for M1 macs as they are so far behind, to begin with running basic cross-platform ARM, x86 code or even base c code compiled with native directives.

Once again you make claims without providing a single chess benchmark to back your outlandish claims. Not even a hashcat run. Mi7chy would be embarrassed. For shame.

throAU · Nov 30, 2021

Leifi said:
I think the main takeaway from this long thread is that no one was able to provide any tangible indication that the performance increase would be even close to being worth the effort to tweak chess code for M1 macs as they are so far behind, to begin with running basic cross-platform ARM, x86 code or even base c code compiled with native directives.

I think the main takeaway from this is that you're living in fantasy land, with thoughts that optimisation can't make significant improvements, and that a 40 core laptop cpu is currently feasible at any sort of comparable price point, form factor or thermal envelope.

I mean even for current easily demonstrated examples.

Download 3dmark
Run the mesh shaders demo on hardware with mesh shader support and compare the mesh shaders on vs. off case.

on my 6900xt I get something way better than 10x performance improvement. from memory its about 30x but its been several months since I ran it. and that's ONE new hardware feature - in a contrived use case specifically to demonstrate the difference, but still. no different to your chess benchmark in that respect.

mr_roboto · Nov 30, 2021

throAU said:
I think the main takeaway from this is that you're living in fantasy land, with thoughts that optimisation can't make significant improvements, and that a 40 core laptop cpu is currently feasible at any sort of comparable price point, form factor or thermal envelope.

It's the other one who's talking about 40 core laptop CPUs. Leifi is slightly more grounded in reality, but only slightly.

Leifi · Nov 30, 2021

throAU said:
These aren't the regular socket 2066 Xeons mate.

Go find a 12 memory channel desktop board with the correct socket for the 9282. Find it retail, and post the price/link to buy it here.

I'll wait.

edit:
I'll even make it easier. If you can't find a board - find an OEM workstation using one of these bad boys with a link to buy it. Doesn't even have to be an ATX motherboard.

Mate,

Yea i agree with you there ..very tricky to get those MBS with BGA-5903, if you are not deep into intel shills territory

.. and the price/performance is also even worse than Apples HW these days .. lol... (but it dose not mean it's impossible by any means given the budget and will)

But hey why even bother just pop in an AMD Epyc 3990x with 64 wonderous cores and 128 threads in a desktop and you are good to go.. bet it will run chess programs really, really well.

And hey this has gone waaaay of topic... now...

Leifi · Nov 30, 2021

throAU said:
I think the main takeaway from this is that you're living in fantasy land, with thoughts that optimisation can't make significant improvements, and that a 40 core laptop cpu is currently feasible at any sort of comparable price point, form factor or thermal envelope.

I'm living in a "fantasy land" ?

I am observing reality as it is with the current available open-source chess engines programs at hand, both CPU and GPU bound ones., and observing how they perform (or do not perform) as is and compiled with NEON on M1. I think you are the only one making bold claims that the current "reality" can be changed and "recoded" into a new "better" "reality". Without any solid evidence of that .. This "much faster m1-tweaked-world" is really the fantasy-land here.

throAU · Nov 30, 2021

Leifi said:
I am observing reality as it is with the current available open-source chess engines programs at hand., and observing how they perform (or do not perform) as is and compiled with NEON on M1. I think you are the only one making bold claims that the current "reality" can be changed and "recoded" into a new "better" "reality". Without any solid evidence of that .. This "much faster m1-tweaked-world" is really the fantasy-land here.

There is plenty of evidence of the M1 and M1 Pro/Max beating far more power hungry machines in various applications.

Your single application that is not optimised for the platform is not proof that the processor is not performant.

It is proof that it is not performant on THAT specific code. Why? Well based on other comments in this thread the compiled version is from 2019 and the source code treats M1 as generic ARM. May as well run a bunch of math heavy code on a current cpu using x87 floating point...

You've listed one specific app that is slow and claimed there's nothing that could be done to speed it up and that 300% performance improvements are totally not feasible. Because... I don't know... you say so?

I've given you examples of 10-50x personally observed performance improvements, including an example (3dmark mesh shaders demo) you (or anyone else) can go download and run for yourself that shows 10-30x performance on modern GPUs when modern features are used, as an example of how actually using hardware features properly helps.

Leifi · Nov 30, 2021

hans1972 said:
Stockfish isn't a benchmark about CPUs ang GPUs. Its purpose is to calculate chess moves.

yes.. but it has a benmark option to report benches when calculating

hans1972 said:
You have the same with Adobe Premiere and Final Cut Pro. If you use Premiere as a benchmark for the M1 you will get worse result than using Final Cut Pro. Why? Final Cut Pro is optimised for macOS.

Stocfish != benchmark - Adobe premiere = benchmark .. Strange way to reason..

If you think benhmarking two completely different applciations on differnt hw is a good general idea to dompare hw-performance.. just think harder.

hans1972 said:
What would be much more interesting is two chess programs where one is build, architectured and optimised for Winter and one is build, architecture and optimised for M1 Mac and macOS, and comparing those.

In my experience, cross plattform programs tends to perform badly on Macs.

Not really optimizing for "intel" is not 1 thing...It is also different based on specific CPUs, CPUs, chipset storage subsets, etc. There are many Intel CPUs and architectures and possible GPU + memory combos out there you now..It's called a "complex" instruction set for a reason, not a "reduced instruction set .. ya know.."

Anyone is free to write a super-duper chess engine for M1... But without any proof of PoCs that the M1 souls even match lowe end last-gen AMD CPUs no one will waste their time on this. There are plenty of examples on chess forums like talk chess when coders looked to upgrade their pc run some quick beaches on M1 and realized they can get faster cheaper computers that even provide good AI GPUs for less money.. It is a no-brainer for most programmers who aren't knee-deep and locked into an Apple-only proprietary dev environment.

Leifi · Nov 30, 2021

throAU said:
There is plenty of evidence of the M1 and M1 Pro/Max beating far more power hungry machines in various applications.

I have seen no evidence here that M1 is beating an 8-core AMD CPU in benchmarks using all cores to its fullest capacity on similar workloads. Please post your evidence of this. and how that evidence would even remotely indicate chess engine performance would beat a modern 16-thread capable CPU being pushed to its limit on all cores.

throAU said:
Your single application that is not optimised for the platform is not proof that the processor is not performant.

True, but you cannot disprove such double negation - kind of general statements. Current benches can be seen as a pretty good indication until someone can prove the CPU "is able" to get higher throughput. And by the nature of things, different architectures will have always have weaker and better benchmarks based on test cases.

I could also for example claim the stockfish intel code could be extensively tweaked using the features of gen 12 Alderlakes to gain an extreme amount of performance over M1 Max. And you would not be able to disprove this in a similar way. It's impossible to disprove "fictive unproven future gains" by the nature of the claim itself. It's simply just guesswork or some reasonble estimations until someone actually implements and runs a real bench, and then you could still "guess" that more could be "tweaked" and question the results.

throAU · Nov 30, 2021

Just for some very disappointing CPU/GPU performance...

No fan noise.

Lets go closer... and turn off the game volume... after playing through for a few minutes.. if you hear anything its the liquid cooled RX6900 based PC next to it. idle.

M1 Pro... M1+Metal optimised game...

Apple M1 CPU & GPU speed is very disappointing

macrumors 68040

macrumors 68040

Suspended

macrumors 65816

macrumors regular

macrumors 65816

macrumors 604

macrumors Core

macrumors 68040

macrumors 6502a

macrumors 6502a

macrumors 68030

macrumors G4

macrumors regular

macrumors 65816

macrumors 68030

Suspended

macrumors G4

macrumors 6502a

macrumors regular

macrumors regular

macrumors G4

macrumors regular

macrumors regular

macrumors G4

Our Staff