Apple M1 CPU & GPU speed is very disappointing

Sopel · Nov 30, 2021

leman said:
Ok, once my M1 16" has arrived I will be happy to contribute profiler output. Do you have instructions on how to collect it?

not on mac, no, never had any contact with apple hardware/software

the8thark · Nov 30, 2021

Appletoni said:
Apple M1 CPU & GPU speed is very disappointing

The megahertz wars ended decades ago. You need to move into the 21st century where what matters is how much your work flows have improved, not the raw speed of the components. Can a less clock speed, but more optimised setup be better? Often yes they can be. The Apple SoC proves this by more efficiently using each clock cycle and moving tasks to other parts of the SoC like video encoding and other things.

Sopel · Nov 30, 2021

the8thark said:
The megahertz wars ended decades ago. You need to move into the 21st century where what matters is how much your work flows have improved, not the raw speed of the components. Can a less clock speed, but more optimised setup be better? Often yes they can be. The Apple SoC proves this by more efficiently using each clock cycle and moving tasks to other parts of the SoC like video encoding and other things.

I know this is long thread, but you would have to read just the last page to understand that what you wrote is useless for this discussion. We don't need marketing babbling here.

the8thark · Nov 30, 2021

Sopel said:
I know this is long thread, but you would have to read just the last page to understand that what you wrote is useless for this discussion. We don't need marketing babbling here.

I addressed the whole point of the topic.
I can't take the blame if the last few pages here veered away from the core topic being talked about here.

Appletoni · Nov 30, 2021

Taz Mangus said:
Speaking of performance running on battery. If there is the loudest fan test, the AMD and Intel laptops would win every single time. Can't say the same about how well they perform running on battery alone.

It‘s always good to see good results.
But 16 vs 17 inch means 1 inch difference and 1 inch is 2.54 centimeter and that’s a lot for games, work, netflix and other stuff.

You know that he earns money by showing benchmarks of the new M1 Max to Apple fans?
Obviously no one will show bad results to (Apple) fans.
If I would sell VW cars, then I wouldn’t show bad results to the customers.

Rashy · Nov 30, 2021

"Apple M1 CPU & GPU speed is very disappointing"

No, it's certainly not. Trollidiot. Please close.

Appletoni · Nov 30, 2021

leman said:
Apple does not have any "magic fairy dust". They simply have a very wide out of order core capable of having hundreds of instructions in flight. Execution capabilities of M1 far surpass the execution capabilities of Zen3 for most real world code and your insistence to cherry picking software that has not been profiled or tuned for M1 further suggests that you are not at all interested in a proper discussion but are just here to repeat arbitrary uneducated claims.

It's not about optimizing by hand. It's about making sure your code behaves well on a certain platform. It shouldn't matter for most code — compilers can make good work here, but you need to be careful if your code relies on hardware-specific assumptions.

Hypothetic example: let's say you are a low-level programmer and you are working on a piece of code that leverages SIMD (either directly via intrinsics or indirectly via compiler vectorization) . Your primary platform is x86 and you want to support both SSE and AVX hardware platforms. Older CPUs with SSE support can do at most two 128-bit SIMD operations per cycle. Let's say you algorithm needs to compute a sum of a long vector, so in your SSE path are repeatedly fetching 128-bit values and adding them to an accumulator. You do some profiling and find out that your code is pretty much optimal — the bottleneck are loads from L1D as well as the vector units. You are happy with your result.

Now the new fancy M1 CPU comes out and you decide to implement a NEON path. Now, NEON uses 128bit SIMD, sounds same as SSE, right? So you copy your SSE code, replace the basic SSE intrinsics with the NEON intrinsics, which takes you just five minutes. Code compiles, tests are passing. But then a user opens an issue telling you that their M1 is slower than an AMD CPU on your code. You close the ticket, telling the user that you already have native NEON support and they should stop drinking coolaid — the problem is that M1 is not as fast as YouTubers claim.

Here is a problem though — your NEON code still uses the assumptions of your SSE hardware. But these assumptions are not correct for M1. It has at least twice as many SIMD processing units and can do more loads from L1D per cycle. Your optimal SSE code is not optimal on M1, as it is not able to feed the hardware properly. If you would have profiled your code on a native machine, you would have seen that half of your vector units are starving for work. To fix is, you have to change your code to work on multiple SIMD values simultaneously. Maybe you want to interleave your sums and instead of x[0] + x[1] + x[2] + x[3]... do something like a = x[0] + x[1], b = x[2] + x[3], a + b — this would allow the compiler to generate code that has higher ILP. Maybe you also want to split your large vector in multipel chunks and process them simultaneously — this might help you avoid cache bank conflicts and maximize your loads. Maybe there are some other tricks to maximizing your ILP.

Getting back to Stockfisch — I have no idea how their code looks like and how they use NEON. I have no idea what hardware assumptions they operate under. One would need to profile the code to figure out where the problems are, and one might need to change the algorithms (using the techniques I describe above) to get maximal performance on M1.

There is one thing I however we do know for certain, because there are plenty of low-level tests and analyses around. M1 is a real monster when it comes to SIMD performance. It is able to do a 512-bit SIMD operation per cycle (4x 128-bit vector units) with very low latency, which is comparable with AVX-512 CPUs — except on M1 you can get there with regular NEON, no need for 512-bit SIMD extensions. So if you show me a piece of SIMD-optimized software that runs significantly on M1 than on a laptop Cezanne, my first guess is that the problem is lack of SIMD ILP in the code. This would be my guess for Stockfish as well.

Hopefully at some point in a future there will be some motivated developer with native hardware who has a passion for chess engines and will spend some time profiling the code and figuring out why it doesn't run faster on M1 than it does now. And maybe that developer will be able to submit a patch that will fix the performance issue. Or, who knows, maybe the result of the technical discussion will be that M1 architecture is indeed bad for this kind of workloads and what we see now is already the best it can do. Who knows. But such discussion is still pending.

P.S. This discussion should give some ideas about how complex these things can get: https://www.realworldtech.com/forum/?threadid=204001&curpostid=204001

Of course everybody can download the Stockfish code and take a look at the makefile and other things.
That’s not a secret. https://github.com/official-stockfish/Stockfish

leman · Nov 30, 2021

Sopel said:
> does your code prefetch the data?

There is no need for prefetching as the most costly part of the network is about 16kiB in size. The largest layer is a lot of sequential accesses that cannot really be prefetched because they are known too late.

> does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?

This is a valid concern, because it's not done explicitly. I cannot do any profiling because I don't own an M1 machine. If you have proof that the vector units are not saturated during inference then I may be able to help. I've stated that same thing for the last few months but apparently no M1 user is capable of providing a profile.

> does your SIMD code rely on long dependency chains that might reduce ILP?

The NEON one, possibly. Again, no one ever provided profiler output so I don't know.

> have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

No, I don't own an M1. No one who has access to an M1 did (which is weird considering this discussion is already more than 20 pages long. Have you guys been throwing empty words for so long?).

By the way, is it possible that you don't really have a M1 problem, but rather an macOS problem?

I just run stockfish 14 bench on my 2019 16" MacBook Pro with a i9-9880HK. I uses the same settings as ones on https://openbenchmarking.org/test/pts/stockfish-1.3.0: stockfish bench 128 16 24 default depth

Result: 11283109 nodes/second

This is around 5% slower than the i7-9750H result on OpenBenchmarking. The 8-core i9 should be at least 20-25% faster than the 6-core i7-9750H...

Edit: scrape that. Turns out homebrew version was very slow for some reason. I have downloaded the source for stockfish 1.3.0 and build it myself. Results:

make profile-build ARCH=x86-64-avx2: 16717512 nodes/second
make build ARCH=x86-64-modern: 15253625 nodes/second

sounds about right...

pshufd · Nov 30, 2021

leman said:
Sorry, what exactly do you still want from me? You've been vocally complaining that nobody can offer a more technical explanation why having native ARM and Neon is not the same as proper optimization. In your own words:

Well, I have explained it to you — in depth — in #550. Your reaction? "but Stockfish has had ARM version for years...!" How is one supposed to talk to you on a professional level if all you do is squirm and wriggle like an eel? I know this kind of behavior from my five year old nephew, if you tell him something he doesn't want to hear he closes his ears and goes "la la la", but dealign with this from an (allegedly) adult person? It's embarrassing, really.

Appletoni · Nov 30, 2021

Jorbanead said:
Oh this whole discussion has already happened before on a different thread when M1 came out. I guess this forum has short term memory loss or the anti-M1 crowd just uses the same talking points.

Tune in next week for “Intels 6,000W new processor smokes the M1 Max chip! Apple made a huge mistake”

Followed by “The M1 Max can’t compete with a liquid cooled, overclocked 3090Ti Apple is doomed” post.

That’s exactly the same thread.
Take a look at page 1-10.

pshufd · Nov 30, 2021

Appletoni said:
Of course if you want you can call Apples media engines: optimization of Hardware
Hardware-accelerated H.264, HEVC, ProRes, and ProRes RAW, Video decode engine, two video encode engines, two ProRes encode and decode engines…
This way it is easy to break ever video,… benchmark.
Some engines for integer math and other stuff would be great too.

Apple provides APIs for common software problems and these look similar to the Intel code libraries which they sell. The advantage of these APIs is that you don't have to port for different hardware architectures and that you don't have to reinvent the wheel. You're also getting Apple's expertise in creating the best code to accomplish a particular task.

You used to have to pay for the Intel optimized libraries which made them far less attractive but it appears that you can download and develop with them for free. You pay for support, though. I do not know whether they have distribution fees. I think that there are open source versions of pieces of their optimized libraries.

Back in the old days, you paid for tools and these often ran into four digits per developer or even per developer per year. So you had to shell out a chunk of change if you wanted to work on an open source project or even your own project. Free tools made a huge difference in this area.

pshufd · Nov 30, 2021

the8thark said:
The megahertz wars ended decades ago. You need to move into the 21st century where what matters is how much your work flows have improved, not the raw speed of the components. Can a less clock speed, but more optimised setup be better? Often yes they can be. The Apple SoC proves this by more efficiently using each clock cycle and moving tasks to other parts of the SoC like video encoding and other things.

Well, even Intel demonstrated in the Netburst to Core transition.

Sopel · Nov 30, 2021

leman said:
By the way, is it possible that you don't really have a M1 problem, but rather an macOS problem?

I just run stockfish 14 bench on my 2019 16" MacBook Pro with a i9-9880HK. I uses the same settings as ones on https://openbenchmarking.org/test/pts/stockfish-1.3.0: stockfish bench 128 16 24 default depth

Result: 11283109 nodes/second

This is around 5% slower than the i7-9750H result on OpenBenchmarking. The 8-core i9 should be at least 20-25% faster than the 6-core i7-9750H...

Apple's intel macs are known for overheating, and I wouldn't suspect anything different with heavy AVX2 workload, just as Stockfish has. There was A LOT of people that discovered their cooling solutions to be insufficient when trying to run Stockfish, not only on laptops.

Appletoni · Nov 30, 2021

Leifi said:
According to discussion on talkchess Apple M1 even got beaten on battery/perfromance..

The M1 was able to tto 60 Gig positions of analysis before the battery went from 100% to 0% (battery lasted 3 hours)

An Asus 5900 laptop ran on battery and was finsihed with the same amount of positions 60G after only 1 hour with 50% left of battery.

Laptop Positions Hours Battery left
Macbook M1 60.000.000.000 3h 0%
Asus 5900H 60.000.000.000 1h 50%

Granted the Asus probably was louder and had a larger battery it is still quite sad when the biggest selling point is power-efficiency for the Macs to be less productive and run out of battery with less work done.

The Asus has a big battery and this is good.
Apple could have used a bigger battery too. It’s their fault/decision. But maybe the M1 Max 16-inch case is to small, which again was Apples decision.

Looks like Asus can run full speed on battery!
If this is not full speed then the result is even more amazing.
Don‘t forget that you can run both devices with a cable and that means the Asus will reach 180.000.000.000 positions compared to Apples only 60.000.000.000 positions.
That‘s a big difference if your work takes 10 or 30 minutes, 1 or 3 hours.

leman · Nov 30, 2021

Appletoni said:
The Asus has a big battery and this is good.
Apple could have used a bigger battery too. It’s their fault/decision. But maybe the M1 Max 16-inch case is to small, which again was Apples decision.

The 16“ has the biggest battery one can legally have in a laptop…

Appletoni · Nov 30, 2021

Homy said:
When 4A wanted to optimize Metro Exodus and Larian Studios wanted to optimize Baldur's Gate 3 they turned to Apple for help and Apple was more than happy to help them. That has made BG3 the best optimized native game for Apple Silicon doing around 100 fps at Ultra 1080p on MBP 32-core M1 Max. That's faster than an Alienware X17 with i7 11800H and 140W RTX 3070.

That's the power of optimization. So the important question here is why chess software developers don't ask Apple for help? or do they? People here really should ask the developers all these questions instead of asking random forum members who don't care for chess (including me) to provide proof. Come back here and tell us what the developers said and what answer Apple gave them before this turns to a 100 page long discussion about what M1 can't do and why it sucks at Stockfish.

As someone already said this exact discussion about chess and Stockfish came up when M1 was released in 2020. I guess chess hasn't got more popular since then.

Optimize high-end games for Apple GPUs - WWDC21 - Videos - Apple Developer

Optimize your high-end games for Apple GPUs: We'll show you how you can use our rendering and debugging tools to eliminate performance...

developer.apple.com

1080p?
That’s a joke.
I only run games and movies with at least 4k.

Stockfish is running on cpu.
But look also at the gpu. LC0 is slow.
Ceres is using cpu and gpu and I’m sure it is slow too.

Appletoni · Nov 30, 2021

throAU said:
Yeah one thing people forget when comparing say, TFLOPs of the Apple devices vs. Nvidia or AMD is that Apple does Tile Based Deferred Rendering.

What's that?

It's basically not rendering stuff that isn't visible.

Saves power, improves performance.

Nvidia and AMD cards not doing TBDR (or mesh shaders) are calculating and drawing a bunch of stuff in the background to just draw over it again with stuff visible in front due to the painter's algorithm. Wasting throughput, compute power and energy.

Yes, mesh shaders will fix this to a degree if used but that is Turing and Navi2 onward only, so no PC games will rely on it for some time and the vast majority of PC hardware out there simply doesn't support it.

This is why you can't just compare on paper spec, say "oh the TFLOPS are lower, it SUCKS".

Apple M1 Max hasn’t even tensor cores.
Look at RTX 3090 or RTX 4090 and how much they have.

Appletoni · Nov 30, 2021

Taz Mangus said:
So what you are telling us is the the M1 could go 3 hrs but the Asus 5900H could only go 2 hrs on 100% battery. Got it, good to know. Might be the only valuable thing that the chess benchmark is good for on the M1 hardware, battery life test.

Come back when the chess engine code has been fully optimized to take advantage of the M1 hardware such as the machine learning, the neural engine. Until then, it is pretty much a useless benchmark for the M1 hardware.

If someone wants to have great battery tests, use Stockfish, LC0 and Ceres.

Appletoni · Nov 30, 2021

leman said:
Based on your text I assume you are one of the developers. Great! Finally someone with a little sense here. If I understand it correctly, your SIMD code does computation on neural networks. We know that M1 has plenty of vector units and generally does excellently in SIMD throughput workloads (I am taking NEON here, not AVX or the NPU). That it performs poorly in your code can have two explanations: a) there is indeed something about the nature of your workload that hits a slow path on M1 hardware or b) your code is not optimal. Personally, I think that a) is less likely, since M1 has the bandwidth and the ALUs to excel at most types of matrix code.

So here a few questions:

- does your code prefetch the data?
- does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?
- does your SIMD code rely on long dependency chains that might reduce ILP?
- have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

This is fair, but unfortunately I personally have little interest in chess engines. Anyway, just because you wrote some NEON code it does not mean that it runs optimally (I have written about it in #550).

You could at least take a short look at the Stockfish code, if they are some obvious problems take one of them.

GitHub - official-stockfish/Stockfish: A free and strong UCI chess engine

A free and strong UCI chess engine. Contribute to official-stockfish/Stockfish development by creating an account on GitHub.

github.com

leman · Nov 30, 2021

Appletoni said:
Apple M1 Max hasn’t even tensor cores.

Of course they do. Different types even.

Bodhitree · Nov 30, 2021

Well one or two bad performing benchmarks won’t do much damage. How many benchmarks are there anyway? 100? More?

cmaier · Nov 30, 2021

Apple M1 Max MacBook Pro (2021) review: Back with a vengeance

Apple's flagship M1 Max MacBook Pro offers a combination of performance, efficiency, build quality, and screen quality that you cannot find in any PC on the market, full stop.

www.dpreview.com

Yeah, but what about chess benchmarks DP Review?

Sopel · Nov 30, 2021

All positive benchmarks for M1 involve stuff for "creators". I find it funny that people consider chess a niche while "creators" are somehow common.

Also, most "creator" benchmarks are export only. How often do you export ffs. None compare file sizes btw.

bcortens · Nov 30, 2021

Sopel said:
All positive benchmarks for M1 involve stuff for "creators". I find it funny that people consider chess a niche while "creators" are somehow common.

Also, most "creator" benchmarks are export only. How often do you export ffs. None compare file sizes btw.

Nonsense - lets look at these spec mp benchmarks shall we?

AnandTech M1 Pro and Max Review

I'm not going to go through all subtests but lets just take a few off the top:

Code compilation M1 is fastest
Combinatorial Optimization M1 is fastest
XML processing M1 is fastest
Chess and Tree M1 is fastest

While these are not the only benchmarks the M1 wins they do nonetheless show the power available.

As others have mentioned, the fact that the chess algorithms in this thread perform poorly is no indication of the general capabilities of the M1 cores. Given the architecture it would be more surprising if the M1 was slow actually, it has a very wide decode and very large numbers of ALUs and SIMD units which mean, if it can keep the whole thing fed, it should be very very fast.

The lack of performance in the tests in this thread suggest some combination of poor optimization and incorrect assumptions about the nature of the M1 cores.

Boil · Nov 30, 2021

cmaier said:
Yeah, but what about chess benchmarks DP Review?

Asking the important questions...! ;^p

Apple M1 CPU & GPU speed is very disappointing

macrumors member

macrumors 601