Apple M1 CPU & GPU speed is very disappointing

Taz Mangus · Nov 29, 2021

pshufd said:
I've seen algorithms that are optimized for L1 and L2 cache sizes - that's not necessarily a bad idea as cache sizes have increased over time. But they can run poorly on older hardware. In some cases, you have separate code paths depending on the architecture. So you sniff the processor capabilities and then execute the best code for the hardware capabilities.

Another way to do a fair test would be to disable all hardware optimizations in software. But even this has problems because you can use algorithms that may favor one architecture over another. Or the compiler may not make best use of hardware. Or there may be some risks taken with compiler and linker optimizations.

It is never going to be perfect and who really cares that it needs to be perfect. It is about how well these computers help people do more with less time and hassle.

bobcomer · Nov 29, 2021

cmaier said:
Yeah “there’s not much an attacker can do with it”

Yeah, that's why I said it wasn't as bad, but it does exist.

cmaier · Nov 29, 2021

leman said:
Apple does not have any "magic fairy dust". They simply have a very wide out of order core capable of having hundreds of instructions in flight. Execution capabilities of M1 far surpass the execution capabilities of Zen3 for most real world code and your insistence to cherry picking software that has not been profiled or tuned for M1 further suggests that you are not at all interested in a proper discussion but are just here to repeat arbitrary uneducated claims.

It's not about optimizing by hand. It's about making sure your code behaves well on a certain platform. It shouldn't matter for most code — compilers can make good work here, but you need to be careful if your code relies on hardware-specific assumptions.

Hypothetic example: let's say you are a low-level programmer and you are working on a piece of code that leverages SIMD (either directly via intrinsics or indirectly via compiler vectorization) . Your primary platform is x86 and you want to support both SSE and AVX hardware platforms. Older CPUs with SSE support can do at most two 128-bit SIMD operations per cycle. Let's say you algorithm needs to compute a sum of a long vector, so in your SSE path are repeatedly fetching 128-bit values and adding them to an accumulator. You do some profiling and find out that your code is pretty much optimal — the bottleneck are loads from L1D as well as the vector units. You are happy with your result.

Now the new fancy M1 CPU comes out and you decide to implement a NEON path. Now, NEON uses 128bit SIMD, sounds same as SSE, right? So you copy your SSE code, replace the basic SSE intrinsics with the NEON intrinsics, which takes you just five minutes. Code compiles, tests are passing. But then a user opens an issue telling you that their M1 is slower than an AMD CPU on your code. You close the ticket, telling the user that you already have native NEON support and they should stop drinking coolaid — the problem is that M1 is not as fast as YouTubers claim.

Here is a problem though — your NEON code still uses the assumptions of your SSE hardware. But these assumptions are not correct for M1. It has at least twice as many SIMD processing units and can do more loads from L1D per cycle. Your optimal SSE code is not optimal on M1, as it is not able to feed the hardware properly. If you would have profiled your code on a native machine, you would have seen that half of your vector units are starving for work. To fix is, you have to change your code to work on multiple SIMD values simultaneously. Maybe you want to interleave your sums and instead of x[0] + x[1] + x[2] + x[3]... do something like a = x[0] + x[1], b = x[2] + x[3], a + b — this would allow the compiler to generate code that has higher ILP. Maybe you also want to split your large vector in multipel chunks and process them simultaneously — this might help you avoid cache bank conflicts and maximize your loads. Maybe there are some other tricks to maximizing your ILP.

Getting back to Stockfisch — I have no idea how their code looks like and how they use NEON. I have no idea what hardware assumptions they operate under. One would need to profile the code to figure out where the problems are, and one might need to change the algorithms (using the techniques I describe above) to get maximal performance on M1.

There is one thing I however we do know for certain, because there are plenty of low-level tests and analyses around. M1 is a real monster when it comes to SIMD performance. It is able to do a 512-bit SIMD operation per cycle (4x 128-bit vector units) with very low latency, which is comparable with AVX-512 CPUs — except on M1 you can get there with regular NEON, no need for 512-bit SIMD extensions. So if you show me a piece of SIMD-optimized software that runs significantly on M1 than on a laptop Cezanne, my first guess is that the problem is lack of SIMD ILP in the code. This would be my guess for Stockfish as well.

Hopefully at some point in a future there will be some motivated developer with native hardware who has a passion for chess engines and will spend some time profiling the code and figuring out why it doesn't run faster on M1 than it does now. And maybe that developer will be able to submit a patch that will fix the performance issue. Or, who knows, maybe the result of the technical discussion will be that M1 architecture is indeed bad for this kind of workloads and what we see now is already the best it can do. Who knows. But such discussion is still pending.

Bless your patience.

pshufd · Nov 29, 2021

leman said:
Apple does not have any "magic fairy dust". They simply have a very wide out of order core capable of having hundreds of instructions in flight. Execution capabilities of M1 far surpass the execution capabilities of Zen3 for most real world code and your insistence to cherry picking software that has not been profiled or tuned for M1 further suggests that you are not at all interested in a proper discussion but are just here to repeat arbitrary uneducated claims.

It's not about optimizing by hand. It's about making sure your code behaves well on a certain platform. It shouldn't matter for most code — compilers can make good work here, but you need to be careful if your code relies on hardware-specific assumptions.

Hypothetic example: let's say you are a low-level programmer and you are working on a piece of code that leverages SIMD (either directly via intrinsics or indirectly via compiler vectorization) . Your primary platform is x86 and you want to support both SSE and AVX hardware platforms. Older CPUs with SSE support can do at most two 128-bit SIMD operations per cycle. Let's say you algorithm needs to compute a sum of a long vector, so in your SSE path are repeatedly fetching 128-bit values and adding them to an accumulator. You do some profiling and find out that your code is pretty much optimal — the bottleneck are loads from L1D as well as the vector units. You are happy with your result.

Now the new fancy M1 CPU comes out and you decide to implement a NEON path. Now, NEON uses 128bit SIMD, sounds same as SSE, right? So you copy your SSE code, replace the basic SSE intrinsics with the NEON intrinsics, which takes you just five minutes. Code compiles, tests are passing. But then a user opens an issue telling you that their M1 is slower than an AMD CPU on your code. You close the ticket, telling the user that you already have native NEON support and they should stop drinking coolaid — the problem is that M1 is not as fast as YouTubers claim.

Here is a problem though — your NEON code still uses the assumptions of your SSE hardware. But these assumptions are not correct for M1. It has at least twice as many SIMD processing units and can do more loads from L1D per cycle. Your optimal SSE code is not optimal on M1, as it is not able to feed the hardware properly. If you would have profiled your code on a native machine, you would have seen that half of your vector units are starving for work. To fix is, you have to change your code to work on multiple SIMD values simultaneously. Maybe you want to interleave your sums and instead of x[0] + x[1] + x[2] + x[3]... do something like a = x[0] + x[1], b = x[2] + x[3], a + b — this would allow the compiler to generate code that has higher ILP. Maybe you also want to split your large vector in multipel chunks and process them simultaneously — this might help you avoid cache bank conflicts and maximize your loads. Maybe there are some other tricks to maximizing your ILP.

Getting back to Stockfisch — I have no idea how their code looks like and how they use NEON. I have no idea what hardware assumptions they operate under. One would need to profile the code to figure out where the problems are, and one might need to change the algorithms (using the techniques I describe above) to get maximal performance on M1.

There is one thing I however we do know for certain, because there are plenty of low-level tests and analyses around. M1 is a real monster when it comes to SIMD performance. It is able to do a 512-bit SIMD operation per cycle (4x 128-bit vector units) with very low latency, which is comparable with AVX-512 CPUs — except on M1 you can get there with regular NEON, no need for 512-bit SIMD extensions. So if you show me a piece of SIMD-optimized software that runs significantly on M1 than on a laptop Cezanne, my first guess is that the problem is lack of SIMD ILP in the code. This would be my guess for Stockfish as well.

Hopefully at some point in a future there will be some motivated developer with native hardware who has a passion for chess engines and will spend some time profiling the code and figuring out why it doesn't run faster on M1 than it does now. And maybe that developer will be able to submit a patch that will fix the performance issue. Or, who knows, maybe the result of the technical discussion will be that M1 architecture is indeed bad for this kind of workloads and what we see now is already the best it can do. Who knows. But such discussion is still pending.

Great post.

BTW, you may even get better performance if you optimize by implementation. Latency can vary one the same instruction from one processor generation to another. Netburst cycle times were typically longer than those on core architectures but there were some things that were more expensive and some things that were less expensive. So you might make an assumption for one generation which would be incorrect on the next.

I wouldn't mind optimizing something for Chrome or Firefox because hundreds of millions of people would benefit from that work. I play chess and enjoy watching chess streamers but have no motivation to optimize chess programs. People need to remember that this is hobbyist stuff and people buy their own hardware and software tools and can spend hundreds or thousands of hours doing this hobbyist stuff just because they love doing it. There's a lot of stuff that's given away for free on the internet - just remember that the people that give of their time, labor and treasure have real lives to attend to.

pshufd · Nov 29, 2021

xWhiplash said:
Why would you disable part of the hardware optimizations? M1 is not just CPU and GPU. There are built in neural engines, encode/decode engines and more. Its not just a CPU. So why not leave hardware optimizations in place because things WILL actively use it to do their work. And that should be how you compare things. As a purchaser of a Windows or Mac system, I want to see how it compares in real world tests. Not "disable this" and "disable that" then test because when I get my own product, I will NOT be disabling those things. So it is not a good test.

For a fair comparison. Everyone just use GCC-generated scalar code.

pshufd · Nov 29, 2021

Taz Mangus said:
It is never going to be perfect and who really cares that it needs to be perfect. It is about how well these computers help people do more with less time and hassle.

You. Don't. Understand.

Do you have an EE or CS degree?

Ethosik · Nov 29, 2021

pshufd said:
For a fair comparison. Everyone just use GCC-generated scalar code.

And how will that help me as someone looking to buy a product? I won't be disabling all these things. I want the device to be tested as I would be using it. THAT is a fair comparison. If the M1 has advantages due to hardware benefits, then the M1 has advantages. No getting around that.

pshufd · Nov 29, 2021

xWhiplash said:
And how will that help me as someone looking to buy a product? I won't be disabling all these things. I want the device to be tested as I would be using it. THAT is a fair comparison. If the M1 has advantages due to hardware benefits, then the M1 has advantages. No getting around that.

It doesn't help if you are considering a purchase. It's just for the purposes of comparison. The alternative is to maximally optimize for both platforms. Nobody seems to want to do that.

Taz Mangus · Nov 29, 2021

pshufd said:
You. Don't. Understand.

Do you have an EE or CS degree?

What are you taking about. I do have a CS degree and have writing embedded software for about 30 years.. But that has nothing to do with me as a consumer comparing 2 systems and deciding which will reduce my time to complete the work I need to get done.

I. do. understand.

pshufd · Nov 29, 2021

Taz Mangus said:
What are you taking about. I do have a CS degree and have writing embedded software for about 30 years.. But that has nothing to do with me as a consumer comparing 2 systems and deciding which will reduce my time to complete the work I need to get done.

I. do. understand.

Then why don't you understand the optimization issue?

And why don't you just optimize your chess program for Apple Silicon for a fair comparison?

Ethosik · Nov 29, 2021

pshufd said:
It doesn't help if you are considering a purchase. It's just for the purposes of comparison. The alternative is to maximally optimize for both platforms. Nobody seems to want to do that.

So you want these reviewers to test these products in a way where someone wanting information on whats best to purchase won't get the right information and might lead them to buying a worse product as a result? Is that what you are saying?

pshufd · Nov 29, 2021

xWhiplash said:
So you want these reviewers to test these products in a way where someone wanting information on whats best to purchase won't get the right information and might lead them to buying a worse product as a result? Is that what you are saying?

No. I want the person who wants the comparison here to do the work. Reviewers don't care about chess programs. We have someone here with a complaint that nobody else cares about. So he should do the work in a fair comparison.

Leifi · Nov 29, 2021

leman said:
Hopefully at some point in a future there will be some motivated developer with native hardware who has a passion for chess engines and will spend some time profiling the code and figuring out why it doesn't run faster on M1 than it does now. And maybe that developer will be able to submit a patch that will fix the performance issue. Or, who knows, maybe the result of the technical discussion will be that M1 architecture is indeed bad for this kind of workloads and what we see now is already the best it can do. Who knows. But such discussion is still pending.

Hopefully also at some point Apple may provide better info on its proprietary hardware and much more openess about the platform, libraries, documentaion etc. in order to assist any efforts of the open source comunity in better optimizing for cross compiling to Apple's proprietary closed HW-platforms.

leman said:
Hopefully
Getting back to Stockfisch — I have no idea how their code looks like and how they use NEON.

Just look at the code.. https://github.com/official-stockfish/Stockfish create a PR if you can increase nps speed by optimizing SIMD by any significancy. I am sure it will be added to the codebase, and avialable for MacOS builds.

Maybe some code samles on the AMX module would be helpful as well?

there is a profile build avialble for M1..

make -j profile-build ARCH=apple-silicon

leman · Nov 29, 2021

Leifi said:
Hopefully also at some point Apple may provide better info on its proprietary hardware and much more openess about the platform, libraries, documentaion etc. in order to assist any efforts of the open source comunity in better optimizing for cross compiling to Apple's proprietary closed HW-platforms.

What specific info are you looking for? It’s an ARM CPU… there is plenty of information available. And it’s not like x86 folks publish micro-optimization manuals for their CPUs either.

Leifi said:
Just look at the code.. https://github.com/official-stockfish/Stockfish create a PR if you can increase nps speed by optimizing SIMD by any significancy. I am sure it will be added to the codebase, and avialable for MacOS builds.

Maybe some code samles on the AMX module would be helpful as well?

there is a profile build avialble for M1..

make -j profile-build ARCH=apple-silicon

Sure, I can do that, once my laptop arrives. You are welcome to contact me via personal message so that we can work something out. It sounds like a fun project, so I would even be willing to lower my usual consultancy fee.

Taz Mangus · Nov 29, 2021

pshufd said:
Then why don't you understand the optimization issue?

And why don't you just optimize your chess program for Apple Silicon for a fair comparison?

I think you are replying to the wrong person.

Leifi · Nov 29, 2021

Taz Mangus said:
Speaking of performance running on battery. If there is the loudest fan test, the AMD and Intel laptops would win every single time. Can't say the same about how well they perform running on battery alone.

Lol..Then my phone wins over Apple-silicon every time.. both cooler and less noisy... Most AMD laptops can beat the M1 on chess benches with so much downclocking that fans will never kick in.. It wins on battery as well.

A 4800U laptop on battery beats the M1 plugged in on performance in chess currenlty. Thats the sad truth.

Leifi · Nov 29, 2021

leman said:
Sure, I can do that, once my laptop arrives. You are welcome to contact me via personal message so that we can work something out. It sounds like a fun project, so I would even be willing to lower my usual consultancy fee.

You don't have the laptop, you havent looked at the current SF code, don't know much about the AMX parts of the M1... and you make bold claims on how much faster you can get it to run, without even digging deeper into this.. Lol.. And on top pf that you want a consultancy fee ..

throAU · Nov 29, 2021

Leifi said:
nd using 100 CPU/GPU parts drans the Macbook M1 pro prettry quick to, dont underestimate the power-draw of the GPU+CPU cores of the M1 Pro and max..

I own one, and yeah it draws down faster than doing nothing, but NOWHERE NEAR any intel machine I've used.

Jorbanead · Nov 29, 2021

Wando64 said:
I am trying to summarise after 20 pages of arguing and I concluded that if you want to play chess you should buy a chess-board.
For anything else, an M1 based computer will do just fine.

Phew

Oh this whole discussion has already happened before on a different thread when M1 came out. I guess this forum has short term memory loss or the anti-M1 crowd just uses the same talking points.

Tune in next week for “Intels 6,000W new processor smokes the M1 Max chip! Apple made a huge mistake”

Followed by “The M1 Max can’t compete with a liquid cooled, overclocked 3090Ti Apple is doomed” post.

leman · Nov 29, 2021

Leifi said:
You don't have the laptop, you havent looked at the current SF code, don't know much about the AMX parts of the M1... and you make bold claims on how much faster you can get it to run, without even digging deeper into this.. Lol.. And on top pf that you want a consultancy fee ..

I never claimed I would be able to make it run much faster. I said that someone would need first to do a code analysis to understand where potential issues are. And sure, I could be that someone but I won’t do it for free. I am not a charity and that’s not a project I have personal interest in.

Leifi · Nov 29, 2021

According to discussion on talkchess Apple M1 even got beaten on battery/perfromance..

The M1 was able to tto 60 Gig positions of analysis before the battery went from 100% to 0% (battery lasted 3 hours)

An Asus 5900 laptop ran on battery and was finsihed with the same amount of positions 60G after only 1 hour with 50% left of battery.

Laptop	Positions	Hours	Battery left
Macbook M1	60.000.000.000	3h	0%
Asus 5900H	60.000.000.000	1h	50%

Granted the Asus probably was louder and had a larger battery it is still quite sad when the biggest selling point is power-efficiency for the Macs to be less productive and run out of battery with less work done.

Homy · Nov 29, 2021

When 4A wanted to optimize Metro Exodus and Larian Studios wanted to optimize Baldur's Gate 3 they turned to Apple for help and Apple was more than happy to help them. That has made BG3 the best optimized native game for Apple Silicon doing around 100 fps at Ultra 1080p on MBP 32-core M1 Max. That's faster than an Alienware X17 with i7 11800H and 140W RTX 3070.

That's the power of optimization. So the important question here is why chess software developers don't ask Apple for help? or do they? People here really should ask the developers all these questions instead of asking random forum members who don't care for chess (including me) to provide proof. Come back here and tell us what the developers said and what answer Apple gave them before this turns to a 100 page long discussion about what M1 can't do and why it sucks at Stockfish.

As someone already said this exact discussion about chess and Stockfish came up when M1 was released in 2020. I guess chess hasn't got more popular since then.

Optimize high-end games for Apple GPUs - WWDC21 - Videos - Apple Developer

Optimize your high-end games for Apple GPUs: We'll show you how you can use our rendering and debugging tools to eliminate performance...

developer.apple.com

throAU · Nov 29, 2021

Leifi said:
According to discussion on talkchess

So, ok, we will give you that the M1 sucks at your chess benchmark.

Next? What else you got for it being crap?

throAU · Nov 29, 2021

Homy said:
When 4A wanted to optimize Metro Exodus and Larian Studios wanted to optimize Baldur's Gate 3 they turned to Apple for help and Apple was more than happy to help them. That has made BG3 the best optimized native game for Apple Silicon doing around 100 fps at Ultra 1080p on MBP 32-core M1 Max. That's faster than an Alienware X17 with i7 11800H and 140W RTX 3070.

That's the power of optimization. So the important question here is why chess software developers don't ask Apple for help? or do they? People here really should ask the developers all these questions instead of asking random forum members who don't care for chess (including me) to provide proof. Come back here and tell us what the developers said and what answer Apple gave them before this turns to a 100 page long discussion about what M1 can't do and why it sucks at Stockfish.

Optimize high-end games for Apple GPUs - WWDC21 - Videos - Apple Developer

Optimize your high-end games for Apple GPUs: We'll show you how you can use our rendering and debugging tools to eliminate performance...

developer.apple.com

Yeah one thing people forget when comparing say, TFLOPs of the Apple devices vs. Nvidia or AMD is that Apple does Tile Based Deferred Rendering.

What's that?

It's basically not rendering stuff that isn't visible.

Saves power, improves performance.

Nvidia and AMD cards not doing TBDR (or mesh shaders) are calculating and drawing a bunch of stuff in the background to just draw over it again with stuff visible in front due to the painter's algorithm. Wasting throughput, compute power and energy.

Yes, mesh shaders will fix this to a degree if used but that is Turing and Navi2 onward only, so no PC games will rely on it for some time and the vast majority of PC hardware out there simply doesn't support it.

This is why you can't just compare on paper spec, say "oh the TFLOPS are lower, it SUCKS".

Jorbanead · Nov 29, 2021

Leifi said:
According to discussion on talkchess Apple M1 even got beaten on battery/perfromance..

The M1 was able to tto 60 Gig positions of analysis before the battery went from 100% to 0% (battery lasted 3 hours)

An Asus 5900 laptop ran on battery and was finsihed with the same amount of positions 60G after only 1 hour with 50% left of battery.

Laptop Positions Hours Battery left
Macbook M1 60.000.000.000 3h 0%
Asus 5900H 60.000.000.000 1h 50%

Granted the Asus probably was louder and had a larger battery it is still quite sad when the biggest selling point is power-efficiency for the Macs to be less productive and run out of battery with less work done.

Well it’s a good thing I didn’t buy my Mac to run chess benchmarks! I guess I’ll just have to deal with extremely fast render times and only being able to run 30 streams of 4K all on battery instead… dang this computer sucks

Apple M1 CPU & GPU speed is very disappointing

macrumors 604

macrumors 601

Suspended

macrumors G4

macrumors G4

macrumors G4

Contributor

macrumors G4

macrumors 604

macrumors G4

Contributor

macrumors G4

macrumors regular

macrumors Core

macrumors 604

macrumors regular

macrumors regular

macrumors G4

macrumors 65816

macrumors Core

macrumors regular

macrumors 68030

macrumors G4

macrumors G4

macrumors 65816

Our Staff