Apple M1 CPU & GPU speed is very disappointing

Leifi · Nov 29, 2021

pshufd said:
What's your USCF?

2398 elo (Fide). But please let us discuss the topic rather than discuss "Persons". I am appalled why so many regulars here are more interested in making things personal than keeping the discussion on topic and presenting some solid arguments and facts.

pshufd · Nov 29, 2021

Leifi said:
2398 elo (Fide). But please let us discuss the topic rather than discuss "Persons". I am appalled why so many regulars here are more interested in making things personal than keeping the discussion on topic and presenting some solid arguments and facts.

Just curious why you care so much and that explains why.

People don't care about chess.

Look at the net worth of Magnus Carlson and compare it to Lebron James. Or Tom Brady. Or even Roger Federer. Tennis is not even in the big money leagues but it's still way more than chess. Look at the sponsorships in Chess. Do chess players get multimillion dollar clothing contracts? Do they get clothing contracts at all?

If you want chess to run well on Apple Silicon, optimize it yourself.

cmaier · Nov 29, 2021

bobcomer said:
Covert channel in Apple’s M1 is mostly harmless, but it sure is interesting

Technically, it’s a vulnerability, but there’s not much an attacker can do with it.

arstechnica.com

Yeah “there’s not much an attacker can do with it”

Leifi · Nov 29, 2021

throAU said:
Blender is still also very early days with Apple Silicon support.

"I didnt do any GPU comparisons because metal isnt supported yet."

Yea turning of your 3070 GPU on less expensive laptop to make the special M1 specific build shine makes "alot" of sense for a cuda supporting application .LOL.. And then also run rendering in battery-saving mode.. amazing ;D

throAU · Nov 29, 2021

Leifi said:
"I didnt do any GPU comparisons because metal isnt supported yet."

Yea turning of your 3070 GPU on less expensive laptop to make the special M1 specific build shine makes "alot" of sense for a cuda supporting application .LOL.. And then also run rendering in battery-saving mode.. amazing ;D

Blender 3.x roadmap — Blender Developers Blog

Ton Roosendaal shares on what's next after Blender 3.0 and beyond.

code.blender.org

Its coming. Soon.

And yeah your 3070 will likely be turned off when on battery if you want life more than 45 minutes.

Leifi · Nov 29, 2021

pshufd said:
Just curious why you care so much and that explains why.

People don't care about chess.

Look at the net worth of Magnus Carlson and compare it to Lebron James. Or Tom Brady. Or even Roger Federer. Tennis is not even in the big money leagues but it's still way more than chess. Look at the sponsorships in Chess. Do chess players get multimillion dollar clothing contracts? Do they get clothing contracts at all?

If you want chess to run well on Apple Silicon, optimize it yourself.

I am not sure what your point is.. So M1s performance is irrelevant because more people knows more about LeBron than they do about the the M1 or chess. And that fashion brands sponsor people who can throw a ball trough a hoop, makes everything that requeires higher IQ less relevant?

Is that what you are trying to argue?

cmaier · Nov 29, 2021

Leifi said:
I am not sure what your point is.. So M1s performance is irrelevant because more people knows more about LeBron than they do about the the M1 or chess. And that fashion brands sponsor people who can throw a ball trough a hoop, makes everything that requeires higher IQ less relevant?

Is that what you are trying to argue?

I believe his highly accurate point is computer chess is lame and nobody much cares - people use computers for other things.

robco74 · Nov 29, 2021

Leifi said:
"I didnt do any GPU comparisons because metal isnt supported yet."

Yea turning of your 3070 GPU on less expensive laptop to make the special M1 specific build shine makes "alot" of sense for a cuda supporting application .LOL.. And then also run rendering in battery-saving mode.. amazing ;D

You deride Apple for using proprietery (sic) technology, but then praise CUDA? You also complain about comparing performance on battery mode, which is perfectly valid for a laptop comparison? How many more escape hatches are there to get through?

M1 has been out for little more than a year. The fact that chess engines haven't yet been fully optimized isn't earth shattering news. Apple, and others, have put resources toward the areas of greatest need. Chess engines aren't exactly at the top of the list.

pshufd · Nov 29, 2021

Leifi said:
I am not sure what your point is.. So M1s performance is irrelevant because more people knows more about LeBron than they do about the the M1 or chess. And that fashion brands sponsor people who can throw a ball trough a hoop, makes everything that requeires higher IQ less relevant?

Is that what you are trying to argue?

1) People don't care about chess. I think that we agree there.
2) Performance in an area requires hard work in optimization. You can't say what M1s performance is in an area until you maximally optimize your software for that area. M1 performance isn't irrelevant. The chess performance you cite is. So when are you going to learn the architecture and code up assembly routines for high cost code in whatever application you are running?

Taz Mangus · Nov 29, 2021

How about a real world comparison, on a real world usage, of the M1 Max against a RTX3090, AMD Ryzen 9 5950x. I am sorry I could not find an example of a chess program simulation benchmark, so you will just have to settle for this.

Leifi · Nov 29, 2021

throAU said:
Blender 3.x roadmap — Blender Developers Blog

Ton Roosendaal shares on what's next after Blender 3.0 and beyond.

code.blender.org

Its coming. Soon.

And yeah your 3070 will likely be turned off when on battery if you want life more than 45 minutes.

Dont matter much if the rendering is done much more quickly, then you can switch to low power mode after tthe render is done

.. and using 100 CPU/GPU parts drans the Macbook M1 pro prettry quick to, dont underestimate the power-draw of the GPU+CPU cores of the M1 Pro and max..

Taz Mangus · Nov 29, 2021

Leifi said:
You would maybe gain a few percent of performance by optimizing by hand compraed to the compiler that already optimizes for the CPU.

Apple probably have a dissadvantage of being a very closed, proprietary system, with poor or non-existing support and documentation to the open-source community trying to optimize code for Apple HW. This said Apple's sadly does not have some magic fairy dust that could be sprayed on their ARMv8.4 SoC to increase perfomance to even get closer to the lower end gen 4 8-core CPUs of AMD.

Can you be specific please.. What exactly optimizations do you think can be done on an M1 nott already done on stockfish and cFish c compiles with NEON. And how much would you think could gained exactly in theory and in practice..

This is a question everyone avoids here and just tries to sidestep, becuase they have no good answer!

There is a lot of room that you can gain from software optimizations. Poorly written code is the starting point. Well considering no one really cares about optimizing a chess program, says it all. You find this obscure benchmark and you think it means something. It means nothing.

Leifi · Nov 29, 2021

pshufd said:
1) People don't care about chess. I think that we agree there.
2) Performance in an area requires hard work in optimization. You can't say what M1s performance is in an area until you maximally optimize your software for that area. M1 performance isn't irrelevant. The chess performance you cite is. So when are you going to learn the architecture and code up assembly routines for high cost code in whatever application you are running?

1) No. Millions of people care about chess... More people than care about anything written on Macforum for sure .-)

2) Point is moot. If you can't get or prove the performance in practice it is useless. If you would like to analyze chess (or even do blender.. an M1 mac is a slough compared to less pricey alternatives.

pshufd · Nov 29, 2021

Taz Mangus said:
How about a real world comparison, on a real world usage, of the M1 Max against a RTX3090, AMD Ryzen 9 5950x. I am sorry I could not find an example of a chess program simulation benchmark, so you will just have to settle for this.

The problem with comparisons is that they are not apples to apples comparisons.

Back around 2004 or 2005, I created an optimized build of Firefox and put it out there and asked people to play with it. A lot of people tried it out and asked me what I did because it ran faster than the official distribution. Open Source rules require that you disclose what changes you make for a distribution (I later had a practice of including the code changes in my distributions).

I figured out how to get SIMD acceleration working in image decoding so that image-heavy websites rendered faster. I made some additional optimizations using hand-written assembler code. This was all after a careful study of Intel's architecture manuals. I received a job offer from the Mozilla CTO but I was happy where I was. SIMD acceleration is commonplace today but it wasn't back then.

x86 has a very big lead on software because there are so many x86 systems out there and some developers optimize expensive algorithms for it. You can look through the source code of large, open source project to find varianted x86 code for hardware acceleration. Or special code modules with assembler or intrinsics.

You can't really do a fair comparison unless you use the limits of the hardware capabilities of the architectures in software. It is a lot to learn an architecture well enough so that you can make best use of it. Consider doubling or tripling that to learn the details of multiple architectures. I did the same thing for PowerPC as well but it was on its way out at the time.

pshufd · Nov 29, 2021

Leifi said:
1) No. Millions of people care about chess... More people than care about anything written on Macforum for sure .-)

2) Point is moot. If you can't get or prove the performance in practice it is useless. If you would like to analyze chess (or even do blender.. an M1 mac is a slough compared to less pricey alternatives.

Millions play chess. They don't care about computer chess benchmarks and they neither have the skills, education or understanding about hardware architecture to be able to optimize chess programs for the hardware that they run on.

You're just being lazy.

Optimize your target program for the Apple Silicon architecture to the limits possible and you'll have your answer.

Leifi · Nov 29, 2021

Taz Mangus said:
There is a lot of room that you can gain from software optimizations. Poorly written code is the starting point. Well considering no one really cares about optimizing a chess program, says it all. You find this obscure benchmark and you think it means something. It means nothing.

Claiming that the best chess engines today are "poorly written code" and that developers don't cares about optimizing, just indicates that you know very little about these matters.

Leifi · Nov 29, 2021

pshufd said:
Optimize your target program for the Apple Silicon architecture to the limits possible and you'll have your answer.

Are you just trolling. U can't be serious right?

pshufd · Nov 29, 2021

Leifi said:
Claiming that the best chess engines today are "poorly written code" and that developers don't cares about optimizing, just indicates that you know very little about these matters.

Have they optimized for Apple Silicon? Have you tried optimizing for Apple Silicon at the assembler level?

Do you have a CS or EE degree?

pshufd · Nov 29, 2021

Leifi said:
Are you just trolling. U can't be serious right?

No, I'm not kidding. It's something that I've done myself though in the image processing world.

I was the first person to port Thunderbird to 64-bit Windows x86. I was the second person to port Firefox to 64-bit Windows x86. It did require hardware-specific assembler work to get running.

Taz Mangus · Nov 29, 2021

pshufd said:
The problem with comparisons is that they are not apples to apples comparisons.

Back around 2004 or 2005, I created an optimized build of Firefox and put it out there and asked people to play with it. A lot of people tried it out and asked me what I did because it ran faster than the official distribution. Open Source rules require that you disclose what changes you make for a distribution (I later had a practice of including the code changes in my distributions).

I figured out how to get SIMD acceleration working in image decoding so that image-heavy websites rendered faster. I made some additional optimizations using hand-written assembler code. This was all after a careful study of Intel's architecture manuals. I received a job offer from the Mozilla CTO but I was happy where I was. SIMD acceleration is commonplace today but it wasn't back then.

x86 has a very big lead on software because there are so many x86 systems out there and some developers optimize expensive algorithms for it. You can look through the source code of large, open source project to find varianted x86 code for hardware acceleration. Or special code modules with assembler or intrinsics.

You can't really do a fair comparison unless you use the limits of the hardware capabilities of the architectures in software. It is a lot to learn an architecture well enough so that you can make best use of it. Consider doubling or tripling that to learn the details of multiple architectures. I did the same thing for PowerPC as well but it was on its way out at the time.

Yes it is true that if one program is optimized for certain hardware that can skew the test. This is not some synthetic benchmark. This is real world usage. Both using the same programs and the same codecs.

Taz Mangus · Nov 29, 2021

Leifi said:
Claiming that the best chess engines today are "poorly written code" and that developers don't cares about optimizing, just indicates that you know very little about these matters.

I never claimed anything. I just made a statement that software optimizations starts with how well or badly written the code is. You are one track, on auto repeat. Getting a little defensive now.

pshufd · Nov 29, 2021

Taz Mangus said:
Yes it is true that if one program is optimized for certain hardware that can skew the test. This is not some synthetic benchmark. This is real world usage. Both using the same programs and the same codecs.

I've seen algorithms that are optimized for L1 and L2 cache sizes - that's not necessarily a bad idea as cache sizes have increased over time. But they can run poorly on older hardware. In some cases, you have separate code paths depending on the architecture. So you sniff the processor capabilities and then execute the best code for the hardware capabilities.

Another way to do a fair test would be to disable all hardware optimizations in software. But even this has problems because you can use algorithms that may favor one architecture over another. Or the compiler may not make best use of hardware. Or there may be some risks taken with compiler and linker optimizations.

Taz Mangus · Nov 29, 2021

Leifi said:
"I didnt do any GPU comparisons because metal isnt supported yet."

Yea turning of your 3070 GPU on less expensive laptop to make the special M1 specific build shine makes "alot" of sense for a cuda supporting application .LOL.. And then also run rendering in battery-saving mode.. amazing ;D

Speaking of performance running on battery. If there is the loudest fan test, the AMD and Intel laptops would win every single time. Can't say the same about how well they perform running on battery alone.

leman · Nov 29, 2021

Leifi said:
Apple probably have a dissadvantage of being a very closed, proprietary system, with poor or non-existing support and documentation to the open-source community trying to optimize code for Apple HW. This said Apple's sadly does not have some magic fairy dust that could be sprayed on their ARMv8.4 SoC to increase perfomance to even get closer to the lower end gen 4 8-core CPUs of AMD.

Apple does not have any "magic fairy dust". They simply have a very wide out of order core capable of having hundreds of instructions in flight. Execution capabilities of M1 far surpass the execution capabilities of Zen3 for most real world code and your insistence to cherry picking software that has not been profiled or tuned for M1 further suggests that you are not at all interested in a proper discussion but are just here to repeat arbitrary uneducated claims.

Leifi said:
You would maybe gain a few percent of performance by optimizing by hand compraed to the compiler that already optimizes for the CPU.

Leifi said:
Can you be specific please.. What exactly optimizations do you think can be done on an M1 nott already done on stockfish and cFish c compiles with NEON. And how much would you think could gained exactly in theory and in practice..

This is a question everyone avoids here and just tries to sidestep, becuase they have no good answer!

It's not about optimizing by hand. It's about making sure your code behaves well on a certain platform. It shouldn't matter for most code — compilers can make good work here, but you need to be careful if your code relies on hardware-specific assumptions.

Hypothetic example: let's say you are a low-level programmer and you are working on a piece of code that leverages SIMD (either directly via intrinsics or indirectly via compiler vectorization) . Your primary platform is x86 and you want to support both SSE and AVX hardware platforms. Older CPUs with SSE support can do at most two 128-bit SIMD operations per cycle. Let's say you algorithm needs to compute a sum of a long vector, so in your SSE path are repeatedly fetching 128-bit values and adding them to an accumulator. You do some profiling and find out that your code is pretty much optimal — the bottleneck are loads from L1D as well as the vector units. You are happy with your result.

Now the new fancy M1 CPU comes out and you decide to implement a NEON path. Now, NEON uses 128bit SIMD, sounds same as SSE, right? So you copy your SSE code, replace the basic SSE intrinsics with the NEON intrinsics, which takes you just five minutes. Code compiles, tests are passing. But then a user opens an issue telling you that their M1 is slower than an AMD CPU on your code. You close the ticket, telling the user that you already have native NEON support and they should stop drinking coolaid — the problem is that M1 is not as fast as YouTubers claim.

Here is a problem though — your NEON code still uses the assumptions of your SSE hardware. But these assumptions are not correct for M1. It has at least twice as many SIMD processing units and can do more loads from L1D per cycle. Your optimal SSE code is not optimal on M1, as it is not able to feed the hardware properly. If you would have profiled your code on a native machine, you would have seen that half of your vector units are starving for work. To fix is, you have to change your code to work on multiple SIMD values simultaneously. Maybe you want to interleave your sums and instead of x[0] + x[1] + x[2] + x[3]... do something like a = x[0] + x[1], b = x[2] + x[3], a + b — this would allow the compiler to generate code that has higher ILP. Maybe you also want to split your large vector in multipel chunks and process them simultaneously — this might help you avoid cache bank conflicts and maximize your loads. Maybe there are some other tricks to maximizing your ILP.

Getting back to Stockfisch — I have no idea how their code looks like and how they use NEON. I have no idea what hardware assumptions they operate under. One would need to profile the code to figure out where the problems are, and one might need to change the algorithms (using the techniques I describe above) to get maximal performance on M1.

There is one thing I however we do know for certain, because there are plenty of low-level tests and analyses around. M1 is a real monster when it comes to SIMD performance. It is able to do a 512-bit SIMD operation per cycle (4x 128-bit vector units) with very low latency, which is comparable with AVX-512 CPUs — except on M1 you can get there with regular NEON, no need for 512-bit SIMD extensions. So if you show me a piece of SIMD-optimized software that runs significantly on M1 than on a laptop Cezanne, my first guess is that the problem is lack of SIMD ILP in the code. This would be my guess for Stockfish as well.

Hopefully at some point in a future there will be some motivated developer with native hardware who has a passion for chess engines and will spend some time profiling the code and figuring out why it doesn't run faster on M1 than it does now. And maybe that developer will be able to submit a patch that will fix the performance issue. Or, who knows, maybe the result of the technical discussion will be that M1 architecture is indeed bad for this kind of workloads and what we see now is already the best it can do. Who knows. But such discussion is still pending.

P.S. This discussion should give some ideas about how complex these things can get: https://www.realworldtech.com/forum/?threadid=204001&curpostid=204001

Ethosik · Nov 29, 2021

pshufd said:
I've seen algorithms that are optimized for L1 and L2 cache sizes - that's not necessarily a bad idea as cache sizes have increased over time. But they can run poorly on older hardware. In some cases, you have separate code paths depending on the architecture. So you sniff the processor capabilities and then execute the best code for the hardware capabilities.

Another way to do a fair test would be to disable all hardware optimizations in software. But even this has problems because you can use algorithms that may favor one architecture over another. Or the compiler may not make best use of hardware. Or there may be some risks taken with compiler and linker optimizations.

Why would you disable part of the hardware optimizations? M1 is not just CPU and GPU. There are built in neural engines, encode/decode engines and more. Its not just a CPU. So why not leave hardware optimizations in place because things WILL actively use it to do their work. And that should be how you compare things. As a purchaser of a Windows or Mac system, I want to see how it compares in real world tests. Not "disable this" and "disable that" then test because when I get my own product, I will NOT be disabling those things. So it is not a good test.

Apple M1 CPU & GPU speed is very disappointing

macrumors regular

macrumors G4

Suspended

macrumors regular

macrumors G4

macrumors regular

Suspended

macrumors 6502a

macrumors G4

macrumors 604

​

macrumors regular

macrumors 604

macrumors regular

macrumors G4

​

macrumors G4

macrumors regular

macrumors regular

macrumors G4

macrumors G4

macrumors 604

macrumors 604

macrumors G4

macrumors 604

macrumors Core

Contributor

Our Staff