Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Status
Not open for further replies.

Leifi

macrumors regular
Nov 6, 2021
128
121
I never claimed I would be able to make it run much faster. I said that someone would need first to do a code analysis to understand where potential issues are. And sure, I could be that someone but I won’t do it for free. I am not a charity and that’s not a project I have personal interest in.

What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Noted. this is the important fact.
 
Last edited:

pshufd

macrumors G4
Oct 24, 2013
10,151
14,574
New Hampshire
When 4A wanted to optimize Metro Exodus and Larian Studios wanted to optimize Baldur's Gate 3 they turned to Apple for help and Apple was more than happy to help them. That has made BG3 the best optimized native game for Apple Silicon doing around 100 fps at Ultra 1080p on MBP 32-core M1 Max. That's faster than an Alienware X17 with i7 11800H and 140W RTX 3070.

That's the power of optimization. So the important question here is why chess software developers don't ask Apple for help? or do they? People here really should ask the developers all these questions instead of asking random forum members who don't care for chess (including me) to provide proof. Come back here and tell us what the developers said and what answer Apple gave them before this turns to a 100 page long discussion about what M1 can't do and why it sucks at Stockfish.

As someone already said this exact discussion about chess and Stockfish came up when M1 was released in 2020. I guess chess hasn't got more popular since then.



This is a video on how to run Think or Swim from TD Ameritrade, natively on Apple Silicon. Performance generally stinks if you use their installer kit as you have interpreted Java code that generates x86 code that's translated by Rosetta 2. Running it on native Apple Silicon Java means that it takes one-third the time for starting up and doing general operations. TD Ameritrade has almost $4 trillion in assets under management and Schwab just bought them out. But they don't have a native Apple Silicon kit for their professional trading platform.

So people who want great performance on Apple Silicon just follow the directions in this video. To earn their living, to make money for their companies or just to manage their 401k accounts. And that's where it is worth it to put some effort into optimization. Because it provides a direct financial payback.

 
  • Like
Reactions: Homy

pshufd

macrumors G4
Oct 24, 2013
10,151
14,574
New Hampshire
What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Have you looked at it yourself? I've done optimizations that nobody has thought to do in the past because I had an intense interest in making something faster. All it takes is one interested party with the right education, background, skillset and tools.
 
  • Like
Reactions: ddhhddhh2

Taz Mangus

macrumors 604
Mar 10, 2011
7,815
3,504
What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Noted. this is the important fact.
I love learning something new all the time. I now know there is something called StockFish. If had not known better I would have thought we were discussing fishing licenses.
 

pshufd

macrumors G4
Oct 24, 2013
10,151
14,574
New Hampshire
Performance optimization encompasses a variety of skills but you need an in-depth knowledge of the architecture, and the low-level APIs and the knowledge of the best tools for profiling. Apple Silicon is a lot more than ARM. Have you actually optimized a large application before? This is all pretty basic software engineering stuff.
 
  • Like
Reactions: Homy and throAU

Taz Mangus

macrumors 604
Mar 10, 2011
7,815
3,504
According to discussion on talkchess Apple M1 even got beaten on battery/perfromance..

The M1 was able to tto 60 Gig positions of analysis before the battery went from 100% to 0% (battery lasted 3 hours)

An Asus 5900 laptop ran on battery and was finsihed with the same amount of positions 60G after only 1 hour with 50% left of battery.


LaptopPositionsHoursBattery left
Macbook M160.000.000.0003h0%
Asus 5900H60.000.000.0001h50%

Granted the Asus probably was louder and had a larger battery it is still quite sad when the biggest selling point is power-efficiency for the Macs to be less productive and run out of battery with less work done.
So what you are telling us is the the M1 could go 3 hrs but the Asus 5900H could only go 2 hrs on 100% battery. Got it, good to know. Might be the only valuable thing that the chess benchmark is good for on the M1 hardware, battery life test.

Come back when the chess engine code has been fully optimized to take advantage of the M1 hardware such as the machine learning, the neural engine. Until then, it is pretty much a useless benchmark for the M1 hardware.
 
  • Like
Reactions: jdb8167 and Tagbert

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
Have you looked at it yourself? I've done optimizations that nobody has thought to do in the past because I had an intense interest in making something faster. All it takes is one interested party with the right education, background, skillset and tools.
This is the key. Stockfish is an open source chess engine. If none of the regular contributors have a M1 Mac, they're not going to be able to put much (or any) effort into properly optimizing for M1 (*). If nobody outside that group of people has the right combination of interest, background, and access to tools, it's just not going to happen. Such is the nature of open source.
 
Last edited by a moderator:

throAU

macrumors G3
Feb 13, 2012
9,204
7,357
Perth, Western Australia
This is the key. Stockfish is an open source chess engine. If none of the regular contributors have a M1 Mac, they're not going to be able to put much (or any) effort into properly optimizing for M1 (*). If nobody outside that group of people has the right combination of interest, background, and access to tools, it's just not going to happen. Such is the nature of open source.

Exactly, and I'd wager that the cross over between
  • chess nerd
  • m1 owner
  • fairly professional coder
  • spare time to devote to chess code optimisation
is exceedingly small

the cross over between chess nerd and linux nerd is probably a lot closer, but even then. you'll likely find the app is primarily intel on windows optimised. because that's the dominant platform.
 
  • Like
Reactions: Homy

JouniS

macrumors 6502a
Nov 22, 2020
638
399
Exactly, and I'd wager that the cross over between
  • chess nerd
  • m1 owner
  • fairly professional coder
  • spare time to devote to chess code optimisation
is exceedingly small
The first three groups overlap quite a lot. Professional software developers are far more likely to use a Mac and play chess than the general public. The fourth point is the real problem. Your employer is not going to pay you for contributing to an open source chess engine, so you must do it on your own time.

Then we run into social dynamics of volunteer projects. People in general do not contribute significantly to somebody else's projects. They prefer contributing to something where thay are a part of the core team and have ownership of the project or a part of it. However, most volunteer projects already have exactly as many core members as there is room for. Most would prefer having a larger team, but they can't accommodate any additional people without changing team structure and dynamics, which could cause them to lose existing people.
 
  • Like
Reactions: Appletoni

leman

macrumors Core
Oct 14, 2008
19,522
19,679
What makes you think no one have done code analyses of Stockfish code? And that there are any specific issues for Apples ARM version.. StockFish has had ARM versions for years... and even special assembler and C versions for ARM ....

"I never claimed I would be able to make it run much faster."

Noted. this is the important fact.

Sorry, what exactly do you still want from me? You've been vocally complaining that nobody can offer a more technical explanation why having native ARM and Neon is not the same as proper optimization. In your own words:

Can you be specific please.. What exactly optimizations do you think can be done on an M1 nott already done on stockfish and cFish c compiles with NEON. And how much would you think could gained exactly in theory and in practice..

This is a question everyone avoids here and just tries to sidestep, becuase they have no good answer!

Well, I have explained it to you — in depth — in #550. Your reaction? "but Stockfish has had ARM version for years...!" How is one supposed to talk to you on a professional level if all you do is squirm and wriggle like an eel? I know this kind of behavior from my five year old nephew, if you tell him something he doesn't want to hear he closes his ears and goes "la la la", but dealign with this from an (allegedly) adult person? It's embarrassing, really.
 

ddhhddhh2

macrumors regular
Jun 2, 2021
242
374
Taipei
I am not sure what your point is.. So M1s performance is irrelevant because more people knows more about LeBron than they do about the the M1 or chess. And that fashion brands sponsor people who can throw a ball trough a hoop, makes everything that requeires higher IQ less relevant?

Is that what you are trying to argue?

Yes, the cruel truth.

Of course people care about the performance of the M1, coding, 3D, 2D, video processing, music, even surfing and text work.

But obviously, not chess.

I even wonder how many people in this discussion really understand chess, and I do, how many chess engines are there? All chess engines on M1 are not optimized? All of them underperform? let's just say that until you claimed this, almost no one knew.

Let me give you an example, a long, long time ago, when I wanted to do 3D creation, there were a lot of options, but not including the most popular 3D Max Studio, later I chose C4D + SketchUp, both of which worked very well on Macs, from G4 to G5 to intel Macs, and will continue to work very well until the M series.

If I had insisted on learning 3D Max Studio back then, I would have had to consider WinPC instead of macOS.

Although this example is rather extreme, as far as I can see, you find that your beloved chess cannot get the same performance on the M1 as everyone else is experiencing. Of course, you have the right to complain, but I think you have a few options.

1. sponsor money - lots of money for chess optimization, and by the way, don't put that stupid pic again, if you're an angel investor, you have the right to talk, unless you don't believe optimizing will help after all.

2. Return the M1, you have spent so much time on here preaching your arguments, there is enough time for you to return it.

All your arguments make you look like a child, you are not happy with the teddy bear in your hand, but you are holding on to it.

All software that has ever performed poorly on macOS may or may not improve, but if it keep poorly, people usually just blame the software developer, even if it is open source.

However, you're different from everyone else, and your unique and targeted remarks make me think you have an ulterior motive.

Ah yes, I've seen Intel's advertising budget go down, so you have to do the right thing and let the world know that even though most people are happy with the performance and power savings and low temperature that the M series offers, but the M series can't run your favorite chess game well, so the M series is garbage, that's your genius logic.

Checkmate!
 
Last edited:
  • Like
Reactions: cbum and pshufd

ddhhddhh2

macrumors regular
Jun 2, 2021
242
374
Taipei
You deride Apple for using proprietery (sic) technology, but then praise CUDA? You also complain about comparing performance on battery mode, which is perfectly valid for a laptop comparison? How many more escape hatches are there to get through?

M1 has been out for little more than a year. The fact that chess engines haven't yet been fully optimized isn't earth shattering news. Apple, and others, have put resources toward the areas of greatest need. Chess engines aren't exactly at the top of the list.

No, he did not believe that optimization could help, so he preferred to blame the M1 in his hands rather than return it, he argued for so long that he lost the possibility of returning the product, I guess.

I can well understand that. My nephew is not happy with his teddy bear, but if I dare to take it away, he will immediately make a fuss.
 
  • Haha
  • Like
Reactions: pshufd and Andropov

Sopel

macrumors member
Nov 30, 2021
41
85
Apple users: This workload is bad on M1 because it's for ****ing peasants. Who would even use a chess engine? PATHETIC
Also apple users: People laugh at us because they are envious!

btw. I'll let you know that Stockfish is optimized for M1, we have manually written NEON code and compiler option to optimize for apple-silicon. If you think you can do better the burden of proof is on you.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Apple users: This workload is bad on M1 because it's for ****ing peasants. Who would even use a chess engine? PATHETIC
Also apple users: People laugh at us because they are envious!

btw. I'll let you know that Stockfish is optimized for M1, we have manually written NEON code and compiler option to optimize for apple-silicon. If you think you can do better the burden of proof is on you.
You manually wrote a compiler option? What?
 
  • Like
Reactions: jdb8167

Appletoni

Suspended
Original poster
Mar 26, 2021
443
177
It is called optimization of hardware. So now you are going to complain that Apple has some sort of unfair advantage.
Of course if you want you can call Apples media engines: optimization of Hardware
Hardware-accelerated H.264, HEVC, ProRes, and ProRes RAW, Video decode engine, two video encode engines, two ProRes encode and decode engines…
This way it is easy to break ever video,… benchmark.
Some engines for integer math and other stuff would be great too.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Of course if you want you can call Apples media engines: optimization of Hardware
Hardware-accelerated H.264, HEVC, ProRes, and ProRes RAW, Video decode engine, two video encode engines, two ProRes encode and decode engines…
This way it is easy to break ever video,… benchmark.
Some engines for integer math and other stuff would be great too.
They have engines for integer math. They are called CPUs.
 

Appletoni

Suspended
Original poster
Mar 26, 2021
443
177
Just curious why you care so much and that explains why.

People don't care about chess.

Look at the net worth of Magnus Carlson and compare it to Lebron James. Or Tom Brady. Or even Roger Federer. Tennis is not even in the big money leagues but it's still way more than chess. Look at the sponsorships in Chess. Do chess players get multimillion dollar clothing contracts? Do they get clothing contracts at all?

If you want chess to run well on Apple Silicon, optimize it yourself.
Funny joke.
It‘s very obvious that you haven’t read the latest newspapers xD
 

Appletoni

Suspended
Original poster
Mar 26, 2021
443
177
You deride Apple for using proprietery (sic) technology, but then praise CUDA? You also complain about comparing performance on battery mode, which is perfectly valid for a laptop comparison? How many more escape hatches are there to get through?

M1 has been out for little more than a year. The fact that chess engines haven't yet been fully optimized isn't earth shattering news. Apple, and others, have put resources toward the areas of greatest need. Chess engines aren't exactly at the top of the list.
If chess engines are slow on the M1 Max, then we have automatically other stuff which is slow too.
 
  • Like
Reactions: Leifi

leman

macrumors Core
Oct 14, 2008
19,522
19,679
btw. I'll let you know that Stockfish is optimized for M1, we have manually written NEON code and compiler option to optimize for apple-silicon.

Based on your text I assume you are one of the developers. Great! Finally someone with a little sense here. If I understand it correctly, your SIMD code does computation on neural networks. We know that M1 has plenty of vector units and generally does excellently in SIMD throughput workloads (I am taking NEON here, not AVX or the NPU). That it performs poorly in your code can have two explanations: a) there is indeed something about the nature of your workload that hits a slow path on M1 hardware or b) your code is not optimal. Personally, I think that a) is less likely, since M1 has the bandwidth and the ALUs to excel at most types of matrix code.

So here a few questions:

- does your code prefetch the data?
- does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?
- does your SIMD code rely on long dependency chains that might reduce ILP?
- have you profiled the code on M1 and found that it results in optimal occupancy of vector units?


If you think you can do better the burden of proof is on you.

This is fair, but unfortunately I personally have little interest in chess engines. Anyway, just because you wrote some NEON code it does not mean that it runs optimally (I have written about it in #550).
 
  • Like
Reactions: jdb8167 and pshufd

Appletoni

Suspended
Original poster
Mar 26, 2021
443
177
How about a real world comparison, on a real world usage, of the M1 Max against a RTX3090, AMD Ryzen 9 5950x. I am sorry I could not find an example of a chess program simulation benchmark, so you will just have to settle for this.

You know that he earns money by showing benchmarks of the new M1 Max to Apple fans?
Obviously no one will show bad results to (Apple) fans.
If I would sell VW cars, then I wouldn’t show bad results to the customers.
 
  • Like
Reactions: Leifi

Sopel

macrumors member
Nov 30, 2021
41
85
Based on your text I assume you are one of the developers. Great! Finally someone with a little sense here. If I understand it correctly, your SIMD code does computation on neural networks. We know that M1 has plenty of vector units and generally does excellently in SIMD throughput workloads (I am taking NEON here, not AVX or the NPU). That it performs poorly in your code can have two explanations: a) there is indeed something about the nature of your workload that hits a slow path on M1 hardware or b) your code is not optimal. Personally, I think that a) is less likely, since M1 has the bandwidth and the ALUs to excel at most types of matrix code.

So here a few questions:

- does your code prefetch the data?
- does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?
- does your SIMD code rely on long dependency chains that might reduce ILP?
- have you profiled the code on M1 and found that it results in optimal occupancy of vector units?




This is fair, but unfortunately I personally have little interest in chess engines. Anyway, just because you wrote some NEON code it does not mean that it runs optimally (I have written about it in #550).
> does your code prefetch the data?

There is no need for prefetching as the most costly part of the network is about 16kiB in size. The largest layer is a lot of sequential accesses that cannot really be prefetched because they are known too late.

> does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?

This is a valid concern, because it's not done explicitly. I cannot do any profiling because I don't own an M1 machine. If you have proof that the vector units are not saturated during inference then I may be able to help. I've stated that same thing for the last few months but apparently no M1 user is capable of providing a profile.

> does your SIMD code rely on long dependency chains that might reduce ILP?

The NEON one, possibly. Again, no one ever provided profiler output so I don't know.

> have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

No, I don't own an M1. No one who has access to an M1 did (which is weird considering this discussion is already more than 20 pages long. Have you guys been throwing empty words for so long?).
 

leman

macrumors Core
Oct 14, 2008
19,522
19,679
> does your code prefetch the data?

There is no need for prefetching as the most costly part of the network is about 16kiB in size. The largest layer is a lot of sequential accesses that cannot really be prefetched because they are known too late.

> does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?

This is a valid concern, because it's not done explicitly. I cannot do any profiling because I don't own an M1 machine. If you have proof that the vector units are not saturated during inference then I may be able to help. I've stated that same thing for the last few months but apparently no M1 user is capable of providing a profile.

> does your SIMD code rely on long dependency chains that might reduce ILP?

The NEON one, possibly. Again, no one ever provided profiler output so I don't know.

> have you profiled the code on M1 and found that it results in optimal occupancy of vector units?

No, I don't own an M1. No one who has access to an M1 did (which is weird considering this discussion is already more than 20 pages long. Have you guys been throwing empty words for so long?).

Ok, once my M1 16" has arrived I will be happy to contribute profiler output. Do you have instructions on how to collect it?
 
  • Like
Reactions: Appletoni
Status
Not open for further replies.
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.