not on mac, no, never had any contact with apple hardware/softwareOk, once my M1 16" has arrived I will be happy to contribute profiler output. Do you have instructions on how to collect it?
not on mac, no, never had any contact with apple hardware/softwareOk, once my M1 16" has arrived I will be happy to contribute profiler output. Do you have instructions on how to collect it?
The megahertz wars ended decades ago. You need to move into the 21st century where what matters is how much your work flows have improved, not the raw speed of the components. Can a less clock speed, but more optimised setup be better? Often yes they can be. The Apple SoC proves this by more efficiently using each clock cycle and moving tasks to other parts of the SoC like video encoding and other things.Apple M1 CPU & GPU speed is very disappointing
I know this is long thread, but you would have to read just the last page to understand that what you wrote is useless for this discussion. We don't need marketing babbling here.The megahertz wars ended decades ago. You need to move into the 21st century where what matters is how much your work flows have improved, not the raw speed of the components. Can a less clock speed, but more optimised setup be better? Often yes they can be. The Apple SoC proves this by more efficiently using each clock cycle and moving tasks to other parts of the SoC like video encoding and other things.
I addressed the whole point of the topic.I know this is long thread, but you would have to read just the last page to understand that what you wrote is useless for this discussion. We don't need marketing babbling here.
It‘s always good to see good results.Speaking of performance running on battery. If there is the loudest fan test, the AMD and Intel laptops would win every single time. Can't say the same about how well they perform running on battery alone.
Of course everybody can download the Stockfish code and take a look at the makefile and other things.Apple does not have any "magic fairy dust". They simply have a very wide out of order core capable of having hundreds of instructions in flight. Execution capabilities of M1 far surpass the execution capabilities of Zen3 for most real world code and your insistence to cherry picking software that has not been profiled or tuned for M1 further suggests that you are not at all interested in a proper discussion but are just here to repeat arbitrary uneducated claims.
It's not about optimizing by hand. It's about making sure your code behaves well on a certain platform. It shouldn't matter for most code — compilers can make good work here, but you need to be careful if your code relies on hardware-specific assumptions.
Hypothetic example: let's say you are a low-level programmer and you are working on a piece of code that leverages SIMD (either directly via intrinsics or indirectly via compiler vectorization) . Your primary platform is x86 and you want to support both SSE and AVX hardware platforms. Older CPUs with SSE support can do at most two 128-bit SIMD operations per cycle. Let's say you algorithm needs to compute a sum of a long vector, so in your SSE path are repeatedly fetching 128-bit values and adding them to an accumulator. You do some profiling and find out that your code is pretty much optimal — the bottleneck are loads from L1D as well as the vector units. You are happy with your result.
Now the new fancy M1 CPU comes out and you decide to implement a NEON path. Now, NEON uses 128bit SIMD, sounds same as SSE, right? So you copy your SSE code, replace the basic SSE intrinsics with the NEON intrinsics, which takes you just five minutes. Code compiles, tests are passing. But then a user opens an issue telling you that their M1 is slower than an AMD CPU on your code. You close the ticket, telling the user that you already have native NEON support and they should stop drinking coolaid — the problem is that M1 is not as fast as YouTubers claim.
Here is a problem though — your NEON code still uses the assumptions of your SSE hardware. But these assumptions are not correct for M1. It has at least twice as many SIMD processing units and can do more loads from L1D per cycle. Your optimal SSE code is not optimal on M1, as it is not able to feed the hardware properly. If you would have profiled your code on a native machine, you would have seen that half of your vector units are starving for work. To fix is, you have to change your code to work on multiple SIMD values simultaneously. Maybe you want to interleave your sums and instead of x[0] + x[1] + x[2] + x[3]... do something like a = x[0] + x[1], b = x[2] + x[3], a + b — this would allow the compiler to generate code that has higher ILP. Maybe you also want to split your large vector in multipel chunks and process them simultaneously — this might help you avoid cache bank conflicts and maximize your loads. Maybe there are some other tricks to maximizing your ILP.
Getting back to Stockfisch — I have no idea how their code looks like and how they use NEON. I have no idea what hardware assumptions they operate under. One would need to profile the code to figure out where the problems are, and one might need to change the algorithms (using the techniques I describe above) to get maximal performance on M1.
There is one thing I however we do know for certain, because there are plenty of low-level tests and analyses around. M1 is a real monster when it comes to SIMD performance. It is able to do a 512-bit SIMD operation per cycle (4x 128-bit vector units) with very low latency, which is comparable with AVX-512 CPUs — except on M1 you can get there with regular NEON, no need for 512-bit SIMD extensions. So if you show me a piece of SIMD-optimized software that runs significantly on M1 than on a laptop Cezanne, my first guess is that the problem is lack of SIMD ILP in the code. This would be my guess for Stockfish as well.
Hopefully at some point in a future there will be some motivated developer with native hardware who has a passion for chess engines and will spend some time profiling the code and figuring out why it doesn't run faster on M1 than it does now. And maybe that developer will be able to submit a patch that will fix the performance issue. Or, who knows, maybe the result of the technical discussion will be that M1 architecture is indeed bad for this kind of workloads and what we see now is already the best it can do. Who knows. But such discussion is still pending.
P.S. This discussion should give some ideas about how complex these things can get: https://www.realworldtech.com/forum/?threadid=204001&curpostid=204001
> does your code prefetch the data?
There is no need for prefetching as the most costly part of the network is about 16kiB in size. The largest layer is a lot of sequential accesses that cannot really be prefetched because they are known too late.
> does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?
This is a valid concern, because it's not done explicitly. I cannot do any profiling because I don't own an M1 machine. If you have proof that the vector units are not saturated during inference then I may be able to help. I've stated that same thing for the last few months but apparently no M1 user is capable of providing a profile.
> does your SIMD code rely on long dependency chains that might reduce ILP?
The NEON one, possibly. Again, no one ever provided profiler output so I don't know.
> have you profiled the code on M1 and found that it results in optimal occupancy of vector units?
No, I don't own an M1. No one who has access to an M1 did (which is weird considering this discussion is already more than 20 pages long. Have you guys been throwing empty words for so long?).
Sorry, what exactly do you still want from me? You've been vocally complaining that nobody can offer a more technical explanation why having native ARM and Neon is not the same as proper optimization. In your own words:
Well, I have explained it to you — in depth — in #550. Your reaction? "but Stockfish has had ARM version for years...!" How is one supposed to talk to you on a professional level if all you do is squirm and wriggle like an eel? I know this kind of behavior from my five year old nephew, if you tell him something he doesn't want to hear he closes his ears and goes "la la la", but dealign with this from an (allegedly) adult person? It's embarrassing, really.
That’s exactly the same thread.Oh this whole discussion has already happened before on a different thread when M1 came out. I guess this forum has short term memory loss or the anti-M1 crowd just uses the same talking points.
Tune in next week for “Intels 6,000W new processor smokes the M1 Max chip! Apple made a huge mistake”
Followed by “The M1 Max can’t compete with a liquid cooled, overclocked 3090Ti Apple is doomed” post.
Of course if you want you can call Apples media engines: optimization of Hardware
Hardware-accelerated H.264, HEVC, ProRes, and ProRes RAW, Video decode engine, two video encode engines, two ProRes encode and decode engines…
This way it is easy to break ever video,… benchmark.
Some engines for integer math and other stuff would be great too.
The megahertz wars ended decades ago. You need to move into the 21st century where what matters is how much your work flows have improved, not the raw speed of the components. Can a less clock speed, but more optimised setup be better? Often yes they can be. The Apple SoC proves this by more efficiently using each clock cycle and moving tasks to other parts of the SoC like video encoding and other things.
Apple's intel macs are known for overheating, and I wouldn't suspect anything different with heavy AVX2 workload, just as Stockfish has. There was A LOT of people that discovered their cooling solutions to be insufficient when trying to run Stockfish, not only on laptops.By the way, is it possible that you don't really have a M1 problem, but rather an macOS problem?
I just run stockfish 14 bench on my 2019 16" MacBook Pro with a i9-9880HK. I uses the same settings as ones on https://openbenchmarking.org/test/pts/stockfish-1.3.0: stockfish bench 128 16 24 default depth
Result: 11283109 nodes/second
This is around 5% slower than the i7-9750H result on OpenBenchmarking. The 8-core i9 should be at least 20-25% faster than the 6-core i7-9750H...
The Asus has a big battery and this is good.According to discussion on talkchess Apple M1 even got beaten on battery/perfromance..
The M1 was able to tto 60 Gig positions of analysis before the battery went from 100% to 0% (battery lasted 3 hours)
An Asus 5900 laptop ran on battery and was finsihed with the same amount of positions 60G after only 1 hour with 50% left of battery.
Laptop Positions Hours Battery left Macbook M1 60.000.000.000 3h 0% Asus 5900H 60.000.000.000 1h 50%
Granted the Asus probably was louder and had a larger battery it is still quite sad when the biggest selling point is power-efficiency for the Macs to be less productive and run out of battery with less work done.
The Asus has a big battery and this is good.
Apple could have used a bigger battery too. It’s their fault/decision. But maybe the M1 Max 16-inch case is to small, which again was Apples decision.
1080p?When 4A wanted to optimize Metro Exodus and Larian Studios wanted to optimize Baldur's Gate 3 they turned to Apple for help and Apple was more than happy to help them. That has made BG3 the best optimized native game for Apple Silicon doing around 100 fps at Ultra 1080p on MBP 32-core M1 Max. That's faster than an Alienware X17 with i7 11800H and 140W RTX 3070.
That's the power of optimization. So the important question here is why chess software developers don't ask Apple for help? or do they? People here really should ask the developers all these questions instead of asking random forum members who don't care for chess (including me) to provide proof. Come back here and tell us what the developers said and what answer Apple gave them before this turns to a 100 page long discussion about what M1 can't do and why it sucks at Stockfish.
As someone already said this exact discussion about chess and Stockfish came up when M1 was released in 2020. I guess chess hasn't got more popular since then.
Optimize high-end games for Apple GPUs - WWDC21 - Videos - Apple Developer
Optimize your high-end games for Apple GPUs: We'll show you how you can use our rendering and debugging tools to eliminate performance...developer.apple.com
Apple M1 Max hasn’t even tensor cores.Yeah one thing people forget when comparing say, TFLOPs of the Apple devices vs. Nvidia or AMD is that Apple does Tile Based Deferred Rendering.
What's that?
It's basically not rendering stuff that isn't visible.
Saves power, improves performance.
Nvidia and AMD cards not doing TBDR (or mesh shaders) are calculating and drawing a bunch of stuff in the background to just draw over it again with stuff visible in front due to the painter's algorithm. Wasting throughput, compute power and energy.
Yes, mesh shaders will fix this to a degree if used but that is Turing and Navi2 onward only, so no PC games will rely on it for some time and the vast majority of PC hardware out there simply doesn't support it.
This is why you can't just compare on paper spec, say "oh the TFLOPS are lower, it SUCKS".
If someone wants to have great battery tests, use Stockfish, LC0 and Ceres.So what you are telling us is the the M1 could go 3 hrs but the Asus 5900H could only go 2 hrs on 100% battery. Got it, good to know. Might be the only valuable thing that the chess benchmark is good for on the M1 hardware, battery life test.
Come back when the chess engine code has been fully optimized to take advantage of the M1 hardware such as the machine learning, the neural engine. Until then, it is pretty much a useless benchmark for the M1 hardware.
You could at least take a short look at the Stockfish code, if they are some obvious problems take one of them.Based on your text I assume you are one of the developers. Great! Finally someone with a little sense here. If I understand it correctly, your SIMD code does computation on neural networks. We know that M1 has plenty of vector units and generally does excellently in SIMD throughput workloads (I am taking NEON here, not AVX or the NPU). That it performs poorly in your code can have two explanations: a) there is indeed something about the nature of your workload that hits a slow path on M1 hardware or b) your code is not optimal. Personally, I think that a) is less likely, since M1 has the bandwidth and the ALUs to excel at most types of matrix code.
So here a few questions:
- does your code prefetch the data?
- does your code use multiple SIMD streams to make sure you have enough ILP for multiple vector units?
- does your SIMD code rely on long dependency chains that might reduce ILP?
- have you profiled the code on M1 and found that it results in optimal occupancy of vector units?
This is fair, but unfortunately I personally have little interest in chess engines. Anyway, just because you wrote some NEON code it does not mean that it runs optimally (I have written about it in #550).
Apple M1 Max hasn’t even tensor cores.
Nonsense - lets look at these spec mp benchmarks shall we?All positive benchmarks for M1 involve stuff for "creators". I find it funny that people consider chess a niche while "creators" are somehow common.
Also, most "creator" benchmarks are export only. How often do you export ffs. None compare file sizes btw.
Asking the important questions...! ;^pYeah, but what about chess benchmarks DP Review?