Apple probably have a dissadvantage of being a very closed, proprietary system, with poor or non-existing support and documentation to the open-source community trying to optimize code for Apple HW. This said Apple's sadly does not have some magic fairy dust that could be sprayed on their ARMv8.4 SoC to increase perfomance to even get closer to the lower end gen 4 8-core CPUs of AMD.
Apple does not have any "magic fairy dust". They simply have a very wide out of order core capable of having hundreds of instructions in flight. Execution capabilities of M1 far surpass the execution capabilities of Zen3 for most real world code and your insistence to cherry picking software that has not been profiled or tuned for M1 further suggests that you are not at all interested in a proper discussion but are just here to repeat arbitrary uneducated claims.
You would maybe gain a few percent of performance by optimizing by hand compraed to the compiler that already optimizes for the CPU.
Can you be specific please.. What exactly optimizations do you think can be done on an M1 nott already done on stockfish and cFish c compiles with NEON. And how much would you think could gained exactly in theory and in practice..
This is a question everyone avoids here and just tries to sidestep, becuase they have no good answer!
It's not about optimizing by hand. It's about making sure your code behaves well on a certain platform. It shouldn't matter for most code — compilers can make good work here, but you need to be careful if your code relies on hardware-specific assumptions.
Hypothetic example: let's say you are a low-level programmer and you are working on a piece of code that leverages SIMD (either directly via intrinsics or indirectly via compiler vectorization) . Your primary platform is x86 and you want to support both SSE and AVX hardware platforms. Older CPUs with SSE support can do at most two 128-bit SIMD operations per cycle. Let's say you algorithm needs to compute a sum of a long vector, so in your SSE path are repeatedly fetching 128-bit values and adding them to an accumulator. You do some profiling and find out that your code is pretty much optimal — the bottleneck are loads from L1D as well as the vector units. You are happy with your result.
Now the new fancy M1 CPU comes out and you decide to implement a NEON path. Now, NEON uses 128bit SIMD, sounds same as SSE, right? So you copy your SSE code, replace the basic SSE intrinsics with the NEON intrinsics, which takes you just five minutes. Code compiles, tests are passing. But then a user opens an issue telling you that their M1 is slower than an AMD CPU on your code. You close the ticket, telling the user that you already have native NEON support and they should stop drinking coolaid — the problem is that M1 is not as fast as YouTubers claim.
Here is a problem though — your NEON code still uses the assumptions of your SSE hardware. But these assumptions are not correct for M1. It has at least twice as many SIMD processing units and can do more loads from L1D per cycle. Your optimal SSE code is not optimal on M1, as it is not able to feed the hardware properly. If you would have profiled your code on a native machine, you would have seen that half of your vector units are starving for work. To fix is, you have to change your code to work on multiple SIMD values simultaneously. Maybe you want to interleave your sums and instead of x[0] + x[1] + x[2] + x[3]... do something like a = x[0] + x[1], b = x[2] + x[3], a + b — this would allow the compiler to generate code that has higher ILP. Maybe you also want to split your large vector in multipel chunks and process them simultaneously — this might help you avoid cache bank conflicts and maximize your loads. Maybe there are some other tricks to maximizing your ILP.
Getting back to Stockfisch — I have no idea how their code looks like and how they use NEON. I have no idea what hardware assumptions they operate under. One would need to profile the code to figure out where the problems are, and one might need to change the algorithms (using the techniques I describe above) to get maximal performance on M1.
There is one thing I however we do know for certain, because there are plenty of low-level tests and analyses around. M1 is a real monster when it comes to SIMD performance. It is able to do a 512-bit SIMD operation per cycle (4x 128-bit vector units) with very low latency, which is comparable with AVX-512 CPUs — except on M1 you can get there with regular NEON, no need for 512-bit SIMD extensions. So if you show me a piece of SIMD-optimized software that runs significantly on M1 than on a laptop Cezanne, my first guess is that the problem is lack of SIMD ILP in the code. This would be my guess for Stockfish as well.
Hopefully at some point in a future there will be some motivated developer with native hardware who has a passion for chess engines and will spend some time profiling the code and figuring out why it doesn't run faster on M1 than it does now. And maybe that developer will be able to submit a patch that will fix the performance issue. Or, who knows, maybe the result of the technical discussion will be that M1 architecture is indeed bad for this kind of workloads and what we see now is already the best it can do. Who knows. But such discussion is still pending.
P.S. This discussion should give some ideas about how complex these things can get:
https://www.realworldtech.com/forum/?threadid=204001&curpostid=204001