I guess my musing here is about AVX2/etc vs NEON. While M1 is clearly a strong chip, a lot of benchmarks and real world apps aren’t heavy on SIMD. And I haven’t seen a lot of discussion between the two SIMD implementations to get an idea how close they are in reality. So I’m not sure how much impact things like AVX2 being 256-bit versus NEON being 128-bit play in. How much would having SVE improve things on the ARM side? Would leveraging AMX2 on M1 via the Accelerate framework help at all (it can be ~2x faster than NEON in certain cases)? Does M1 have a similar number of SIMD units per core as modern Intel? Things I just don’t honestly know enough about that would have an impact on the final results here.
This is not a trivial question. Yes, NEON is 128-bit, but Apple Silicon currently offers 4x SIMD units (for 4x 128-bit ops per cycle) while most AVX-2 implementations can do 2x 256-bit ops per cycle, so it's more or less the same. Some Intel CPUs with AVX512 can do more than one 512bit operations per cycle, but it is unclear to me under which conditions.
To make comparisons more difficult though, there is also a difference in clocks. M1 will run at ~2.9-3.0ghz while running heavy SIMD code, while x86 CPUs can clock significantly higher. However, since we are talking about sustained multithreaded thought put, the x86 chip is probably operating close to it's base frequency (which is ~3ghz for modern mobile x86), so we should get comparable thoughtput.
And then... there is the matter of data loads and stores. Modern x86 CPUs can sustain two 256-bit fetches from L1D, while M1 can't. If I understand the available Firestorm info correctly, it will only do two 128-bit loads from L1D per cycle on most data (and it can do an additional 128-bit load if load overlaps with a previous store). This might limit M1 performance on some workloads where the ratio of operations to memory transfers is low. Stockfish could like one of those cases (they are processing large batches of data, but only do few operations per load), so it might be limited by L1D performance - but who knows. At the same time, M1 has much larger caches, so it might have an advantage if you have to process large amount of data with less predictable access patterns (which does not seem to be the case for Stockfish specifically).
Finally, looking at the ISA itself, it's a bit of a mixed bag. There are some patterns in x86 SIMD that NEON can't do directly — where you need a bunch of instructions to do the same thing. And via versa.
Overall, for throughput oriented stream processing style code, I'd expect the Firestorm core to perform similarly — or slightly worse — than a modern AVX2 x86 core, assuming well optimized code for both platforms. On more complex workloads that involve dependent data fetches, non-trivial conditions and high SIMD to data transfer ratio, Firestorm might pull ahead. On less optimized code, M1 should be faster, since it's overall a more flexible architecture — and we can already see it in benchmarks like SPEC that do not employ architecture-specific SIMD optimizations (or maybe where using SIMD would not work in the first place), M1's four independent FP units allow it to execute the code much faster.
It was Apples decision not to sell 16 to 20 high-performance cores, which was a bad decision.
Please just go away.