LOL. Do you want me to count them? I don't see any "complex mappings". Most of the lines in the source are just type casts which don't produce any code.
Such as? I have done quite a bit of optimization using various versions of SSE and AVX in my time, but don't have any experience in ARM coding on that level.
Anyway, Alder Lake is more than twice as fast in Cinebench multi-core than the M1 Max. It would require some pretty sensational intrinsics in Neon to achieve that much of an boost. If that were possible, I suspect Apple would have merged that into Embree long ago, given that it is used by some very popular projects such as Blender. But on the Embree Github page you can see them contributing optimizations that gain something like 8% ...
I doubt that very much. There was a significant performance improvement from AVX to AVX2, the only big difference being the longer vectors (well, and FMA for FP calculations). There are a lot of factors playing into SIMD performance besides just the execution units. For example, Alder Lake can load two 256-bit vectors per cycle from L1D (strictly speaking even two 512-bit vectors, but the consumer Alder Lakes can't make use of that since AVX512 is disabled). If you have shorter vectors, the throughput will obviously be that much lower, everything else being equal. Up to some point there is no replacement for displacement.
On the upcoming workstation Alder Lakes the difference may be even starker, since they have AVX512 enabled.