I'd guess that a lot of video editing and transcoding programs use x86-specific vector acceleration and that will need to be ported for best performance on the ARM stuff unless Apple has a layer that makes that transparent. Most compilers added Intrinsics about ten years ago so you didn't have to program in assembler and so that you didn't have to worry about registers as well but the Intrinsics themselves are geared to x86 architectures.
ARM has a full set of intrinsics for various versions of Neon. It's not a complete 1:1 mapping compared to intel's interface.
Intel® Intrinsics Guide
Intel® Intrinsics Guide includes C-style functions that provide access to other instructions without writing assembly code.
software.intel.com
Intrinsics – Arm Developer
developer.arm.com
Clang also implements the pragma simd openmp directive, but I don't know if Apple's fork supports it. They don't seem to bundle an OpenMP runtime with XCode, but pragma simd doesn't need to link against them anyway.
As far as writing in intrinsics, I prefer ARM here.
Intel has been kind of a mess. Right now they support 128, 256, and 512 bits. Some architectures only support 256. Very old architectures only support 128, but some of them still exist in inexpensive units that are sold today. The 128 bit types have 2 types of instructions, the newer vex prefixed 3 operand ones and the older 2 operand ones. Intrinsics do an imperfect job of handling this, as the correspondence between vex prefixed and non-prefixed instructions isn't perfect, even for 128 bit widths.
Complicating things further, intel makes a lot of incremental changes on what instructions will optionally micro fuse with a load operation. While you don't deal with that when writing in intrinsics, it's common to perform unrolling steps directly when using intrinsics based code, since compilers are very inconsistent in what/how they will unroll these and you typically need multiple independent ops to fully hide latency. Micro fusion impacts how far you can go, as compilers may spill additional stuff into stack based copies.
ARM does not expose micro fusion at the assembly layer. They expose 32 register names instead (AVX512 also does this, but it's the least common), which tends to accommodate things that would require micro fusion with 16 names. They don't have opcodes or intrinsics for cross lane shuffles, because their register name width matches their lane width. Restricting intel to 128 bits results in sub-optimal performance.
Neon doesn't have masked load / write, but it provides clean options for scalar loads and stores to/from simd register names.
Coding for Neon - Part 2: Dealing with Leftovers
In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors you want to process.
community.arm.com
I could go on. I don't mind using intel intrinsics for a small amount of code where necessary, but I don't see this as a drawback for ARM. Rather I think ARM allows for simpler vectorization back ends and cleaner use of intrinsics.