What are the advantages of Arm's memory model over x86's memory model? Does it facilitate out-of-order CPU designs? Does it help compilers perform better optimizations?
The memory model is (more or less) the promises the hardware makes about the order in which CPU B will see changes made to multiple different memory addresses by CPU A.
The reason this matters is that CPU B mostly doesn't care about what CPU A is doing to "its" memory (ie most addresses that CPU A touches CPU B never even looks at). This means that CPU A doesn't have to care very much about the order in which it sends those changes out to memory and doesn't have to spend transistors and energy tracking that ordering.
Consider code like the following:
CPU A makes a bunch of changes to various shared variables (call them the PAYLOAD)
CPU A makes a change to one final shared variable (call it the FLAG)
meanwhile CPU B is constantly checking the value of the FLAG, so that when it is set, B knows that the contents of the PAYLOAD are valid.
On ARM code like this is not good enough. CPU A, after setting the PAYLOAD has to essentially emit a "PUBLISH" instruction that says to force all the changes out to memory BEFORE it publishes the FLAG. If this is not done, then other CPUs have the potential to see the FLAG's value before they see the value of the PAYLOAD, so getting the old values of the PAYLOAD.
(As a technicality we won't go into, CPU B also has to execute a "SUBSCRIBE" instruction after reading the FLAG but before reading the PAYLOAD. Technically the words used are not PUBLISH and SUBSCRIBE but release and acquire, but I think my words make it much easier to understand what's going on.)
On x86 these PUBLISH/SUBSCRIBE instructions are not necessary.
The consequence of this is that ARM can be much more flexible in reordering loads and stores, both within the CPU proper, and in (eventually, at some point) in moving data between various caches and RAM. In principle, the x86 folks would say, you can track the correct ordering in each CPU and fake it to get the same results.
Well, to some extent, but nothing is free. There are limits to just how aggressively you can reorder, even if you are doing that, and tracking the reordering does cost transistors and energy. The primary problem is that you have to treat every interaction with another CPU as potentially requiring correct ordering; there's no external mechanism like the PUBLISH/SUBSCRIBE instructions that tell you exactly where ordering is required (meaning anywhere else you can ignore it).
It definitely allows the CPU to be more aggressive with its internal load/store reordering, and also allows for more flexibility in how cache lines are propagated between different caches (eg Apple would have no problem with say a modified cache line from CPU A propagating up to cluster L2 then being seen by CPU B, while that change does not propagate out to a different cluster on the SoC till a long time later).
There appears to be no serious work done as to the "cost" of weaker memory models. We do know that when people have flipped on the force-TSO bit on M1 (usually only used by Rosetta) ARM code runs 15 to 25% slower. Defenders of x86 have said this proves nothing because Apple doesn't have incentives to make their TSO-mode run fast, and that's not a totally unreasonable claim. But it is worth noting that all the GPU vendors run their designs with very aggressive weak ordering (even Intel's iGPU) which suggests that the cost of strong ordering is enough to drop it whenever you can...