Thanks for the link! Unfortunately not much information known. Linus is speculating that the feature might only make the decoder transforms memory operations into such with release/acquire semantics. This does not make much sense to me, as that is what Rosetta 2 could do on its own when emitting code.
The github code mostly inserts a value at a certain offset in the thread structure, as far as i read it. And most likely it can be shown, that this offset is used by the kernel when returning to Rosetta 2 at EL0 - but does it really do what people think its doing?
Elsewhere in that RWT thread, someone mentions that ARM's acquire/release loads and stores support a reduced set of addressing modes. That's okay for their normal use case, where most memory ops don't need acquire/release, and for the rare ones which do, the overhead of a few extra instructions for address gen disappears into the noise. But it's not okay for software like Rosetta or Microsoft's JIT, where the code generator needs to emit the minimum number of AArch64 instructions per x86-64 instruction to improve performance. So it may make sense even if it's only the Linus idea - a mode bit telling the load/store unit to use acquire/release semantics on all accesses.
You can just test it with any Windows ARM device. The description is right there with the emulation settings. There are 4 options for multi-core: fast, strict, very strict, single core. The last one forces all threads to a single core while no barriers will be emitted anymore by the JIT compiler.
I don't have any Windows device to test, ARM or not. If I did, it'd be x86 because I wouldn't want to waste money on a bad Windows computer (this is a hint about where the rest of my post is going). Finally, I don't know anybody who has an ARM Windows computer to lend to me. So testing is a little difficult.
When I've been reading your posts about Microsoft's JIT, I've constantly been thinking "this guy must be all wrong, no way Microsoft did things that way". For example, as I read the list of modes you wrote about, I could not help but read them as "extremely buggy, buggy, not buggy, not buggy but lol single threaded only". I didn't think Microsoft's JIT could possibly be that bad, so I googled a bit, and found that you were correct. Sorry for doubting you. Microsoft admits it all in one of their support pages:
"These settings change the number of memory barriers used to synchronize memory accesses between cores in apps during emulation. Fast is the default mode, but the strict and very strict options will increase the number of barriers. This slows down the app, but reduces the risk of app errors. The single-core option removes all barriers but forces all app threads to run on a single core."
So, Microsoft wrote a JIT whose default mode inflates benchmark numbers at the cost of automatically injecting bugs into x86 software. They push the problems this causes off onto the end user, expecting them to notice that an app crashed, figure out that it might be an x86 memory ordering issue, and then tweak a relatively obscure and technical setting to find the best balance between performance and correctness.
This is insane and, combined with the unavailability of native ARM software for Windows, makes it crystal clear to me why ARM Windows has been a failure so far. I am a bit shocked that they shipped a JIT with deliberate bugs in it. I'm sure it was some manager demanding higher performance at any cost so that reviewers wouldn't savage ARM Windows for running x86 software extremely slowly, but the cost is that when people buy the product, it crashes a lot and gets a bad reputation.
The sad thing is that this didn't even result in performance good enough to review well. While googling for info, I kept finding links to Surface Pro X reviews which praised the hardware design but complained about poor x86 performance (and bugs).
Speaking of performance, you keep saying Microsoft's JIT is close to Rosetta, but it just isn't. According to you, Microsoft is at ~0.5x of native performance. So far, the one data point we have for Rosetta is that in Geekbench 5, Rosetta is ~0.75x of native performance. 0.5x and 0.75x may sound close, but it actually means that if Rosetta and Microsoft's JIT were able to run on the same processor, Rosetta would run x86 software 1.5x faster than the Microsoft JIT. That's a huge performance gain.
And if the criteria is "run x86 software
correctly", Rosetta might be 2x faster, or more. You seem to be puzzled by Apple choosing to add TSO to their CPUs. Hopefully this explains why. For Apple, the only acceptable design goal would have been one mode which is fast, correct, and doesn't force bad restrictions like single-threaded execution. TSO helps them get there.
You have to understand that "TSO is off" is the normal case for any SW running on ARM systems natively, still you will be hard pressed finding barriers somewhere in application source code (or where x86 and ARM code differs for that matter). The reason for this is, that well written code only require ordering at synchronization points.
I think you know this, but the issue isn't the frequency of real barriers in well written code. It's that there is no way for a JIT or translator to examine an x86 binary and determine where all the synchronization points
are. The strong memory ordering semantics of x86 simply require everything to be treated as a barrier on ARM. (Unless you know you're running on a specific ARM implementation which provides stronger ordering than required by the ARM architecture...)
Likewise on the Surface Pro X, most x86 apps just run fine with default emulation behavior. Most which have issues are relying on a kernel mode extension - like for instance copy protection for games. And then there might be a small subset where you have to change the emulation setting to more stricter ordering - emulation settings can be changed per app by the way.
I strongly disagree with the idea that this is just fine, and that it only affects a small subset of apps. Multithreaded applications are the norm these days. Ubiquitous threading means the only way a JIT can safely fail to emulate TSO semantics is to force the software to be single threaded. (And then performance sucks for a different reason.)
This as a general statement is totally correct. However as i have explained above, a less strict emulator can assume that ordering only needs to be guaranteed at synchronization points when trapping into the kernel or in user mode when using spinlocks or atomics. The later one are typically implemented as CAS operation on x86 and can be easily detected by the emulator.
A less strict emulator is just a buggy emulator. I don't know how you convinced yourself that Microsoft's defective-by-design JIT is a good thing. You seem to know better, but are telling yourself everything's fine.