Isn't this big news? x64 emulation is coming to Windows on ARM

curmudgeonette · Oct 10, 2020

Gerdi said:
Since when has an ISA ever been patented? You can patent a certain implementation or method but neither would be violated by a SW emulator.

It is not the instruction encoding that's the problem. Instead, emulating instructions may require emulating certain CPU modules. The hullabaloo here seems to be related the SSE* functionality.

That's why I mentioned the SSE2 release date.

Gerdi · Oct 10, 2020

mr_roboto said:
Microsoft's emulator, however, has to run on generic ARM CPUs. It has to use slow software emulation of x86-TSO ordering. It also means there's going to be a ton of differences in emulator design compared to Rosetta 2.

First of all, if Rosetta 2 is really faster in JIT mode than Microsofts emulator has to be proven - the difference even when comparing Rosetta 2 static translation vs. Microsoft JIT is small as it stands today.
Second Microsofts emulator does not guarantee TSO with default settings - still most apps do run without switching the emulator into a stronger ordering mode - there are a few settings where you can control this.
Part of the reason why this works is the fact, that the kernel always runs native ARM64 code, where proper memory barriers are in place. In the user domain you rarely require TSO.

Finally, because weak memory ordering is such a fundamental concept of an ARM microarchitecture implementation - where it gains much of its inherent advantages compared to x86, i doubt that Apple has introduced a TSO ordering mode in HW.

leman · Oct 10, 2020

Gerdi said:
Finally, because weak memory ordering is such a fundamental concept of an ARM microarchitecture implementation - where it gains much of its inherent advantages compared to x86, i doubt that Apple has introduced a TSO ordering mode in HW.

There was a report that Apple has implemented a per-thread switchable memory ordering mode to support execution of x86 code. It’s probably part of A14.

If Windows translation layer ignores this issue as you say, then there will be a lot of fun with multithreaded software breaking spectacularly.

Gerdi · Oct 10, 2020

leman said:
There was a report that Apple has implemented a per-thread switchable memory ordering mode to support execution of x86 code. It’s probably part of A14.

If Windows translation layer ignores this issue as you say, then there will be a lot of fun with multithreaded software breaking spectacularly.

There are lots of so called reports on the internet...most of them are pure speculation from people with no clue. Is anything of this confirmed by Apple or is there even any evidence?
And assuming it is part of A14, why is Rosetta 2 working on A12 then?
Weak memory ordering is a fundamental part of ARM and enforcing TSO not only mean major re-design of the microarchitecture but you would also lose many advantages of ARM vs x86 with respect to efficiency and performance.

Coming back to Windows x86 emulation, it assumes weak ordering - any less weak implementation of ordering is a non issue.

thenewperson · Oct 10, 2020

Gerdi said:
There are lots of so called reports on the internet...most of them are pure speculation from people with no clue. Is anything of this confirmed by Apple or is there even any evidence?

IIRC, it was from someone going through macOS on the DTK, but I can't remember if it was Steve Troughton-Smith or someone else.

leman · Oct 10, 2020

Gerdi said:
There are lots of so called reports on the internet...most of them are pure speculation from people with no clue. Is anything of this confirmed by Apple or is there even any evidence?

Here is a discussion:https://www.realworldtech.com/forum/?threadid=193883&curpostid=193883

And here is a tool that enables togling memory ordering on the DTK:https://github.com/saagarjha/TSOEnabler

If I understand it correctly, the setting must already be present on the A12 and it works per-core. It would probably insert explicit memory barriers between operations to emulate string memory ordering, posters on real world tech speculated that it might not be well optimized.

jdb8167 · Oct 10, 2020

leman said:
Here is a discussion:https://www.realworldtech.com/forum/?threadid=193883&curpostid=193883

And here is a tool that enables togling memory ordering on the DTK:https://github.com/saagarjha/TSOEnabler

If I understand it correctly, the setting must already be present on the A12 and it works per-core. It would probably insert explicit memory barriers between operations to emulate string memory ordering, posters on real world tech speculated that it might not be well optimized.

Someone in that thread also noted that the MSR TSO setting in the DTK A12Z is only available on the high performance cores which explains why when running the emulated version of GeekBench only the 4 HP cores were used.

mr_roboto · Oct 10, 2020

Gerdi said:
First of all, if Rosetta 2 is really faster in JIT mode than Microsofts emulator has to be proven - the difference even when comparing Rosetta 2 static translation vs. Microsoft JIT is small as it stands today.

"Has to be proven" cuts both ways. How can you be so certain about where Microsoft's JIT is relative to Rosetta? We have very limited data (only leaked data for non-production hardware and software on the Rosetta side), and no way to perform a direct comparison (meaning, on the same hardware).

Second Microsofts emulator does not guarantee TSO with default settings - still most apps do run without switching the emulator into a stronger ordering mode - there are a few settings where you can control this.
Part of the reason why this works is the fact, that the kernel always runs native ARM64 code, where proper memory barriers are in place. In the user domain you rarely require TSO.

Your claim seems dubious, can you point at any documentation for it?

The reason I am skeptical is that I doubt Microsoft's platform forces you to make a kernel roundtrip every time you want to share data with another thread. That would hurt performance a lot.

And on the other side, it's extremely common for x86 implementations of widely used synchronization libraries (e.g. C++ std::atomic) to lean on x86-TSO wherever possible, because that results in better performance than putting full barriers everywhere.

Since threaded code is very common these days, leaving TSO off by default would seem to be asking for bugs. (Perhaps this is why, when I tried to search for information about running x86 software on Surface Pro X, people commonly mentioned bugs.)

Finally, because weak memory ordering is such a fundamental concept of an ARM microarchitecture implementation - where it gains much of its inherent advantages compared to x86, i doubt that Apple has introduced a TSO ordering mode in HW.

I was going to link you that github project, but someone beat me to it. It's real, and it even neatly explained one of the early mysteries about leaked DTK Geekbench scores.

I would say it's more like weak ordering is allowed in ARM implementations, in the hopes that doing so has benefits, but implementations are free to support stronger ordering so long as they don't violate any ARM specifications. The fact that it's a mode bit suggests A12Z performance cores should perform better with TSO turned off (otherwise, why not just leave it on all the time), but the fact that Rosetta 2 runs with it on equally suggests that when it comes to x86 emulation, it's a win. Possibly a huge win. The whole issue with this is it's impossible for either a translator or a JIT to know whether any given memory reference is one where thread safety is a concern, so on an architecture with weaker ordering rules than x86, translated memory references all have to be treated as requiring a barrier. That's not good.

Tech198 · Oct 10, 2020

donawalt said:
I may be wrong, but even if the emulation incurs a 30% performance penalty, speaking as a MacBook Pro user, it won't be that long before there's a Apple-silicon-based laptop that's 50% faster than the best Intel-based MacBook Pro now. If true, then I get my few but important Windows apps running as fast or marginally faster than today, with a MacBook Pro that is significantly faster for native apps. I can live with that!

With Apple throwing round the word 'performance' everywhere they can, "performance hit" may not mean much anymore, because all Apple will do is up their CPU quickly.

Even if you do see a hit, it will probably be far less noticeable anyway compared to Intel (assuming Apple's hype is true)

Benchmarks are just numbers. To me, they don't mean much in the real world until you use one. i.e Apple could of tamped with the Developer Kit Macs to prove their theory..

Gerdi · Oct 10, 2020

leman said:
Here is a discussion:https://www.realworldtech.com/forum/?threadid=193883&curpostid=193883

And here is a tool that enables togling memory ordering on the DTK:https://github.com/saagarjha/TSOEnabler

If I understand it correctly, the setting must already be present on the A12 and it works per-core. It would probably insert explicit memory barriers between operations to emulate string memory ordering, posters on real world tech speculated that it might not be well optimized.

Thanks for the link! Unfortunately not much information known. Linus is speculating that the feature might only make the decoder transforms memory operations into such with release/acquire semantics. This does not make much sense to me, as that is what Rosetta 2 could do on its own when emitting code.
The github code mostly inserts a value at a certain offset in the thread structure, as far as i read it. And most likely it can be shown, that this offset is used by the kernel when returning to Rosetta 2 at EL0 - but does it really do what people think its doing?

Gerdi · Oct 10, 2020

mr_roboto said:
"Has to be proven" cuts both ways. How can you be so certain about where Microsoft's JIT is relative to Rosetta? We have very limited data (only leaked data for non-production hardware and software on the Rosetta side), and no way to perform a direct comparison (meaning, on the same hardware).

I am going with the data we have - using the metric "native runtime divided by emulated runtime". On Microsoft side the performance hit is around 50% give or take in benchmarks - using the default emulator settings.
And as i said, it still has to be shown - but sure it cuts both ways.

Your claim seems dubious, can you point at any documentation for it?

You can just test it with any Windows ARM device. The description is right there with the emulation settings. There are 4 options for multi-core: fast, strict, very strict, single core. The last one forces all threads to a single core while no barriers will be emitted anymore by the JIT compiler.

And on the other side, it's extremely common for x86 implementations of widely used synchronization libraries (e.g. C++ std::atomic) to lean on x86-TSO wherever possible, because that results in better performance than putting full barriers everywhere.

ARM implementation do not use full barriers either, the atomic for instance using load/stores with release and acquire semantics. You can just check the implementations of these primitives in ARM64 Linux for instance.

Since threaded code is very common these days, leaving TSO off by default would seem to be asking for bugs. (Perhaps this is why, when I tried to search for information about running x86 software on Surface Pro X, people commonly mentioned bugs.)

You have to understand that "TSO is off" is the normal case for any SW running on ARM systems natively, still you will be hard pressed finding barriers somewhere in application source code (or where x86 and ARM code differs for that matter). The reason for this is, that well written code only require ordering at synchronization points.

Likewise on the Surface Pro X, most x86 apps just run fine with default emulation behavior. Most which have issues are relying on a kernel mode extension - like for instance copy protection for games. And then there might be a small subset where you have to change the emulation setting to more stricter ordering - emulation settings can be changed per app by the way.

I would say it's more like weak ordering is allowed in ARM implementations, in the hopes that doing so has benefits, but implementations are free to support stronger ordering so long as they don't violate any ARM specifications.

Of course, you are always free to implement a stronger requirement. Valid SW has to assume weak though.

The whole issue with this is it's impossible for either a translator or a JIT to know whether any given memory reference is one where thread safety is a concern, so on an architecture with weaker ordering rules than x86, translated memory references all have to be treated as requiring a barrier. That's not good.

This as a general statement is totally correct. However as i have explained above, a less strict emulator can assume that ordering only needs to be guaranteed at synchronization points when trapping into the kernel or in user mode when using spinlocks or atomics. The later one are typically implemented as CAS operation on x86 and can be easily detected by the emulator.

mr_roboto · Oct 11, 2020

Gerdi said:
Thanks for the link! Unfortunately not much information known. Linus is speculating that the feature might only make the decoder transforms memory operations into such with release/acquire semantics. This does not make much sense to me, as that is what Rosetta 2 could do on its own when emitting code.
The github code mostly inserts a value at a certain offset in the thread structure, as far as i read it. And most likely it can be shown, that this offset is used by the kernel when returning to Rosetta 2 at EL0 - but does it really do what people think its doing?

Elsewhere in that RWT thread, someone mentions that ARM's acquire/release loads and stores support a reduced set of addressing modes. That's okay for their normal use case, where most memory ops don't need acquire/release, and for the rare ones which do, the overhead of a few extra instructions for address gen disappears into the noise. But it's not okay for software like Rosetta or Microsoft's JIT, where the code generator needs to emit the minimum number of AArch64 instructions per x86-64 instruction to improve performance. So it may make sense even if it's only the Linus idea - a mode bit telling the load/store unit to use acquire/release semantics on all accesses.

Gerdi said:
You can just test it with any Windows ARM device. The description is right there with the emulation settings. There are 4 options for multi-core: fast, strict, very strict, single core. The last one forces all threads to a single core while no barriers will be emitted anymore by the JIT compiler.

I don't have any Windows device to test, ARM or not. If I did, it'd be x86 because I wouldn't want to waste money on a bad Windows computer (this is a hint about where the rest of my post is going). Finally, I don't know anybody who has an ARM Windows computer to lend to me. So testing is a little difficult.

When I've been reading your posts about Microsoft's JIT, I've constantly been thinking "this guy must be all wrong, no way Microsoft did things that way". For example, as I read the list of modes you wrote about, I could not help but read them as "extremely buggy, buggy, not buggy, not buggy but lol single threaded only". I didn't think Microsoft's JIT could possibly be that bad, so I googled a bit, and found that you were correct. Sorry for doubting you. Microsoft admits it all in one of their support pages:

"These settings change the number of memory barriers used to synchronize memory accesses between cores in apps during emulation. Fast is the default mode, but the strict and very strict options will increase the number of barriers. This slows down the app, but reduces the risk of app errors. The single-core option removes all barriers but forces all app threads to run on a single core."

So, Microsoft wrote a JIT whose default mode inflates benchmark numbers at the cost of automatically injecting bugs into x86 software. They push the problems this causes off onto the end user, expecting them to notice that an app crashed, figure out that it might be an x86 memory ordering issue, and then tweak a relatively obscure and technical setting to find the best balance between performance and correctness.

This is insane and, combined with the unavailability of native ARM software for Windows, makes it crystal clear to me why ARM Windows has been a failure so far. I am a bit shocked that they shipped a JIT with deliberate bugs in it. I'm sure it was some manager demanding higher performance at any cost so that reviewers wouldn't savage ARM Windows for running x86 software extremely slowly, but the cost is that when people buy the product, it crashes a lot and gets a bad reputation.

The sad thing is that this didn't even result in performance good enough to review well. While googling for info, I kept finding links to Surface Pro X reviews which praised the hardware design but complained about poor x86 performance (and bugs).

Speaking of performance, you keep saying Microsoft's JIT is close to Rosetta, but it just isn't. According to you, Microsoft is at ~0.5x of native performance. So far, the one data point we have for Rosetta is that in Geekbench 5, Rosetta is ~0.75x of native performance. 0.5x and 0.75x may sound close, but it actually means that if Rosetta and Microsoft's JIT were able to run on the same processor, Rosetta would run x86 software 1.5x faster than the Microsoft JIT. That's a huge performance gain.

And if the criteria is "run x86 software correctly", Rosetta might be 2x faster, or more. You seem to be puzzled by Apple choosing to add TSO to their CPUs. Hopefully this explains why. For Apple, the only acceptable design goal would have been one mode which is fast, correct, and doesn't force bad restrictions like single-threaded execution. TSO helps them get there.

You have to understand that "TSO is off" is the normal case for any SW running on ARM systems natively, still you will be hard pressed finding barriers somewhere in application source code (or where x86 and ARM code differs for that matter). The reason for this is, that well written code only require ordering at synchronization points.

I think you know this, but the issue isn't the frequency of real barriers in well written code. It's that there is no way for a JIT or translator to examine an x86 binary and determine where all the synchronization points are. The strong memory ordering semantics of x86 simply require everything to be treated as a barrier on ARM. (Unless you know you're running on a specific ARM implementation which provides stronger ordering than required by the ARM architecture...)

Likewise on the Surface Pro X, most x86 apps just run fine with default emulation behavior. Most which have issues are relying on a kernel mode extension - like for instance copy protection for games. And then there might be a small subset where you have to change the emulation setting to more stricter ordering - emulation settings can be changed per app by the way.

I strongly disagree with the idea that this is just fine, and that it only affects a small subset of apps. Multithreaded applications are the norm these days. Ubiquitous threading means the only way a JIT can safely fail to emulate TSO semantics is to force the software to be single threaded. (And then performance sucks for a different reason.)

This as a general statement is totally correct. However as i have explained above, a less strict emulator can assume that ordering only needs to be guaranteed at synchronization points when trapping into the kernel or in user mode when using spinlocks or atomics. The later one are typically implemented as CAS operation on x86 and can be easily detected by the emulator.

A less strict emulator is just a buggy emulator. I don't know how you convinced yourself that Microsoft's defective-by-design JIT is a good thing. You seem to know better, but are telling yourself everything's fine.

Yebubbleman · Oct 11, 2020

Waragainstsleep said:
I expect Apple to offer a combination of custom hardware and software that others might find hard to match. I also think there will be scope for games that offer a range of ways to interact with the game via different devices that other platforms won't be able to so easily.
I wouldn't write it off just yet.

It's not that the hardware isn't good already. For being an ARM64 based platform, Apple's offerings are among the best out there. The problem isn't even the OS. The developer tools suck and developers find the frameworks hard and annoying to develop for. Apple would need to do one hell of a course correction on their SDKs and development tools. The closest thing that Apple has done to make it easier to port games over is to release Metal development tools for Windows and that IS nice. But it's far from enough.

Gerdi · Oct 11, 2020

mr_roboto said:
A less strict emulator is just a buggy emulator. I don't know how you convinced yourself that Microsoft's defective-by-design JIT is a good thing. You seem to know better, but are telling yourself everything's fine.

I make this short, as you are not willing to listen. The overwhelming majority of apps running without issues under default setting - and this is because the code is well formed for the assumptions made by the emulation engine at default settings - which i did explain.

leman · Oct 12, 2020

Gerdi said:
I make this short, as you are not willing to listen. The overwhelming majority of apps running without issues under default setting - and this is because the code is well formed for the assumptions made by the emulation engine at default settings - which i did explain.

That might be true for the usual basic utility software, but I suspect it will break down quickly if you try to run something more demanding such as games. Of course, running demanding software is not a well-tested case, because why would you run games on an ARM laptop... except that next year we might actually get ARM laptops that are fast enough fro that. It becomes a chicken and egg problem.

And of course, it's so much like Microsoft to design a "good enough solution" that just pushes the responsibility onto the end user. To be fair though, I don't see any reasonable way to solve this except either putting memory barriers everywhere (making everything dead slow) or adding hardware x86 emulation support like Apple did. .

Search

Search

Isn't this big news? x64 emulation is coming to Windows on ARM

curmudgeonette

macrumors 6502a

Gerdi

macrumors 6502

leman

macrumors Core

Gerdi

macrumors 6502

thenewperson

macrumors 6502a

leman

macrumors Core

jdb8167

macrumors 601

mr_roboto

macrumors 6502a

Tech198

Cancelled

Gerdi

macrumors 6502

Gerdi

macrumors 6502

mr_roboto

macrumors 6502a

Yebubbleman

macrumors 603

Gerdi

macrumors 6502

leman

macrumors Core

Our Staff