Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
Not really, though. The M2 Ultra has 16P+8E cores, 96MB L3, 76 GPU cores and 8 DRAM modules in the same size package; the Xeon has no GPU cores, no neural engine (though AVX2 does have targeted ops), no H2xx type or other accelerators, and no DRAM. They are pretty dissimilar. It looks like the entire Ultra die is around a quarter the size of the Xeon die.
You are obsessing over the wrong thing.
The point is that at high end packaging, what Intel is shipping in volume and what Apple is shipping in volume are not that different in size. Neither team has special magic that's vastly beyond the capabilities of the other (though it's likely that Apple's solution is cheaper).

PV will be about 50% larger (if it actually ships and sells), but then again if (as is likely) Ultra expands outward, either to four compute chiplets and/or memory spikes, it will likely grow twice as large.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
What are the advantages of Arm's memory model over x86's memory model? Does it facilitate out-of-order CPU designs? Does it help compilers perform better optimizations?
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
What are the advantages of Arm's memory model over x86's memory model? Does it facilitate out-of-order CPU designs? Does it help compilers perform better optimizations?

From what I understand, it can help with performance, transistor count, and probably energy consumption. The strong memory ordering model of x86 means more synchronisation overhead for most memory instructions, while in the weak memory model the burden of correct synchronisation is placed on the developer instead (you have to use correct instructions or your multithreaded code won't work properly). But this is very simplified and there are of course many nuances.
 
  • Like
Reactions: Xiao_Xi

Basic75

macrumors 68020
May 17, 2011
2,101
2,446
Europe
What are the advantages of Arm's memory model over x86's memory model?
In a nutshell, ARM is allowed to be more aggressive with its reordering while x86 has stronger ordering constraints. This potentially allows an ARM processor to extract more instruction level parallelism from the linear instruction stream. The tradeoff is that on ARM you might need more memory barriers to prevent reordering where it's undesired which the developer and/or compiler need to take care of. The weaker the memory model the easier it is to create a bug in multi-threaded code. That's why Apple's M1 and M2 can be switched into an x86 compatible memory model, otherwise the translated code would need to be littered with barrier instructions because Rosetta doesn't know which parts of the code are critical to correct multi-threaded operation.
 
  • Like
Reactions: Xiao_Xi

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
What are the advantages of Arm's memory model over x86's memory model? Does it facilitate out-of-order CPU designs? Does it help compilers perform better optimizations?
The memory model is (more or less) the promises the hardware makes about the order in which CPU B will see changes made to multiple different memory addresses by CPU A.

The reason this matters is that CPU B mostly doesn't care about what CPU A is doing to "its" memory (ie most addresses that CPU A touches CPU B never even looks at). This means that CPU A doesn't have to care very much about the order in which it sends those changes out to memory and doesn't have to spend transistors and energy tracking that ordering.

Consider code like the following:
CPU A makes a bunch of changes to various shared variables (call them the PAYLOAD)
CPU A makes a change to one final shared variable (call it the FLAG)
meanwhile CPU B is constantly checking the value of the FLAG, so that when it is set, B knows that the contents of the PAYLOAD are valid.
On ARM code like this is not good enough. CPU A, after setting the PAYLOAD has to essentially emit a "PUBLISH" instruction that says to force all the changes out to memory BEFORE it publishes the FLAG. If this is not done, then other CPUs have the potential to see the FLAG's value before they see the value of the PAYLOAD, so getting the old values of the PAYLOAD.
(As a technicality we won't go into, CPU B also has to execute a "SUBSCRIBE" instruction after reading the FLAG but before reading the PAYLOAD. Technically the words used are not PUBLISH and SUBSCRIBE but release and acquire, but I think my words make it much easier to understand what's going on.)

On x86 these PUBLISH/SUBSCRIBE instructions are not necessary.
The consequence of this is that ARM can be much more flexible in reordering loads and stores, both within the CPU proper, and in (eventually, at some point) in moving data between various caches and RAM. In principle, the x86 folks would say, you can track the correct ordering in each CPU and fake it to get the same results.

Well, to some extent, but nothing is free. There are limits to just how aggressively you can reorder, even if you are doing that, and tracking the reordering does cost transistors and energy. The primary problem is that you have to treat every interaction with another CPU as potentially requiring correct ordering; there's no external mechanism like the PUBLISH/SUBSCRIBE instructions that tell you exactly where ordering is required (meaning anywhere else you can ignore it).
It definitely allows the CPU to be more aggressive with its internal load/store reordering, and also allows for more flexibility in how cache lines are propagated between different caches (eg Apple would have no problem with say a modified cache line from CPU A propagating up to cluster L2 then being seen by CPU B, while that change does not propagate out to a different cluster on the SoC till a long time later).

There appears to be no serious work done as to the "cost" of weaker memory models. We do know that when people have flipped on the force-TSO bit on M1 (usually only used by Rosetta) ARM code runs 15 to 25% slower. Defenders of x86 have said this proves nothing because Apple doesn't have incentives to make their TSO-mode run fast, and that's not a totally unreasonable claim. But it is worth noting that all the GPU vendors run their designs with very aggressive weak ordering (even Intel's iGPU) which suggests that the cost of strong ordering is enough to drop it whenever you can...
 

Icelus

macrumors 6502
Nov 3, 2018
421
574
Is it true that only x86 has TSO and any other ISA has a weaker memory model?
No, x86 isn't the only one.

weak-strong-table.png


 
  • Like
Reactions: Basic75 and Xiao_Xi

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
in the weak memory model the burden of correct synchronisation is placed on the developer instead (you have to use correct instructions or your multithreaded code won't work properly)

But not really a huge burden on most third-party devs. The vast bulk of most apps is coded by a compiler, which must be designed to handle the memory model. Compilers for AS/64-bit-ARM have been around/refined for at least a decade, so dealing with memory ordering is basically not a concern.

Is it true that only x86 has TSO and any other ISA has a weaker memory model? Is there a reason for that?

The main reason x86 has stronger ordering is that it has a smaller register file than almost every comparable alternative out there and hence has to do more work with values in memory.
 
  • Like
Reactions: Xiao_Xi

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
One thing I think is important to understand for those not familiar with this corner of computing technology: Arm's ordering rules permit stores by one core to become visible to other cores out of order, but they do not require it. If you draw a Venn diagram, TSO ordering behaviors are a smaller circle fully contained inside the bigger circle representing the set of ordering behaviors permitted by Arm. The only piece of Apple's TSO support that's a custom, non-standard extension is the register mode bit they provided to make TSO optional.

Apple's not the only architectural license holder which has created TSO Arm cores. Fujitsu's A64FX supercomputer chip and NVidia's "Denver" CPU cores use TSO ordering too. Unlike Apple's cores, these are TSO-only.
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
The only piece of Apple's TSO support that's a custom, non-standard extension is the register mode bit they provided to make TSO optional.

Apple's not the only architectural license holder which has created TSO Arm cores. Fujitsu's A64FX supercomputer chip and NVidia's "Denver" CPU cores use TSO ordering too. Unlike Apple's cores, these are TSO-only.
What is the difference between a core with and without TSO? Is there an instruction that one can execute and the other cannot?

If you draw a Venn diagram, TSO ordering behaviors are a smaller circle fully contained inside the bigger circle representing the set of ordering behaviors permitted by Arm.
Like this?
1687472042905.png

 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
What is the difference between a core with and without TSO? Is there an instruction that one can execute and the other cannot?
No, TSO is a set of behaviors rather than an instruction. Let's consider a behavior from the table you found, "Stores can be reordered after stores". This affects any program containing a sequence of stores, like this example.

Store #1: Write contents of register 1 to memory address A
Store #2: Write contents of register 2 to memory address B

If you're on a TSO machine like x86, all CPU cores in the system always see these two stores become visible in "program order". #1 always occurs before #2, because stores cannot be moved past other stores. But on a more weakly ordered machine like Arm, when other cores observe these stores (*), #2 could happen before #1. If you need to make sure all threads observe a specific ordering, you (or the compiler) have to insert extra instructions which force the CPU to do specific memory accesses in order. (Note that even x86 has some order-forcing instructions, it's just that they're seldom necessary.)

On a Fujitsu A64FX supercomputer, even though the Arm architecture spec claims you can't depend on store #1 happening before #2, in practice it always does, and Fujitsu probably formally documents it in their manuals. If you're writing Arm software which will never be deployed anywhere but an A64FX supercomputer, it's safe to omit nearly all instructions which force memory ordering.

Same applies to an Apple Arm core while it's configured for TSO mode, except here Apple doesn't document it at all. In fact, not only is it undocumented, in an unmodified copy of macOS running in the standard, default secured mode, the only context in which TSO mode is ever enabled is when a CPU core is running a Rosetta process. Native ARM code always runs in the more weakly ordered (and faster) memory model.


* - A single thread observing only its own effects never sees anything appear to happen out of order, even on a weakly-ordered architecture like Arm. All this memory ordering model stuff is only relevant for analyzing situations when a thread running on one core interacts with memory reads and writes being performed by a thread running on a different core.
 

Retskrad

macrumors regular
Apr 1, 2022
200
672
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

 
  • Like
Reactions: Romain_H

pshufd

macrumors G4
Oct 24, 2013
10,145
14,572
New Hampshire
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.


I built a Windows system in 2020 before prices went up and I ran into a Windows issue for my workflow and started moving back to macOS because of the issue. I tried turning a few systems into Hackintoshes and had some success. I bought an M1 mini and a 2014 iMac in 2021, and then a Mac Studio in 2022. The Mac Studio resolved my workflow issues so I'm now 100% macOS on the desktop and on laptops. The side-benefits of Apple Silicon desktops is low power consumption and a quiet work environment. I did build my Windows desktop to run quietly (very large case, a lot of large fans, a non-K CPU and low power GPU) but the fans can come on with heavy loads and it can warm up my basement when it is hot outside.

So for me, the benefits of Apple Silicon on the desktop are performance per watt, that it runs quietly, doesn't generate a lot of heat and runs macOS, and the ecosystem. Since I have an Apple Silicon MacBook Pro, it makes sense to have a Mac desktop so that all of the productivity programs that I run are consistent between laptop and desktop: Calendar, Reminder, Notes, Numbers, Mail are programs that I use a lot and it's nice to have them on my iPhone, iPad, desktop and laptop.

The software issue that I have with Windows: Microsoft said that they were going to fix it in an update of Windows 10. Never happened. They said it would be in Windows 11. Then they said a Windows 11 update. So far it hasn't happened and there's no date given as to when it will be fixed. So Windows isn't an option for me on the desktop at this time. It would also be unpleasant working in a mixed Windows/macOS world for productivity software and ecosystem.

The iMac 27 was never competitive on performance as you have a tight case without a lot of airflow which you could easily beat with a full tower. Yet the iMac was a very popular product and spawned copycat products from the PC makers.
 
  • Like
Reactions: Juraj22

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
TSO is a set of behaviors rather than an instruction.
The RISC-V documentation lists the following differences between TSO and WMO.
RVTSO makes the following adjustments to RVWMO:
• All load operations behave as if they have an acquire-RCpc annotation
• All store operations behave as if they have a release-RCpc annotation.
• All AMOs behave as if they have both acquire-RCsc and release-RCsc annotations.

I understand that TSO-based core and an WMO-based core run different code.
[C]ode written assuming RVTSO will not run correctly on implementations not supporting Ztso. Binaries compiled to run only under Ztso should indicate as such via a flag in the binary, so that platforms which do not implement Ztso can simply refuse to run them.
What are the hardware differences between a TSO-based core and an WMO-based core? Is the dispatch unit different?
 

jeanlain

macrumors 68020
Mar 14, 2009
2,459
953
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

?
According to that article, the M2 ultra ranks among Workstation CPUs, and GPU performance isn't discussed.
 

Appletoni

Suspended
Mar 26, 2021
443
177
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

The MacBook 24-core M2 ULTRA will be great.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
It looks like the first 7840U laptops are coming soon and some Geekbenchs results are appearing.

The 7840U ties with the M2 in single core, but some benchmarks show a big difference between them.

Some results don't make any sense. The 7840U wins in almost all but two of the multicore tests. Of the two, the Object Detection results are the strangest because M2 wins by 20% on multicore, but loses by 10% on single core. How is this possible?

It's a shame that the Geekbench developers explain so little about the benchmarks. They don't even say which version of the Embree library it uses.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Has anyone else confirmed this?
Found some more bit defines yesterday, and this is the first hint I've seen so far that Apple CPUs have microcode.

Makes you wonder how much though! We haven't seen anything resembling microcode updates, and macOS is in charge of initializing CPU cores and flipping chicken bits so it would likely do the updates if they existed.

My guess is it's very simple microcode compared to Intel, perhaps similar to what IBM do on POWER chips. There may be no general update mechanism, perhaps only a few match/patch registers if that.
 

dmccloud

macrumors 68040
Sep 7, 2009
3,138
1,899
Anchorage, AK
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.


This is a strange position to take, especially when the article you linked to literally states the opposite in the preview part of your post. The quoted part of the article states that in their test bench, the M2 Ultra rubs shoulders with both AMDs EPYC and Intel Xeon server processors. That is completely different from only being competitive with laptop CPUs and GPUs...
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
Has anyone else confirmed this?

People can get carried away with the term "microcode". Look at, for example, this patent:
(2011) https://patents.google.com/patent/US9280352B2 Lookahead scanning and cracking of microcode instructions in a dispatch queue

What appears to be referred to as microcode (and what would make sense) is mostly cracked instructions, for example perhaps the weirder NEON SoA to AoS instructions (like LD4).

It's probably also the case that some of the very rare but essential OS/cache control/TLB instructions (things like walk some cache and invalidate all entries matching some field) are implemented in this way.

In both cases this is probably better thought of as millicode, PAL-code, or even something like A-line or F-line instructions (to reach REALLY far back into Apple history) rather than Intel-style micro-code.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
This is a strange position to take, especially when the article you linked to literally states the opposite in the preview part of your post. The quoted part of the article states that in their test bench, the M2 Ultra rubs shoulders with both AMDs EPYC and Intel Xeon server processors. That is completely different from only being competitive with laptop CPUs and GPUs...

I don’t disagree with @Retskrad on this. M2 Ultra is very competitive with the mid-tier workstation processors, and is a fairly good value for money for $7k if you compare it to similar-performance workstations.

But there are two nuances. First, high-end workstation processors will be almost twice as fast for just 30-50% higher price. Second, an enthusiast-level CPU will deliver similar (or better) performance as M2 Ultra for significantly lower price. Given the fact that M2 Ultra lacks ECC memory and has only 16 PCIe lanes, it makes at least some sense to position it against enthusiast level CPU (with the caveat that it has significantly more RAM bandwidth as well as large vector/matrix hardware).

The point is, if you are interested in a desktop computer with highest possible CPU and GPU performance, Mac value proposition is currently indeed in a fairly bad spot, as you can get both better perf/$ and better absolute performance with other platforms. Macs are excellent in terms of perf/watt and perf/chassis size.

I am curious how things will be going forward. Right not Apple uses the same exact parts on mobile and desktop. One would expect for example a higher clocked M2 Ultra in the MP (the chassis has hundreds of watts thermal headroom)
- and it was indeed rumored- but in the end Apple shipped it with exact same clocks as M2 Max in the 16” MBP. This means that the chip design itself precludes running at desktop thermal levels. Will future designs relax this restriction? We will see.
 
  • Like
Reactions: Basic75

Joe Dohn

macrumors 6502a
Jul 6, 2020
840
748
I don’t disagree with @Retskrad on this. M2 Ultra is very competitive with the mid-tier workstation processors, and is a fairly good value for money for $7k if you compare it to similar-performance workstations.

But there are two nuances. First, high-end workstation processors will be almost twice as fast for just 30-50% higher price. Second, an enthusiast-level CPU will deliver similar (or better) performance as M2 Ultra for significantly lower price. Given the fact that M2 Ultra lacks ECC memory and has only 16 PCIe lanes, it makes at least some sense to position it against enthusiast level CPU (with the caveat that it has significantly more RAM bandwidth as well as large vector/matrix hardware).
Let's not forget the expandability that Apple axed and the PC architecture still has.
 
  • Like
Reactions: Basic75

jeanlain

macrumors 68020
Mar 14, 2009
2,459
953
Second, an enthusiast-level CPU will deliver similar (or better) performance as M2 Ultra for significantly lower price.
What about graphics performance of those enthusiast-level GPU?
The M2 Ultra is competitive with high-end Radeons.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
Let's not forget the expandability that Apple axed and the PC architecture still has.

I don’t consider this to be very interesting and/or important. The practical consequences are lack of upgradeability (which is not a concern for the majority of users) as well as hard limit on capability (e.g. no multi GPU, limited RAM). Those are reasonable trade offs IMO, which only exclude a small portion of users. Besides, Mac Pro is plenty expandable in the narrow sense of the word.


What about graphics performance of those enthusiast-level GPU?
The M2 Ultra is competitive with high-end Radeons.

Kind of depends, isn’t it? In terms of compute or gaming capability M2 Ultra is roughly comparable to a $700 mainstream gaming GPU. With the caveat that it’s going to be significantly slower in two main application domains that Apple is focusing on - raytracing and machine learning. Apples big saving grace is the huge pool of GPU RAM which definitely helps with some high-end applications and would cost a pretty penny to get on the PC side, but again, these are fairly niche use cases. Apples offering would be much more convincing if it had the raw performance to back it up.

Again, I’m just trying to give a realistic assessment to the current status quo. I do actually think that Apples offering is very impressive given the fact that M2 family is still essentially a replicated iPhone chip. The level of performance and capability they are offering is very impressive compared to the industry veterans. So I think there are good reasons to be optimistic going forward. I believe that next-gen Apple GPU will bring performance and feature parity with Nvidia. As to Radeons… I don’t think AMD is a serious competitor at the current point. They are significantly lagging in key technology area, the only thing they have going for them is manufacturing which allows them to offer good prices. But that only helps them with the gaming market, in professional applications AMD is currently very weak IMO compared to Nvidia.
 
  • Like
Reactions: pshufd
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.