[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

name99 · Jun 19, 2023

Sydde said:
Not really, though. The M2 Ultra has 16P+8E cores, 96MB L3, 76 GPU cores and 8 DRAM modules in the same size package; the Xeon has no GPU cores, no neural engine (though AVX2 does have targeted ops), no H2xx type or other accelerators, and no DRAM. They are pretty dissimilar. It looks like the entire Ultra die is around a quarter the size of the Xeon die.

You are obsessing over the wrong thing.
The point is that at high end packaging, what Intel is shipping in volume and what Apple is shipping in volume are not that different in size. Neither team has special magic that's vastly beyond the capabilities of the other (though it's likely that Apple's solution is cheaper).

PV will be about 50% larger (if it actually ships and sells), but then again if (as is likely) Ultra expands outward, either to four compute chiplets and/or memory spikes, it will likely grow twice as large.

Xiao_Xi · Jun 21, 2023

What are the advantages of Arm's memory model over x86's memory model? Does it facilitate out-of-order CPU designs? Does it help compilers perform better optimizations?

leman · Jun 21, 2023

Xiao_Xi said:
What are the advantages of Arm's memory model over x86's memory model? Does it facilitate out-of-order CPU designs? Does it help compilers perform better optimizations?

From what I understand, it can help with performance, transistor count, and probably energy consumption. The strong memory ordering model of x86 means more synchronisation overhead for most memory instructions, while in the weak memory model the burden of correct synchronisation is placed on the developer instead (you have to use correct instructions or your multithreaded code won't work properly). But this is very simplified and there are of course many nuances.

Basic75 · Jun 21, 2023

Xiao_Xi said:
What are the advantages of Arm's memory model over x86's memory model?

In a nutshell, ARM is allowed to be more aggressive with its reordering while x86 has stronger ordering constraints. This potentially allows an ARM processor to extract more instruction level parallelism from the linear instruction stream. The tradeoff is that on ARM you might need more memory barriers to prevent reordering where it's undesired which the developer and/or compiler need to take care of. The weaker the memory model the easier it is to create a bug in multi-threaded code. That's why Apple's M1 and M2 can be switched into an x86 compatible memory model, otherwise the translated code would need to be littered with barrier instructions because Rosetta doesn't know which parts of the code are critical to correct multi-threaded operation.

name99 · Jun 21, 2023

Xiao_Xi said:
What are the advantages of Arm's memory model over x86's memory model? Does it facilitate out-of-order CPU designs? Does it help compilers perform better optimizations?

The memory model is (more or less) the promises the hardware makes about the order in which CPU B will see changes made to multiple different memory addresses by CPU A.

The reason this matters is that CPU B mostly doesn't care about what CPU A is doing to "its" memory (ie most addresses that CPU A touches CPU B never even looks at). This means that CPU A doesn't have to care very much about the order in which it sends those changes out to memory and doesn't have to spend transistors and energy tracking that ordering.

Consider code like the following:
CPU A makes a bunch of changes to various shared variables (call them the PAYLOAD)
CPU A makes a change to one final shared variable (call it the FLAG)
meanwhile CPU B is constantly checking the value of the FLAG, so that when it is set, B knows that the contents of the PAYLOAD are valid.
On ARM code like this is not good enough. CPU A, after setting the PAYLOAD has to essentially emit a "PUBLISH" instruction that says to force all the changes out to memory BEFORE it publishes the FLAG. If this is not done, then other CPUs have the potential to see the FLAG's value before they see the value of the PAYLOAD, so getting the old values of the PAYLOAD.
(As a technicality we won't go into, CPU B also has to execute a "SUBSCRIBE" instruction after reading the FLAG but before reading the PAYLOAD. Technically the words used are not PUBLISH and SUBSCRIBE but release and acquire, but I think my words make it much easier to understand what's going on.)

On x86 these PUBLISH/SUBSCRIBE instructions are not necessary.
The consequence of this is that ARM can be much more flexible in reordering loads and stores, both within the CPU proper, and in (eventually, at some point) in moving data between various caches and RAM. In principle, the x86 folks would say, you can track the correct ordering in each CPU and fake it to get the same results.

Well, to some extent, but nothing is free. There are limits to just how aggressively you can reorder, even if you are doing that, and tracking the reordering does cost transistors and energy. The primary problem is that you have to treat every interaction with another CPU as potentially requiring correct ordering; there's no external mechanism like the PUBLISH/SUBSCRIBE instructions that tell you exactly where ordering is required (meaning anywhere else you can ignore it).
It definitely allows the CPU to be more aggressive with its internal load/store reordering, and also allows for more flexibility in how cache lines are propagated between different caches (eg Apple would have no problem with say a modified cache line from CPU A propagating up to cluster L2 then being seen by CPU B, while that change does not propagate out to a different cluster on the SoC till a long time later).

There appears to be no serious work done as to the "cost" of weaker memory models. We do know that when people have flipped on the force-TSO bit on M1 (usually only used by Rosetta) ARM code runs 15 to 25% slower. Defenders of x86 have said this proves nothing because Apple doesn't have incentives to make their TSO-mode run fast, and that's not a totally unreasonable claim. But it is worth noting that all the GPU vendors run their designs with very aggressive weak ordering (even Intel's iGPU) which suggests that the cost of strong ordering is enough to drop it whenever you can...

Xiao_Xi · Jun 21, 2023

name99 said:
There appears to be no serious work done as to the "cost" of weaker memory models.

Is it true that only x86 has TSO and any other ISA has a weaker memory model? Is there a reason for that?

Icelus · Jun 21, 2023

Xiao_Xi said:
Is it true that only x86 has TSO and any other ISA has a weaker memory model?

No, x86 isn't the only one.

Weak vs. Strong Memory Models

There are many types of memory reordering, and not all types of reordering occur equally often. It all depends on processor you’re targeting and/or the toolchain you’ …

preshing.com

Sydde · Jun 21, 2023

leman said:
in the weak memory model the burden of correct synchronisation is placed on the developer instead (you have to use correct instructions or your multithreaded code won't work properly)

But not really a huge burden on most third-party devs. The vast bulk of most apps is coded by a compiler, which must be designed to handle the memory model. Compilers for AS/64-bit-ARM have been around/refined for at least a decade, so dealing with memory ordering is basically not a concern.

Xiao_Xi said:
Is it true that only x86 has TSO and any other ISA has a weaker memory model? Is there a reason for that?

The main reason x86 has stronger ordering is that it has a smaller register file than almost every comparable alternative out there and hence has to do more work with values in memory.

mr_roboto · Jun 22, 2023

One thing I think is important to understand for those not familiar with this corner of computing technology: Arm's ordering rules permit stores by one core to become visible to other cores out of order, but they do not require it. If you draw a Venn diagram, TSO ordering behaviors are a smaller circle fully contained inside the bigger circle representing the set of ordering behaviors permitted by Arm. The only piece of Apple's TSO support that's a custom, non-standard extension is the register mode bit they provided to make TSO optional.

Apple's not the only architectural license holder which has created TSO Arm cores. Fujitsu's A64FX supercomputer chip and NVidia's "Denver" CPU cores use TSO ordering too. Unlike Apple's cores, these are TSO-only.

Xiao_Xi · Jun 22, 2023

mr_roboto said:
The only piece of Apple's TSO support that's a custom, non-standard extension is the register mode bit they provided to make TSO optional.

Apple's not the only architectural license holder which has created TSO Arm cores. Fujitsu's A64FX supercomputer chip and NVidia's "Denver" CPU cores use TSO ordering too. Unlike Apple's cores, these are TSO-only.

What is the difference between a core with and without TSO? Is there an instruction that one can execute and the other cannot?

mr_roboto said:
If you draw a Venn diagram, TSO ordering behaviors are a smaller circle fully contained inside the bigger circle representing the set of ordering behaviors permitted by Arm.

Like this?

Memory ordering - Wikipedia

en.wikipedia.org

mr_roboto · Jun 24, 2023

Xiao_Xi said:
What is the difference between a core with and without TSO? Is there an instruction that one can execute and the other cannot?

No, TSO is a set of behaviors rather than an instruction. Let's consider a behavior from the table you found, "Stores can be reordered after stores". This affects any program containing a sequence of stores, like this example.

Store #1: Write contents of register 1 to memory address A
Store #2: Write contents of register 2 to memory address B

If you're on a TSO machine like x86, all CPU cores in the system always see these two stores become visible in "program order". #1 always occurs before #2, because stores cannot be moved past other stores. But on a more weakly ordered machine like Arm, when other cores observe these stores (*), #2 could happen before #1. If you need to make sure all threads observe a specific ordering, you (or the compiler) have to insert extra instructions which force the CPU to do specific memory accesses in order. (Note that even x86 has some order-forcing instructions, it's just that they're seldom necessary.)

On a Fujitsu A64FX supercomputer, even though the Arm architecture spec claims you can't depend on store #1 happening before #2, in practice it always does, and Fujitsu probably formally documents it in their manuals. If you're writing Arm software which will never be deployed anywhere but an A64FX supercomputer, it's safe to omit nearly all instructions which force memory ordering.

Same applies to an Apple Arm core while it's configured for TSO mode, except here Apple doesn't document it at all. In fact, not only is it undocumented, in an unmodified copy of macOS running in the standard, default secured mode, the only context in which TSO mode is ever enabled is when a CPU core is running a Rosetta process. Native ARM code always runs in the more weakly ordered (and faster) memory model.

* - A single thread observing only its own effects never sees anything appear to happen out of order, even on a weakly-ordered architecture like Arm. All this memory ordering model stuff is only relevant for analyzing situations when a thread running on one core interacts with memory reads and writes being performed by a thread running on a different core.

Retskrad · Jun 24, 2023

I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

24-core M2 Ultra lands on PassMark as highest-rated Apple silicon with promising workstation-level performance

A couple of samples of the Apple M2 Ultra processor have arrived at PassMark and have offered up very good results for the latest ARM-based SoCs out of Cupertino. The test suite score for the 24-core M2 Ultra leaves it rubbing shoulders with AMD EPYC and Intel Xeon server processors and...

www.notebookcheck.net

pshufd · Jun 24, 2023

Retskrad said:
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

24-core M2 Ultra lands on PassMark as highest-rated Apple silicon with promising workstation-level performance

A couple of samples of the Apple M2 Ultra processor have arrived at PassMark and have offered up very good results for the latest ARM-based SoCs out of Cupertino. The test suite score for the 24-core M2 Ultra leaves it rubbing shoulders with AMD EPYC and Intel Xeon server processors and...

www.notebookcheck.net

I built a Windows system in 2020 before prices went up and I ran into a Windows issue for my workflow and started moving back to macOS because of the issue. I tried turning a few systems into Hackintoshes and had some success. I bought an M1 mini and a 2014 iMac in 2021, and then a Mac Studio in 2022. The Mac Studio resolved my workflow issues so I'm now 100% macOS on the desktop and on laptops. The side-benefits of Apple Silicon desktops is low power consumption and a quiet work environment. I did build my Windows desktop to run quietly (very large case, a lot of large fans, a non-K CPU and low power GPU) but the fans can come on with heavy loads and it can warm up my basement when it is hot outside.

So for me, the benefits of Apple Silicon on the desktop are performance per watt, that it runs quietly, doesn't generate a lot of heat and runs macOS, and the ecosystem. Since I have an Apple Silicon MacBook Pro, it makes sense to have a Mac desktop so that all of the productivity programs that I run are consistent between laptop and desktop: Calendar, Reminder, Notes, Numbers, Mail are programs that I use a lot and it's nice to have them on my iPhone, iPad, desktop and laptop.

The software issue that I have with Windows: Microsoft said that they were going to fix it in an update of Windows 10. Never happened. They said it would be in Windows 11. Then they said a Windows 11 update. So far it hasn't happened and there's no date given as to when it will be fixed. So Windows isn't an option for me on the desktop at this time. It would also be unpleasant working in a mixed Windows/macOS world for productivity software and ecosystem.

The iMac 27 was never competitive on performance as you have a tight case without a lot of airflow which you could easily beat with a full tower. Yet the iMac was a very popular product and spawned copycat products from the PC makers.

Xiao_Xi · Jun 24, 2023

mr_roboto said:
TSO is a set of behaviors rather than an instruction.

The RISC-V documentation lists the following differences between TSO and WMO.

RVTSO makes the following adjustments to RVWMO:
• All load operations behave as if they have an acquire-RCpc annotation
• All store operations behave as if they have a release-RCpc annotation.
• All AMOs behave as if they have both acquire-RCsc and release-RCsc annotations.

I understand that TSO-based core and an WMO-based core run different code.

[C]ode written assuming RVTSO will not run correctly on implementations not supporting Ztso. Binaries compiled to run only under Ztso should indicate as such via a flag in the binary, so that platforms which do not implement Ztso can simply refuse to run them.

What are the hardware differences between a TSO-based core and an WMO-based core? Is the dispatch unit different?

jeanlain · Jun 24, 2023

Retskrad said:
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

24-core M2 Ultra lands on PassMark as highest-rated Apple silicon with promising workstation-level performance

A couple of samples of the Apple M2 Ultra processor have arrived at PassMark and have offered up very good results for the latest ARM-based SoCs out of Cupertino. The test suite score for the 24-core M2 Ultra leaves it rubbing shoulders with AMD EPYC and Intel Xeon server processors and...

www.notebookcheck.net

?
According to that article, the M2 ultra ranks among Workstation CPUs, and GPU performance isn't discussed.

thenewperson · Jun 24, 2023

jeanlain said:
?
According to that article, the M2 ultra ranks among Workstation CPUs, and GPU performance isn't discussed.

That person shows up periodically to regurgitate nonsense like that.

Appletoni · Jun 24, 2023

Retskrad said:
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

24-core M2 Ultra lands on PassMark as highest-rated Apple silicon with promising workstation-level performance

A couple of samples of the Apple M2 Ultra processor have arrived at PassMark and have offered up very good results for the latest ARM-based SoCs out of Cupertino. The test suite score for the 24-core M2 Ultra leaves it rubbing shoulders with AMD EPYC and Intel Xeon server processors and...

www.notebookcheck.net

The MacBook 24-core M2 ULTRA will be great.

Xiao_Xi · Jul 6, 2023

It looks like the first 7840U laptops are coming soon and some Geekbenchs results are appearing.

HP HP EliteBook 835 13 inch G10 Notebook PC - Geekbench

Benchmark results for a HP HP EliteBook 835 13 inch G10 Notebook PC with an AMD Ryzen 7 7840U processor.

browser.geekbench.com

The 7840U ties with the M2 in single core, but some benchmarks show a big difference between them.

Mac14,15 vs HP HP EliteBook 835 13 inch G10 Notebook PC - Geekbench

Some results don't make any sense. The 7840U wins in almost all but two of the multicore tests. Of the two, the Object Detection results are the strangest because M2 wins by 20% on multicore, but loses by 10% on single core. How is this possible?

It's a shame that the Geekbench developers explain so little about the benchmarks. They don't even say which version of the Embree library it uses.

https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

Xiao_Xi · Jul 11, 2023

Has anyone else confirmed this?

Found some more bit defines yesterday, and this is the first hint I've seen so far that Apple CPUs have microcode.

Makes you wonder how much though! We haven't seen anything resembling microcode updates, and macOS is in charge of initializing CPU cores and flipping chicken bits so it would likely do the updates if they existed.

My guess is it's very simple microcode compared to Intel, perhaps similar to what IBM do on POWER chips. There may be no general update mechanism, perhaps only a few match/patch registers if that.

Hector Martin (@marcan@treehouse.systems)

Attached: 2 images Found some more bit defines yesterday, and this is the first hint I've seen so far that Apple CPUs have microcode. Makes you wonder how much though! We haven't seen anything resembling microcode updates, and macOS is in charge of initializing CPU cores and flipping chicken...

social.treehouse.systems

dmccloud · Jul 12, 2023

Retskrad said:
I'm sorry, but Apple Silicon on desktop is a poor proposition. On laptops and phones, it's leading edge because of the efficency. However, on desktop, where that advantage is moot, $4K for the Mac Studio with the M2 Ultra is objectionable value. I just saw this article where M2 Ultra is only capable pf competing with laptop CPU and GPU's from AMD, Intel and nVidia - forget competing against their desktop counterparts.

24-core M2 Ultra lands on PassMark as highest-rated Apple silicon with promising workstation-level performance

A couple of samples of the Apple M2 Ultra processor have arrived at PassMark and have offered up very good results for the latest ARM-based SoCs out of Cupertino. The test suite score for the 24-core M2 Ultra leaves it rubbing shoulders with AMD EPYC and Intel Xeon server processors and...

www.notebookcheck.net

This is a strange position to take, especially when the article you linked to literally states the opposite in the preview part of your post. The quoted part of the article states that in their test bench, the M2 Ultra rubs shoulders with both AMDs EPYC and Intel Xeon server processors. That is completely different from only being competitive with laptop CPUs and GPUs...

name99 · Jul 12, 2023

Xiao_Xi said:
Has anyone else confirmed this?

Hector Martin (@marcan@treehouse.systems)

Attached: 2 images Found some more bit defines yesterday, and this is the first hint I've seen so far that Apple CPUs have microcode. Makes you wonder how much though! We haven't seen anything resembling microcode updates, and macOS is in charge of initializing CPU cores and flipping chicken...

social.treehouse.systems

People can get carried away with the term "microcode". Look at, for example, this patent:
(2011) https://patents.google.com/patent/US9280352B2 Lookahead scanning and cracking of microcode instructions in a dispatch queue

What appears to be referred to as microcode (and what would make sense) is mostly cracked instructions, for example perhaps the weirder NEON SoA to AoS instructions (like LD4).

It's probably also the case that some of the very rare but essential OS/cache control/TLB instructions (things like walk some cache and invalidate all entries matching some field) are implemented in this way.

In both cases this is probably better thought of as millicode, PAL-code, or even something like A-line or F-line instructions (to reach REALLY far back into Apple history) rather than Intel-style micro-code.

leman · Jul 12, 2023

dmccloud said:
This is a strange position to take, especially when the article you linked to literally states the opposite in the preview part of your post. The quoted part of the article states that in their test bench, the M2 Ultra rubs shoulders with both AMDs EPYC and Intel Xeon server processors. That is completely different from only being competitive with laptop CPUs and GPUs...

I don’t disagree with @Retskrad on this. M2 Ultra is very competitive with the mid-tier workstation processors, and is a fairly good value for money for $7k if you compare it to similar-performance workstations.

But there are two nuances. First, high-end workstation processors will be almost twice as fast for just 30-50% higher price. Second, an enthusiast-level CPU will deliver similar (or better) performance as M2 Ultra for significantly lower price. Given the fact that M2 Ultra lacks ECC memory and has only 16 PCIe lanes, it makes at least some sense to position it against enthusiast level CPU (with the caveat that it has significantly more RAM bandwidth as well as large vector/matrix hardware).

The point is, if you are interested in a desktop computer with highest possible CPU and GPU performance, Mac value proposition is currently indeed in a fairly bad spot, as you can get both better perf/$ and better absolute performance with other platforms. Macs are excellent in terms of perf/watt and perf/chassis size.

I am curious how things will be going forward. Right not Apple uses the same exact parts on mobile and desktop. One would expect for example a higher clocked M2 Ultra in the MP (the chassis has hundreds of watts thermal headroom)
- and it was indeed rumored- but in the end Apple shipped it with exact same clocks as M2 Max in the 16” MBP. This means that the chip design itself precludes running at desktop thermal levels. Will future designs relax this restriction? We will see.

Joe Dohn · Jul 12, 2023

leman said:
I don’t disagree with @Retskrad on this. M2 Ultra is very competitive with the mid-tier workstation processors, and is a fairly good value for money for $7k if you compare it to similar-performance workstations.

But there are two nuances. First, high-end workstation processors will be almost twice as fast for just 30-50% higher price. Second, an enthusiast-level CPU will deliver similar (or better) performance as M2 Ultra for significantly lower price. Given the fact that M2 Ultra lacks ECC memory and has only 16 PCIe lanes, it makes at least some sense to position it against enthusiast level CPU (with the caveat that it has significantly more RAM bandwidth as well as large vector/matrix hardware).

Let's not forget the expandability that Apple axed and the PC architecture still has.

jeanlain · Jul 12, 2023

leman said:
Second, an enthusiast-level CPU will deliver similar (or better) performance as M2 Ultra for significantly lower price.

What about graphics performance of those enthusiast-level GPU?
The M2 Ultra is competitive with high-end Radeons.

leman · Jul 12, 2023

Joe Dohn said:
Let's not forget the expandability that Apple axed and the PC architecture still has.

I don’t consider this to be very interesting and/or important. The practical consequences are lack of upgradeability (which is not a concern for the majority of users) as well as hard limit on capability (e.g. no multi GPU, limited RAM). Those are reasonable trade offs IMO, which only exclude a small portion of users. Besides, Mac Pro is plenty expandable in the narrow sense of the word.

jeanlain said:
What about graphics performance of those enthusiast-level GPU?
The M2 Ultra is competitive with high-end Radeons.

Kind of depends, isn’t it? In terms of compute or gaming capability M2 Ultra is roughly comparable to a $700 mainstream gaming GPU. With the caveat that it’s going to be significantly slower in two main application domains that Apple is focusing on - raytracing and machine learning. Apples big saving grace is the huge pool of GPU RAM which definitely helps with some high-end applications and would cost a pretty penny to get on the PC side, but again, these are fairly niche use cases. Apples offering would be much more convincing if it had the raw performance to back it up.

Again, I’m just trying to give a realistic assessment to the current status quo. I do actually think that Apples offering is very impressive given the fact that M2 family is still essentially a replicated iPhone chip. The level of performance and capability they are offering is very impressive compared to the industry veterans. So I think there are good reasons to be optimistic going forward. I believe that next-gen Apple GPU will bring performance and feature parity with Nvidia. As to Radeons… I don’t think AMD is a serious competitor at the current point. They are significantly lagging in key technology area, the only thing they have going for them is manufacturing which allows them to offer good prices. But that only helps them with the gaming market, in professional applications AMD is currently very weak IMO compared to Nvidia.

[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

macrumors 68030

macrumors 68000

macrumors Core

macrumors 68020

macrumors 68030

macrumors 68000

macrumors 6502

macrumors 68030

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors regular

macrumors G4

macrumors 68000

macrumors 68020

macrumors 65816

Suspended

macrumors 68000

macrumors 68000

macrumors 68040

macrumors 68030

macrumors Core

macrumors 6502a

macrumors 68020

macrumors Core

Our Staff