[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

theorist9 · Sep 5, 2022

kasakka said:
Yes, Windows generally forces text to the pixel grid so there is no blurring of fonts like MacOS has at the expense of font shapes. I think it's a very working compromise for better looking text especially on lower res displays. Windows does tend to do worse for oddball pixel structures like what LG and Samsung OLED panels have (WRGB and triangular RGB).

MacOS uses a very simplistic way to handle HiDPI scaling. Target resolution (aka "looks like AxB") -> render at 2x -> downscale to native resolution. So e.g "looks like 2560x1440" -> render at 5120x2880 -> downscale to native res. Which just happens to perfectly fit the 5K Apple Studio Display, resulting in no loss of fidelity. Using any other scaling level you will have quality loss but with the high PPI of the ASD it's unlikely you will notice it much.

Personally I don't mind the same scaling using a 4K display even if it is less sharp and appreciate the massively cheaper cost, more flexible inputs, higher refresh rate and better stand options etc of my 28" 4K 144 Hz screen.

I'm not quite sure how Windows handles this by comparison but it seems much more complicated and more in the hands of developers on how exactly to set it up. Results in a mixed HiDPI experience but most apps nowadays are fine with this, I tend to only see scaling issues in things like installers looking blurry.

A big issue on MacOS is that Apple imposes arbitrary rules on "what is HiDPI capable". Apps like BetterDisplay would not need to exist if it wasn't often doing the wrong thing. For example the LG DualUp 28" 16:18 2560x2880 display has the same PPI as a 32" 4K screen but because it's horizontal resolution is less than 3840 then MacOS just decides "not a HiDPI display". They should simply adopt Windows' "the user can decide for themselves" way for this stuff.

Agreed. Though I'm now thinking (and this is largely my fault) we should return to the discussion of the AMD vs. M-series CPU's before a mod takes us (me) to task for hijacking this thread.

Xiao_Xi · Sep 5, 2022

One of the advantages of RISC-based CPUs over CISC-based CPUs is that they don't need a micro-operation cache, isn't it? Are there any other blocks that RISC-based CPUs don't need?

leman · Sep 5, 2022

Xiao_Xi said:
View attachment 2051265

One of the advantages of RISC-based CPUs over CISC-based CPUs is that they don't need a micro-operation cache, isn't it? Are there any other blocks that RISC-based CPUs don't need?

Apple uses micro-ops but I have no idea whether they also use a micro-op cache. Probably the decode is fast enough that they don’t need it?

Xiao_Xi · Sep 5, 2022

Locuza has a video describing the Zen 4 in detail.

kvic · Sep 5, 2022

Along the train of thought of micro-op cache, microcode ROM should be exclusive to x86/x86_64 but not in ARM. As indicated by AMD in the below breakdown of one Zen 2 core, microcode ROM size is not trivial:

deconstruct60 · Sep 5, 2022

Xiao_Xi said:
Locuza has a video describing the Zen 4 in detail.

His lament that the L2 is not an effective use of die space is a bit premature. Yes, AMD presented 22 workloads and a geometric mean over those workloads. But those also were a blend of mid-upper range like Windows PC workloads. (games , Adobe Photohops , Ray tracing, etc. ). Those aren't server workloads (certainly are not multiple users workloads) . Extremely high probability that Epyc will have a different set of apps that demonstrate its IPC uplift gen over gen. The full list is at about 5:18 if pause to look will see this without presenter.

AMD Ryzen 7000 and Zen 4 Launch

Today we had the AMD Ryzen 7000 series launch along with AMD Zen 4 microarchitecture bringing PCIe Gen5, DDR5, AVX-512, and VNNI support

www.servethehome.com

Adobe Photoshop and Premiere were under the IPC uplift average. wPrime appears to be some kind of threading stressor that is likely perfectly happy with a just-big-enough L2 cache. ( https://www.wprime.net/About/ )
If that L2 cache increase is helping those somewhat outliers at the right hand side then they got "bang for the buck" out of that space outlay. Those two entries at the end raise the IPC uplift by 1% just by themselves. Similar if the Photoshop/Premier scores would have sagged back to %1 increase that also would have set the overall mean result back 1%.

[ And when do the geomean of the percentages listed on the chart it is ~11% not 13%. The average is close to 13%. . ]

Pretty good chance that L2 is either substantively propping up some of those low uplift on the left hand side or goosing the overskew stuff on the far right hand side.

As the video covering the floorplan shows the floorplan is approximately the same. The major blocks are still in the same general area. The block shapes have shifted but this is largely a shrink for many units here, but some blocks shrink at different rate/percentage. So yes there is work done here, but not start completely over from scratch work. Moving the L2 subsection to new node isn't radically directly perturbing the FP section and vice versa.

deconstruct60 · Sep 5, 2022

Xiao_Xi said:
View attachment 2051265

One of the advantages of RISC-based CPUs over CISC-based CPUs is that they don't need a micro-operation cache, isn't it? Are there any other blocks that RISC-based CPUs don't need?

The uop cache doesn't necessarily have to be a strict subset of the L1 instruction cache. If have the decoded version why do you need to keep the 'raw' , untranslated version. ( perhaps to simplify lookup but not sure that is necessary but probably has lower overhead.).

So if there is a non intersection of instruction coverage non-intersection ucache + non-intersection L1 cache can be more coverage than the L2 cache by itself. The "wasted" overhead would be in the size of the 'redundant' intersection of the two. That is where would be using 'twice' as much space to store the same thing.

Depends upon what the prefetch is doing. It could bump L1 cache lines out that are still being executed in a loop. There would be no 'miss' because the loop execution is all out of the ucache. When exit the loop, the post loop code could already be there.

In contrast, if always fetching from the L1 cache over and over again then any prefetch cache fill could generate a 'miss' if bring something into the cache that was still being used.

Even most Arm implementations actually execute uops at the actually execution level. Even more so when doing deep reordering and re-writing. What is being executed isn't the same explicit registers listed in the assembler code/ binary. Even if shifting from 0.0010 W to translate the raw instruction to 0.0002W to translate the instruction if you go though the loop 10M times you still get to a decent aggregate size power consumption for translation even at this lower rate because the iteration number is so high. However, if could skip the translation power could skip by doing a uop lookup and transfer is 0.0001W, that would still be a net savings on power. You'd be trading space for less power usage in very large iteration loops. If the uop lookup and transfer is the same power as decoding from scratch, then not so much.

Most classic Arm implementations have area constraints so won't do it even if it saved power. As the transistor budgets get bigger in the same amount of typical space used then have options to add more things that consume resources.

deconstruct60 · Sep 5, 2022

kvic said:
Along the train of thought of micro-op cache, microcode ROM should be exclusive to x86/x86_64 but not in ARM. As indicated by AMD in the below breakdown of one Zen 2 core, microcode ROM size is not trivial:

Not trivial but also not particularly significant. if go to the video above the chiplet test modules and memory interconnect subsystems are bigger than that ROM. ( although the ROM is replicated in the core.).

If Apple SoC were dragging around Arm instruction baggage from the late 80's it would be bulkier also. A hefty portion of this isn't a 'Complex' vs 'Reduced' issue. It is a constipated versus 'willing to let go' issue. If dragging around mcro op sequences that will never be executed during the lifetime of a Windows 11 system running modern , optimized apps because no one uses those instructions anymore , then that is a clear waste of space. To replicated that non usage several times in a package is a bigger one.

It would be slightly different is most folks were using some of these ancient instructions but most don't. ( And yes there is a 0.1% of the population that still does use those instruction , but why should the rest be dragging around that baggage versus the 1% just get some special case SoCs in another part of product line up at Intel/AMD. )

kvic · Sep 5, 2022

deconstruct60 said:
Not trivial but also not particularly significant. if go to...

Not the gist of the discussion started in #152 at all. Even if the Resident Expert of Everything on MR gets rid of the legacy x86 in x86_64, I believe he still has to include microcode ROM (on the die of each core) which ARM doesn't need. Just like modern x86_64 has micro-ops cache which ARM doesn't need.

Sydde · Sep 6, 2022

kvic said:
Anandtech found M1 has 8 instruction decoders. By now x86_64 chips (Intel already did; Zen 4 likely the moment for AMD) pretty much have caught up and close.

But what is the difference is size? An ARM decoder is barely more than a moderate-sized gate array (elaborate demultiplexer), whereas an x86-64 decoder has to be a complex beast probably a couple orders of magnitude larger.

Also, I believe both x86 and ARM μarches are also doing the inverse of μop parceling, by fusing ops together, where it can make sense. The name "micro-op" is kind of deceptive, as the μop object is a large piece of info, so, if you can sensibly fit two or three steps into it, so much the better.

And I seriously doubt that "ARM processors also use μops" is anything close to reasonably accurate. Many ARM instructions do 2 or 3 things, but all those things would fit easily into one μop record. The only real exceptions in the ISA are a handful of atomic writebacks designed for semaphores (i.e., somewhat rarely used) – for those, μop parceling makes sense, for everthing else, not so much. Hence, to say that ARM uses μops is true to the extent that there are in fact a few lightly used instructions that are split up, but nowhere near anything like the heavy use of μops in a modern x86 processor.

Xiao_Xi · Sep 6, 2022

What is the difference between micro-ops and microcodes?

leman · Sep 6, 2022

Sydde said:
And I seriously doubt that "ARM processors also use μops" is anything close to reasonably accurate. Many ARM instructions do 2 or 3 things, but all those things would fit easily into one μop record. The only real exceptions in the ISA are a handful of atomic writebacks designed for semaphores (i.e., somewhat rarely used) – for those, μop parceling makes sense, for everthing else, not so much. Hence, to say that ARM uses μops is true to the extent that there are in fact a few lightly used instructions that are split up, but nowhere near anything like the heavy use of μops in a modern x86 processor.

From what I gathered modern ARM CPUs indeed use micro-ops and micro-ops-caches, as described in official ARM product slides. No idea how Apple does it but they will probably still use some sort of micro-ops given that certain complex instructions (e.g. register auto-increment load/store) are executed using multiple ports over multiple cycles. There is no doubt of course that the instruction decode for ARM64 is simpler overall and that one needs fewer uops per instruction (usually just one) for most operations.

diamond.g · Sep 6, 2022

There was Zen 4 TDP talks earlier. Looks like AMD is being very conservative with the voltage.

AMD Ryzen 7000 CPUs (6-8 Cores) Draw Just 67W of Power @ 5.05GHz Under Load with Voltage Tuning | Hardware Times

Rumors have been stating that AMD’s next-gen Ryzen 7000 processors will run much hotter than their predecessors on account of the denser logic. More recent discoveries, however, have proved otherwise. According to a new set of benchmarks, a single-CCD Zen 4 processor (6 to 8 cores) consumes just...

www.hardwaretimes.com

kvic · Sep 6, 2022

Sydde said:
But what is the difference is size? An ARM decoder is barely more than a moderate-sized gate array (elaborate demultiplexer), whereas an x86-64 decoder has to be a complex beast probably a couple orders of magnitude larger.

The nature of ISA predominantly determines complexity of their instruction decoders. Nothing much could change that. So in terms of cost (of the front end) i.e. engineering effort and transistor counts, x86_64 is at disadvantage against Apple silicon. Much less so in terms of performance now though. Because x86_64 already has/will very likely have 6 decoders (compared to 8 in M1/M2) which could fetch, decode more instructions in parallel, and feed to the execution units.

Also, I believe both x86 and ARM μarches are also doing the inverse of μop parceling, by fusing ops together, where it can make sense. The name "micro-op" is kind of deceptive, as the μop object is a large piece of info, so, if you can sensibly fit two or three steps into it, so much the better.

At the risk of being a laughing stock to x86/ARM designers, let me try. x86/x64 fetches and decodes instructions into micro-ops i.e. expands an instruction into much simpler, internal-only operations which will be executed by highly optimised hardware. I would think the decode process does as optimally as it can in performing the break down.

Interestingly, Intel has long been doing fusion of adjacent instructions into fewer instructions before instruction decoding. IDK. I would guess that's a technique borrowed from RISC designs.

ARM fetches and decodes instructions into macro-ops. I would think RISC's optimisation strategy is the opposite of classic CISC. That is it tries to reduce adjacent instructions into fewer macro-ops for preparing faster and more efficient execution later. Macro-ops are then split into micro-ops. One macro-op can be split into up to two micro-ops which are executed by highly optimised hardware.

Interestingly also, recent designs by ARM Ltd added macro-ops cache. Sounds like they borrowed the technique from CISC designs. Nobody knows much about Apple silicon. Not unexpectedly Apple was simply ahead of ARM Ltd.

deconstruct60 · Sep 6, 2022

Sydde said:
And I seriously doubt that "ARM processors also use μops" is anything close to reasonably accurate. Many ARM instructions do 2 or 3 things, but all those things would fit easily into one μop record.

" ...
MOPs are macro ops. They are the instructions from the Architecture, which you can write in assembly language or generate from high level language compilation, and that you feed the processor with.

µOPs are generated on-the-fly by the processor and are not directly visible to the programmer. ..."
https://community.arm.com/support-f...um/48877/mops-macro-ops-uops-micro-ops/170857

external instructions format requirements versus internal implementation requirements are different. Binary code doesn't operate in an out of oder context. There is an exact order right there in the code. What is happening inside of a modern , aggressive processor is not necessarily the same.

Super rigid instructions lead to super rigid implementations. That is a decent trade off if want to stay small and cheap. If trying to outcompete everyone else then it is not so good.

Sydde · Sep 6, 2022

Xiao_Xi said:
What is the difference between micro-ops and microcodes?

They are kind of the opposite thing. On a CISC processor, when a multi-cycle complex instruction was encountered, the CPU would resolve the arguments and deliver them to an internal micro-program to handle the operation. Perhaps the foremost examples of this are the string-move and string-compare ops that involve an arbitrary loop of memory operations that usually execute a tiny bit faster than program code to do the same thing would.

Micro-ops (μops) involve breaking a complex operation into its constituent parts and executing the parts individually, which is often faster than dispatching to micro-programs. An example would be add (bx),ax, which is broken into

obtain the value at memory location (bx)
add the value of register ax to it
store the sum at memory location (bx)

which functions well enough as a micro-program but greater efficiencies are seen to be had by dispatching it into the instruction stream as a series of μops. There may be an untried microarchitecture scheme that could parallelize micro-program segments for some performance gain, but so far, μop parceling appears to be the more efficient approach for CISC processors.

mr_roboto · Sep 7, 2022

Sydde said:
They are kind of the opposite thing. On a CISC processor, when a multi-cycle complex instruction was encountered, the CPU would resolve the arguments and deliver them to an internal micro-program to handle the operation. Perhaps the foremost examples of this are the string-move and string-compare ops that involve an arbitrary loop of memory operations that usually execute a tiny bit faster than program code to do the same thing would.

Micro-ops (μops) involve breaking a complex operation into its constituent parts and executing the parts individually, which is often faster than dispatching to micro-programs.

Microcode and uops aren't opposites at all. One uop is simply one microcode instruction. Every modern x86 implements those string instructions by decoding them to a loop composed of µops.

You know how Intel high performance cores had 4-wide decode for a very long time? One technicality is that only one of the four decoders can handle the entire x86 ISA. The other three are only capable of decoding 'simple' x86 instructions - that is, those which translate to a single uop. Or maybe it's at most two µops? I don't remember for sure, but regardless of that limit, the point is that only the one 'complex' decoder is capable of generating long sequences of µops from single x86 instructions. (IIRC, this is still true in Intel's new 6-wide decoder for high performance cores, it's just 5+1 instead of 3+1.)

You might think that only having one complex decoder is a problem, but in practice the vast majority of x86 instructions executed by a CPU fall into the 'simple' category. Furthermore, complex things like the string instructions generate a µop loop which might execute hundreds of times. Nobody cares if decode temporarily bottlenecks to 1-wide to handle them.

People get very wrong ideas about how complex average x86 instructions are. Most are quite simple. The encoding is awful, the register count is low, but if you kinda squint the ISA almost looks like an extremely awkward RISC.

The final thing I'll say is that, counterintuitively, µops aren't actually all that micro. They're a decoded version of the instruction, fully prepped for ease of handling in the execution backend. So, if the instruction encodes a constant in some funky format, that constant is decoded and expanded to full width. When the instruction references a register, the register's architectural name is translated to the correct dynamic register name, which is larger (register renaming means the physical register file or the ROB is much larger than the architectural register file). So on and so forth.

In fact, µops can easily be hundreds of bits wide! The in-memory version of an instruction has to be compact because it's desirable to use less RAM, but the in-core version can be as wide as it needs to be. The job of a decoder is to translate that compact form to the full set of information necessary for the execution backend to do its job.

And yes, that means RISC CPUs also have µops. They too have all sorts of tricks to encode things as compactly as possible. Not as many as x86, which is why their decoders are much simpler than x86 decoders, but still, at the end of the decoding process, you end up with more bits, and that's OK and expected. The main difference is that you don't tend to need large microcode ROMs with complex microprograms to handle the deep weirdness lurking in x86; a clean RISC design decodes nearly everything to single µops.

Sydde · Sep 8, 2022

mr_roboto said:
And yes, that means RISC CPUs also have µops. They too have all sorts of tricks to encode things as compactly as possible. Not as many as x86, which is why their decoders are much simpler than x86 decoders, but still …

The biggest code compaction "trick" up the x86 sleeve is 8-bit VLE. And a good trick it is, as it leaves room for expanding the ISA by adding instructions composed of more bytes. The downside is that the decoders have to work harder to discern op boundaries. It is a tad more challenging to add instructions to a fixed-length architecture, but that is why the ARM-based SoC community is squarely in the heterogenous computing camp.

More tools to do a job is often more efficient than coding in new CPU features, and when you get down to N3 or N2, a lot more specialized logic will fit in the same space. Specialized logic blocks use somewhat less juice to do a job and can be gated off when idle so they are not drawing any power.

The reality is that CPU performance is at a wall-of-practicality. The cores and core complexes can, yes, be made even faster (the wall is not like a physical limit), but, why? The work that CPUs do is becoming less and less consequential as workloads get shunted over to GPUs and internal ASICs which handle the work more efficiently. Improving processor performance is a path of rapidly diminishing returns, because CPUs are already way fast enough to do what they need to do.

pshufd · Sep 8, 2022

Sydde said:
The reality is that CPU performance is at a wall-of-practicality. The cores and core complexes can, yes, be made even faster (the wall is not like a physical limit), but, why? The work that CPUs do is becoming less and less consequential as workloads get shunted over to GPUs and internal ASICs which handle the work more efficiently. Improving processor performance is a path of rapidly diminishing returns, because CPUs are already way fast enough to do what they need to do.

This is why I like the focus on efficiency. Most consumers can get by with older tech, at least on the desktop. Specialized users needing more performance have that with the competition between Apple, Intel and AMD.

Xiao_Xi · Sep 12, 2022

AMD and Intel has some work to do. This video shows the impact of smart boosting in AMD/Intel CPUs on battery life.

mi7chy · Sep 13, 2022

Xiao_Xi said:
AMD and Intel has some work to do. This video shows the impact of smart boosting in AMD/Intel CPUs on battery life.

He should to learn how to turn off boost on x64 when comparing with AS without boost. Otherwise, it's by design for x64 to boost clock as high as allowed by TDP setting and thermal when boost is enabled. For example, on AMD 4650U with boost disabled it's ~4W for Cinebench R23 single-core vs ~11W multi-core.

Xiao_Xi · Sep 13, 2022

mi7chy said:
He should to learn how to turn off boost on x64 when comparing with AS without boost. Otherwise, it's by design for x64 to boost clock as high as allowed by TDP setting and thermal when boost is enabled. For example, on AMD 4650U with boost disabled it's ~4W for Cinebench R23 single-core vs ~11W multi-core.

Benchmarks should be run with the default settings. It is up to the manufacturer to change them to improve the performance of the notebook. In fact, it could be considered cheating if he changes the default settings to improve a benchmark result.

mr_roboto · Sep 13, 2022

mi7chy said:
He should to learn how to turn off boost on x64 when comparing with AS without boost. Otherwise, it's by design for x64 to boost clock as high as allowed by TDP setting and thermal when boost is enabled. For example, on AMD 4650U with boost disabled it's ~4W for Cinebench R23 single-core vs ~11W multi-core.

M1 and M2 both have single core boost. It's just that Apple-designed big cores reach full performance at ~5W. Thanks to this, users who want good battery life aren't forced to find and tweak obscure settings which reduce the performance of their computer.

No matter how much you try to spin it, Apple Silicon's power efficiency is better than AMD and Intel by enormous margins. That has consequences, as this reviewer showed with their practical battery life test. Truth hurts, mi7chy. Truth hurts.

mi7chy · Sep 13, 2022

He doesn't know the difference between simple core and single core so can't expect him to know the difference between base clock and race to sleep vs base clock to boost clock. Truth hurts.

mr_roboto · Sep 13, 2022

mi7chy said:
He doesn't know the difference between simple core and single core so can't expect him to know the difference between base clock and race to sleep vs base clock to boost clock. Truth hurts.

Ah, so now you're moving on to making fun of someone who's clearly not a native English speaker just because they accidentally used the wrong word. Classy.

[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

macrumors 601

macrumors 68000

macrumors Core

macrumors 68000

macrumors 6502a

macrumors G5

macrumors G5

macrumors G5

macrumors 6502a

macrumors 68030

macrumors 68000

macrumors Core

macrumors G5

macrumors 6502a

macrumors G5

macrumors 68030

macrumors 6502a

macrumors 68030

macrumors G4

macrumors 68000

Suspended

macrumors 68000

macrumors 6502a

Suspended

macrumors 6502a

Our Staff