What microarchitecture choices have made M2 less efficient than M1?

NT1440 · Jul 26, 2022

Gerdi said:
It does not really matter, if you interpret the formula I did post above correctly. Given 2 identical cores, one runs with higher frequency and voltage, it will never get the efficiency back by finishing faster. Because you only gain time back, which is linear to frequency, while the power shows a cubic increase. Pdyn = Cdyn * V^2 * f
With other words, you always loose efficiency if you increase voltage - independent how you calculate.

Another conclusion from this formula is, that power is always increasing faster than frequency (and performance does not increase faster than frequency) - hence the lower the voltage the higher the efficiency.

See I’m not a math guy, but assuming your formula is correct then you’ve answered my question (and then some). Thanks

Silly-con · Jul 26, 2022

I've been looking but am unable to find how notebookcheck measures package power. Does anyone know?

EDIT: I found out from here https://www.notebookcheck.net/Our-Test-Criteria.15394.0.html They use a multimeter. So they are measuring at the wall instead of using powermetrics. That's disappointing.

mi7chy · Jul 26, 2022

Silly-con said:
I've been looking but am unable to find how notebookcheck measures package power. Does anyone know?

They use 'powermetrics' command line tool.

Tagbert · Jul 26, 2022

mi7chy said:
Apple attempted to make it more performant especially on the iGPU side to catch up with the competition but lack of active cooling negated any benefit. If you're coming from M1 it's better to wait for 3nm M3 with better designed MBA.

Untrue.

At any load below the thermal limits, the M2 is significantly faster than the M1. Even at it’s thermal limits, the M2 is still faster than the M1.

mi7chy · Jul 26, 2022

Silly-con said:
EDIT: I found out from here https://www.notebookcheck.net/Our-Test-Criteria.15394.0.html They use a multimeter. So they are measuring at the wall instead of using powermetrics. That's disappointing.

That's outdated from 2016. Look at some of the more recent Macbook reviews where they specifically mention powermetrics.

https://www.notebookcheck.net/Apple...dia-Laptop-for-Content-Creators.579013.0.html

"The M1 Pro is also very efficient when we have a look at the internal consumption (via powermetrics) and there is no throttling or turbo peaks at the start of the benchmarks."

mi7chy · Jul 26, 2022

Tagbert said:
Untrue.

At any load below the thermal limits, the M2 is significantly faster than the M1. Even at it’s thermal limits, the M2 is still faster than the M1.

Untrue.

NT1440 · Jul 26, 2022

mi7chy said:
Untrue.

Is 19 not a bigger number than 18?

Xiao_Xi · Jul 26, 2022

Gerdi said:
Not sure if I understand the question. M2 happens to run with higher frequency and voltage when not thermally restricted - hence it will be less efficient - but it has not much to do with architectural efficiency (remember the formula in my last post).

I thought Apple's ARM SoCs were more efficient than Intel/AMD x86 CPUs because ARM SoCs take about the same time as x86 CPUs to finish the same task at lower frequency. Mx can do that because Apple has chosen to use wider/shorter pipelines, faster memory access, better branch predictors...

I wonder where cmaier is when you need him most.

leman · Jul 26, 2022

Xiao_Xi said:
I thought Apple's ARM SoCs were more efficient than Intel/AMD x86 CPUs because ARM SoCs take about the same time as x86 CPUs to finish the same task at lower frequency. Mx can do that because Apple has chosen to use wider/shorter pipelines, faster memory access, better branch predictors...

Yes, and this is still true. M2 is still much more efficient than x86 CPUs for similar performance. .

Gerdi · Jul 27, 2022

Xiao_Xi said:
I thought Apple's ARM SoCs were more efficient than Intel/AMD x86 CPUs because ARM SoCs take about the same time as x86 CPUs to finish the same task at lower frequency. Mx can do that because Apple has chosen to use wider/shorter pipelines, faster memory access, better branch predictors...

I wonder where cmaier is when you need him most.

That is still the case. However the way the youtubers measure efficiency does not tell you much about architectural efficiency. As I tried to explain, Cdyn (the so called switching capacitance) is the only architecture dependent parameter, which impacts efficiency , while V and f can be freely chosen from the frequency-voltage curve. In addition V is quadratic in the equation, so it has a big impact.

Sydde · Jul 27, 2022

Xiao_Xi said:
I thought Apple's ARM SoCs were more efficient than Intel/AMD x86 CPUs because ARM SoCs take about the same time as x86 CPUs to finish the same task at lower frequency. Mx can do that because Apple has chosen to use wider/shorter pipelines, faster memory access, better branch predictors...

Mainly, x86 has to have an elaborate instruction stream parser that has to look at a bunch of bytes and figure out which ones go together to form each instruction; it then has to convert each instruction into its constituent μops to be fed into the pipe.

ARMv8+ does away with both of these issues: instructions are concise, so they very rarely require μops, since that is basically what they already are; and since all instructions are 32-bits long, the pipeline can just grab 16 bytes and already know it has 4 opodes there, just about ready to get stuffed right into the pipe as is.

In other words, x86 starts with a disadvantage.

Xiao_Xi said:
I wonder where cmaier is when you need him most.

He got banned, AAUI, for, reasons.

Romain_H · Jul 27, 2022

Sydde said:
He got banned, AAUI, for, reasons.

What did he do this time? 🤣

leman · Jul 27, 2022

Sydde said:
Mainly, x86 has to have an elaborate instruction stream parser that has to look at a bunch of bytes and figure out which ones go together to form each instruction; it then has to convert each instruction into its constituent μops to be fed into the pipe.

Offtopic, but made me wonder why Appel is using VLE for their custom GPU ISA. Of course, that's an in-order architecture, so probably decoding isn't a bottleneck in the first place and can be done efficiently, but code density matters a lot.

diamond.g · Jul 27, 2022

leman said:
Offtopic, but made me wonder why Appel is using VLA for their custom GPU ISA. Of course, that's an in-order architecture, so probably decoding isn't a bottleneck in the first place and can be done efficiently, but code density matters a lot.

I tried searching, but my searchfu isn't as stong as I'd like, what does VLA stand for?

leman · Jul 27, 2022

diamond.g said:
I tried searching, but my searchfu isn't as stong as I'd like, what does VLA stand for?

It stands for a typo

Sorry, it was supposed to be VLE (Variable-Length Encoding)

Xiao_Xi · Jul 27, 2022

leman said:
made me wonder why Apple is using VLE for their custom GPU ISA

Instructions are variable length, multiples of 2 bytes, up to ~10 bytes. They are written as little-endian, so bit 7 denotes the 0x80 bit of the first byte, and bit 8 denotes the 1 bit of the second byte.

Apple M1 GPU ISA

What do other companies use?

diamond.g · Jul 27, 2022

Xiao_Xi said:
Apple M1 GPU ISA

What do other companies use?

https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf

I thought AMD used VLIW, but that appears to be a GCN/CDNA thing. Not 100% sure how RDNA2 would be classified.

EDIT: I found this by AMD, so it looks like VLIW hasn't been a thing for a few generations now.

Screen Shot 2022-07-27 at 3.39.37 PM.png

Sydde · Jul 27, 2022

Romain_H said:
What did he do this time? 🤣

His side of the story is something like "I looked under the bridge and became angry that it had been allowed to reach such a state". If you were to ask, say, weaselbot, you would probably get a completely different narrative, about snowflakes and hurt fee-fees and how I should be reprimanded for even mentioning it.

tomO2013 · Jul 27, 2022

Gerdi said:
That is still the case. However the way the youtubers measure efficiency does not tell you much about architectural efficiency. As I tried to explain, Cdyn (the so called switching capacitance) is the only architecture dependent parameter, which impacts efficiency , while V and f can be freely chosen from the frequency-voltage curve. In addition V is quadratic in the equation, so it has a big impact.

I love math

Would you mind quoting the source of the formula that you’re using, and perhaps explain how it is accounting for architectural differences (e.g. path layout as oppoled switching capacitance in isolation) to account for fundamental differences such as M1 has no hardware accelerated logic for proRes, proRes Raw, or even the fundamental differences that anandtech identified between the newer blizzard efficiency cores which complete work faster and use a lower amount of energy go do so relative to anandtech.

I apologize if I have misunderstood your point on the formula - really I’m just looking for clarity as, in isolation, I do find myself questioning (not necessarily the formula itself which looks to determine efficiency as a measure of die size and capacitance) but the actual (relevance in isolation??) or applicability when comparing two fundamentally different pieces of silicon

Surely we cannot say that M1 decodes prores raw more efficiently at the same power envelope to M2 - so we should avoid hyperboles in the description and state scenarios where a situation holds true (or not)

Please can you help me understand better

.

Many thanks,

Tom.

Xiao_Xi · Jul 27, 2022

diamond.g said:
I thought AMD used VLIW, but that appears to be a GCN/CDNA thing. Not 100% sure how RDNA2 would be classified.

Wikipedia calls GCN "RISC SIMD".

TeraScale is a VLIW SIMD architecture, while Tesla is a RISC SIMD architecture, similar to TeraScale's successor Graphics Core Next.

More info about AMD's GPU:

An Overview of AMD’s GPU Architectures

Prefacing our deep-dive into TeraScale, GCN & RDNA…

medium.com

An Architectural Deep-Dive into AMD’s TeraScale, GCN & RDNA GPU Architectures

With an overview of AMD’s GPUs and supporting prerequisite information behind us, it’s time to delve into TeraScale, GCN and RDNA’s…

medium.com

Gerdi · Jul 27, 2022

tomO2013 said:
I love math

Would you mind quoting the source of the formula that you’re using, and perhaps explain how it is accounting for architectural differences (e.g. path layout as oppoled switching capacitance in isolation) to account for fundamental differences such as M1 has no hardware accelerated logic for proRes, proRes Raw, or even the fundamental differences that anandtech identified between the newer blizzard efficiency cores which complete work faster and use a lower amount of energy go do so relative to anandtech.

I apologize if I have misunderstood your point on the formula - really I’m just looking for clarity as, in isolation, I do find myself questioning (not necessarily the formula itself which looks to determine efficiency as a measure of die size and capacitance) but the actual (relevance in isolation??) or applicability when comparing two fundamentally different pieces of silicon

Surely we cannot say that M1 decodes prores raw more efficiently at the same power envelope to M2 - so we should avoid hyperboles in the description and state scenarios where a situation holds true (or not)

Please can you help me understand better .

Many thanks,

Tom.

The formula is just basic physics, you might find it in schoolbooks. You can also convince yourself that 1W = 1F * 1V^2 / s.

I am using it to demonstrate a common fallacy, which the OP fell for. Lets be more precise and define efficiency:

Efficiency = Performance/Power = (work/cycle)*f/Pdyn = (work/cycle*f)/(Cdyn*V^2*f) = (work/cycle)/Cdyn/V^2

As expected, since f is linear in power as well as in performance it cancels out. Still a higher voltage is required in order to reach higher frequencies.
When looking at the formula - only (work/cycle/Cdyn) is architecture dependent and constant per architecture. Otherwise efficiency is decreasing with the square of voltage for any architecture.

My point was that you cannot possibly conclude based on the presented data, which shows that the M2 is less efficient than the M1, that it has anything to do with architecture - my point was it has everything to do with voltage. And if you look at an iso-voltage scenario, you will see that the M2 is architectural more efficient than the M1 (e.g. work/cycle/Cdyn is higher for M2 compared to M1.)

Corollary, if you want to compare the efficiency of 2 different architectures - make sure you compare them at an iso-voltage point - and not at an iso-performance point (as many youtubers or people here in the forum would suggest).

Sydde · Jul 27, 2022

Gerdi said:
As expected, since f is linear in power as well as in performance it cancels out. Still a higher voltage is required in order to reach higher frequencies.

Say what? The trajectory of IC design has been the opposite of that. At higher frequencies, the cycle goes up to and across peak more smoothly when the difference between 0 and peak is less. '70s/'80s machines used +5 as peak, operating in the 001~300Mhz range; once we started getting close to the Ghz mark, voltages started to drop, in part because gating was being designed to respond properly at lower differentials. So, please stop blorting nonsense.

jav6454 · Jul 27, 2022

mi7chy said:
Untrue.

Uhm, your graph confirmed what the other posted stated.

Xiao_Xi · Jul 27, 2022

Although a bit old, "Computing's energy problem and what we can do about it" presentation is very instructive.

Summary of the presentation: https://www.gwern.net/docs/cs/hardware/2014-horowitz.pdf

Presenter wrote other interesting papers. For instance: "Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis".

https://www.seas.upenn.edu/~leebcc/documents/azizi2010-isca-opt.pdf

jeanlain · Jul 27, 2022

mi7chy said:
fake MSAA 8x anti-aliasing on MacOS though.

What's that? Got any reference? Unsourced screenshots don't count.

What microarchitecture choices have made M2 less efficient than M1?

macrumors P6

Suspended

Suspended

macrumors 604

Suspended

Suspended

macrumors P6

macrumors 68000

macrumors Core

macrumors 6502

macrumors 68030

macrumors 6502a

macrumors Core

macrumors G5

macrumors Core

macrumors 68000

macrumors G5

macrumors 68030

macrumors member

macrumors 68000

macrumors 6502

macrumors 68030

macrumors Core

macrumors 68000

macrumors 68020

Our Staff