What microarchitecture choices have made M2 less efficient than M1?

diamond.g · Jul 30, 2022

leman said:
How would I know? All I can tell you that Apple Silicon only supports up to 4x MSAA (which is clearly exposed in both Metal and OpenGL), if the game gives you other options than it must be doing something arbitrary in the background.

Well I'll be. Unigine Valley (and also Heaven) inexplicably defaults to 8x MSAA. Setting them to 4x gets them to work again (so you don't have to disable AA altogether). I don't feel like getting my 2016 MacBook Pro out to see if it correctly shows 4x MSAA as a default though I suspect it doesn't.

So expected behavior is for nothing to show up if you try to do something that isn't supported (which I guess is better than crashing).

theorist9 · Jul 30, 2022

Gerdi said:
Look, the efficiency loss is mostly due to voltage Pdyn = Cdyn * V^2 * f. But it does not really matter, because when the M2 is throttling down to sustainable power, Voltage will be much reduced and efficiency is back up again.

In any case, the architectural constant, which impacts power is Cdyn. You however conclude, that because Pdyn is going up, it is because of Cdyn going up - which is probably not the case.

Is the voltage a function of frequency? If so, I'd suggest the equation would be better written this way, to indicate that explicitly:

Pdyn = Cdyn * V(f)^2 * f

And if V is a function of f, what would be the typical functional form of the relationship between the two? E.g., is it V(f) = a f^0.5, V(f) = a Exp(f), etc.?

Also, how accurate is this equation in real-world application? Based on the numbers in Darkseth's post (see bottom screenshot), I tried seeing what exponents I got for V and f (see penultimate screenshot). The range of Cdyn values that give real solutions is ~ 6 * 10^-5 to 10^2. For Cdyn > 1, the f exponent is < 1, so I went in the other direction. With approximately the smallest possible Cdyn value that gives a real solution, I get v and f exponents of 1 and 2, respectively—the opposite of the standard equation you presented.

Yes, I understand that's the standard equation in the field, and there could be a lot going on with Darkseth's example that could cause a confounding effects, but it still makes me wonder how readily one can apply this equation to something like, say, power vs. frequency for AS.

Darkseth said:
You'd be surprised.
Just tried it with my old GTX 1080 MSI Gaming X in Cyberpunk.

Stock: 1,049 Volt, 1949 Mhz, 5000 Mhz Memory.
64 fps, 180 Watt

UV: 0,800 Volt, 1823 Mhz, 5500 Mhz Memory OC.
62 fps, 120 Watt.

Not to mention, much less heat, less fan noise etc.

Let's to back to M1.
M1 has 561 Points per Wat in CB r23 Multicore. I personally got 516.
But put the M1 in Low Power Mode (whole SoC consumes not more than 4 Watt), points drop to 4500, but points per Watt goes up to 1125.

M1 in 4w low Power Mode has 1125 points per Watt there.
it's insane how efficient Chips can be, if you don't go for maximum Performance.

Sydde · Jul 30, 2022

diamond.g said:
Well I'll be. Unigine Valley (and also Heaven) inexplicably defaults to 8x MSAA. Setting them to 4x gets them to work again (so you don't have to disable AA altogether).

I am kind of perplexed. Is MSAA not a display output characteristic? If so, does it even affect game logic at all? If 8x is not supported, if the system simply generated 4x, would it make any difference beyond how pretty the images look?

DavidChoux · Jul 31, 2022

mi7chy said:
Untrue.

I have a hard time believing anything you say given how interpreted the above chart as saying that the M1 performs better than the M2 in sustained loads lol.

MayaUser · Jul 31, 2022

jeanlain said:
Sill waiting for evidence that such thing exists...

you will wait forever something true from that girl...she cant read data, she is here just try to make negative all over the place regarding macs. Ignoring that user this forum will be a better place

leman · Jul 31, 2022

Sydde said:
I am kind of perplexed. Is MSAA not a display output characteristic? If so, does it even affect game logic at all? If 8x is not supported, if the system simply generated 4x, would it make any difference beyond how pretty the images look?

MSAA modifies how rasterization works and how the final image is composed, it introduces additional information to the rendering pipeline. As such, it can certainly affect game logic, especially if the game uses modern features such as programmable multi sampling.

But in case of Unigine the situation is probably much simpler. I’d guess that they try to configure 8x MSAA, this fails, they ignore the error and happily continue drawing with a broken rendering pipeline. That’s OpenGL for you and the reason why the industry moved on.

diamond.g · Jul 31, 2022

leman said:
MSAA modifies how rasterization works and how the final image is composed, it introduces additional information to the rendering pipeline. As such, it can certainly affect game logic, especially if the game uses modern features such as programmable multi sampling.

But in case of Unigine the situation is probably much simpler. I’d guess that they try to configure 8x MSAA, this fails, they ignore the error and happily continue drawing with a broken rendering pipeline. That’s OpenGL for you and the reason why the industry moved on.

Which really means any OpenGL game on AS macOS is suspect. You can really only trust Metal games to be consistent.

As an aside why doesn't AS support 8x MSAA? I've also been curious as to why SSAA couldn't use Apples display scaling tech internally.

leman · Jul 31, 2022

diamond.g said:
Which really means any OpenGL game on AS macOS is suspect. You can really only trust Metal games to be consistent.

Devs can always mess up. The way how OpenGL is designed makes messing up very easy though. At least Metal and co. makes it impossible for you to ignore the error.

diamond.g said:
As an aside why doesn't AS support 8x MSAA?

Just not supported by the hardware. Will they ever increase the limit? I'm kind of doubtful. Modern AA technique doesn't have much use for MSAA anyway.

diamond.g said:
I've also been curious as to why SSAA couldn't use Apples display scaling tech internally.

Who says it can't? Just chose a HiDPI backing store and render at 2x resolution, the system will take care of the rest.

Silly-con · Jul 31, 2022

M2 Air 19 fps > M1 Air 18 fps

Obviously you’re switching to the 32% difference between M2 Air vs Pro is bigger than 21% gap between M1 Air and Pro…

but the question was whether the M2 air was faster, and it is.
.

diamond.g · Jul 31, 2022

Silly-con said:
M2 Air 19 fps > M1 Air 18 fps

Obviously you’re switching to the 32% difference between M2 Air vs Pro is bigger than 21% gap between M1 Air and Pro…

but the question was whether the M2 air was faster, and it is.
.

What is odd to me is how the M1 rendered more frames than the M2 but got a lower FPS score in Shadow of the Tomb Raider.

mi7chy · Jul 31, 2022

diamond.g said:
So using Batman as the example, what happens if you choose 8x MSAA? I know in Uningine Heaven you have to disable AA to get anything to show up on the screen, does Batman do the same thing?

Batman Arkham City GOTY steam key is like $1.50 on Eneba so you can see for yourself that the game will run when set for MSAA 8x but it's actually no AA with jaggies.

diamond.g · Jul 31, 2022

mi7chy said:
Batman Arkham City GOTY steam key is like $1.50 on Eneba so you can see for yourself that the game will run when set for MSAA 8x but it's actually no AA with jaggies.

I already own the game on the Epic Store, and have no plans on rebuying it on Steam. Maybe someone else can be super helpful and confirm, lol.

leman · Jul 31, 2022

mi7chy said:
Batman Arkham City GOTY steam key is like $1.50 on Eneba so you can see for yourself that the game will run when set for MSAA 8x but it's actually no AA with jaggies.

What’s your point exactly? That an old game is buggy? What does it have to do with Apple or anything?

hajime · Jul 31, 2022

Xiao_Xi said:
M2 seems to be less efficient than M1 although M2 uses a better TSMC node than the M1. What microarchitecture choices have made M2 less efficient than M1?

Notebookcheck wrote in its M2 MBA review:

View attachment 2034681
View attachment 2034682

Apple MacBook Air M2 Entry Review – A very good, but too expensive daily MacBook

Notebookcheck reviews the new Apple MacBook Air M2 Midnight in the base-level spec with 8 GB RAM and 256 GB SSD storage.

www.notebookcheck.net

Would M2 Pro/Max be slower than M1 Pro/Max?

jeanlain · Jul 31, 2022

leman said:
What’s your point exactly? That an old game is buggy? What does it have to do with Apple or anything?

Apparently it means that macOS fakes MSAA 8x. For all apps. Based on a pair of screenshots from an unspecified game at unknown settings. And the fact that chrome has some rendering issue.

jeanlain · Jul 31, 2022

mi7chy said:
On AS MacOS it appears to be a wider issue and not per game. Discovered it in early July but apparently there's a similar MSAA 8x Google bug report opened since 3/12/2022. No Intel Mac heaters allowed here.

Link? not some unsourced screenshot please.
MSAA and macOS have been around for more than a decade and the issue was discovered this month?

Xiao_Xi · Jul 31, 2022

hajime said:
Would M2 Pro/Max be slower than M1 Pro/Max?

Why would it be? M2 is faster than M1, although it has been shown to be less efficient in some benchmarks.

To be fair, it seems that the 13" MBP M2 lasts longer on battery than the 13" MBP M1. Can anyone confirm this?

Analog Kid · Jul 31, 2022

Oh man, I dove deeper into this than I'd meant to but now feel committed to posting it... Apologies for the mess.

Gerdi said:
That is still the case. However the way the youtubers measure efficiency does not tell you much about architectural efficiency. As I tried to explain, Cdyn (the so called switching capacitance) is the only architecture dependent parameter, which impacts efficiency , while V and f can be freely chosen from the frequency-voltage curve. In addition V is quadratic in the equation, so it has a big impact.

I get what you're saying, but since the thread specifically asks about microarchitecture it's worth being clear on what's process and what's microarchitecture and what's instruction set architecture.

I think the capacitance is a process dependent parameter, not an architectural parameter. Resistance (actually a collection of parameters that all restrict current flow to the collection of parameters that behave like a capacitor) is also a process parameter.

Architecture is more logical than physical. Process sets the physical parameters, and then there's some engineering involved in deciding what tradeoffs to make in using that process.

The instruction set is mapped to a microarchitecture is implemented in a process.

tomO2013 said:
I love math

Would you mind quoting the source of the formula that you’re using, and perhaps explain how it is accounting for architectural differences (e.g. path layout as oppoled switching capacitance in isolation) to account for fundamental differences such as M1 has no hardware accelerated logic for proRes, proRes Raw, or even the fundamental differences that anandtech identified between the newer blizzard efficiency cores which complete work faster and use a lower amount of energy go do so relative to anandtech.

I apologize if I have misunderstood your point on the formula - really I’m just looking for clarity as, in isolation, I do find myself questioning (not necessarily the formula itself which looks to determine efficiency as a measure of die size and capacitance) but the actual (relevance in isolation??) or applicability when comparing two fundamentally different pieces of silicon

Surely we cannot say that M1 decodes prores raw more efficiently at the same power envelope to M2 - so we should avoid hyperboles in the description and state scenarios where a situation holds true (or not)

Please can you help me understand better .

Many thanks,

Tom.

The math comes from the fact that you can, if you squint really hard and wave your hands frantically, treat each transistor stage like an RC (resistor capacitor) circuit. The gate (input) of each transistor behaves largely like a capacitor and the circuit feeding that gate (the conducting channels of the transistors before it, and the wires connecting them) behaves resistively.

You can get a few PhDs exploring all the various ways that paragraph oversimplifies the situation, but to understand the few bits of math discussed so far, that's a reasonable mental framework.

Ok, so where does the power go in a modern microprocessor?

There are two main sources of power consumption: static leakage, and dynamic switching currents.

Static leakage is just current that slips through the chip because voltage is applied. Insulators and junctions aren't perfect, and electrons aren't billiards balls passing through pipes (they're probability distributions), so there's always a chance that an electron disappears on this side of the insulator and shows up on the other side. As process sizes get smaller, leakage goes up. It's roughly proportional to the area of the chip, so bigger chips leak more. One way around this problem is to control what parts of a chip are powered, so leakage can be reduced by removing the supply voltage from unused portions of the chip.

Dynamic switching power is the P=fCV^2 equation that's being talked about. So where does it come from? To switch a transistor, you raise and lower the voltage on its "gate". The gate structure looks a lot like a teeny, tiny capacitor, so when you apply a voltage to the terminal, you transfer charge to it. How much charge?

q=CV

Charge (q, number of electrons) equals the capacitance times the voltage across that capacitor.

As you switch the transistor on and off, you push charge onto the gate from the power rail, and then off the gate to ground. The flow of charge per second is current. Each on off cycle carries q electrons from the rail to ground with a pitstop at the gate and there are f cycles per second, so the current through the transistor is:

I=fq=fCV

Power dissipated is typically voltage times current, so:

P=IV=fCV*V=fCV^2

This is the power dissipated in a single transistor. In this case, f isn't really the clock frequency. f is the number of times that particular transistor switches each second. If it's the clock buffer, then f is the clock frequency. If it's a transistor in a second level cache, then it'll only switch on a first level cache miss and only if that bit is a one. Each transistor consumes different power depending on the job it's doing.

Typically people might assume that, in a relative sense, if you double the clock frequency, you're doing the work twice as fast and thus doubling the power (ignoring the necessary voltage scaling that I'll get to in a minute). To a first order, sure. But this too will break down under scrutiny for something as complex as an M-series SoC.

And, it turns out, f and V aren't fully independent. Why? Because the way a transistor works in a digital circuit is essentially as a switch, so you can think of there being a threshold voltage on the gate, less than the supply voltage, that's necessary to turn it on. Because it's a capacitor, and there's resistance leading into that capacitor the voltage on the capacitor is described by a differential equation and the time it takes to reach a particular gate voltage solves to:

Vgate=Vsource(1-e^(-t/(RC)))

In this case, the green line has a source voltage of 2.5V versus the blue line with a voltage of 2V, and the higher voltage crosses the 1V red threshold more quickly and thus can operate at a higher frequency. (Vertical is voltage, horizontal is time, RC set arbitrarily to 1)

The time of interest is half the cycle time (one cycle is an on and an off and it needs to reach threshold for on) so:

Vgate=Vsource(1-e^(-1/(2fRC)))

theorist9 said:
And if V is a function of f, what would be the typical functional form of the relationship between the two? E.g., is it V(f) = a f^0.5, V(f) = a Exp(f), etc.?

So V(f)=Vsource=-Vgate/(e^(-1/(2fRC)) - 1)

Where Vgate would be set to whatever the threshold voltage is. The devices may be broken into blocks that each have different clocks, and there may be several different supply voltages applied depending on the needs of each block, but f in this case likely is the clock frequency for the block because you have to choose one source voltage for a large fraction of the chip, you don't get to apply different voltages to each individual transistor.

Anyway, that's the general idea behind the math and it's a good way to understand how power and performance trade off for small deviations from base, but it's not going to give you a precise measurement of anything. It's a terribly complex system in there. Not only does each transistor behave differently, but every process parameter is temperature dependent (so things will begin to shift as the system heats up).

And, again, this is only the process parameters, it doesn't really say anything about ProRes performance. The macro and micro architecture will also drive logical differences in power/performance. Partitioning work between efficiency and performance cores is an obvious one. Cache sizes and strategies will impact power. Fine details that I'm not really aware of in the core architecture will clock more or less logic at higher or lower rates. "Race to sleep" means that it is often more efficient to execute small jobs quickly and start shutting down parts of the device again, rather than execute more slowly over a longer time (part of this is to power down blocks to reduce static leakage, part is because running clocks into idle blocks still burns power). At the OS level, at least in the iOS devices but likely in the MacOS systems too, they delay execution of non-critical tasks so they can power the chip up once, keep it really busy for a while, and then put everything back to sleep. The power supplies will be more or less efficient at different voltages and currents...

The only way to get our heads around all that complexity is to ignore most of it to start and begin from rules of thumb: P=fCV^2

The N5P process used for M2 is quoted as giving 7% better performance or 15% lower power than the N5 process used for M1. That's not an either or, that's probably the two extremes of "use it all for performance, use it all for efficiency". I don't know the details of the process, but it comes down to what they did something to improve the R and C parameters and you can use that for higher clocks, or lower voltage for the same clock. Those numbers don't map well to the M1 to M2 changes, so Apple obviously did more than just port the M1 design to the N5P process-- which we knew.

theorist9 · Jul 31, 2022

Analog Kid said:
Vgate=Vsource(1-e^(-1/(2fRC)))

So V(f)=Vsource=-Vgate/(e^(-1/(2fRC)) - 1)

....

The only way to get our heads around all that complexity is to ignore most of it to start and begin from rules of thumb: P=fCV^2

TLDR: If we substitute the 2nd eqn into the 3rd, and assume Vgate, R, C, and Cdyn are all roughly independent of f, we can eliminate the voltage and get P ~ f^3.

I'm going to assume that the C in the latter equation is what the other poster called the architectural constant, "Cdyn", and that C in the first two is capacitance. If so, we can substitute the 2nd eqn. into the 3rd to get:

And if Vgate, R, C, and Cdyn are all independent of f (or roughly independent of it), we can set them all equal to 1 to get the functional form of the relationship between power and frequency, eliminating V(f):

A log-log plot of P vs. f gives a straight line, indicating it can be closely approximated as a power law, i.e., a function of the form a*f^n. Trying a few different values of n, I found a very close match with n = 3 (and a = 4). This is shown by the following plot containing P(f) (blue line) and 4 f^3 (dashed red line):

Here we confirm the relative difference between P(f) and 4 f^3 is quite small (< O ~ 10^-9 in the GHz range:

Hence, at least with the simple model you kindly provided, we get P ~ f^3. Of course, as you mentioned, actual architectures are more complicated. But understanding that the simple model gives P ~ f^3 is a nice starting point.

Analog Kid · Jul 31, 2022

theorist9 said:
TLDR: If we substitute the 2nd eqn into the 3rd, and assume Vgate, R, C, and Cdyn are all roughly independent of f, we can eliminate the voltage and get P ~ f^3.

I'm going to assume that the C in the latter equation is what the other poster called the architectural constant, "Cdyn", and that C in the first two is capacitance. If so, we can substitute the 2nd eqn. into the 3rd to get:

View attachment 2037223

And if Vgate, R, C, and Cdyn are all independent of f (or roughly independent of it), we can set them all equal to 1 to get the functional form of the relationship between power and frequency, eliminating V(f):

View attachment 2037266
A log-log plot of P vs. f gives a straight line, indicating it can be closely approximated as a power law, i.e., a function of the form a*f^n. Trying a few different values of n, I found a very close match with n = 3 (and a = 4). This is shown by the following plot containing P(f) (blue line) and 4 f^3 (dashed red line):
View attachment 2037260

Here we confirm the relative difference between P(f) and f^3 is small (<O ~ 10^-7 up to 1 GHz), and that it oscillates rapidly between P(f) > 4 f^3 and P(f) < 4 f^3:

View attachment 2037261
Hence, at least with the simple model you kindly provided, we get P ~ f^3. Of course, as you mentioned, actual architectures are more complicated. But understanding that the simple model gives P ~ f^3 is a nice starting point.

Addendum: Just for fun, expanding the resolution on the above oscillatory behavior (in the relative difference between the simple model for P(f) and 4 f^3) shows that it's actually a sawtooth:

View attachment 2037262

I'm pretty sure those oscillations are just due to numerical precision artifacts. There shouldn't be any oscillations in either function you're comparing.

throAU · Jul 31, 2022

Xiao_Xi said:
M2 seems to be less efficient than M1 although M2 uses a better TSMC node than the M1. What microarchitecture choices have made M2 less efficient than M1?

It uses the 5nm performance node, not the 5nm energy efficiency node (which the A14 and M1* use). This was a design choice by Apple to no doubt improve responsiveness/burst performance.

The node isn't necessarily "better", it makes a different set of trade-offs. It can clock higher by consuming more power.

For non-sustained workloads (i.e., what the 13" machines are aimed at) it can likely be more responsive and "race to sleep" faster, and battery life will be comparable. Most battery life measurements are whilst playing video or whatever over wifi which consumes barely any of the processing power of any of these machines. I.e., playing wireless video the machine is basically 99% asleep.

For sustained workloads.... an Air (due to thermals) or even 13" Pro (due to lower performance than a 14-16) is the wrong machine. Which is why apple no doubt didn't worry about optimising those machines for this.

theorist9 · Jul 31, 2022

Analog Kid said:
I'm pretty sure those oscillations are just due to numerical precision artifacts. There shouldn't be any oscillations in either function you're comparing.

You are correct! The numbers I input into the Plot command are exact, but I forgot that Plot itself uses machine precision. When I increased Plot's "WorkingPrecision" to 100, the oscillations went away. I'll edit my post appropriately.

But what about the central conclusion of my post—that, using the basic model you supplied, we obtain P ~ f^3, under the assumptions that Vgate, R, C, and Cdyn are all roughly independent of f? How reasonable are those assumptions?

Analog Kid · Aug 1, 2022

theorist9 said:
You are correct! The numbers I input into the Plot command are exact, but I forgot that Plot itself uses machine precision. When I increased Plot's "WorkingPrecision" to 100, the oscillations went away. I'll edit my post appropriately.

But what about the central conclusion of my post—that, using the basic model you supplied, we obtain P ~ f^3, under the assumptions that Vgate, R, C, and Cdyn are all roughly independent of f? How reasonable are those assumptions?

Yeah, kinda? One second order effect that stands out is that we're ignoring the inductance of the interconnect, so there's going to be an imaginary component to R that will introduce some frequency dependencies, but to a first order most of those parameters are probably roughly independent of f. We're also reaching the limit of my understanding of the physics of these extreme process geometries, though. Physics gets weird when geometries get this small. Are there other frequency dependencies to pushing countable numbers of electrons atomic distances through highly doped semiconductive crystals? I'd be hard pressed to bet against it...

The bigger issue, I think, is that your fit holds true in the limit but may not for realistic design values. That exponential in the denominator is going to be sensitive to the magnitude of its exponent. The exponent approaches zero in the limit, so the exponential approaches 1, so the difference to one approaches zero, so having it in the denominator approaches infinity.

The slope of that exponential is, well, an exponential so its going to change value rapidly around an exponent value of 1. You're using 10^0 for R and C and 10^9 for f creating an exponent on the order of 10^-9 and putting the exponential well past the knee in the curve. More realistic numbers are likely putting that exponent closer to or less than 1.

RC conveniently has units of seconds (which when multiplied by Hz gives a unitless number saving us the pain of trying to understand what units mean in an exponent). Remember from the Vgate equation, RC is the time constant of the voltage curve on the gate, so after RC seconds Vgate is Vsource*(1-1/e), or about 63% of Vsource. We don't know the threshold switching voltage (and it's not a precise number anyway), but if you need to pick a value then 63% of the base frequency source voltage isn't an unreasonable approximation.

With your input values you're assuming that it takes a full second to reach that threshold but the signals are changing in fractions of a nanosecond, so nothing has time to switch.

I'd consider setting RC=1/(2f) around your analysis point. So if you want to assume that the base frequency is 3GHz, I'd set RC=1/6 * 10^-9 and look at the behavior from 1.5GHz to 6GHz (or whatever range you care about), for example. You can play a bit with that RC value too and see how sensitive your results are to that assumption-- that's not much different than changing the frequency range you're looking at to a wider interval though.

erkanasu · Aug 1, 2022

loving the math in this thread holy ****

theorist9 · Aug 1, 2022

Analog Kid said:
I'd consider setting RC=1/(2f) around your analysis point. So if you want to assume that the base frequency is 3GHz, I'd set RC=1/6 * 10^-9 and look at the behavior from 1.5GHz to 6GHz (or whatever range you care about), for example. You can play a bit with that RC value too and see how sensitive your results are to that assumption-- that's not much different than changing the frequency range you're looking at to a wider interval though.

If you set RC = 1/(2f), then you get that power is linear instead of cubic in f, and that applies at all frequencies:

What is typical (or is there no typical) real-world CPU power-frequency scaling behavior? Do CPU's typically have at least two frequency regimes, one up to which the power scaling is favorable and one above which it is not?

What microarchitecture choices have made M2 less efficient than M1?

macrumors G5

macrumors 601

macrumors 68030

Suspended

macrumors 68040

macrumors Core

macrumors G5

macrumors Core

Suspended

macrumors G5

Suspended

macrumors G5

macrumors Core

macrumors G3

macrumors 68020

macrumors 68020

macrumors 68000

macrumors G3

macrumors 601

macrumors G3

macrumors G4

macrumors 601

macrumors G3

macrumors 6502a

macrumors 601

Our Staff