Oh man, I dove deeper into this than I'd meant to but now feel committed to posting it... Apologies for the mess.
That is still the case. However the way the youtubers measure efficiency does not tell you much about architectural efficiency. As I tried to explain, Cdyn (the so called switching capacitance) is the only architecture dependent parameter, which impacts efficiency , while V and f can be freely chosen from the frequency-voltage curve. In addition V is quadratic in the equation, so it has a big impact.
I get what you're saying, but since the thread specifically asks about microarchitecture it's worth being clear on what's process and what's microarchitecture and what's instruction set architecture.
I think the capacitance is a process dependent parameter, not an architectural parameter. Resistance (actually a collection of parameters that all restrict current flow to the collection of parameters that behave like a capacitor) is also a process parameter.
Architecture is more logical than physical. Process sets the physical parameters, and then there's some engineering involved in deciding what tradeoffs to make in using that process.
The instruction set is mapped to a microarchitecture is implemented in a process.
I love math
Would you mind quoting the source of the formula that you’re using, and perhaps explain how it is accounting for architectural differences (e.g. path layout as oppoled switching capacitance in isolation) to account for fundamental differences such as M1 has no hardware accelerated logic for proRes, proRes Raw, or even the fundamental differences that anandtech identified between the newer blizzard efficiency cores which complete work faster and use a lower amount of energy go do so relative to anandtech.
I apologize if I have misunderstood your point on the formula - really I’m just looking for clarity as, in isolation, I do find myself questioning (not necessarily the formula itself which looks to determine efficiency as a measure of die size and capacitance) but the actual (relevance in isolation??) or applicability when comparing two fundamentally different pieces of silicon
Surely we cannot say that M1 decodes prores raw more efficiently at the same power envelope to M2 - so we should avoid hyperboles in the description and state scenarios where a situation holds true (or not)
Please can you help me understand better
.
Many thanks,
Tom.
The math comes from the fact that you can, if you squint really hard and wave your hands frantically, treat each transistor stage like an RC (resistor capacitor) circuit. The gate (input) of each transistor behaves largely like a capacitor and the circuit feeding that gate (the conducting channels of the transistors before it, and the wires connecting them) behaves resistively.
You can get a few PhDs exploring all the various ways that paragraph oversimplifies the situation, but to understand the few bits of math discussed so far, that's a reasonable mental framework.
Ok, so where does the power go in a modern microprocessor?
There are two main sources of power consumption: static leakage, and dynamic switching currents.
Static leakage is just current that slips through the chip because voltage is applied. Insulators and junctions aren't perfect, and electrons aren't billiards balls passing through pipes (they're probability distributions), so there's always a chance that an electron disappears on this side of the insulator and shows up on the other side. As process sizes get smaller, leakage goes up. It's roughly proportional to the area of the chip, so bigger chips leak more. One way around this problem is to control what parts of a chip are powered, so leakage can be reduced by removing the supply voltage from unused portions of the chip.
Dynamic switching power is the P=fCV^2 equation that's being talked about. So where does it come from? To switch a transistor, you raise and lower the voltage on its "gate". The gate structure looks a lot like a teeny, tiny capacitor, so when you apply a voltage to the terminal, you transfer charge to it. How much charge?
q=CV
Charge (q, number of electrons) equals the capacitance times the voltage across that capacitor.
As you switch the transistor on and off, you push charge onto the gate from the power rail, and then off the gate to ground. The flow of charge per second is current. Each on off cycle carries q electrons from the rail to ground with a pitstop at the gate and there are f cycles per second, so the current through the transistor is:
I=fq=fCV
Power dissipated is typically voltage times current, so:
P=IV=fCV*V=fCV^2
This is the power dissipated in a single transistor. In this case, f isn't really the clock frequency. f is the number of times that particular transistor switches each second. If it's the clock buffer, then f is the clock frequency. If it's a transistor in a second level cache, then it'll only switch on a first level cache miss and only if that bit is a one. Each transistor consumes different power depending on the job it's doing.
Typically people might assume that, in a relative sense, if you double the clock frequency, you're doing the work twice as fast and thus doubling the power (ignoring the necessary voltage scaling that I'll get to in a minute). To a first order, sure. But this too will break down under scrutiny for something as complex as an M-series SoC.
And, it turns out, f and V aren't fully independent. Why? Because the way a transistor works in a digital circuit is essentially as a switch, so you can think of there being a threshold voltage on the gate, less than the supply voltage, that's necessary to turn it on. Because it's a capacitor, and there's resistance leading into that capacitor the voltage on the capacitor is described by a differential equation and the time it takes to reach a particular gate voltage solves to:
Vgate=Vsource(1-e^(-t/(RC)))
In this case, the green line has a source voltage of 2.5V versus the blue line with a voltage of 2V, and the higher voltage crosses the 1V red threshold more quickly and thus can operate at a higher frequency. (Vertical is voltage, horizontal is time, RC set arbitrarily to 1)
The time of interest is half the cycle time (one cycle is an on and an off and it needs to reach threshold for on) so:
Vgate=Vsource(1-e^(-1/(2fRC)))
And if V is a function of f, what would be the typical functional form of the relationship between the two? E.g., is it V(f) = a f^0.5, V(f) = a Exp(f), etc.?
So V(f)=Vsource=-Vgate/(e^(-1/(2fRC)) - 1)
Where Vgate would be set to whatever the threshold voltage is. The devices may be broken into blocks that each have different clocks, and there may be several different supply voltages applied depending on the needs of each block, but f in this case likely is the clock frequency for the block because you have to choose one source voltage for a large fraction of the chip, you don't get to apply different voltages to each individual transistor.
Anyway, that's the general idea behind the math and it's a good way to understand how power and performance trade off for small deviations from base, but it's not going to give you a precise measurement of anything. It's a terribly complex system in there. Not only does each transistor behave differently, but every process parameter is temperature dependent (so things will begin to shift as the system heats up).
And, again, this is only the process parameters, it doesn't really say anything about ProRes performance. The macro and micro architecture will also drive logical differences in power/performance. Partitioning work between efficiency and performance cores is an obvious one. Cache sizes and strategies will impact power. Fine details that I'm not really aware of in the core architecture will clock more or less logic at higher or lower rates. "Race to sleep" means that it is often more efficient to execute small jobs quickly and start shutting down parts of the device again, rather than execute more slowly over a longer time (part of this is to power down blocks to reduce static leakage, part is because running clocks into idle blocks still burns power). At the OS level, at least in the iOS devices but likely in the MacOS systems too, they delay execution of non-critical tasks so they can power the chip up once, keep it really busy for a while, and then put everything back to sleep. The power supplies will be more or less efficient at different voltages and currents...
The only way to get our heads around all that complexity is to ignore most of it to start and begin from rules of thumb: P=fCV^2
The N5P process used for M2 is quoted as giving 7% better performance or 15% lower power than the N5 process used for M1. That's not an either or, that's probably the two extremes of "use it all for performance, use it all for efficiency". I don't know the details of the process, but it comes down to what they did something to improve the R and C parameters and you can use that for higher clocks, or lower voltage for the same clock. Those numbers don't map well to the M1 to M2 changes, so Apple obviously did more than just port the M1 design to the N5P process-- which we knew.