It's better than great. It's industry-defining performance at this thermal level.
Incredible.
Intel, good luck!
Incredible.
Intel, good luck!
I'm going to be that guy for a moment here and point out that it's quadratic, not exponential.There's a sweet spot for frequency/watt since it isn't linear and more exponential.
From my limited understanding of the problem (I may be wrong, wish we had a semiconductor person around) the problem is that data is stored on the rising edge of the clock, and since the paths between transistors have non-zero lengths, if you keep upping the frequency then the signal has less and less time to travel between transistors, since the speed at which the signal travels is fixed (it's a property of the semiconductor material itself).Just a quick comment on this: microarchitecture definitely has an effect on clocks. The way you set up your transistors limits their synchronization capability (I have no idea how it works since I am not a semiconductor person, hopefully someone can explain better). Appel chose to implement more processing units and arrange them in a more complex way, so they can't go as fast, but they can do much more work per clock. Current x86 CPUs choose to do less work per clock but go very fast instead. The later is arguably "easier" (in terms of chip design), but the power consumption will suffer.
70% is for specific benchmarks or an avg of some sort , there would be scenarios where it will CRUSH the M1 , for example , loads that takes advantage of the crazy DRAM BW r , you should note that in scenarios where you cannot fully feed all the cores due to the way your code is written the advantage will be less significant , if you run ProRes content it will be 10x better , if you use the other ASIC`s that got double the instances in the Max it will be larger then 70%.i'm not jumping at 70% performance increase from the M1, hope the leaked benchmark is flawed.
90% Audio (logic pro)70% is for specific benchmarks or an avg of some sort , there would be scenarios where it will CRUSH the M1 , for example , loads that takes advantage of the crazy DRAM BW r , you should note that in scenarios where you cannot fully feed all the cores due to the way your code is written the advantage will be less significant , if you run ProRes content it will be 10x better , if you use the other ASIC`s that got double the instances in the Max it will be larger then 70%.
So my suggestion is to look at your use case and see the performance there , you might be extremely happy or extremely disappointed , for example if all you do is run CPU only single threaded workloads this machine will be an overkill as it will have the same performance as the M1.
No, not much anyway. A few analysers do, but fairly insignificant.Does Logic use GPU acceleration? In Apple's CPU only benchmarks the M1 Max scores the same as the M1 Pro. I would assume that's because the CPU alone can't saturate the 200GB/s bandwidth of the M1 Pro, but I don't know.
From my quick search, power loss in electronics are the products of the square of voltage used to switch the transistor and the frequency of switching that transistor. So the relationship is linear as far as frequency is concerned.Letting them boost the frequency almost infinitely like Intel does just seems like a bad idea in a (not plugged-in) laptop.
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.From my limited understanding of the problem (I may be wrong, wish we had a semiconductor person around) the problem is that data is stored on the rising edge of the clock, and since the paths between transistors have non-zero lengths, if you keep upping the frequency then the signal has less and less time to travel between transistors, since the speed at which the signal travels is fixed (it's a property of the semiconductor material itself).
If the frequency is high enough, the transistors after the longest paths may be outputting the wrong value at the start of the cycle since the signal from the previous transistor hasn't reached it yet, leading to desynchronization and a crash.
So this the place where you got something conceptually wrong, paradoxically by getting something else right.But I have no idea how close the Apple Silicon designs would be to this kind of limit. My guess is: very far. I think Apple just limits the frequency to avoid excessive power consumption since the CPUs are already plenty powerful.
Very interesting, thank you for taking time to explain.This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.
We rarely think about individual transistors; we're mostly concerned with gates. Gates are logic circuit elements built from at least two (in CMOS logic, anyways) transistors, and often more.
There are two categories of gate, combinatorial and sequential. Combinatorial logic gates aren't clocked. A common example is the 2-input AND gate. These have two input terminals A and B, and one output terminal Y. The gate continuously computes the logic function Y = A AND B. When A or B changes, there is a small propagation delay before the output Y changes too.
There are several types of sequential gates, but by far the most common (and the one I'll be using as shorthand for all the rest) is the D-type flipflop, or DFF. A basic DFF has a clock input C, a data input D, and a data output Q. The DFF holds Q stable at all times except when clock input C transitions from logic 0 to logic 1. At that moment in time, it copies the value present on input D to output Q.
Just as with a combinatorial gate, there is propagation delay: it takes some time to copy the value of D to Q. This is referred to as clock-to-out time. The other key DFF timing parameters are setup and hold time: for the DFF to work correctly, input D must be stable for a period of time surrounding the arrival of a rising edge on C. Stable time before the rising edge is setup time, stable time after is hold time.
For any given DFF D input, there is a "cone" of logic feeding it which can be traced back to the Q outputs of one or more DFFs. You find the worst case path in that cone. Whatever its delay is (the sum of combinatorial logic propagation delays, wire delays, the launching DFF's clock-to-out delay, and the destination DFF's setup time), that's the minimum period for the clock being sent to both DFFs. Any shorter period would result in errors.
This might seem long winded but I have simplified it a lot to avoid writing a book.
So this the place where you got something conceptually wrong, paradoxically by getting something else right.
Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.
Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
Yeah, I imagined I wouldn't get the details right. Back when I studied this, I did it from a physics perspective, so we rarely thought about gates, and were mostly concerned with transistors 😂 (actually, it was mostly about physics of the p-n junction, so even transistors were a bit too 'high level' for us).This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.
We rarely think about individual transistors; we're mostly concerned with gates.
That makes a lot of sense. I got a bit confused, because I knew some people overclock desktop chips a lot higher than its factory frequency, so I thought Apple's would be no different. But I guess desktop parts are less concerned about reducing power and can use gates with more drive strength, therefore having a bigger margin for frequency increases.Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.
We rarely think about individual transistors; we're mostly concerned with gates. Gates are logic circuit elements built from at least two (in CMOS logic, anyways) transistors, and often more.
There are two categories of gate, combinatorial and sequential. Combinatorial logic gates aren't clocked. A common example is the 2-input AND gate. These have two input terminals A and B, and one output terminal Y. The gate continuously computes the logic function Y = A AND B. When A or B changes, there is a small propagation delay before the output Y changes too.
There are several types of sequential gates, but by far the most common (and the one I'll be using as shorthand for all the rest) is the D-type flipflop, or DFF. A basic DFF has a clock input C, a data input D, and a data output Q. The DFF holds Q stable at all times except when clock input C transitions from logic 0 to logic 1. At that moment in time, it copies the value present on input D to output Q.
Just as with a combinatorial gate, there is propagation delay: it takes some time to copy the value of D to Q. This is referred to as clock-to-out time. The other key DFF timing parameters are setup and hold time: for the DFF to work correctly, input D must be stable for a period of time surrounding the arrival of a rising edge on C. Stable time before the rising edge is setup time, stable time after is hold time.
For any given DFF D input, there is a "cone" of logic feeding it which can be traced back to the Q outputs of one or more DFFs. You find the worst case path in that cone. Whatever its delay is (the sum of combinatorial logic propagation delays, wire delays, the launching DFF's clock-to-out delay, and the destination DFF's setup time), that's the minimum period for the clock being sent to both DFFs. Any shorter period would result in errors.
This might seem long winded but I have simplified it a lot to avoid writing a book.
So this the place where you got something conceptually wrong, paradoxically by getting something else right.
Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.
Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
I thought the M1 single core score(in M1 mini) was in the range of mid 1600s. So, the gain is around 5-6% per core from M1 to M1x assuming the scores in the 1st post is accurate, is that right?Here is an other geek-bench score! The Mac is back guys!!!
MacBookPro18,2 - Geekbench
Benchmark results for a MacBookPro18,2 with an Apple M1 Max processor.browser.geekbench.com
View attachment 1870619
Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.
Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.
We rarely think about individual transistors; we're mostly concerned with gates. Gates are logic circuit elements built from at least two (in CMOS logic, anyways) transistors, and often more.
There are two categories of gate, combinatorial and sequential. Combinatorial logic gates aren't clocked. A common example is the 2-input AND gate. These have two input terminals A and B, and one output terminal Y. The gate continuously computes the logic function Y = A AND B. When A or B changes, there is a small propagation delay before the output Y changes too.
There are several types of sequential gates, but by far the most common (and the one I'll be using as shorthand for all the rest) is the D-type flipflop, or DFF. A basic DFF has a clock input C, a data input D, and a data output Q. The DFF holds Q stable at all times except when clock input C transitions from logic 0 to logic 1. At that moment in time, it copies the value present on input D to output Q.
Just as with a combinatorial gate, there is propagation delay: it takes some time to copy the value of D to Q. This is referred to as clock-to-out time. The other key DFF timing parameters are setup and hold time: for the DFF to work correctly, input D must be stable for a period of time surrounding the arrival of a rising edge on C. Stable time before the rising edge is setup time, stable time after is hold time.
For any given DFF D input, there is a "cone" of logic feeding it which can be traced back to the Q outputs of one or more DFFs. You find the worst case path in that cone. Whatever its delay is (the sum of combinatorial logic propagation delays, wire delays, the launching DFF's clock-to-out delay, and the destination DFF's setup time), that's the minimum period for the clock being sent to both DFFs. Any shorter period would result in errors.
This might seem long winded but I have simplified it a lot to avoid writing a book.
So this the place where you got something conceptually wrong, paradoxically by getting something else right.
Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.
Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
its like hereHere is a whole list of results:
Im not sure what the various model #s indicate (18,2 ... 18,3 ... etc)
In turn, people like me often work on this from the other side without ever thinking about junction physics. Chip design requires a lot of contributions from different specialists!Yeah, I imagined I wouldn't get the details right. Back when I studied this, I did it from a physics perspective, so we rarely thought about gates, and were mostly concerned with transistors 😂 (actually, it was mostly about physics of the p-n junction, so even transistors were a bit too 'high level' for us).
That makes a lot of sense. I got a bit confused, because I knew some people overclock desktop chips a lot higher than its factory frequency, so I thought Apple's would be no different. But I guess desktop parts are less concerned about reducing power and can use gates with more drive strength, therefore having a bigger margin for frequency increases.
The main knob they might have to get more frequency out of the Firestorm core with no changes is boosting voltage. As both of you no doubt know, that comes with a substantial power penalty because the power scaling effect is at least proportional to V^2 (iirc it's worse on modern process tech, for physics reasons which I've forgotten). Exactly how much of this headroom is available, and how bad the power penalty will be, depends on how close they're already running to the maximum voltage TSMC N5/N5P can handle safely (meaning, without long term reliability concerns).While I agree that the M1 as-is likely doesn’t have headroom as you state, it sounds like there are options for scaling up in terms of clocks if Apple was willing to take the tradeoffs required to do so, without making major changes to the uarch? Or does that have knock on effects in other areas? Mostly just thinking of the Mac Pro, where if Apple is going to be willing to up the power limits a bit anywhere, it’s going to be there, but not if it means a lot of effort.
How do we know that's not the case this time around?…the chip is the same and the thermal envelope is what dictates the performance. You’re stating it like that wasn’t the intention of Apples design.
No. The M1 gets mid 1700 single core results, so there is no improvement in the M1 Pro/Max. This implies it is using the same 1-year-old Firestorm & Ice Storm cores as the M1, which I find disappointing.I thought the M1 single core score(in M1 mini) was in the range of mid 1600s. So, the gain is around 5-6% per core from M1 to M1x assuming the scores in the 1st post is accurate, is that right?
They focused on the GPU, added media blocks and improved bandwidth. These are much more important than increasing clock speed which increase single core perf.No. The M1 gets mid 1700 single core results, so there is no improvement in the M1 Pro/Max. This implies it is using the same 1-year-old Firestorm & Ice Storm cores as the M1, which I find disappointing.
I agree, and of course the other improvements are far more significant than the single core speed which is already excellent.They focused on the GPU, added media blocks and improved bandwidth. These are much more important than increasing clock speed which increase single core perf.
I am happy they did the first.