Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
There's a sweet spot for frequency/watt since it isn't linear and more exponential.
I'm going to be that guy for a moment here and point out that it's quadratic, not exponential. :p

Just a quick comment on this: microarchitecture definitely has an effect on clocks. The way you set up your transistors limits their synchronization capability (I have no idea how it works since I am not a semiconductor person, hopefully someone can explain better). Appel chose to implement more processing units and arrange them in a more complex way, so they can't go as fast, but they can do much more work per clock. Current x86 CPUs choose to do less work per clock but go very fast instead. The later is arguably "easier" (in terms of chip design), but the power consumption will suffer.
From my limited understanding of the problem (I may be wrong, wish we had a semiconductor person around) the problem is that data is stored on the rising edge of the clock, and since the paths between transistors have non-zero lengths, if you keep upping the frequency then the signal has less and less time to travel between transistors, since the speed at which the signal travels is fixed (it's a property of the semiconductor material itself).

If the frequency is high enough, the transistors after the longest paths may be outputting the wrong value at the start of the cycle since the signal from the previous transistor hasn't reached it yet, leading to desynchronization and a crash.

But I have no idea how close the Apple Silicon designs would be to this kind of limit. My guess is: very far. I think Apple just limits the frequency to avoid excessive power consumption since the CPUs are already plenty powerful. IMHO this is how all laptop CPUs should work. Otherwise you end up boosting the frequency too much and getting diminishing returns for a massive amount of power used. Letting them boost the frequency almost infinitely like Intel does just seems like a bad idea in a (not plugged-in) laptop.
 
i'm not jumping at 70% performance increase from the M1, hope the leaked benchmark is flawed.
70% is for specific benchmarks or an avg of some sort , there would be scenarios where it will CRUSH the M1 , for example , loads that takes advantage of the crazy DRAM BW r , you should note that in scenarios where you cannot fully feed all the cores due to the way your code is written the advantage will be less significant , if you run ProRes content it will be 10x better , if you use the other ASIC`s that got double the instances in the Max it will be larger then 70%.

So my suggestion is to look at your use case and see the performance there , you might be extremely happy or extremely disappointed , for example if all you do is run CPU only single threaded workloads this machine will be an overkill as it will have the same performance as the M1.
 
  • Like
Reactions: l0stl0rd
70% is for specific benchmarks or an avg of some sort , there would be scenarios where it will CRUSH the M1 , for example , loads that takes advantage of the crazy DRAM BW r , you should note that in scenarios where you cannot fully feed all the cores due to the way your code is written the advantage will be less significant , if you run ProRes content it will be 10x better , if you use the other ASIC`s that got double the instances in the Max it will be larger then 70%.

So my suggestion is to look at your use case and see the performance there , you might be extremely happy or extremely disappointed , for example if all you do is run CPU only single threaded workloads this machine will be an overkill as it will have the same performance as the M1.
90% Audio (logic pro)

There's a few "logic" benchmarks floating around, i'd hope for about 2x track count with current benchmarks.
Aiming for The low-tier M1 Max, because i think 200€ price hike over M1 Pro is worth it for the memory bandwidth.

But a lot of what was good for audio with M1 is also because the crazy single-core speed.
 
Does Logic use GPU acceleration? In Apple's CPU only benchmarks the M1 Max scores the same as the M1 Pro. I would assume that's because the CPU alone can't saturate the 200GB/s bandwidth of the M1 Pro, but I don't know.
 
  • Like
Reactions: NotTooLate
Does Logic use GPU acceleration? In Apple's CPU only benchmarks the M1 Max scores the same as the M1 Pro. I would assume that's because the CPU alone can't saturate the 200GB/s bandwidth of the M1 Pro, but I don't know.
No, not much anyway. A few analysers do, but fairly insignificant.

i guess m1 max is pointless for me - unless gta v is finally ported for M1
 
Letting them boost the frequency almost infinitely like Intel does just seems like a bad idea in a (not plugged-in) laptop.
From my quick search, power loss in electronics are the products of the square of voltage used to switch the transistor and the frequency of switching that transistor. So the relationship is linear as far as frequency is concerned.

So if one core is dissipating 5W per core at 3.2GHz, pushing the frequency to say 3.5GHz would mean that the power loss goes to 5.47W assuming everything else is equal. But from what I understand, going faster means you have to use higher voltage for stability reason, so the square of voltage increase will be exponential.

I think Apple will wait for the node shrink before going up in frequency, as smaller nodes means lower voltages driving the transistors.
 
  • Like
Reactions: asdex
Hope for Apple to leave a bag of cash outside Take two door step , getting GTA is a BIG win for the platform ,also Battlefield will be big , its time also to make peace with Valve , getting native Steam client is a must to start getting traction , the only disappointed in the ALL show was the lack of some gaming segment , having a 120Hz kick ass 3.5K 10bit miniled display would've been amazing to showcase some games on.

the 70MM$ spent on the movie greyhound for ATV+ would net you around 12-14 AAA ports to the Mac ,and to my personal opinion will generate more buzz and more value to Apples platform as a whole.
And it doesn't even have to be an OR operation between them , Apple makes so much money overall that I find it weird to not see them push on that front now that they have the HW to support such a push.
 
  • Like
Reactions: tpfang56
From my limited understanding of the problem (I may be wrong, wish we had a semiconductor person around) the problem is that data is stored on the rising edge of the clock, and since the paths between transistors have non-zero lengths, if you keep upping the frequency then the signal has less and less time to travel between transistors, since the speed at which the signal travels is fixed (it's a property of the semiconductor material itself).

If the frequency is high enough, the transistors after the longest paths may be outputting the wrong value at the start of the cycle since the signal from the previous transistor hasn't reached it yet, leading to desynchronization and a crash.
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.

We rarely think about individual transistors; we're mostly concerned with gates. Gates are logic circuit elements built from at least two (in CMOS logic, anyways) transistors, and often more.

There are two categories of gate, combinatorial and sequential. Combinatorial logic gates aren't clocked. A common example is the 2-input AND gate. These have two input terminals A and B, and one output terminal Y. The gate continuously computes the logic function Y = A AND B. When A or B changes, there is a small propagation delay before the output Y changes too.

There are several types of sequential gates, but by far the most common (and the one I'll be using as shorthand for all the rest) is the D-type flipflop, or DFF. A basic DFF has a clock input C, a data input D, and a data output Q. The DFF holds Q stable at all times except when clock input C transitions from logic 0 to logic 1. At that moment in time, it copies the value present on input D to output Q.

Just as with a combinatorial gate, there is propagation delay: it takes some time to copy the value of D to Q. This is referred to as clock-to-out time. The other key DFF timing parameters are setup and hold time: for the DFF to work correctly, input D must be stable for a period of time surrounding the arrival of a rising edge on C. Stable time before the rising edge is setup time, stable time after is hold time.

For any given DFF D input, there is a "cone" of logic feeding it which can be traced back to the Q outputs of one or more DFFs. You find the worst case path in that cone. Whatever its delay is (the sum of combinatorial logic propagation delays, wire delays, the launching DFF's clock-to-out delay, and the destination DFF's setup time), that's the minimum period for the clock being sent to both DFFs. Any shorter period would result in errors.

This might seem long winded but I have simplified it a lot to avoid writing a book. :)

But I have no idea how close the Apple Silicon designs would be to this kind of limit. My guess is: very far. I think Apple just limits the frequency to avoid excessive power consumption since the CPUs are already plenty powerful.
So this the place where you got something conceptually wrong, paradoxically by getting something else right.

Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.

Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
 
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.

We rarely think about individual transistors; we're mostly concerned with gates. Gates are logic circuit elements built from at least two (in CMOS logic, anyways) transistors, and often more.

There are two categories of gate, combinatorial and sequential. Combinatorial logic gates aren't clocked. A common example is the 2-input AND gate. These have two input terminals A and B, and one output terminal Y. The gate continuously computes the logic function Y = A AND B. When A or B changes, there is a small propagation delay before the output Y changes too.

There are several types of sequential gates, but by far the most common (and the one I'll be using as shorthand for all the rest) is the D-type flipflop, or DFF. A basic DFF has a clock input C, a data input D, and a data output Q. The DFF holds Q stable at all times except when clock input C transitions from logic 0 to logic 1. At that moment in time, it copies the value present on input D to output Q.

Just as with a combinatorial gate, there is propagation delay: it takes some time to copy the value of D to Q. This is referred to as clock-to-out time. The other key DFF timing parameters are setup and hold time: for the DFF to work correctly, input D must be stable for a period of time surrounding the arrival of a rising edge on C. Stable time before the rising edge is setup time, stable time after is hold time.

For any given DFF D input, there is a "cone" of logic feeding it which can be traced back to the Q outputs of one or more DFFs. You find the worst case path in that cone. Whatever its delay is (the sum of combinatorial logic propagation delays, wire delays, the launching DFF's clock-to-out delay, and the destination DFF's setup time), that's the minimum period for the clock being sent to both DFFs. Any shorter period would result in errors.

This might seem long winded but I have simplified it a lot to avoid writing a book. :)


So this the place where you got something conceptually wrong, paradoxically by getting something else right.

Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.

Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
Very interesting, thank you for taking time to explain.
 
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.

We rarely think about individual transistors; we're mostly concerned with gates.
Yeah, I imagined I wouldn't get the details right. Back when I studied this, I did it from a physics perspective, so we rarely thought about gates, and were mostly concerned with transistors 😂 (actually, it was mostly about physics of the p-n junction, so even transistors were a bit too 'high level' for us).

Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.
That makes a lot of sense. I got a bit confused, because I knew some people overclock desktop chips a lot higher than its factory frequency, so I thought Apple's would be no different. But I guess desktop parts are less concerned about reducing power and can use gates with more drive strength, therefore having a bigger margin for frequency increases.

Either way, very interesting read. Thank you for the explanation.
 
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.

We rarely think about individual transistors; we're mostly concerned with gates. Gates are logic circuit elements built from at least two (in CMOS logic, anyways) transistors, and often more.

There are two categories of gate, combinatorial and sequential. Combinatorial logic gates aren't clocked. A common example is the 2-input AND gate. These have two input terminals A and B, and one output terminal Y. The gate continuously computes the logic function Y = A AND B. When A or B changes, there is a small propagation delay before the output Y changes too.

There are several types of sequential gates, but by far the most common (and the one I'll be using as shorthand for all the rest) is the D-type flipflop, or DFF. A basic DFF has a clock input C, a data input D, and a data output Q. The DFF holds Q stable at all times except when clock input C transitions from logic 0 to logic 1. At that moment in time, it copies the value present on input D to output Q.

Just as with a combinatorial gate, there is propagation delay: it takes some time to copy the value of D to Q. This is referred to as clock-to-out time. The other key DFF timing parameters are setup and hold time: for the DFF to work correctly, input D must be stable for a period of time surrounding the arrival of a rising edge on C. Stable time before the rising edge is setup time, stable time after is hold time.

For any given DFF D input, there is a "cone" of logic feeding it which can be traced back to the Q outputs of one or more DFFs. You find the worst case path in that cone. Whatever its delay is (the sum of combinatorial logic propagation delays, wire delays, the launching DFF's clock-to-out delay, and the destination DFF's setup time), that's the minimum period for the clock being sent to both DFFs. Any shorter period would result in errors.

This might seem long winded but I have simplified it a lot to avoid writing a book. :)


So this the place where you got something conceptually wrong, paradoxically by getting something else right.

Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.

Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.

This was incredibly informative and interesting. Thank you!
 
Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.

Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.

Always good to get a refresher on my college courses, thanks for sharing.

But it does lead me to one question here: While I agree that the M1 as-is likely doesn’t have headroom as you state, it sounds like there are options for scaling up in terms of clocks if Apple was willing to take the tradeoffs required to do so, without making major changes to the uarch? Or does that have knock on effects in other areas? Mostly just thinking of the Mac Pro, where if Apple is going to be willing to up the power limits a bit anywhere, it’s going to be there, but not if it means a lot of effort.

That said, they might not care to up the power limits on the M1 cores at all, considering the M1 Pro already eats away at the bottom of the Mac Pro CPU performance scale as it is.
 
This is not exactly how us chip people would describe it, and there are a few minor mistakes, but it is conceptually correct.

We rarely think about individual transistors; we're mostly concerned with gates. Gates are logic circuit elements built from at least two (in CMOS logic, anyways) transistors, and often more.

There are two categories of gate, combinatorial and sequential. Combinatorial logic gates aren't clocked. A common example is the 2-input AND gate. These have two input terminals A and B, and one output terminal Y. The gate continuously computes the logic function Y = A AND B. When A or B changes, there is a small propagation delay before the output Y changes too.

There are several types of sequential gates, but by far the most common (and the one I'll be using as shorthand for all the rest) is the D-type flipflop, or DFF. A basic DFF has a clock input C, a data input D, and a data output Q. The DFF holds Q stable at all times except when clock input C transitions from logic 0 to logic 1. At that moment in time, it copies the value present on input D to output Q.

Just as with a combinatorial gate, there is propagation delay: it takes some time to copy the value of D to Q. This is referred to as clock-to-out time. The other key DFF timing parameters are setup and hold time: for the DFF to work correctly, input D must be stable for a period of time surrounding the arrival of a rising edge on C. Stable time before the rising edge is setup time, stable time after is hold time.

For any given DFF D input, there is a "cone" of logic feeding it which can be traced back to the Q outputs of one or more DFFs. You find the worst case path in that cone. Whatever its delay is (the sum of combinatorial logic propagation delays, wire delays, the launching DFF's clock-to-out delay, and the destination DFF's setup time), that's the minimum period for the clock being sent to both DFFs. Any shorter period would result in errors.

This might seem long winded but I have simplified it a lot to avoid writing a book. :)


So this the place where you got something conceptually wrong, paradoxically by getting something else right.

Apple does want to limit power consumption, and it is precisely because of this that we should expect there's little frequency headroom in M1. You see, when you look at the actual gate libraries designers work with, AKA cell libraries, you won't find just one 2-input AND gate. You'll find a whole family of them, each with a different tradeoff between power consumption and performance. There probably isn't much difference in propagation time between any of them, but there is a big difference in output drive strength. This matters because, particularly at advanced nodes like TSMC 5nm, timing tends to be dominated by wire delay. How do you force a given wire to change state faster? You drive it with a more powerful gate. But the more powerful the gate's output drive structure is, the more power it consumes.

Therefore, to minimize power, Apple will have used some combination of automated tooling and manual designer choice to select the versions of gates with the minimum drive strength required for each path to make its frequency target. There will only be enough timing margin for safety, because anything more than that is an opportunity to reduce power.

Wow, thank you! Very well written and explained.
 
Yeah, I imagined I wouldn't get the details right. Back when I studied this, I did it from a physics perspective, so we rarely thought about gates, and were mostly concerned with transistors 😂 (actually, it was mostly about physics of the p-n junction, so even transistors were a bit too 'high level' for us).
In turn, people like me often work on this from the other side without ever thinking about junction physics. Chip design requires a lot of contributions from different specialists!

That makes a lot of sense. I got a bit confused, because I knew some people overclock desktop chips a lot higher than its factory frequency, so I thought Apple's would be no different. But I guess desktop parts are less concerned about reducing power and can use gates with more drive strength, therefore having a bigger margin for frequency increases.
While I agree that the M1 as-is likely doesn’t have headroom as you state, it sounds like there are options for scaling up in terms of clocks if Apple was willing to take the tradeoffs required to do so, without making major changes to the uarch? Or does that have knock on effects in other areas? Mostly just thinking of the Mac Pro, where if Apple is going to be willing to up the power limits a bit anywhere, it’s going to be there, but not if it means a lot of effort.
The main knob they might have to get more frequency out of the Firestorm core with no changes is boosting voltage. As both of you no doubt know, that comes with a substantial power penalty because the power scaling effect is at least proportional to V^2 (iirc it's worse on modern process tech, for physics reasons which I've forgotten). Exactly how much of this headroom is available, and how bad the power penalty will be, depends on how close they're already running to the maximum voltage TSMC N5/N5P can handle safely (meaning, without long term reliability concerns).

It's hard to say how much headroom there might be for a revision pass on physical design as Krevnik suggests. There's entire other universes of design tradeoffs lurking there. One of them is simply that Apple probably optimized for as few pipeline stages as possible to meet their frequency targets, because minimizing that metric reduces power and improves performance. That means there might be some really long critical paths in the design where they already pulled out all the stops on using the fastest gates in the cell library. If there's only a few places you must do it, it's OK to do so even in a low power design.

But I honestly don't expect it. It costs a lot to do such a revision, and Firestorm should be fine as-is for a big machine.

The thing you have to look at is what the competition is doing. Intel's current biggest 28-core Xeon has a max turbo frequency of 4.30 Ghz, and a base frequency (guaranteed minimum if all 28 cores are at 100% load) of just 2.90 GHz. These are the same Intel core designs which can hit 5.0 GHz in their 8/10-core desktop SKUs, but they must rein them in a bit to jam 28 of them into a single die or package, even with a 250W TDP.

Apple's Firestorm core is in a pretty good place, there. Its single thread performance is competitive with 5 GHz Intel cores, but they only need about 5W to get there. Apple should be able to run 16 or 32 of them at full 3.2 GHz frequency in a single package. A workstation whose CPU cores stay locked at Fmax no matter how many cores you load up should be very compelling, even if the peak single thread performance isn't quite as good as the competition. That's something Apple can build with zero revisions to Firestorm (or perhaps Avalanche?), and no voltage boosting.
 
I thought the M1 single core score(in M1 mini) was in the range of mid 1600s. So, the gain is around 5-6% per core from M1 to M1x assuming the scores in the 1st post is accurate, is that right?
No. The M1 gets mid 1700 single core results, so there is no improvement in the M1 Pro/Max. This implies it is using the same 1-year-old Firestorm & Ice Storm cores as the M1, which I find disappointing.
 
No. The M1 gets mid 1700 single core results, so there is no improvement in the M1 Pro/Max. This implies it is using the same 1-year-old Firestorm & Ice Storm cores as the M1, which I find disappointing.
They focused on the GPU, added media blocks and improved bandwidth. These are much more important than increasing clock speed which increase single core perf.

I am happy they did the first.
 
  • Like
Reactions: Fomalhaut
It makes more sense to scale out with more cores than scale up in frequency plus CPU has taken a back seat to GPU for compute for at least the last decade. Remember first doing GPU compute with 9800GX2 which was over a decade ago.
 
They focused on the GPU, added media blocks and improved bandwidth. These are much more important than increasing clock speed which increase single core perf.

I am happy they did the first.
I agree, and of course the other improvements are far more significant than the single core speed which is already excellent.

I’m just being greedy….or maybe desperately looking for a reason to avoid being tempted to buy one :) These are very impressive machines, but I don’t actually need one for my work or recreation, and I suspect the same will apply to the majority of buyers.
 
next year this SoC will be in the next imac...so for the first time, Apple should say that their SoC is slower or equal to the top of the line (2 years old) gpu that replaces it ...until now Apple was like 2x faster , 800 faster etc than the device that new one replaces it
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.