Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
Thanks for the links. To be sure, Apple is beautifully situated for expanding to large core counts. And I understand increasing single-core performance is a challenge for all chip makers. But since most programs remain single-threaded, until chips get fast enough for these programs to respond with no noticeable CPU-bound delay, increased single-core performance will continue to improve the user experience.* And this would probably improve the experience for more users than those who need high core counts.

*I'd say there's two cutoffs for the maximum amount of time you'd want the program to take to complete routine, repeated tasks. One is the cutoff for "no perceptible delay". The other is "perceptible, but not irritating". I don't know what either of these would be, but I'd guess the former is maybe <4 ms, and the latter is ~ < 100 ms, depending on the type of work you're doing (obviously you'd want much less latency for gaming, but I'm thinking about productivity work).

So it's a conundrum. Chip makers can no longer generationally improve single-core performance the way they could during the sub-GHz days, so most of the improvement has been in multi-core performance, through scaling to higher core counts. Yet, with a few key exceptions (audio/photography/video apps), software makers seem unable or unwilling to parallelize their programs to take advantage of this.

I did not know this. What are the frequencies it uses when running software that can make use of all cores?
Not that I’m smart enough to understand this, but I think ASi is in a better position than x86 chips because of the parallelism problem. If the SoC is more efficient at retrieving and moving data then it would be able to run single threaded tasks without entering nuclear furnace territory.

Knowing that, I do want to know how well ASi scales with clock speed. Iirc all the M1 cores run at 3.2ghz(?) while comparable performance x86 chips run closer to 4.5-5ghz.

I assume the m series has clocks locked, but I’m curious to know if there’s any barriers to clock speed on the m series. Will they become unstable at higher clocks or immediately go past the efficiency curve?

And for parallelism, I’m more excited to be able to run multiple programs at the same time with little to no slowdown. That’s more appealing than running a single program faster.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
So, IR drop is an effect of great concern in CPU designs, caused by resistance on the on-chip power grid. If you run a core faster, it increases the current, because you have to charge and discharge the FET capacitances (and any wires that are switching), and current is, by definition, the movement of charge.

Since resistance on the grid is constant, increasing the current increases the voltage that is dropped across the power grid (Ohm’s law: V = IR).

The result is that the transistors see a lower voltage difference across gate and source/drain, which slows them down (because the switching speed of a transistor is affected by the voltage).

Informative post?

:)
 

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
So, IR drop is an effect of great concern in CPU designs, caused by resistance on the on-chip power grid. If you run a core faster, it increases the current, because you have to charge and discharge the FET capacitances (and any wires that are switching), and current is, by definition, the movement of charge.

Since resistance on the grid is constant, increasing the current increases the voltage that is dropped across the power grid (Ohm’s law: V = IR).

The result is that the transistors see a lower voltage difference across gate and source/drain, which slows them down (because the switching speed of a transistor is affected by the voltage).

Informative post?

:)
Very, but if I were smart enough to understand it, I wouldn’t make my living with pretty shapes and colors.
 
  • Like
Reactions: cmaier

bobcomer

macrumors 601
May 18, 2015
4,949
3,699
Informative post?
Very!

I'm curious, I take it that's why Intel is less heat and power efficient, they're forcing the issue with more current to get a higher clock? And Apple would have the same problem if they did the same thing? (say, for a Mac Pro level processor)

Would alternating P and E cores make any difference? (other than to screw up timing?)
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Very!

I'm curious, I take it that's why Intel is less heat and power efficient, they're forcing the issue with more current to get a higher clock? And Apple would have the same problem if they did the same thing? (say, for a Mac Pro level processor)

Would alternating P and E cores make any difference? (other than to screw up timing?)

Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )

Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.

Alternating cores is more of a local-hot-spot solution than a voltage drop solution.

I will put together a post explaining in more detail at talkedabout.com when I get a chance.
 

bobcomer

macrumors 601
May 18, 2015
4,949
3,699
Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )

Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.

Alternating cores is more of a local-hot-spot solution than a voltage drop solution.

I will put together a post explaining in more detail at talkedabout.com when I get a chance.
Thanks!
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )

Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.

Alternating cores is more of a local-hot-spot solution than a voltage drop solution.

I will put together a post explaining in more detail at talkedabout.com when I get a chance.

I have a thread started here, where I will eventually go into a lot more detail on the Arm/M1 vs x86 issue:

 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
Here it is for the iPhone:


The regular M1 is similar:


I’m not sure about the Pro/Max, but I think it’s the same. If so: it’s 3GHz for running all P-cores (obviously less for E-cores) and 7% boost to 3.2GHz for when running a single thread.
I cannot confirm Hector's statements in that twitter thread. I have a M1 Air, and whenever I've tried investigating its frequency limits with powermetrics and CPU eater tasks (same thing he suggested), I see all four P cores jump to 3.204 GHz and stay there until the P core temps rise into roughly the 90-100C range. After that, Apple's DVFS control loop starts responding to the high temperature and uses frequency and voltage as the control variable to regulate temperature.

This is true whether I use my usual CPU eater ("yes >/dev/null"), Hector Martin's ("cat /dev/zero >/dev/null"), or POVRay ("povray +WTn -benchmark", with n being the number of worker threads you want). The only difference between any of these is how much power per core they consume.

As an aside: The unlikely champion of that fight is 'yes'. One thread of 'yes' translates to about 5.7W in the P cluster, and with the enhanced powermetrics in Monterey you can see that the average P cluster IPC is about 6.1! That's a very, very high IPC number. A single thread of 'cat' only manages 4.2W and 3.5 IPC, and a single thread of 'povray' 3.9W and almost 4 IPC. I have no idea why 'yes' is a M1 power virus.

Anyways, I wonder if Hector was taking care to start all threads in his multi-thread tests simultaneously. With the Air, doing that and starting with the CPU cold is important when trying to observe 4-thread behavior.
 
  • Like
Reactions: altaic and cmaier

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
I cannot confirm Hector's statements in that twitter thread. I have a M1 Air, and whenever I've tried investigating its frequency limits with powermetrics and CPU eater tasks (same thing he suggested), I see all four P cores jump to 3.204 GHz and stay there until the P core temps rise into roughly the 90-100C range. After that, Apple's DVFS control loop starts responding to the high temperature and uses frequency and voltage as the control variable to regulate temperature.

This is true whether I use my usual CPU eater ("yes >/dev/null"), Hector Martin's ("cat /dev/zero >/dev/null"), or POVRay ("povray +WTn -benchmark", with n being the number of worker threads you want). The only difference between any of these is how much power per core they consume.

As an aside: The unlikely champion of that fight is 'yes'. One thread of 'yes' translates to about 5.7W in the P cluster, and with the enhanced powermetrics in Monterey you can see that the average P cluster IPC is about 6.1! That's a very, very high IPC number. A single thread of 'cat' only manages 4.2W and 3.5 IPC, and a single thread of 'povray' 3.9W and almost 4 IPC. I have no idea why 'yes' is a M1 power virus.

Anyways, I wonder if Hector was taking care to start all threads in his multi-thread tests simultaneously. With the Air, doing that and starting with the CPU cold is important when trying to observe 4-thread behavior.

Interesting! Are you looking at the per-core or cluster line?


Edit:

Unless I’m misunderstanding Andrei wrote the same for the Pro/Max:

 
Last edited:

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
Interesting! Are you looking at the per-core or cluster line?


Edit:

Unless I’m misunderstanding Andrei wrote the same for the Pro/Max:

I was indeed looking at the per-core reporting, trusting "residency" to mean "this is the frequency bin this core was actually in". Interesting!
 
  • Like
Reactions: crazy dave

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
I’m only a little bit back :). Most of my CPU posts will be at another place, but I will poke my head in here now and again to discuss ohm’s law and thermodynamics :)
Since I teach thermo, I'm all for that.

How on earth did you manage to get yourself banned (or were they kidding?).

Is there, at least in principle, room for Apple to reoptimize its architecture for higher single-core performance (for the same process size)? E.g., if they could get a 20% increase in performance for a 30% increase in wattage, that might be a good tradeoff for a desktop machine, which is not constrained by battery life, and would presumably have better thermals than their mobile devices.

Of course, if their focus is on scaling horizontally, this might not be of interest to them. That tradeoff would make more sense for an 8-10 perf. core prosumer desktop (e.g., the large iMac) than a 20+ perf. core workstation, and they may not want to invest the added dev costs to make an added core design (preferring to instead share the same core design acrosss all their prosumer/pro devices, differentiating them only by core count).

[If you'd prefer, I can repost this -- less the banned question -- at your other site.]
 
Last edited:

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Since I teach thermo, I'm all for that.

How on earth did you manage to get yourself banned (or were they kidding?).

Is there, at least in principle, room for Apple to reoptimize its architecture for higher single-core performance (for the same process size)? E.g., if they could get a 20% increase in performance for a 30% increase in wattage, that might be a good tradeoff for a desktop machine, which is not constrained by battery life, and would presumably have better thermals than their mobile devices.

Of course, if their focus is on scaling horizontally, this might not be of interest to them. That tradeoff would make more sense for an 8-10 perf. core prosumer desktop (e.g., the large iMac) than a 20+ perf. core workstation, and they may not want to invest the added dev costs to make an added core design (preferring to instead share the same core design acrosss all their prosumer/pro devices, differentiating them only by core count)

I assume that they eked out close to the maximum single core performance they could get given the process and reasonable power budget. That is, they designed for the knee in the power/performance curve, so that raising the performance any more would increase power consumption by much more. If they wanted to improve single core at the expense of multicore, they could remove cores or spread the cores out, so that local heating isn’t as much of an issue, and then raise the clock frequency, of course. I think A15 is an interesting signpost. Next generation core, running on essentially the same process, with performance increase in single core that is not all that much more than the difference in process gets them. (I think A15 cores give them more headroom, though, and may allow higher clock speeds in M2).
 
  • Like
Reactions: theorist9

huge_apple_fangirl

macrumors 6502a
Aug 1, 2019
769
1,301
I’m only a little bit back :). Most of my CPU posts will be at another place, but I will poke my head in here now and again to discuss ohm’s law and thermodynamics :)
Only a little bit? Well, it’s better than nothing. Great to see you back though!
 

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
I assume that they eked out close to the maximum single core performance they could get given the process and reasonable power budget. That is, they designed for the knee in the power/performance curve, so that raising the performance any more would increase power consumption by much more. If they wanted to improve single core at the expense of multicore, they could remove cores or spread the cores out, so that local heating isn’t as much of an issue, and then raise the clock frequency, of course. I think A15 is an interesting signpost. Next generation core, running on essentially the same process, with performance increase in single core that is not all that much more than the difference in process gets them. (I think A15 cores give them more headroom, though, and may allow higher clock speeds in M2).
That all makes sense. But what I'm asking is whether, with a different version of their current architecture (still keeping to all its essential components, including unified memory, various coprocessors, etc.), they could have (at the expense of some loss of efficiency) created a design whose knee was at a higher clock (assuming the same process size).

Your observation about the A15 suggests this might be possible—though for the actual A15 they chose not to (staying with an equivalent knee after renormalizing for process size), likely to continue to maximize efficiency.
 

tomO2013

macrumors member
Feb 11, 2020
67
102
Canada
Just wanted to say publicly that it is so good to see you back on these forums @cmaier . Your contributions were missed around here. :)

I have to say that I find Apple’s approach to performance and in particular their solution to single core, single threaded, targeted performance improvments via co-processors and accelerators to be a very exciting one and the future!

I could envisage M2 being a mixture of the A15 improved e-cores , A14/15 p-cores as well as additional silicon budget invested in video encode/decode, AMX, <insert new co-processor accelerator that targets a particular and very specific type of desktop workload>.
 
  • Love
  • Like
Reactions: JMacHack and cmaier

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
That all makes sense. But what I'm asking is whether, with a different version of their current architecture (still keeping to all its essential components, including unified memory, various coprocessors, etc.), they could have (at the expense of some loss of efficiency) created a design whose knee was at a higher clock (assuming the same process size).

Your observation about the A15 suggests this might be possible—though for the actual A15 they chose not to (staying with an equivalent knee after renormalizing for process size), likely to continue to maximize efficiency.

They could have picked a point to the right of the knee, for sure. For example, trade off 30% more power for 15% more performance. Tough to say whether they could have shifted the knee point itself much more to the right, though. Maybe by like 10% or 20%.

Edit: btw, by “knee,” i mean the highest point where you get 1:1 performance vs power increase. Beyond that you get more power consumption for a given percent performance increase.
 
  • Like
Reactions: theorist9

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
So since this seems to be the thread for more technical questions: You'll recall there was a lot of discussion about whether Apple would use HBM2 for the AS Mac Pro.

Does what we've seen in with LPDDR5 in the MBP's tells us HBM2 wouldn't be needed? And even if there would be a benefit, would too many architectural changes be needed to switch to HBM2 (i.e, would it be qualitatively different from the relatively easy switch they made from LPDDR4 in the M1 to LPDDR5 in the M1 Pro/Max)?
 

iPadified

macrumors 68020
Apr 25, 2017
2,014
2,257
I have to say that I find Apple’s approach to performance and in particular their solution to single core, single threaded, targeted performance improvments via co-processors and accelerators to be a very exciting one and the future!

I could envisage M2 being a mixture of the A15 improved e-cores , A14/15 p-cores as well as additional silicon budget invested in video encode/decode, AMX, <insert new co-processor accelerator that targets a particular and very specific type of desktop workload>.
That so many missed. Next step, a ray tracer (please). I saw once a video (can't find it) where a 10W chip performed better at ray tracing than a monster NVIDIA card because the former was dedicated to the task. I believe that was 5-7 years ago...
 
  • Like
Reactions: tomO2013

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,678
So since this seems to be the thread for more technical questions: You'll recall there was a lot of discussion about whether Apple would use HBM2 for the AS Mac Pro.

Does what we've seen in with LPDDR5 in the MBP's tells us HBM2 wouldn't be needed? And even if there would be a benefit, would too many architectural changes be needed to switch to HBM2 (i.e, would it be qualitatively different from the relatively easy switch they made from LPDDR4 in the M1 to LPDDR5 in the M1 Pro/Max)?

HBM2 still has an advantage in bandwidth per area. If you want to go, say, 1.6TB/s bandwidth (4x M1 Max), you’d need 16x RAM modules with the current implementation which can get messy. HBM3 can probably do it with a much smaller package. But then again you lose RAM capacity, so there are drawbacks too.
 

Kpjoslee

macrumors 6502
Sep 11, 2007
417
269
HBM2 still has an advantage in bandwidth per area. If you want to go, say, 1.6TB/s bandwidth (4x M1 Max), you’d need 16x RAM modules with the current implementation which can get messy. HBM3 can probably do it with a much smaller package. But then again you lose RAM capacity, so there are drawbacks too.

Instead of integrating LPDDR5 on SoC. I think they can shoot for 16-channels of DDR5-6400 instead on Mac Pro. They can get the bandwidth and capacity (theoretically up to 4TB, with ~600GB/s bandwidth)
 
  • Like
Reactions: jdb8167
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.