Intel Alder Lake vs. Apple M1

satchmo · Nov 8, 2021

We'll need to have a sound test in the future...because decibels matter.

JMacHack · Nov 8, 2021

theorist9 said:
Thanks for the links. To be sure, Apple is beautifully situated for expanding to large core counts. And I understand increasing single-core performance is a challenge for all chip makers. But since most programs remain single-threaded, until chips get fast enough for these programs to respond with no noticeable CPU-bound delay, increased single-core performance will continue to improve the user experience.* And this would probably improve the experience for more users than those who need high core counts.

*I'd say there's two cutoffs for the maximum amount of time you'd want the program to take to complete routine, repeated tasks. One is the cutoff for "no perceptible delay". The other is "perceptible, but not irritating". I don't know what either of these would be, but I'd guess the former is maybe <4 ms, and the latter is ~ < 100 ms, depending on the type of work you're doing (obviously you'd want much less latency for gaming, but I'm thinking about productivity work).

So it's a conundrum. Chip makers can no longer generationally improve single-core performance the way they could during the sub-GHz days, so most of the improvement has been in multi-core performance, through scaling to higher core counts. Yet, with a few key exceptions (audio/photography/video apps), software makers seem unable or unwilling to parallelize their programs to take advantage of this.

I did not know this. What are the frequencies it uses when running software that can make use of all cores?

Not that I’m smart enough to understand this, but I think ASi is in a better position than x86 chips because of the parallelism problem. If the SoC is more efficient at retrieving and moving data then it would be able to run single threaded tasks without entering nuclear furnace territory.

Knowing that, I do want to know how well ASi scales with clock speed. Iirc all the M1 cores run at 3.2ghz(?) while comparable performance x86 chips run closer to 4.5-5ghz.

I assume the m series has clocks locked, but I’m curious to know if there’s any barriers to clock speed on the m series. Will they become unstable at higher clocks or immediately go past the efficiency curve?

And for parallelism, I’m more excited to be able to run multiple programs at the same time with little to no slowdown. That’s more appealing than running a single program faster.

cmaier · Nov 8, 2021

crazy dave said:
An interesting follow up question though is why Apple only lets their cores run at absolute peak frequency when only one core is active regardless of the thermal capacity of the system? ?‍♂️

Due to IR drop on the power rails. Also local heating - the cores are too close to each other.

Andropov · Nov 8, 2021

Glad to see you back, @cmaier

cmaier · Nov 8, 2021

Andropov said:
Glad to see you back, @cmaier

I’m only a little bit back

. Most of my CPU posts will be at another place, but I will poke my head in here now and again to discuss ohm’s law and thermodynamics

cmaier · Nov 8, 2021

So, IR drop is an effect of great concern in CPU designs, caused by resistance on the on-chip power grid. If you run a core faster, it increases the current, because you have to charge and discharge the FET capacitances (and any wires that are switching), and current is, by definition, the movement of charge.

Since resistance on the grid is constant, increasing the current increases the voltage that is dropped across the power grid (Ohm’s law: V = IR).

The result is that the transistors see a lower voltage difference across gate and source/drain, which slows them down (because the switching speed of a transistor is affected by the voltage).

Informative post?

JMacHack · Nov 8, 2021

cmaier said:
So, IR drop is an effect of great concern in CPU designs, caused by resistance on the on-chip power grid. If you run a core faster, it increases the current, because you have to charge and discharge the FET capacitances (and any wires that are switching), and current is, by definition, the movement of charge.

Since resistance on the grid is constant, increasing the current increases the voltage that is dropped across the power grid (Ohm’s law: V = IR).

The result is that the transistors see a lower voltage difference across gate and source/drain, which slows them down (because the switching speed of a transistor is affected by the voltage).

Informative post?

Very, but if I were smart enough to understand it, I wouldn’t make my living with pretty shapes and colors.

bobcomer · Nov 8, 2021

cmaier said:
Informative post?

Very!

I'm curious, I take it that's why Intel is less heat and power efficient, they're forcing the issue with more current to get a higher clock? And Apple would have the same problem if they did the same thing? (say, for a Mac Pro level processor)

Would alternating P and E cores make any difference? (other than to screw up timing?)

cmaier · Nov 8, 2021

bobcomer said:
Very!

I'm curious, I take it that's why Intel is less heat and power efficient, they're forcing the issue with more current to get a higher clock? And Apple would have the same problem if they did the same thing? (say, for a Mac Pro level processor)

Would alternating P and E cores make any difference? (other than to screw up timing?)

Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )

Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.

Alternating cores is more of a local-hot-spot solution than a voltage drop solution.

I will put together a post explaining in more detail at talkedabout.com when I get a chance.

bobcomer · Nov 8, 2021

cmaier said:
Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )

Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.

Alternating cores is more of a local-hot-spot solution than a voltage drop solution.

I will put together a post explaining in more detail at talkedabout.com when I get a chance.

Thanks!

cmaier · Nov 8, 2021

cmaier said:
Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )

Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.

Alternating cores is more of a local-hot-spot solution than a voltage drop solution.

I will put together a post explaining in more detail at talkedabout.com when I get a chance.

I have a thread started here, where I will eventually go into a lot more detail on the Arm/M1 vs x86 issue:

X86 vs. Arm

At “the other place” I promised to address the fundamental disadvantage that Intel has to cope with in trying to match Apple’s M-series chips. I’ll do that in this thread, a little bit at a time, probably as a stream of consciousness sort of thing. Probably the first thing I’ll note is that...

talkedabout.com

mr_roboto · Nov 8, 2021

crazy dave said:
Here it is for the iPhone:

The iPhone 12 & 12 Pro Review: New Design and Diminishing Returns

www.anandtech.com

The regular M1 is similar:

https://twitter.com/x/status/1438536935518052358

I’m not sure about the Pro/Max, but I think it’s the same. If so: it’s 3GHz for running all P-cores (obviously less for E-cores) and 7% boost to 3.2GHz for when running a single thread.

I cannot confirm Hector's statements in that twitter thread. I have a M1 Air, and whenever I've tried investigating its frequency limits with powermetrics and CPU eater tasks (same thing he suggested), I see all four P cores jump to 3.204 GHz and stay there until the P core temps rise into roughly the 90-100C range. After that, Apple's DVFS control loop starts responding to the high temperature and uses frequency and voltage as the control variable to regulate temperature.

This is true whether I use my usual CPU eater ("yes >/dev/null"), Hector Martin's ("cat /dev/zero >/dev/null"), or POVRay ("povray +WTn -benchmark", with n being the number of worker threads you want). The only difference between any of these is how much power per core they consume.

As an aside: The unlikely champion of that fight is 'yes'. One thread of 'yes' translates to about 5.7W in the P cluster, and with the enhanced powermetrics in Monterey you can see that the average P cluster IPC is about 6.1! That's a very, very high IPC number. A single thread of 'cat' only manages 4.2W and 3.5 IPC, and a single thread of 'povray' 3.9W and almost 4 IPC. I have no idea why 'yes' is a M1 power virus.

Anyways, I wonder if Hector was taking care to start all threads in his multi-thread tests simultaneously. With the Air, doing that and starting with the CPU cold is important when trying to observe 4-thread behavior.

crazy dave · Nov 8, 2021

mr_roboto said:
I cannot confirm Hector's statements in that twitter thread. I have a M1 Air, and whenever I've tried investigating its frequency limits with powermetrics and CPU eater tasks (same thing he suggested), I see all four P cores jump to 3.204 GHz and stay there until the P core temps rise into roughly the 90-100C range. After that, Apple's DVFS control loop starts responding to the high temperature and uses frequency and voltage as the control variable to regulate temperature.

This is true whether I use my usual CPU eater ("yes >/dev/null"), Hector Martin's ("cat /dev/zero >/dev/null"), or POVRay ("povray +WTn -benchmark", with n being the number of worker threads you want). The only difference between any of these is how much power per core they consume.

As an aside: The unlikely champion of that fight is 'yes'. One thread of 'yes' translates to about 5.7W in the P cluster, and with the enhanced powermetrics in Monterey you can see that the average P cluster IPC is about 6.1! That's a very, very high IPC number. A single thread of 'cat' only manages 4.2W and 3.5 IPC, and a single thread of 'povray' 3.9W and almost 4 IPC. I have no idea why 'yes' is a M1 power virus.

Anyways, I wonder if Hector was taking care to start all threads in his multi-thread tests simultaneously. With the Air, doing that and starting with the CPU cold is important when trying to observe 4-thread behavior.

Interesting! Are you looking at the per-core or cluster line?

https://twitter.com/x/status/1438748879705370630

Edit:

Unless I’m misunderstanding Andrei wrote the same for the Pro/Max:

Intel Alder Lake vs. Apple M1

Found a reference for the Pro/Max. As you thought, they show the same behavior: "The CPU cores clock up to 3228 MHz peak, however vary in frequency depending on how many cores are active within a cluster, clocking down to 3132 at 2, and 3036 MHz at 3 and 4 cores active. I say “per cluster”...

forums.macrumors.com

mr_roboto · Nov 8, 2021

crazy dave said:
Interesting! Are you looking at the per-core or cluster line?

https://twitter.com/x/status/1438748879705370630

Edit:

Unless I’m misunderstanding Andrei wrote the same for the Pro/Max:

Intel Alder Lake vs. Apple M1

Found a reference for the Pro/Max. As you thought, they show the same behavior: "The CPU cores clock up to 3228 MHz peak, however vary in frequency depending on how many cores are active within a cluster, clocking down to 3132 at 2, and 3036 MHz at 3 and 4 cores active. I say “per cluster”...

forums.macrumors.com

I was indeed looking at the per-core reporting, trusting "residency" to mean "this is the frequency bin this core was actually in". Interesting!

theorist9 · Nov 8, 2021

cmaier said:
I’m only a little bit back . Most of my CPU posts will be at another place, but I will poke my head in here now and again to discuss ohm’s law and thermodynamics

Since I teach thermo, I'm all for that.

How on earth did you manage to get yourself banned (or were they kidding?).

Is there, at least in principle, room for Apple to reoptimize its architecture for higher single-core performance (for the same process size)? E.g., if they could get a 20% increase in performance for a 30% increase in wattage, that might be a good tradeoff for a desktop machine, which is not constrained by battery life, and would presumably have better thermals than their mobile devices.

Of course, if their focus is on scaling horizontally, this might not be of interest to them. That tradeoff would make more sense for an 8-10 perf. core prosumer desktop (e.g., the large iMac) than a 20+ perf. core workstation, and they may not want to invest the added dev costs to make an added core design (preferring to instead share the same core design acrosss all their prosumer/pro devices, differentiating them only by core count).

[If you'd prefer, I can repost this -- less the banned question -- at your other site.]

cmaier · Nov 8, 2021

theorist9 said:
Since I teach thermo, I'm all for that.

How on earth did you manage to get yourself banned (or were they kidding?).

Is there, at least in principle, room for Apple to reoptimize its architecture for higher single-core performance (for the same process size)? E.g., if they could get a 20% increase in performance for a 30% increase in wattage, that might be a good tradeoff for a desktop machine, which is not constrained by battery life, and would presumably have better thermals than their mobile devices.

Of course, if their focus is on scaling horizontally, this might not be of interest to them. That tradeoff would make more sense for an 8-10 perf. core prosumer desktop (e.g., the large iMac) than a 20+ perf. core workstation, and they may not want to invest the added dev costs to make an added core design (preferring to instead share the same core design acrosss all their prosumer/pro devices, differentiating them only by core count)

I assume that they eked out close to the maximum single core performance they could get given the process and reasonable power budget. That is, they designed for the knee in the power/performance curve, so that raising the performance any more would increase power consumption by much more. If they wanted to improve single core at the expense of multicore, they could remove cores or spread the cores out, so that local heating isn’t as much of an issue, and then raise the clock frequency, of course. I think A15 is an interesting signpost. Next generation core, running on essentially the same process, with performance increase in single core that is not all that much more than the difference in process gets them. (I think A15 cores give them more headroom, though, and may allow higher clock speeds in M2).

huge_apple_fangirl · Nov 8, 2021

cmaier said:
I’m only a little bit back . Most of my CPU posts will be at another place, but I will poke my head in here now and again to discuss ohm’s law and thermodynamics

Only a little bit? Well, it’s better than nothing. Great to see you back though!

theorist9 · Nov 8, 2021

cmaier said:
I assume that they eked out close to the maximum single core performance they could get given the process and reasonable power budget. That is, they designed for the knee in the power/performance curve, so that raising the performance any more would increase power consumption by much more. If they wanted to improve single core at the expense of multicore, they could remove cores or spread the cores out, so that local heating isn’t as much of an issue, and then raise the clock frequency, of course. I think A15 is an interesting signpost. Next generation core, running on essentially the same process, with performance increase in single core that is not all that much more than the difference in process gets them. (I think A15 cores give them more headroom, though, and may allow higher clock speeds in M2).

That all makes sense. But what I'm asking is whether, with a different version of their current architecture (still keeping to all its essential components, including unified memory, various coprocessors, etc.), they could have (at the expense of some loss of efficiency) created a design whose knee was at a higher clock (assuming the same process size).

Your observation about the A15 suggests this might be possible—though for the actual A15 they chose not to (staying with an equivalent knee after renormalizing for process size), likely to continue to maximize efficiency.

tomO2013 · Nov 8, 2021

Just wanted to say publicly that it is so good to see you back on these forums @cmaier . Your contributions were missed around here.

I have to say that I find Apple’s approach to performance and in particular their solution to single core, single threaded, targeted performance improvments via co-processors and accelerators to be a very exciting one and the future!

I could envisage M2 being a mixture of the A15 improved e-cores , A14/15 p-cores as well as additional silicon budget invested in video encode/decode, AMX, <insert new co-processor accelerator that targets a particular and very specific type of desktop workload>.

cmaier · Nov 8, 2021

theorist9 said:
That all makes sense. But what I'm asking is whether, with a different version of their current architecture (still keeping to all its essential components, including unified memory, various coprocessors, etc.), they could have (at the expense of some loss of efficiency) created a design whose knee was at a higher clock (assuming the same process size).

Your observation about the A15 suggests this might be possible—though for the actual A15 they chose not to (staying with an equivalent knee after renormalizing for process size), likely to continue to maximize efficiency.

They could have picked a point to the right of the knee, for sure. For example, trade off 30% more power for 15% more performance. Tough to say whether they could have shifted the knee point itself much more to the right, though. Maybe by like 10% or 20%.

Edit: btw, by “knee,” i mean the highest point where you get 1:1 performance vs power increase. Beyond that you get more power consumption for a given percent performance increase.

cmaier · Nov 8, 2021

huge_apple_fangirl said:
Only a little bit? Well, it’s better than nothing. Great to see you back though!

Thanks! I will be posting longer explanations and discussions of things over on talkedabout.com in the tech part of the forum, but will chime in a bit here, too.

theorist9 · Nov 9, 2021

So since this seems to be the thread for more technical questions: You'll recall there was a lot of discussion about whether Apple would use HBM2 for the AS Mac Pro.

Does what we've seen in with LPDDR5 in the MBP's tells us HBM2 wouldn't be needed? And even if there would be a benefit, would too many architectural changes be needed to switch to HBM2 (i.e, would it be qualitatively different from the relatively easy switch they made from LPDDR4 in the M1 to LPDDR5 in the M1 Pro/Max)?

iPadified · Nov 9, 2021

tomO2013 said:
I have to say that I find Apple’s approach to performance and in particular their solution to single core, single threaded, targeted performance improvments via co-processors and accelerators to be a very exciting one and the future!

I could envisage M2 being a mixture of the A15 improved e-cores , A14/15 p-cores as well as additional silicon budget invested in video encode/decode, AMX, <insert new co-processor accelerator that targets a particular and very specific type of desktop workload>.

That so many missed. Next step, a ray tracer (please). I saw once a video (can't find it) where a 10W chip performed better at ray tracing than a monster NVIDIA card because the former was dedicated to the task. I believe that was 5-7 years ago...

leman · Nov 9, 2021

theorist9 said:
So since this seems to be the thread for more technical questions: You'll recall there was a lot of discussion about whether Apple would use HBM2 for the AS Mac Pro.

Does what we've seen in with LPDDR5 in the MBP's tells us HBM2 wouldn't be needed? And even if there would be a benefit, would too many architectural changes be needed to switch to HBM2 (i.e, would it be qualitatively different from the relatively easy switch they made from LPDDR4 in the M1 to LPDDR5 in the M1 Pro/Max)?

HBM2 still has an advantage in bandwidth per area. If you want to go, say, 1.6TB/s bandwidth (4x M1 Max), you’d need 16x RAM modules with the current implementation which can get messy. HBM3 can probably do it with a much smaller package. But then again you lose RAM capacity, so there are drawbacks too.

Kpjoslee · Nov 9, 2021

leman said:
HBM2 still has an advantage in bandwidth per area. If you want to go, say, 1.6TB/s bandwidth (4x M1 Max), you’d need 16x RAM modules with the current implementation which can get messy. HBM3 can probably do it with a much smaller package. But then again you lose RAM capacity, so there are drawbacks too.

Instead of integrating LPDDR5 on SoC. I think they can shoot for 16-channels of DDR5-6400 instead on Mac Pro. They can get the bandwidth and capacity (theoretically up to 4TB, with ~600GB/s bandwidth)

Intel Alder Lake vs. Apple M1

macrumors 603

Suspended

Suspended

macrumors 6502a

Suspended

Suspended

Suspended

macrumors 601

Suspended

macrumors 601

Suspended

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 601

Suspended

macrumors 6502a

macrumors 601

macrumors member

Suspended

Suspended

macrumors 601

macrumors 68020

macrumors Core

macrumors 6502

Our Staff