Intel Alder Lake vs. Apple M1

crazy dave · Jan 15, 2022

Rigby said:
You're living in a fantasy land.

In single core that’s true (around 4). Multicore it’s basically ~2x, maybe 3x for some memory bound loads.

jeanlain · Jan 15, 2022

crazy dave said:
In single core that’s true (around 4). Multicore it’s basically ~2x, maybe 3x for some memory bound loads.

For the same number of threads used, the performance/W advantage should still be around 4X I think.
The 2-3X advantage in MT tasks involves intel CPUs with more cores/threads than Apple's.

People must know that power efficiency (as far as CPUs concerned) should be compared core-for-core.
Doubling the number of cores and cutting the core frequency by half will drastically increaser power efficiency in tasks like cinebench MT. There is nothing impressive about this, and single-threads tasks will run twice slower (at this frequency).

Xiao_Xi · Jan 15, 2022

crazy dave said:
1. Node
2. ISA
3. Microarchitecture
(4. software optimization but we’re going to ignore that for now and assume it is the same here - I think I came up with a 5th earlier but can’t remember and it’s first three we’re concerned with here)

How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?

cmaier · Jan 15, 2022

Xiao_Xi said:
How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?

Power consumption is capacitance driven times voltage squared times frequency. If you keep the voltage the same and increase the frequency, then power increases linearly with frequency. Sometimes you need to increase the voltage to increase frequency, however. Depends on details of the physical design.

Performance increases linearly with frequency.

Romain_H · Jan 15, 2022

Rigby said:
You're living in a fantasy land.

Its not me who‘s living there… I know, its hard to believe, in particular if standing on the Intel side of things. But in actuality that‘s what it is

EntropyQ3 · Jan 15, 2022

Xiao_Xi said:
How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?

It’s not nearly that simple when it comes to consumption. It kind of works that way with performance though, provided you are not bottlenecked by anything outside the domain of frequency increase, which you are likely to be.

power vs. frequency is a function that starts out as a constant then gets into a linear region which gradually moves to higher order growth when you need to increase drive voltage to increase frequency.

This is a practical example from a desktop processor. The curves I’ve seen from CPUs and GPUs are all similar but not identical in behaviour, due to for instance process specifics.

leman · Jan 15, 2022

Romain_H said:
Now, given that Intel‘s got a roughly 4-5x disavantage in terms of performance per watt

That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.

crazy dave · Jan 15, 2022

leman said:
That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.

It’s substantially improved but performance also suffers there too which is why they push it so high. The i5 chips offer probably the best balance between performance and efficiency for Intel.

cmaier · Jan 15, 2022

leman said:
That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.

I‘ve often said around here that if you give the same designers the same process node and ask them to design an x86 vs a RISC processor, you will see a 20% x86 penalty.

crazy dave · Jan 15, 2022

jeanlain said:
For the same number of threads used, the performance/W advantage should still be around 4X I think.
The 2-3X advantage in MT tasks involves intel CPUs with more cores/threads than Apple's.

People must know that power efficiency (as far as CPUs concerned) should be compared core-for-core.
Doubling the number of cores and cutting the core frequency by half will drastically increaser power efficiency in tasks like cinebench MT. There is nothing impressive about this, and single-threads tasks will run twice slower (at this frequency).

I understand where you are coming from and it can be useful as a theoretical perspective: “what could be built with these cores?” - like say in an upcoming M1 Max Duo.

However for comparing the M1 Max to an Intel i7/i9, then the design each chipmaker (or even individual OEM on things like RAM, thermals, and power) goes with is the design and the holistic comparison is what matters.

And even when normalizing by thread count there can be nuances. A major reason AMD and Intel chips claw back MT efficiency is SMT2/HT which comes with its own set of nuances and complications. So when normalizing by threads for AMD and Intel is it one thread per processor or more? For Intel and Apple, what percentage of threads are on “E” cores (I use the term loosely for Intel).

Xiao_Xi said:
How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?

@cmaier and @EntropyQ3 both responded with everything I was going to say and more.

jeanlain · Jan 16, 2022

leman said:
That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.

The curves Apple published suggest that at low frequency, the difference is much higher than 20-30%.
But no one else has established these curves, so we have to trust Apple.

The cinebench test on battery that was published by this Romanian site can still inform us.
At 23W (package power), the 12900H scores 6119. This is at rather low core frequency.
The M1 scores 7500 at 15.4W package power.
The intel CPU has 14 cores, the M1 as 8. This is a very crude estimation because each CPU has different types of cores and some intel cores have hypertheading, but the the average is 19 cinebench points per Watt per core for the intel part running at low frequency, and 60.8 points per Watt per core for the M1 at nominal frequency.
I must admit I find the 6119 score way too low, even considering that it only required 23W on average. Perhaps something went wrong in that test.

leman · Jan 16, 2022

jeanlain said:
At 23W (package power), the 12900H scores 6119. This is at rather low core frequency.

Is that with P-cores disabled? I find it a bit confusing that the average CPU utilisation is around 30%...

jeanlain · Jan 16, 2022

leman said:
Is that with P-cores disabled? I find it a bit confusing that the average CPU utilisation is around 30%...

I don't know. Could it be that the P-cores are disabled when unplugged?
That would be about 40 points per Watt per E-core in that case (at <20W, assuming disabled P-cores consume >3W). That's still lower than the M1's average, which includes Icestorm cores. If we ignore these cores, we're probably close to 110 points per W per Firestorm core (Icestorm cores only account for 15% of the performance, if we trust anantech SPEC tests, and they consume ~1.4W).

If the intel E-cores are indeed almost 3X less efficient than the Firestorm cores, I don't think that the P-cores can be near Firestorm cores anywhere in the frequency curve.

jeanlain · Jan 16, 2022

leman said:
I find it a bit confusing that the average CPU utilisation is around 30%...

The run yielding an impressive score of 16917 reports 46% average core usage, so I don't know what that metric means. ?
Interestingly, that run yielded in 18.6 points / Watt / core on average. That's very close to the run on battery power (assuming it used all 14 cores) which would be very weird. It suggests that the run on battery did not use all the cores.

Xiao_Xi · Jan 16, 2022

How do benchmarks that measure multicore performance work?

Is it more complex to measure the multicore performance of CPUs with performant/efficient cores?

jeanlain · Jan 16, 2022

Xiao_Xi said:
How do benchmarks that measure multicore performance work?

Is it more complex to measure the multicore performance of CPUs with performant/efficient cores?

I don't see why it should be. All tools just measure the time it takes to perform a task.
Of course, they are designed so that the workload can be split into many threads.

Xiao_Xi · Jan 16, 2022

jeanlain said:
All tools just measure the time it takes to perform a task.
Of course, they are designed so that the workload can be split into many threads.

How do benchmarks factor in that efficient cores take longer than performant cores to finish the same task?

jeanlain · Jan 16, 2022

Xiao_Xi said:
How do benchmarks factor in that efficient cores take longer than performant cores to finish the same task?

They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took. There is one time. Just look at a cinebench MT run. All cores work together. The program does not repeat the task separately for each core.

Xiao_Xi · Jan 16, 2022

jeanlain said:
They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took.

How can this be? Icestorm takes twice as long as Firestorm for almost all instructions.

Source:

Firestorm Base Instructions

dougallj.github.io

Icestorm Base Instructions

dougallj.github.io

leman · Jan 16, 2022

Xiao_Xi said:
How can this be? Icestorm takes twice as long as Firestorm for almost all instructions.

Source:

Firestorm Base Instructions

dougallj.github.io

Icestorm Base Instructions

dougallj.github.io

I think there has been a bit of a misunderstanding what @jenlain meant. Sure, on a heterogeneous CPU the slower cores will take longer, but that doesn't really matter, because you are measuring the total time it takes the CPU to run a workload. For a measure of multicore performance you just need to make sure that there are enough work packages to saturate all the cores. The CPU/OS schedulers will take care of the rest. And it is entirely possible that the P-cores will finish a task and go idle while the E-cores will still need a second or two to finish. On most modern systems however the OS will move the task from E-core to the P-core in such a situation. That's an implementation detail though.

Bottomline: as long as you have enough work, you probably don't have to worry about these things.

Andropov · Jan 16, 2022

jeanlain said:
They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took. There is one time. Just look at a cinebench MT run. All cores work together. The program does not repeat the task separately for each core.

That's not entirely true. It holds if you have a task that can be split into many small 'chunks' where the result of a chunk is not needed for enqueuing the following chunks. For example: Imagine you're trying to apply a filter on an image. You divide the image in tiles, and then apply the filter on each of these tiles. A P core will likely process twice as many tiles as an E core before the image is finished. You just have to wait until all tiles are processed, but since processing a single tile is a relatively short process, that each core takes different times processing them is not a problem in practice.

This is more of a problem if you have few, but very long running tasks. For example, running several instances of a numerical simulation (like a N-body particle simulation) with different parameters for each thread. A single run of the simulation could take, for example, 10 minutes on a P core and 20 minutes on a E core. So what happens if you run 10 of them in a M1 Pro/Max? Do 8 of the threads finish at the 10 minute mark but you still have to wait until the 20 minute mark for the whole process to finish? Well, no. The 2 remaining threads get promoted from the E cores to the P cores as soon as they're free, so it would finish at the 15 minute mark (half work done in the E cores = 10 minutes + the other half of the work done on the P cores = 5 minutes), and the CPU would be at low utilization for those last minutes.

(Note that this can also become problematic in homogeneous core designs: running 9 such simulations on a homogeneous 8-core machine could take significantly more time than running 8 simulations since you can't run the 9th simulation until one of the others has finished. That's why usually the amount of simulations run is made a multiple of the number of CPU cores, for small simulation counts).

Xiao_Xi · Jan 16, 2022

Andropov said:
Note that this can also become problematic in homogeneous core designs: running 9 such simulations on a homogeneous 8-core machine could take significantly more time than running 8 simulations since you can't run the 9th simulation until one of the others has finished.

Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.

Source: https://www.anaconda.com/blog/apple-silicon-transition

jeanlain · Jan 16, 2022

Andropov said:
This is more of a problem if you have few, but very long running tasks.

I assume that benchmark tools run much more tasks than a common CPU can process in parallel, otherwise low granularity would indeed be a problem.
Cinebench divides the scene in 200 tiles or something.

jeanlain · Jan 16, 2022

Xiao_Xi said:
Would it go crazy like this?

The decrease between 8 and 9 processes is weird and may indicate that the last process finished while the 8 cores were at idle for a while.

Andropov · Jan 16, 2022

Xiao_Xi said:
Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.

Yes. You usually chose a number of threads equal* to the number of cores *precisely* to avoid this (before Apple Silicon and heterogeneous cores). If you try to run more threads than physical cores at the same time, cores have to (needlessly) context switch from one thread to another thread and that decreases performance (context switches are expensive, mess with the data in the caches, and should be avoided as much as possible).

*sometimes the number of threads is set to twice the number of cores in SMT-capable (multithreading) machines, since they were designed to be able to run two threads per core. Didn't make much of a difference in any workload I ever tried, since the point of SMT (someone correct me if wrong) is to keep the core ALU busy and that's not a problem for numerical simulations since they're ALU-heavy by definition. I assume context switches here are less of a problem since SMT-capable cores have two sets of registers so context switching between (just) two threads should be cheaper, so performance doesn't improve nor decrease.

jeanlain said:
I assume that benchmark tools run much more tasks than a common CPU can process in parallel, otherwise low granularity would indeed be a problem.
Cinebench divides the scene in 200 tiles or something.

Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.

Intel Alder Lake vs. Apple M1

macrumors 68000

macrumors 68020

macrumors 68000

Suspended

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors 68000

Suspended

macrumors 68000

macrumors 68020

macrumors Core

macrumors 68020

macrumors 68020

macrumors 68000

macrumors 68020

macrumors 68000

macrumors 68020

macrumors 68000

macrumors Core

macrumors 6502a

macrumors 68000

macrumors 68020

macrumors 68020

macrumors 6502a

Our Staff