Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
In single core that’s true (around 4). Multicore it’s basically ~2x, maybe 3x for some memory bound loads.
For the same number of threads used, the performance/W advantage should still be around 4X I think.
The 2-3X advantage in MT tasks involves intel CPUs with more cores/threads than Apple's.

People must know that power efficiency (as far as CPUs concerned) should be compared core-for-core.
Doubling the number of cores and cutting the core frequency by half will drastically increaser power efficiency in tasks like cinebench MT. There is nothing impressive about this, and single-threads tasks will run twice slower (at this frequency).
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
1. Node
2. ISA
3. Microarchitecture
(4. software optimization but we’re going to ignore that for now and assume it is the same here - I think I came up with a 5th earlier but can’t remember and it’s first three we’re concerned with here)
How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?

Power consumption is capacitance driven times voltage squared times frequency. If you keep the voltage the same and increase the frequency, then power increases linearly with frequency. Sometimes you need to increase the voltage to increase frequency, however. Depends on details of the physical design.

Performance increases linearly with frequency.
 

Romain_H

macrumors 6502a
Sep 20, 2021
520
438
You're living in a fantasy land.
Its not me who‘s living there… I know, its hard to believe, in particular if standing on the Intel side of things. But in actuality that‘s what it is
 
Last edited:

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?
It’s not nearly that simple when it comes to consumption. It kind of works that way with performance though, provided you are not bottlenecked by anything outside the domain of frequency increase, which you are likely to be.

power vs. frequency is a function that starts out as a constant then gets into a linear region which gradually moves to higher order growth when you need to increase drive voltage to increase frequency.

This is a practical example from a desktop processor. The curves I’ve seen from CPUs and GPUs are all similar but not identical in behaviour, due to for instance process specifics.

0BAA0A3C-F764-4E9E-BB39-3E86941BC61D.png
 
Last edited:

leman

macrumors Core
Original poster
Oct 14, 2008
19,523
19,679
Now, given that Intel‘s got a roughly 4-5x disavantage in terms of performance per watt

That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.
 
  • Like
Reactions: Xiao_Xi

crazy dave

macrumors 65816
Sep 9, 2010
1,454
1,232
That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.

It’s substantially improved but performance also suffers there too which is why they push it so high. The i5 chips offer probably the best balance between performance and efficiency for Intel.
 
  • Like
Reactions: bobcomer

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.
I‘ve often said around here that if you give the same designers the same process node and ask them to design an x86 vs a RISC processor, you will see a 20% x86 penalty.
 
  • Like
Reactions: Xiao_Xi

crazy dave

macrumors 65816
Sep 9, 2010
1,454
1,232
For the same number of threads used, the performance/W advantage should still be around 4X I think.
The 2-3X advantage in MT tasks involves intel CPUs with more cores/threads than Apple's.

People must know that power efficiency (as far as CPUs concerned) should be compared core-for-core.
Doubling the number of cores and cutting the core frequency by half will drastically increaser power efficiency in tasks like cinebench MT. There is nothing impressive about this, and single-threads tasks will run twice slower (at this frequency).

I understand where you are coming from and it can be useful as a theoretical perspective: “what could be built with these cores?” - like say in an upcoming M1 Max Duo. ;) However for comparing the M1 Max to an Intel i7/i9, then the design each chipmaker (or even individual OEM on things like RAM, thermals, and power) goes with is the design and the holistic comparison is what matters.

And even when normalizing by thread count there can be nuances. A major reason AMD and Intel chips claw back MT efficiency is SMT2/HT which comes with its own set of nuances and complications. So when normalizing by threads for AMD and Intel is it one thread per processor or more? For Intel and Apple, what percentage of threads are on “E” cores (I use the term loosely for Intel).

How does the frequency of the SOC influence the performance/consumption ratio?

I have read that performance increases linearly and consumption quadratically with frequency. Is it true?

@cmaier and @EntropyQ3 both responded with everything I was going to say and more. :)
 

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.
The curves Apple published suggest that at low frequency, the difference is much higher than 20-30%.
But no one else has established these curves, so we have to trust Apple.

The cinebench test on battery that was published by this Romanian site can still inform us.
At 23W (package power), the 12900H scores 6119. This is at rather low core frequency.
The M1 scores 7500 at 15.4W package power.
The intel CPU has 14 cores, the M1 as 8. This is a very crude estimation because each CPU has different types of cores and some intel cores have hypertheading, but the the average is 19 cinebench points per Watt per core for the intel part running at low frequency, and 60.8 points per Watt per core for the M1 at nominal frequency.
I must admit I find the 6119 score way too low, even considering that it only required 23W on average. Perhaps something went wrong in that test.
 

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
Is that with P-cores disabled? I find it a bit confusing that the average CPU utilisation is around 30%...
I don't know. Could it be that the P-cores are disabled when unplugged?
That would be about 40 points per Watt per E-core in that case (at <20W, assuming disabled P-cores consume >3W). That's still lower than the M1's average, which includes Icestorm cores. If we ignore these cores, we're probably close to 110 points per W per Firestorm core (Icestorm cores only account for 15% of the performance, if we trust anantech SPEC tests, and they consume ~1.4W).

If the intel E-cores are indeed almost 3X less efficient than the Firestorm cores, I don't think that the P-cores can be near Firestorm cores anywhere in the frequency curve.
 
Last edited:

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
I find it a bit confusing that the average CPU utilisation is around 30%...
The run yielding an impressive score of 16917 reports 46% average core usage, so I don't know what that metric means. ?
Interestingly, that run yielded in 18.6 points / Watt / core on average. That's very close to the run on battery power (assuming it used all 14 cores) which would be very weird. It suggests that the run on battery did not use all the cores.
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
How do benchmarks that measure multicore performance work?

Is it more complex to measure the multicore performance of CPUs with performant/efficient cores?
 

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
How do benchmarks that measure multicore performance work?

Is it more complex to measure the multicore performance of CPUs with performant/efficient cores?
I don't see why it should be. All tools just measure the time it takes to perform a task.
Of course, they are designed so that the workload can be split into many threads.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
All tools just measure the time it takes to perform a task.
Of course, they are designed so that the workload can be split into many threads.
How do benchmarks factor in that efficient cores take longer than performant cores to finish the same task?
 

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
How do benchmarks factor in that efficient cores take longer than performant cores to finish the same task?
They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took. There is one time. Just look at a cinebench MT run. All cores work together. The program does not repeat the task separately for each core.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,523
19,679
How can this be? Icestorm takes twice as long as Firestorm for almost all instructions.

Source:

I think there has been a bit of a misunderstanding what @jenlain meant. Sure, on a heterogeneous CPU the slower cores will take longer, but that doesn't really matter, because you are measuring the total time it takes the CPU to run a workload. For a measure of multicore performance you just need to make sure that there are enough work packages to saturate all the cores. The CPU/OS schedulers will take care of the rest. And it is entirely possible that the P-cores will finish a task and go idle while the E-cores will still need a second or two to finish. On most modern systems however the OS will move the task from E-core to the P-core in such a situation. That's an implementation detail though.

Bottomline: as long as you have enough work, you probably don't have to worry about these things.
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took. There is one time. Just look at a cinebench MT run. All cores work together. The program does not repeat the task separately for each core.
That's not entirely true. It holds if you have a task that can be split into many small 'chunks' where the result of a chunk is not needed for enqueuing the following chunks. For example: Imagine you're trying to apply a filter on an image. You divide the image in tiles, and then apply the filter on each of these tiles. A P core will likely process twice as many tiles as an E core before the image is finished. You just have to wait until all tiles are processed, but since processing a single tile is a relatively short process, that each core takes different times processing them is not a problem in practice.

This is more of a problem if you have few, but very long running tasks. For example, running several instances of a numerical simulation (like a N-body particle simulation) with different parameters for each thread. A single run of the simulation could take, for example, 10 minutes on a P core and 20 minutes on a E core. So what happens if you run 10 of them in a M1 Pro/Max? Do 8 of the threads finish at the 10 minute mark but you still have to wait until the 20 minute mark for the whole process to finish? Well, no. The 2 remaining threads get promoted from the E cores to the P cores as soon as they're free, so it would finish at the 15 minute mark (half work done in the E cores = 10 minutes + the other half of the work done on the P cores = 5 minutes), and the CPU would be at low utilization for those last minutes.

(Note that this can also become problematic in homogeneous core designs: running 9 such simulations on a homogeneous 8-core machine could take significantly more time than running 8 simulations since you can't run the 9th simulation until one of the others has finished. That's why usually the amount of simulations run is made a multiple of the number of CPU cores, for small simulation counts).
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
Note that this can also become problematic in homogeneous core designs: running 9 such simulations on a homogeneous 8-core machine could take significantly more time than running 8 simulations since you can't run the 9th simulation until one of the others has finished.
Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.
apple-m1-blog-post.png

Source: https://www.anaconda.com/blog/apple-silicon-transition
 

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
This is more of a problem if you have few, but very long running tasks.
I assume that benchmark tools run much more tasks than a common CPU can process in parallel, otherwise low granularity would indeed be a problem.
Cinebench divides the scene in 200 tiles or something.
 

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.
Yes. You usually chose a number of threads equal* to the number of cores *precisely* to avoid this (before Apple Silicon and heterogeneous cores). If you try to run more threads than physical cores at the same time, cores have to (needlessly) context switch from one thread to another thread and that decreases performance (context switches are expensive, mess with the data in the caches, and should be avoided as much as possible).

*sometimes the number of threads is set to twice the number of cores in SMT-capable (multithreading) machines, since they were designed to be able to run two threads per core. Didn't make much of a difference in any workload I ever tried, since the point of SMT (someone correct me if wrong) is to keep the core ALU busy and that's not a problem for numerical simulations since they're ALU-heavy by definition. I assume context switches here are less of a problem since SMT-capable cores have two sets of registers so context switching between (just) two threads should be cheaper, so performance doesn't improve nor decrease.

I assume that benchmark tools run much more tasks than a common CPU can process in parallel, otherwise low granularity would indeed be a problem.
Cinebench divides the scene in 200 tiles or something.
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.