You're living in a fantasy land.
In single core that’s true (around 4). Multicore it’s basically ~2x, maybe 3x for some memory bound loads.
You're living in a fantasy land.
For the same number of threads used, the performance/W advantage should still be around 4X I think.In single core that’s true (around 4). Multicore it’s basically ~2x, maybe 3x for some memory bound loads.
How does the frequency of the SOC influence the performance/consumption ratio?1. Node
2. ISA
3. Microarchitecture
(4. software optimization but we’re going to ignore that for now and assume it is the same here - I think I came up with a 5th earlier but can’t remember and it’s first three we’re concerned with here)
How does the frequency of the SOC influence the performance/consumption ratio?
I have read that performance increases linearly and consumption quadratically with frequency. Is it true?
Its not me who‘s living there… I know, its hard to believe, in particular if standing on the Intel side of things. But in actuality that‘s what it isYou're living in a fantasy land.
It’s not nearly that simple when it comes to consumption. It kind of works that way with performance though, provided you are not bottlenecked by anything outside the domain of frequency increase, which you are likely to be.How does the frequency of the SOC influence the performance/consumption ratio?
I have read that performance increases linearly and consumption quadratically with frequency. Is it true?
Now, given that Intel‘s got a roughly 4-5x disavantage in terms of performance per watt
That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.
I‘ve often said around here that if you give the same designers the same process node and ask them to design an x86 vs a RISC processor, you will see a 20% x86 penalty.That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.
For the same number of threads used, the performance/W advantage should still be around 4X I think.
The 2-3X advantage in MT tasks involves intel CPUs with more cores/threads than Apple's.
People must know that power efficiency (as far as CPUs concerned) should be compared core-for-core.
Doubling the number of cores and cutting the core frequency by half will drastically increaser power efficiency in tasks like cinebench MT. There is nothing impressive about this, and single-threads tasks will run twice slower (at this frequency).
How does the frequency of the SOC influence the performance/consumption ratio?
I have read that performance increases linearly and consumption quadratically with frequency. Is it true?
The curves Apple published suggest that at low frequency, the difference is much higher than 20-30%.That depends a bit where on the frequency curve you are looking. At it’s top, yes - because Intel is obsessed with being the fastest. Around the base frequency of their mobile chips the difference is more like 20-30% I think.
At 23W (package power), the 12900H scores 6119. This is at rather low core frequency.
I don't know. Could it be that the P-cores are disabled when unplugged?Is that with P-cores disabled? I find it a bit confusing that the average CPU utilisation is around 30%...
The run yielding an impressive score of 16917 reports 46% average core usage, so I don't know what that metric means. ?I find it a bit confusing that the average CPU utilisation is around 30%...
I don't see why it should be. All tools just measure the time it takes to perform a task.How do benchmarks that measure multicore performance work?
Is it more complex to measure the multicore performance of CPUs with performant/efficient cores?
How do benchmarks factor in that efficient cores take longer than performant cores to finish the same task?All tools just measure the time it takes to perform a task.
Of course, they are designed so that the workload can be split into many threads.
They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took. There is one time. Just look at a cinebench MT run. All cores work together. The program does not repeat the task separately for each core.How do benchmarks factor in that efficient cores take longer than performant cores to finish the same task?
They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took.
How can this be? Icestorm takes twice as long as Firestorm for almost all instructions.
Source:
That's not entirely true. It holds if you have a task that can be split into many small 'chunks' where the result of a chunk is not needed for enqueuing the following chunks. For example: Imagine you're trying to apply a filter on an image. You divide the image in tiles, and then apply the filter on each of these tiles. A P core will likely process twice as many tiles as an E core before the image is finished. You just have to wait until all tiles are processed, but since processing a single tile is a relatively short process, that each core takes different times processing them is not a problem in practice.They don't take longer in a multithread run. All cores work to complete the same task, and the program just measures the time it took. There is one time. Just look at a cinebench MT run. All cores work together. The program does not repeat the task separately for each core.
Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.Note that this can also become problematic in homogeneous core designs: running 9 such simulations on a homogeneous 8-core machine could take significantly more time than running 8 simulations since you can't run the 9th simulation until one of the others has finished.
I assume that benchmark tools run much more tasks than a common CPU can process in parallel, otherwise low granularity would indeed be a problem.This is more of a problem if you have few, but very long running tasks.
The decrease between 8 and 9 processes is weird and may indicate that the last process finished while the 8 cores were at idle for a while.Would it go crazy like this?
Yes. You usually chose a number of threads equal* to the number of cores *precisely* to avoid this (before Apple Silicon and heterogeneous cores). If you try to run more threads than physical cores at the same time, cores have to (needlessly) context switch from one thread to another thread and that decreases performance (context switches are expensive, mess with the data in the caches, and should be avoided as much as possible).Would it go crazy like this? Anaconda run a microbenchmark (calculation of a cosine a million times) to test the differences between the P and E cores of an M1 Mac and got this.
Many benchmarks go around this problem by measuring throughput (like the anaconda graph above) instead of time to finish a task. For example, Geekbench in the physical N-body simulation measures interactions computed per second, not the time to compute, say, 1M interactions. That takes granularity and task scheduling out of the equation.I assume that benchmark tools run much more tasks than a common CPU can process in parallel, otherwise low granularity would indeed be a problem.
Cinebench divides the scene in 200 tiles or something.