Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Gerdi

macrumors 6502
Apr 25, 2020
449
301
I disagree. What you're suggesting addresses a different question from the one being raised here. Here we're simply asking about the efficiency of the M1 vs. Zen 4/Eco Mode, i.e., the relative efficiencies available to consumers in normal use.

Then it is not a technical analysis of the architecture but just 2 random voltage points you just happen to use, and the architecture, which happens to use the lowest voltage wins.

If you're talking about adjusting the voltage of the M1, you're asking an entirely different question that opens up an entirely new can of worms. First, just to start, you'd need benchmark measurements for the M1 at the lower voltage, which we don't have. I.e., you can't just consider the 33% power reduction you'd get by reducing the M1's voltage without also accounting for the reduction in performance you'd get.

I suggest, that try to understand the presented formula - the reduction in performance is fully considered.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Further, once you started deviating from mfr-specified voltages you can't just look at two voltages. Rather, you'd need to look at efficiency vs. voltage curves for both processors, and find the most efficient point at which each operated stably. But then the performance at those points might be too low to be desirable. Etc., etc...

You still do not understand the theory here. I do not suggest that you decrease the voltages even further - i said that we need to use an adjustment factor in case the voltage points at which we do performance and power measurements deviate between our 2 DUTs.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
You still do not understand the theory here. I do not suggest that you decrease the voltages even further - i said that we need to use an adjustment factor in case the voltage points at which we do performance and power measurements deviate between our 2 DUTs.
I do understand it. You just don't understand my critique. It's the same as our last interaction, where I had to go back and forth with you several times before you finally understood (see post #291). I'm sorry, but I just don't have the patience to do that again this time. If we were in-person, it would be easier. But online it's difficult.
 

mi7chy

macrumors G4
Oct 24, 2014
10,622
11,294
7950x dethrones Intel at the top stock tier where it has reigned with strong single-core performance that favors console emulation.

https://docs.google.com/spreadsheets/d/1Rpq_2D4Rf3g6O-x2R1fwTSKWvJH7X63kExsVxHnT2Mc/edit#gid=0
1664950638397.png
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
As I said, you can just use a factor. Current is implicit in the calculation. To be more precise:

efficiency = performance / power = (C0 *f) / (C1 * V^2 * f) = C0 /C1 * (1/V)^2 - this holds for any digital circuit.

Ok I think I understand now. C1 will be smaller if the chip needs less current to work and thus the coefficient ratio will be higher for chip that uses less power at the same performance level. Question: do C0 and C1 indeed remain constant across the entire voltage range? And how is all of this related to the fact that increasing voltage has diminishing returns on frequency increases? Is there also a formula for that?

Back to reality however it doesn’t seem like we have a practical way to get M1 voltages. There surely is a sensor for those but it doesn’t seem like Apple exposes it. And finally, I still don’t get why this formula is ultimately useful to me as an interested user since the voltages these chips operate at are fixed anyway. For M1, I care about performance/power at maximal voltage as this is where the chip is doing demanding work. It’s a different story of course for other systems where these things are adjustable.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Ok I think I understand now. C1 will be smaller if the chip needs less current to work and thus the coefficient ratio will be higher for chip that uses less power at the same performance level. Question: do C0 and C1 indeed remain constant across the entire voltage range? And how is all of this related to the fact that increasing voltage has diminishing returns on frequency increases? Is there also a formula for that?

C0 is constant as long as we assume that the work, which is done per cycle is in average constant. C1 is constant as long as we assume that the number of gates switching per cycle is in average constant. I believe both are very reasonable assumptions - and as you may observe are only dependent on architecture and the benchmark in question.

The interesting observation is, that the frequency cancels out. This means if you keep the voltage constant, any change in frequency has no impact on efficiency. With other words, when increasing the voltage, you are getting the quadratic efficiency hit, if you increase the frequency or not.

As disclaimer I want to add, i am considering the most important contributor - namely dynamic power. There is also static power for instance, which I neglected on purpose to simplify the model.

Back to reality however it doesn’t seem like we have a practical way to get M1 voltages. There surely is a sensor for those but it doesn’t seem like Apple exposes it. And finally, I still don’t get why this formula is ultimately useful to me as an interested user since the voltages these chips operate at are fixed anyway. For M1, I care about performance/power at maximal voltage as this is where the chip is doing demanding work. It’s a different story of course for other systems where these things are adjustable.

Depends, what conclusion you want to draw. If you want to reason about the inherent architectural efficiency of Zen vs M1, you really want to know C0/C1. Everything else does not tell you anything about architecture but rather about the ability to reduce voltage.
There is also the misconception, that the big increase in efficiency in ECO mode is somehow a feature of the Zen4 architecture - it is not - it is only voltage scaling of the technology - M1s efficiency will scale the same assuming same technology.

For additional motivation I give following example:
We have following voltages: Veco64 = 0.866V Veco105=1.038V Vnormal=1.183V.
Now based on the kitguru review we do know that the efficiency of the 7950X in normal mode is 179pts/W. Based on the given formula the estimated efficiency for eco105 is 235.4pts/W and eco65 is 338pts/W. The measured values are 249pts/w and 335pts/W - quite close to the estimate.
 
Last edited:

altaic

macrumors 6502a
Jan 26, 2004
711
484
C0 is constant as long as we assume that the work, which is done per cycle is in average constant.
Why would it be even constant-ish for some undefined benchmark procedure? Also, you use ‘average constant’, which is ill defined.
C1 is constant as long as we assume that the number of gates switching per cycle is in average constant.
That is not a good assumption.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
Why would it be even constant-ish for some undefined benchmark procedure? Also, you use ‘average constant’, which is ill defined.

That is not a good assumption.

It is a valid assumption - i do not care if you consider it "good". I am not sure, what is even to debate here. For a given distribution (like gates switching per cycle) variance and mean are well defined. A well defined mean does not imply that the variance is zero.

I was also clearly pointing out, that both C0 and C1 are dependent on architecture and benchmark. Therefore when evaluating architecture efficiency between different architectures we are using the same benchmark (and for the very same reason shall also use the same voltage), which only leaves architecture metrics as variable in the efficiency calculation.
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Depends, what conclusion you want to draw. If you want to reason about the inherent architectural efficiency of Zen vs M1, you really want to know C0/C1. Everything else does not tell you anything about architecture but rather about the ability to reduce voltage.

I think answering the question of inherent architectural efficiency is simply not feasible with the data available to us. But we do know that Apple Silicon is locked at maximal voltage (whatever that might be) when running a sustained workload at a nominal frequency, and personally, I would already be happy if we could obtain a performance/power ratio for that fixed point.

To put it differently, since I do not believe that the question of architectural efficiency is tractable, I would be ok settling for the question of device efficiency at common configuration points.

Why would it be even constant-ish for some undefined benchmark procedure? Also, you use ‘average constant’, which is ill defined.

That is not a good assumption.

I think what @Gerdi writes makes sense. Obviously you can't make these assumptions when you look at small time intervals, but if you are running a workload over a reasonably long period of time you can average (integrate) out the bumps. It's similar marginalisation in probability theory. That's why these constants are workload-specific.
 
  • Like
Reactions: tomO2013

Gerdi

macrumors 6502
Apr 25, 2020
449
301
I think answering the question of inherent architectural efficiency is simply not feasible with the data available to us. But we do know that Apple Silicon is locked at maximal voltage (whatever that might be) when running a sustained workload at a nominal frequency, and personally, I would already be happy if we could obtain a performance/power ratio for that fixed point.

To put it differently, since I do not believe that the question of architectural efficiency is tractable, I would be ok settling for the question of device efficiency at common configuration points.

Sure, that is totally fine as long as you do not try to generalize like architecture "A is more efficient than architecture B". After all my first point was - that you cannot reasonably compare architectural efficiency at different voltage points, because voltage has the single most significant impact on efficiency.

In addition my second point is, that once we know the voltage the M1 is running at, we can with sufficient accuracy reason about architectural efficiency - as we can already for the 7950X, which is around 258 pts/F for Cinebench.
 

bcortens

macrumors 65816
Aug 16, 2007
1,324
1,796
Canada
Sure, that is totally fine as long as you do not try to generalize like architecture "A is more efficient than architecture B". After all my first point was - that you cannot reasonably compare architectural efficiency at different voltage points, because voltage has the single most significant impact on efficiency.

In addition my second point is, that once we know the voltage the M1 is running at, we can with sufficient accuracy reason about architectural efficiency - as we can already for the 7950X, which is around 258 pts/F for Cinebench.
But you can say something like architecture ”A” as deployed in a manufacturers recommended configuration is more/less efficient than architecture “B”
Given that this is the only metric actually relevant to most people when comparing CPUs it is often just assumed…
 

roach1245

macrumors member
Oct 26, 2021
77
172
As soon as you put more than 16GB RAM in a machine with the Ryzen 7950X (or Intel 13th gen Raptor Lake) performance declines quickly. When you put in 128GB DDR5 (4*32GB) you'll actually get DDR4 speeds (3600 MT/s) with high latency (unless you start overclocking which may raise it to stable 4100... in the future... if you're lucky with the silicon....). Your memory bandwidth will be about 50 GB/s or 60GB/s overclocked (because the Ryzen 7950X only has 2 meager memory channels) versus e.g. the M1 Max which has 400gb/s memory bandwidth and M1 Ultra at 800 GB/s (insanely high).

If you're doing (parallel) data analysis the 16 cores on the Ryzen 7950X are mostly useless as they'll be bottlenecked by the memory bandwidth; you can only feed 4-6 cores with that. The M1 Ultra on the other hand can truly feed its 20 cores with 128GB of memory. There's a great analysis on this here:


which rightly concludes that:

"M1 Ultra has an extremely high 800 GB/sec memory bandwidth.... which leads to a level of CPU performance scaling that I don’t even see on supercomputers, and is the result of a SoC (system on a chip) design""

Ryzen 7950X (and Intel 13th gen) is made for gaming galore where 16GB precisely-timed RAM modules suffice (and on which all these CPU benchmarks are based); go beyond that and you enter a different league with the M1 Pro/Max/Ultra chips or AMD Threadripper Pros with 8 memory channels and 200GB/s memory bandwidth (e.g. the Threadripper 5965WX with 26 cores, the chip which alone costs $2400 and a compatible motherboard $1000...you'll end up paying more for that than the M1 Ultra).
 
Last edited:

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
As soon as you put more than 16GB RAM in a machine with the Ryzen 7950X (or Intel 13th gen Raptor Lake) performance declines quickly. When you put in 128GB DDR5 (4*32GB) you'll actually get DDR4 speeds (3200 MT/s) with high latency (unless you start overclocking which may raise it to stable 4100... in the future... if you're lucky with the silicon....). Your memory bandwidth will be about 50 GB/s or 60GB/s overclocked (because the Ryzen 7950X only has 2 meager memory channels) versus e.g. the M1 Max which has 400gb/s memory bandwidth and M1 Ultra at 800 GB/s (insanely high).
Do you have a source for this? That seems like a curious claim.

According to Paul Alcorn on Tom's Hardware, "[In Raptor Lake,] Intel has increased its DDR5 memory support up to 5600 MT/s if you use one DIMM per channel (1DPC), a big increase over the prior 4800 MT/s speed with Alder Lake. Just as importantly, Intel increased 2DPC speeds up to 4400 MT/s, an improvement over the previous-gen 3600 MT/s."

So 1DPC => 5600 MHz DDR5; 2DPC => 4400 MHz DDR5.

Neither of these restrict the machine to 16 GB RAM to get full performance.

 

roach1245

macrumors member
Oct 26, 2021
77
172
Do you have a source for this? That seems like a curious claim.

According to Paul Alcorn on Tom's Hardware, "[In Raptor Lake,] Intel has increased its DDR5 memory support up to 5600 MT/s if you use one DIMM per channel (1DPC), a big increase over the prior 4800 MT/s speed with Alder Lake. Just as importantly, Intel increased 2DPC speeds up to 4400 MT/s, an improvement over the previous-gen 3600 MT/s."

So 1DPC => 5600 MHz DDR5; 2DPC => 4400 MHz DDR5.

Neither of these restrict the machine to 16 GB RAM to get full performance.

It's inherent to DDR; AMD even mentions it explicitly on the Ryzen 9 7950X page:


Max Memory Speed
1x1R 5200 MT/s
1x2R 5200 MT/s
2x1R 3600 MT/s
2x2R 3600 MT/s


You're in the 2x1R/2x2R category when you put 128GB DDR5 in (32GB x 4 since 64GB RAM sticks don't exist). Thus 3600 MT/s.

Same goes for Raptor Lake - 4 sticks on 2 memory channels greatly reduces memory bandwidth (also with DDR4).
 
Last edited:

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
It's inherent to DDR; AMD even mentions it explicitly on the Ryzen 9 7950X page:


Max Memory Speed
1x1R 5200 MT/s
1x2R 5200 MT/s
2x1R 3600 MT/s
2x2R 3600 MT/s


You're in the 2x1R/2x2R category when you put 128GB DDR5 in (32GB x 4 since 64GB RAM sticks don't exist). Thus 3600 MT/s.

Same goes for Raptor Lake - 4 sticks on 2 memory channels greatly reduces memory bandwidth (also with DDR4).
Still not seeing where you're limited to 16 GB to get full performance on Ryzen 7000 or Raptor Lake. What about a single 32 GB chip?
Also not seeing how you're limited to 3200 MT/s with 128 GB. This is saying 3600 MT/s for AMD. And what about Raptor Lake?
 
Last edited:

roach1245

macrumors member
Oct 26, 2021
77
172
That benchmark has nothing to do with memory bandwidth. And besides almost all benchmarks are ran on Ryzen 7950X with 16GB DDR5 RAM which has the highest clock speeds (partly so because 128GB DDR5 RAM is hardly stable yet).

You need a benchmark where you explicitly compare a Ryzen 7950X with 128GB RAM to an M1 Ultra with 128GB RAM - in parallel data analysis where all available memory and cores are consumed in conjunction - something like computational fluid dynamics comes close - and you'll see the huge differences in favor of the M1 Ultra.

As I mentioned


is a great source on that.
 

roach1245

macrumors member
Oct 26, 2021
77
172
Still not seeing where you're limited to 16 GB to get full performance on AMD. What about a single 32 GB chip?
Also not seeing how you're limited to 3200 MT/s with 128 GB. This is saying 3600 MT/s for AMD. And what about Raptor Lake?
3600 MT/s indeed instead of 3200 MT/s (which is DDR4 speed). The limit applies as RAM goes up but most extremely when you need 4 sticks of RAM (> 64GB). Below that you can get further ahead with overclocking your memory. This is the same for Ryzen 7000 / Intel Raptor Lake as they both only have 2 memory channels - it's a limitation rather of DDR memory which has always been much harder on the memory controller when there are more sticks than channels.

If 64GB DDR5 RAM sticks exist at some point (and they need to have decent speed too) it will solve this issue but I don't see that happening in the next years for consumers (there are some 64GB DDR4 RAM sticks but for enterprise servers only and they have low clock speeds).
 
Last edited:

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
Max Memory Speed
1x1R 5200 MT/s
1x2R 5200 MT/s
2x1R 3600 MT/s
2x2R 3600 MT/s


You're in the 2x1R/2x2R category when you put 128GB DDR5 in (32GB x 4 since 64GB RAM sticks don't exist). Thus 3600 MT/s.
I'm not familiar with this nomenclauture. Can you configure 2 x 32 GB sticks as 1 x 2R, and thus get 5200 MT/s?
 

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
AS macOS software definitely have to be re-architected to take advantage of the higher memory bandwidth provided as well as UMA where all processing cores can work in unison to solve a single problem.

Most benchmarks fits nicely into the L1/L2/L3 caches. In this instance, higher CPU core clocks will win out, e.g. Intel 13th gen CPU is going past 6GHz, almost twice as fast as the fastest M1 Max 3.2GHz core clock.

Once the task involves pushing memory around (which relates more closely to real work tasks IMHO), the high CPU clocks becomes a liability as it will be waiting for data most of the time. Case in point where dGPUs burns a major portion of it's power requirement due to the need for high speed memory, otherwise the GPU cores will be waiting a lot for data to process.

Can't wait for what Apple has in store for the AS Mac Pro.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
Sure, that is totally fine as long as you do not try to generalize like architecture "A is more efficient than architecture B". After all my first point was - that you cannot reasonably compare architectural efficiency at different voltage points, because voltage has the single most significant impact on efficiency.

In addition my second point is, that once we know the voltage the M1 is running at, we can with sufficient accuracy reason about architectural efficiency - as we can already for the 7950X, which is around 258 pts/F for Cinebench.
I don't understand why you're so obsessed with voltage. You can and should just measure power, energy, and performance. These are enough to make conclusions about architectural efficiency. You don't need to equalize voltage at all.

Your thinking on this topic just seems generally off. For example, you've also claimed that the number of gates switching per cycle is irrelevant, but that's nonsense too. Microarchitectures which target higher operating frequencies generally have more pipeline stages, not just a voltage boost. More pipeline stages means more flops switching state and higher loads on clock distribution networks. It may even increase the quantity of combinatorial logic, because dividing a task into more pipeline stages often eliminates opportunities to optimize. The floorplan bloat from all the extra combinatorial and sequential logic required to go faster increases average wire length, which also costs power. And more pipeline stages hurts efficiency by increasing the amount of work discarded on any misprediction, which often drives adding even more additional logic to compensate (better branch predictors, and so on).

Much like pipeline stages, voltage is one of many tools which architects and circuit designers use to achieve their goals. You don't generate interesting or useful test data by forcing equal voltage across very different chips when, in the real world, by design, the two chips will never actually be running at equal voltage under equivalent loads.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
Blender benchmarks are a fine datapoint if you want to run Blender TODAY.
They are a terrible datapoint for understanding the Apple Silicon CPUs because Blender optimization (like Apple Silicon optimization of other heavyweight apps) is all over the place.
I took this discussion to be about Apple HW, not about the performance of a particular app today.

As for whether there are memory bandwidth issues on AMD many-core systems, I have no idea; I'd like to see a real discussion from someone who knows what they are talking about.
 
  • Like
Reactions: jdb8167

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
I'm not sure if you can make such a generalization. According to Geekbench developer John Poole, its scores are affected by RAM performance (https://www.geekbench.com/blog/2007/06/matched-ram-performance/). And SPEC2017's are as well (https://www.anandtech.com/show/1704...hybrid-performance-brings-hybrid-complexity/8)
The very claim being made reveals ignorance.

What is it that is being claimed to fit in L2/L3 caches? Instructions or data?
It is trivial to scale-up a benchmark to ensure that the data footprint is larger than L3, and SPEC definitely does this. GB probably tries to, but GB5 may be behind the curve and we may need to wait for GB6.

It's much harder to scale up a benchmark to ensure that the instruction footprint exceeds cache size. Neither SPEC nor GB do this. The professional code bases that do this are larger server databases, and the academic papers that investigate this tend to look and feel rather different from papers that investigate data issues like data prefetch because they are different communities. About the closest you can get to testing large I-footprint performance without seriously hard work is using browser benchmarks that test a wide variety of functionality.

But ultimately the question is: do you want to make tribal claims, or do you want to understand? Because if you want understanding, then simplistic measurements of the size of caches are not helpful. Substantially more important than these raw sizes, in any modern machine, are how well prefetching works (on both I and D-side) and how well the branch predictor machinery works in the face of "not recently executed" code (on the I side). It does you no good to be able to get to new I code rapidly if you're then constantly flushing in the face of each newly encountered branch... There are multiple ways to handle this (better static predictors, L2 branch predictor, analysis of code for branches before it is even executed) and all of those probably matter more than the trivial question of I-footprint size compared to cache sizes.
 
  • Like
Reactions: jdb8167
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.