The techheads reading this thread might enjoy reading
https://eclecticlight.co/2023/11/27/evaluating-m3-pro-cpu-cores-1-general-performance/
As usual, Howard is more interested in understanding the chips than in getting page views, so he looks at things from a few unusual angles.
Points to note include
- once genuine frequencies are taken into account, the IPC gains look higher than zero.
(You might ask why the M3 P frequency didn't ramp to full max. Presumably the heuristics for DVFS ramping have changed slightly, though who knows how exactly. Perhaps max frequency is only engaged after a few seconds of lower frequency, on the grounds that there's no point in doubling power if the issue is saving .25 seconds of wait time?)
The "NEON" and "simd" changes are substantial and interesting.
The relevant gating code is essentially a stream of dependent FADD's. If FADD takes 3 cycles (the case for M1) then a 3GHz M1 can do 1B successive dependent FADDs, and that's in fact the case.
So how do we get to ~1.6 as fast? The best case frequency different is 4/3 which isn't enough.
My GUESS is that the latency for FADD (perhaps under some specific conditions like when a result feeds directly into the same unit so there's no waiting for the register from a common dispatch bus) FADD latency has dropped to 2.5 cycles (or more precisely something like 2 dependent FADDs in 5 cycles).
If that's the case then we expect performance to increase by something like 4/3*3/2.5=1.6.
https://eclecticlight.co/2023/11/27/evaluating-m3-pro-cpu-cores-1-general-performance/
As usual, Howard is more interested in understanding the chips than in getting page views, so he looks at things from a few unusual angles.
Points to note include
- once genuine frequencies are taken into account, the IPC gains look higher than zero.
(You might ask why the M3 P frequency didn't ramp to full max. Presumably the heuristics for DVFS ramping have changed slightly, though who knows how exactly. Perhaps max frequency is only engaged after a few seconds of lower frequency, on the grounds that there's no point in doubling power if the issue is saving .25 seconds of wait time?)
The "NEON" and "simd" changes are substantial and interesting.
The relevant gating code is essentially a stream of dependent FADD's. If FADD takes 3 cycles (the case for M1) then a 3GHz M1 can do 1B successive dependent FADDs, and that's in fact the case.
So how do we get to ~1.6 as fast? The best case frequency different is 4/3 which isn't enough.
My GUESS is that the latency for FADD (perhaps under some specific conditions like when a result feeds directly into the same unit so there's no waiting for the register from a common dispatch bus) FADD latency has dropped to 2.5 cycles (or more precisely something like 2 dependent FADDs in 5 cycles).
If that's the case then we expect performance to increase by something like 4/3*3/2.5=1.6.