Not that I’m smart enough to understand this, but I think ASi is in a better position than x86 chips because of the parallelism problem. If the SoC is more efficient at retrieving and moving data then it would be able to run single threaded tasks without entering nuclear furnace territory.Thanks for the links. To be sure, Apple is beautifully situated for expanding to large core counts. And I understand increasing single-core performance is a challenge for all chip makers. But since most programs remain single-threaded, until chips get fast enough for these programs to respond with no noticeable CPU-bound delay, increased single-core performance will continue to improve the user experience.* And this would probably improve the experience for more users than those who need high core counts.
*I'd say there's two cutoffs for the maximum amount of time you'd want the program to take to complete routine, repeated tasks. One is the cutoff for "no perceptible delay". The other is "perceptible, but not irritating". I don't know what either of these would be, but I'd guess the former is maybe <4 ms, and the latter is ~ < 100 ms, depending on the type of work you're doing (obviously you'd want much less latency for gaming, but I'm thinking about productivity work).
So it's a conundrum. Chip makers can no longer generationally improve single-core performance the way they could during the sub-GHz days, so most of the improvement has been in multi-core performance, through scaling to higher core counts. Yet, with a few key exceptions (audio/photography/video apps), software makers seem unable or unwilling to parallelize their programs to take advantage of this.
I did not know this. What are the frequencies it uses when running software that can make use of all cores?
An interesting follow up question though is why Apple only lets their cores run at absolute peak frequency when only one core is active regardless of the thermal capacity of the system? ?♂️
Glad to see you back, @cmaier
Very, but if I were smart enough to understand it, I wouldn’t make my living with pretty shapes and colors.So, IR drop is an effect of great concern in CPU designs, caused by resistance on the on-chip power grid. If you run a core faster, it increases the current, because you have to charge and discharge the FET capacitances (and any wires that are switching), and current is, by definition, the movement of charge.
Since resistance on the grid is constant, increasing the current increases the voltage that is dropped across the power grid (Ohm’s law: V = IR).
The result is that the transistors see a lower voltage difference across gate and source/drain, which slows them down (because the switching speed of a transistor is affected by the voltage).
Informative post?
Very!Informative post?
Very!
I'm curious, I take it that's why Intel is less heat and power efficient, they're forcing the issue with more current to get a higher clock? And Apple would have the same problem if they did the same thing? (say, for a Mac Pro level processor)
Would alternating P and E cores make any difference? (other than to screw up timing?)
Thanks!Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )
Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.
Alternating cores is more of a local-hot-spot solution than a voltage drop solution.
I will put together a post explaining in more detail at talkedabout.com when I get a chance.
Intel‘s low efficiency is due to architectural issues caused by x86 in combination with having to run at a higher clock rate to compensate (higher clock rate does require more current, and power consumption is proportional to capacitance charged/discharged times voltage squared times clock frequency. )
Apple won’t have that problem because pipeline for Arm needs many fewer stages, and decoders can be much simpler.
Alternating cores is more of a local-hot-spot solution than a voltage drop solution.
I will put together a post explaining in more detail at talkedabout.com when I get a chance.
I cannot confirm Hector's statements in that twitter thread. I have a M1 Air, and whenever I've tried investigating its frequency limits with powermetrics and CPU eater tasks (same thing he suggested), I see all four P cores jump to 3.204 GHz and stay there until the P core temps rise into roughly the 90-100C range. After that, Apple's DVFS control loop starts responding to the high temperature and uses frequency and voltage as the control variable to regulate temperature.Here it is for the iPhone:
The iPhone 12 & 12 Pro Review: New Design and Diminishing Returns
www.anandtech.com
The regular M1 is similar:
I’m not sure about the Pro/Max, but I think it’s the same. If so: it’s 3GHz for running all P-cores (obviously less for E-cores) and 7% boost to 3.2GHz for when running a single thread.
I cannot confirm Hector's statements in that twitter thread. I have a M1 Air, and whenever I've tried investigating its frequency limits with powermetrics and CPU eater tasks (same thing he suggested), I see all four P cores jump to 3.204 GHz and stay there until the P core temps rise into roughly the 90-100C range. After that, Apple's DVFS control loop starts responding to the high temperature and uses frequency and voltage as the control variable to regulate temperature.
This is true whether I use my usual CPU eater ("yes >/dev/null"), Hector Martin's ("cat /dev/zero >/dev/null"), or POVRay ("povray +WTn -benchmark", with n being the number of worker threads you want). The only difference between any of these is how much power per core they consume.
As an aside: The unlikely champion of that fight is 'yes'. One thread of 'yes' translates to about 5.7W in the P cluster, and with the enhanced powermetrics in Monterey you can see that the average P cluster IPC is about 6.1! That's a very, very high IPC number. A single thread of 'cat' only manages 4.2W and 3.5 IPC, and a single thread of 'povray' 3.9W and almost 4 IPC. I have no idea why 'yes' is a M1 power virus.
Anyways, I wonder if Hector was taking care to start all threads in his multi-thread tests simultaneously. With the Air, doing that and starting with the CPU cold is important when trying to observe 4-thread behavior.
I was indeed looking at the per-core reporting, trusting "residency" to mean "this is the frequency bin this core was actually in". Interesting!Interesting! Are you looking at the per-core or cluster line?
Edit:
Unless I’m misunderstanding Andrei wrote the same for the Pro/Max:
Intel Alder Lake vs. Apple M1
Found a reference for the Pro/Max. As you thought, they show the same behavior: "The CPU cores clock up to 3228 MHz peak, however vary in frequency depending on how many cores are active within a cluster, clocking down to 3132 at 2, and 3036 MHz at 3 and 4 cores active. I say “per cluster”...forums.macrumors.com
Since I teach thermo, I'm all for that.I’m only a little bit back . Most of my CPU posts will be at another place, but I will poke my head in here now and again to discuss ohm’s law and thermodynamics
Since I teach thermo, I'm all for that.
How on earth did you manage to get yourself banned (or were they kidding?).
Is there, at least in principle, room for Apple to reoptimize its architecture for higher single-core performance (for the same process size)? E.g., if they could get a 20% increase in performance for a 30% increase in wattage, that might be a good tradeoff for a desktop machine, which is not constrained by battery life, and would presumably have better thermals than their mobile devices.
Of course, if their focus is on scaling horizontally, this might not be of interest to them. That tradeoff would make more sense for an 8-10 perf. core prosumer desktop (e.g., the large iMac) than a 20+ perf. core workstation, and they may not want to invest the added dev costs to make an added core design (preferring to instead share the same core design acrosss all their prosumer/pro devices, differentiating them only by core count)
Only a little bit? Well, it’s better than nothing. Great to see you back though!I’m only a little bit back . Most of my CPU posts will be at another place, but I will poke my head in here now and again to discuss ohm’s law and thermodynamics
That all makes sense. But what I'm asking is whether, with a different version of their current architecture (still keeping to all its essential components, including unified memory, various coprocessors, etc.), they could have (at the expense of some loss of efficiency) created a design whose knee was at a higher clock (assuming the same process size).I assume that they eked out close to the maximum single core performance they could get given the process and reasonable power budget. That is, they designed for the knee in the power/performance curve, so that raising the performance any more would increase power consumption by much more. If they wanted to improve single core at the expense of multicore, they could remove cores or spread the cores out, so that local heating isn’t as much of an issue, and then raise the clock frequency, of course. I think A15 is an interesting signpost. Next generation core, running on essentially the same process, with performance increase in single core that is not all that much more than the difference in process gets them. (I think A15 cores give them more headroom, though, and may allow higher clock speeds in M2).
That all makes sense. But what I'm asking is whether, with a different version of their current architecture (still keeping to all its essential components, including unified memory, various coprocessors, etc.), they could have (at the expense of some loss of efficiency) created a design whose knee was at a higher clock (assuming the same process size).
Your observation about the A15 suggests this might be possible—though for the actual A15 they chose not to (staying with an equivalent knee after renormalizing for process size), likely to continue to maximize efficiency.
Thanks! I will be posting longer explanations and discussions of things over on talkedabout.com in the tech part of the forum, but will chime in a bit here, too.Only a little bit? Well, it’s better than nothing. Great to see you back though!
That so many missed. Next step, a ray tracer (please). I saw once a video (can't find it) where a 10W chip performed better at ray tracing than a monster NVIDIA card because the former was dedicated to the task. I believe that was 5-7 years ago...I have to say that I find Apple’s approach to performance and in particular their solution to single core, single threaded, targeted performance improvments via co-processors and accelerators to be a very exciting one and the future!
I could envisage M2 being a mixture of the A15 improved e-cores , A14/15 p-cores as well as additional silicon budget invested in video encode/decode, AMX, <insert new co-processor accelerator that targets a particular and very specific type of desktop workload>.
So since this seems to be the thread for more technical questions: You'll recall there was a lot of discussion about whether Apple would use HBM2 for the AS Mac Pro.
Does what we've seen in with LPDDR5 in the MBP's tells us HBM2 wouldn't be needed? And even if there would be a benefit, would too many architectural changes be needed to switch to HBM2 (i.e, would it be qualitatively different from the relatively easy switch they made from LPDDR4 in the M1 to LPDDR5 in the M1 Pro/Max)?
HBM2 still has an advantage in bandwidth per area. If you want to go, say, 1.6TB/s bandwidth (4x M1 Max), you’d need 16x RAM modules with the current implementation which can get messy. HBM3 can probably do it with a much smaller package. But then again you lose RAM capacity, so there are drawbacks too.