Yeah, maybe Apple should just leave the three level crap to Android.
They need to have SOMETHING to crow about after all...
I would think the difference in energy usage would be substantial. LPDDR4X needs minimum 0.6V (maybe 1.1V) to drive more than 100 data + address lines for the A15 running at over 2GHz. I would think the voltage used inside the A15 would be in terms of mV? So wattage used would be a lot less if it's in the A15, even tho. if the L3 cache is running at CPU speed at 3.2GHz.While reducing main memory traffic reduces power draw, increasing the amount of on-chip cache (and cache traffic) represents an increase in power draw, and any gain is only in the delta between the two, which certainly hasn't seemed to be on anywhere near this scale before or even necessarily a net gain. And it isn't as if a doubling of SLC at these sizes does a world of difference in hit rate either, typically, so the actual difference producing the delta can't be all that much to begin with.
Which is by the way something that is generally underappreciated about Apple's hardware effort, and one of the reasons I am so exited by it: they simply don't give a damn about fads or fashion. They don't optimize for popular benchmarks, they don't do things "because all other kids do it" and they are not afraid to invest into power savings instead of performance despite the PR hit. They simply do a very honest job designing some damn good hardware without projecting much drama. Which is very rare in the industry nowadays.
Yeah, maybe Apple should just leave the three-tiers core crap to Android.
They need to have SOMETHING to crow about after all...
Apple already has lots of what you could call U-cores. They're tiny microcontroller-class ultra low power Arm cores dedicated to specific low-level jobs. They are perhaps not exactly what you were thinking of, in that the macOS and iOS schedulers don't run on them. They aren't part of the main general purpose SMP computing resources. Instead, each of these embedded cores runs its own instance of a bare-bones realtime OS, RTKit, and communicates with a macOS driver through a mailbox interface.Pretty sure he's talking about M-cores, which seem like they could be useful. However, as a non-electrical engineer, M-cores seem a bit conceptually boring to me.
What about U-cores, that are ultra-low power? For doing things like tasks during sleep (dealing with WiFi, pre-loading eMails, siphoning RAM to SSD before switch to hibernate, reporting 'Find My' locations while shut down, etc.)
Apple already has lots of what you could call U-cores. They're tiny microcontroller-class ultra low power Arm cores dedicated to specific low-level jobs. They are perhaps not exactly what you were thinking of, in that the macOS and iOS schedulers don't run on them. They aren't part of the main general purpose SMP computing resources. Instead, each of these embedded cores runs its own instance of a bare-bones realtime OS, RTKit, and communicates with a macOS driver through a mailbox interface.
The Asahi Linux project published a writeup about one of these in this blog post:
Progress Report: August 2021 - Asahi Linux
asahilinux.org
Look at it this way.I would think the difference in energy usage would be substantial. LPDDR4X needs minimum 0.6V (maybe 1.1V) to drive more than 100 data + address lines for the A15 running at over 2GHz. I would think the voltage used inside the A15 would be in terms of mV? So wattage used would be a lot less if it's in the A15, even tho. if the L3 cache is running at CPU speed at 3.2GHz.
The longer the data and address lines of the LPDDR4X modules stays at high impedance state, the less power will be expended.
If you could get such improvements from doubling the SLC, everyone would do it, particularly since everyone other than Apple would have to invest far fewer transistors. Andrei points out that Qualcomms 888 only has 3MB of SLC and the Exynos 2100 seems to have 6MB, (so he is chewing on these issues), it would be a total no-brainer.
Yeah, I know Apple already has dedicated modules, I probably didn't explain what I meant well enough. I would assume lots of that unlabelled misc area in the die shot is comprised of these random little dedicated modules. I was more pointing out that what these little ultra-low power accelerators do, and could do in the future, is likely a lot more interesting than the addition of an additional 'M-core' tier, that just does what the P-cores could do with less power, or what the E-cores do, but faster.Apple already has lots of what you could call U-cores. They're tiny microcontroller-class ultra low power Arm cores dedicated to specific low-level jobs.
Well, I threw it out there as a question, not as a suggestion. As I originally noted, I would need to profile actual code in quite some detail to figure out if it would be useful in anyway. My gut, though, is that an even lower power core might make sense. It should be easy to reduce power in an ”ultra efficient” core by around 30% (by looking at what Apple is doing in comparison with what Qualcomm is doing). The net performance of such a core might be, say, several times less than the current low power cores. Say they can fit two of them in the same space as a low power core. Are there enough threads running around on iOS and macOS that are just background threads that monitor things and are not performing useful calculations, such that a couple of these ultra efficient cores would be worth it? (by removing workload from the other cores?). I don’t know.In your three tier hypothetical, would you see Apple adding a smaller, more efficient E-core? (making the current E-core the new M-core) Or adding a new M-core in-between their current E- and P-cores? or a bigger, more hungry P-core? (making the current P-core the new M-core) The last one seems unlikely for a phone SOC (though I suppose if they went something like the X1 where there’s only one in the SOC, maybe not too bad …), so I’m guessing one of the two first ones? Are the potential gains worth it from a complexity/scheduling POV? I mean Android does it and Apple held off doing big-little until Apple finally did it, but is adding a third tier worth it?
I am guessing but I would imagine that the Apple design team has an excellent store of instruction traces. I would bet that it is those traces that they are tuning their designs to and not synthetic benchmarks.Which is by the way something that is generally underappreciated about Apple's hardware effort, and one of the reasons I am so exited by it: they simply don't give a damn about fads or fashion. They don't optimize for popular benchmarks, they don't do things "because all other kids do it" and they are not afraid to invest into power savings instead of performance despite the PR hit. They simply do a very honest job designing some damn good hardware without projecting much drama. Which is very rare in the industry nowadays.
Just curious, since we're talking about maximum efficiency cores, is the branch predictor necessary at all? Since every mis-predicted branch means 'wasted' energy, and we're already discussing an in-order processor, wouldn't a significant amount of energy (potentially) be saved by waiting for the jump instruction to finish execution?. Or would that be unreasonably slow for a modern processor, even in a high efficiency core?Not really, at least i surmise not. There is far more to be gained by differentiating microarchitecture. For example, a completely in-order core with no hardware multiplier, no reservation stations, etc. can have far fewer pipe stages and a lot denser hardware, which means the wires are shorter, the branch prediction miss penalty is much lower, etc., which brings much higher power efficiency for instruction steams that don’t need to worry about speed.
I would guess that would be far too slow, especially because you can implement a very simple branch predictor that doesn't cost much energy at all. For example, just assume all backward branches are taken and forward are not. That probably works 80% of the time.Just curious, since we're talking about maximum efficiency cores, is the branch predictor necessary at all? Since every mis-predicted branch means 'wasted' energy, and we're already discussing an in-order processor, wouldn't a significant amount of energy (potentially) be saved by waiting for the jump instruction to finish execution?. Or would that be unreasonably slow for a modern processor, even in a high efficiency core?
Oh. I had never looked at real-life misprediction rate data for branch predictors. I imagined for some reason that they were much much worse at predicting which branch was taken. Makes sense to keep them then.I would guess that would be far too slow, especially because you can implement a very simple branch predictor that doesn't cost much energy at all. For example, just assume all backward branches are taken and forward are not. That probably works 80% of the time.
If you can afford a tiny bit more power, just keep, say, 6 registers with target (or source) addresses and one extra bit that keeps track of whether that target was branched to (or source branched from) last time. A tiny bit more power? Increment or decrement a counter for each of those addresses.
Oh. I had never looked at real-life misprediction rate data for branch predictors. I imagined for some reason that they were much much worse at predicting which branch was taken. Makes sense to keep them then.
Ah, this stuff is sooo cool.
Yeah, I guessed loops were the reason. Still, I find the fact that most conditional branches are loops incredibly amusing. It makes sense, obviously, since a for loop will have N jump instructions while ifs only jump once. But in the codebases I have worked with (iOS dev here), explicit for loops are exceedingly rare, while if statements are commonplace. I guess that's why I thought branch predictors would perform much worse.The vast majority of conditional branches are loops:
x = 0
A: x = x + 1
…
if x < 100 go to A
So most of the time, if the branch is backwards, it is a taken branch (because the loop counter hasn’t hit its final value). That fun fact actually gets you pretty far.
Yeah, I guessed loops were the reason. Still, I find the fact that most conditional branches are loops incredibly amusing. It makes sense, obviously, since a for loop will have N jump instructions while ifs only jump once. But in the codebases I have worked with (iOS dev here), explicit for loops are exceedingly rare, while if statements are commonplace. I guess that's why I thought branch predictors would perform much worse.
Obviously loops must be everywhere in the compiled code, but they're often hidden inside higher-order functions such as map, filter, etc, which are preferred (to most, at least. I'm not yet convinced, I suspect people unknowingly nest or unnecessarily repeat loops when coding this way, and I'm not sure that the compiler is smart enough to optimize all that away).
Well, I threw it out there as a question, not as a suggestion. As I originally noted, I would need to profile actual code in quite some detail to figure out if it would be useful in anyway. My gut, though, is that an even lower power core might make sense. It should be easy to reduce power in an ”ultra efficient” core by around 30% (by looking at what Apple is doing in comparison with what Qualcomm is doing). The net performance of such a core might be, say, several times less than the current low power cores. Say they can fit two of them in the same space as a low power core. Are there enough threads running around on iOS and macOS that are just background threads that monitor things and are not performing useful calculations, such that a couple of these ultra efficient cores would be worth it? (by removing workload from the other cores?). I don’t know.
That’s effectively the role of the A55 - the problem is while they sip power and do indeed take up next to no space even compared to Apple’s E-core they’re also so slow that Apple’s E-core is in the end more power efficient overall. Thus the only thing you end up saving by using such U cores as opposed to simply adding more Apple style E-cores is die area. That’s not nothing - die area is expensive and becoming increasingly so as nodes progress. But, so far, Apple seems willing to pay if they need to.
ARM does have new in-order A510 designs coming out that are better, but a lot of their focus is still die area. However in the article about the A510, Andrei notes that ARM themselves make a similar point as you do: sure if you benchmark these littlest cores independently they look like crap, but in normal “real life” scenarios on the kinds of small, integer heavy background workloads they’re meant to handle while the rest of the cores are active, they are actually the most efficient way to fulfill that role not just in die area but also energy efficiency. Trouble is those tests are internal to ARM so kind of hard to verify.
yep. i've never profiled code for Arm, so I can only guess. I would imagine, for example, Windows on x86 would benefit greatly from this sort of thing, because Windows has a tendency to have a lot of threads sitting around just polling. (Or at least, they did 10 years ago. Not sure today). Getting those threads out of the way of more important threads may well be worth it.
While not quite as extreme as adding small in-order cores, that’s also Intel’s big bet with Alder Lake.
Well sounds like the driver, not the interrupt controller, supports multiple die (the idea being that each die has an interrupt controller, so any die can trigger interrupts and needs support for assignment of interrupt vectors). But, still, cool.Probably not a big shock to anyone here but the A15 has a new interrupt controller that is explicitly built for multiple dies:
Prepping for the Mac Pro … and any other larger SOCs …