New cores (A15)

Erasmus · Oct 5, 2021

Yeah, maybe Apple should just leave the three-tiers core crap to Android.

They need to have SOMETHING to crow about after all...

leman · Oct 5, 2021

Erasmus said:
Yeah, maybe Apple should just leave the three level crap to Android.

They need to have SOMETHING to crow about after all...

Which is by the way something that is generally underappreciated about Apple's hardware effort, and one of the reasons I am so exited by it: they simply don't give a damn about fads or fashion. They don't optimize for popular benchmarks, they don't do things "because all other kids do it" and they are not afraid to invest into power savings instead of performance despite the PR hit. They simply do a very honest job designing some damn good hardware without projecting much drama. Which is very rare in the industry nowadays.

quarkysg · Oct 5, 2021

EntropyQ3 said:
While reducing main memory traffic reduces power draw, increasing the amount of on-chip cache (and cache traffic) represents an increase in power draw, and any gain is only in the delta between the two, which certainly hasn't seemed to be on anywhere near this scale before or even necessarily a net gain. And it isn't as if a doubling of SLC at these sizes does a world of difference in hit rate either, typically, so the actual difference producing the delta can't be all that much to begin with.

I would think the difference in energy usage would be substantial. LPDDR4X needs minimum 0.6V (maybe 1.1V) to drive more than 100 data + address lines for the A15 running at over 2GHz. I would think the voltage used inside the A15 would be in terms of mV? So wattage used would be a lot less if it's in the A15, even tho. if the L3 cache is running at CPU speed at 3.2GHz.

The longer the data and address lines of the LPDDR4X modules stays at high impedance state, the less power will be expended.

crazy dave · Oct 5, 2021

leman said:
Which is by the way something that is generally underappreciated about Apple's hardware effort, and one of the reasons I am so exited by it: they simply don't give a damn about fads or fashion. They don't optimize for popular benchmarks, they don't do things "because all other kids do it" and they are not afraid to invest into power savings instead of performance despite the PR hit. They simply do a very honest job designing some damn good hardware without projecting much drama. Which is very rare in the industry nowadays.

Apparently they even undersell their own stuff, which is just ... weird. I mean their marketing team must've been like: "Ehhh this a nice round number to say, sure it makes our chip look worse than it is, but screw it we're so far ahead anyway and I like the sound of it"

crazy dave · Oct 5, 2021

Erasmus said:
Yeah, maybe Apple should just leave the three-tiers core crap to Android.

They need to have SOMETHING to crow about after all...

I mean maybe there will be a day where Apple will have three levels to its CPUs and we'll all nod sagely "yes this makes sense" as we did with big-little, but I feel like the three-tier stuff in Android SOC's is driven by cost die area concerns more than anything else. It allows them build small CPUs with decent multi-threaded performances at lower cost which is important if you are competing on price with other Android SOCs and phone makers. Apple doesn't have to care about that and they don't. Oh and they want the cores to be used in everything from smartwatches to eventually workstations - so spending die area to get the performance and efficiency they want is more important to them and if they do need it to be small as well, an old core shrunk down to the newest process will thus fit the bill just fine because back in its day its was still performant and efficient and is now small(er) to boot.

mr_roboto · Oct 5, 2021

Erasmus said:
Pretty sure he's talking about M-cores, which seem like they could be useful. However, as a non-electrical engineer, M-cores seem a bit conceptually boring to me.

What about U-cores, that are ultra-low power? For doing things like tasks during sleep (dealing with WiFi, pre-loading eMails, siphoning RAM to SSD before switch to hibernate, reporting 'Find My' locations while shut down, etc.)

Apple already has lots of what you could call U-cores. They're tiny microcontroller-class ultra low power Arm cores dedicated to specific low-level jobs. They are perhaps not exactly what you were thinking of, in that the macOS and iOS schedulers don't run on them. They aren't part of the main general purpose SMP computing resources. Instead, each of these embedded cores runs its own instance of a bare-bones realtime OS, RTKit, and communicates with a macOS driver through a mailbox interface.

The Asahi Linux project published a writeup about one of these in this blog post:

Progress Report: August 2021 - Asahi Linux

asahilinux.org

crazy dave · Oct 5, 2021

mr_roboto said:
Apple already has lots of what you could call U-cores. They're tiny microcontroller-class ultra low power Arm cores dedicated to specific low-level jobs. They are perhaps not exactly what you were thinking of, in that the macOS and iOS schedulers don't run on them. They aren't part of the main general purpose SMP computing resources. Instead, each of these embedded cores runs its own instance of a bare-bones realtime OS, RTKit, and communicates with a macOS driver through a mailbox interface.

The Asahi Linux project published a writeup about one of these in this blog post:

Progress Report: August 2021 - Asahi Linux

asahilinux.org

Yeah I should've mentioned this in my reply to him that Apple has a lot of dedicated coprocessors for a lot of different tasks. But I was mostly focused on why general purpose (in-order) U-cores are not necessarily the greatest of ideas.

EntropyQ3 · Oct 5, 2021

quarkysg said:
I would think the difference in energy usage would be substantial. LPDDR4X needs minimum 0.6V (maybe 1.1V) to drive more than 100 data + address lines for the A15 running at over 2GHz. I would think the voltage used inside the A15 would be in terms of mV? So wattage used would be a lot less if it's in the A15, even tho. if the L3 cache is running at CPU speed at 3.2GHz.

The longer the data and address lines of the LPDDR4X modules stays at high impedance state, the less power will be expended.

Look at it this way.
Average SPECint 2017 scores went up (7.28/6.47 = 1.125) 12.5% while energy usage went down (7328/8771 = 0.835) 16.5%. (The process is slightly polished to the tune of 5% or so.)

If you could get such improvements from doubling the SLC, everyone would do it, particularly since everyone other than Apple would have to invest far fewer transistors. Andrei points out that Qualcomms 888 only has 3MB of SLC and the Exynos 2100 seems to have 6MB, (so he is chewing on these issues), it would be a total no-brainer. But this isn't the type of improvements you typically see from a doubled cache size.

If I stuck my expertly licked finger in the air and made an estimate of what efficiency to expect from a slightly polished process, 10% higher clocks and increased cache size, I'd predict a few percent worse efficiency. Nothing to worry about though, easily set off by a somewhat larger battery.
Instead we see a much improved efficiency.
I, for one, am surprised.

crazy dave · Oct 5, 2021

EntropyQ3 said:
If you could get such improvements from doubling the SLC, everyone would do it, particularly since everyone other than Apple would have to invest far fewer transistors. Andrei points out that Qualcomms 888 only has 3MB of SLC and the Exynos 2100 seems to have 6MB, (so he is chewing on these issues), it would be a total no-brainer.

I agree with most of your post but I would add that Andrei is constantly berating Android SOC designers for starving their cores of cache - where their being stingy with cache (at all levels) causes the cores to be less performant and/or efficient than the ARM design could be if implemented by people who weren’t so obsessed with die area and cost.

Erasmus · Oct 5, 2021

mr_roboto said:
Apple already has lots of what you could call U-cores. They're tiny microcontroller-class ultra low power Arm cores dedicated to specific low-level jobs.

Yeah, I know Apple already has dedicated modules, I probably didn't explain what I meant well enough. I would assume lots of that unlabelled misc area in the die shot is comprised of these random little dedicated modules. I was more pointing out that what these little ultra-low power accelerators do, and could do in the future, is likely a lot more interesting than the addition of an additional 'M-core' tier, that just does what the P-cores could do with less power, or what the E-cores do, but faster.

To use the standard car analogy, if you own a sports car and a motorbike, buying an everyday hatchback might or might not be beneficial, but either way it's not interesting. A use-case for roller-skates is interesting.

Anyways, if you try hard enough, you can probably conceptually fit the GPU cores into the M-core category. Probably the NPU cores too, although they might be a bit closer to the random accelerator module end of the spectrum.

cmaier · Oct 5, 2021

crazy dave said:
In your three tier hypothetical, would you see Apple adding a smaller, more efficient E-core? (making the current E-core the new M-core) Or adding a new M-core in-between their current E- and P-cores? or a bigger, more hungry P-core? (making the current P-core the new M-core) The last one seems unlikely for a phone SOC (though I suppose if they went something like the X1 where there’s only one in the SOC, maybe not too bad …), so I’m guessing one of the two first ones? Are the potential gains worth it from a complexity/scheduling POV? I mean Android does it and Apple held off doing big-little until Apple finally did it, but is adding a third tier worth it?

Well, I threw it out there as a question, not as a suggestion. As I originally noted, I would need to profile actual code in quite some detail to figure out if it would be useful in anyway. My gut, though, is that an even lower power core might make sense. It should be easy to reduce power in an ”ultra efficient” core by around 30% (by looking at what Apple is doing in comparison with what Qualcomm is doing). The net performance of such a core might be, say, several times less than the current low power cores. Say they can fit two of them in the same space as a low power core. Are there enough threads running around on iOS and macOS that are just background threads that monitor things and are not performing useful calculations, such that a couple of these ultra efficient cores would be worth it? (by removing workload from the other cores?). I don’t know.

BigSplash · Oct 5, 2021

leman said:
Which is by the way something that is generally underappreciated about Apple's hardware effort, and one of the reasons I am so exited by it: they simply don't give a damn about fads or fashion. They don't optimize for popular benchmarks, they don't do things "because all other kids do it" and they are not afraid to invest into power savings instead of performance despite the PR hit. They simply do a very honest job designing some damn good hardware without projecting much drama. Which is very rare in the industry nowadays.

I am guessing but I would imagine that the Apple design team has an excellent store of instruction traces. I would bet that it is those traces that they are tuning their designs to and not synthetic benchmarks.

Andropov · Oct 5, 2021

cmaier said:
Not really, at least i surmise not. There is far more to be gained by differentiating microarchitecture. For example, a completely in-order core with no hardware multiplier, no reservation stations, etc. can have far fewer pipe stages and a lot denser hardware, which means the wires are shorter, the branch prediction miss penalty is much lower, etc., which brings much higher power efficiency for instruction steams that don’t need to worry about speed.

Just curious, since we're talking about maximum efficiency cores, is the branch predictor necessary at all? Since every mis-predicted branch means 'wasted' energy, and we're already discussing an in-order processor, wouldn't a significant amount of energy (potentially) be saved by waiting for the jump instruction to finish execution?. Or would that be unreasonably slow for a modern processor, even in a high efficiency core?

cmaier · Oct 5, 2021

Andropov said:
Just curious, since we're talking about maximum efficiency cores, is the branch predictor necessary at all? Since every mis-predicted branch means 'wasted' energy, and we're already discussing an in-order processor, wouldn't a significant amount of energy (potentially) be saved by waiting for the jump instruction to finish execution?. Or would that be unreasonably slow for a modern processor, even in a high efficiency core?

I would guess that would be far too slow, especially because you can implement a very simple branch predictor that doesn't cost much energy at all. For example, just assume all backward branches are taken and forward are not. That probably works 80% of the time.

If you can afford a tiny bit more power, just keep, say, 6 registers with target (or source) addresses and one extra bit that keeps track of whether that target was branched to (or source branched from) last time. A tiny bit more power? Increment or decrement a counter for each of those addresses.

Andropov · Oct 5, 2021

cmaier said:
I would guess that would be far too slow, especially because you can implement a very simple branch predictor that doesn't cost much energy at all. For example, just assume all backward branches are taken and forward are not. That probably works 80% of the time.

If you can afford a tiny bit more power, just keep, say, 6 registers with target (or source) addresses and one extra bit that keeps track of whether that target was branched to (or source branched from) last time. A tiny bit more power? Increment or decrement a counter for each of those addresses.

Oh. I had never looked at real-life misprediction rate data for branch predictors. I imagined for some reason that they were much much worse at predicting which branch was taken. Makes sense to keep them then.

Ah, this stuff is sooo cool.

cmaier · Oct 5, 2021

Andropov said:
Oh. I had never looked at real-life misprediction rate data for branch predictors. I imagined for some reason that they were much much worse at predicting which branch was taken. Makes sense to keep them then.

Ah, this stuff is sooo cool.

The vast majority of conditional branches are loops:

x = 0
A: x = x + 1
…
if x < 100 go to A

So most of the time, if the branch is backwards, it is a taken branch (because the loop counter hasn’t hit its final value). That fun fact actually gets you pretty far.

Andropov · Oct 5, 2021

cmaier said:
The vast majority of conditional branches are loops:

x = 0
A: x = x + 1
…
if x < 100 go to A

So most of the time, if the branch is backwards, it is a taken branch (because the loop counter hasn’t hit its final value). That fun fact actually gets you pretty far.

Yeah, I guessed loops were the reason. Still, I find the fact that most conditional branches are loops incredibly amusing. It makes sense, obviously, since a for loop will have N jump instructions while ifs only jump once. But in the codebases I have worked with (iOS dev here), explicit for loops are exceedingly rare, while if statements are commonplace. I guess that's why I thought branch predictors would perform much worse.

Obviously loops must be everywhere in the compiled code, but they're often hidden inside higher-order functions such as map, filter, etc, which are preferred (to most, at least. I'm not yet convinced, I suspect people unknowingly nest or unnecessarily repeat loops when coding this way, and I'm not sure that the compiler is smart enough to optimize all that away).

cmaier · Oct 5, 2021

Andropov said:
Yeah, I guessed loops were the reason. Still, I find the fact that most conditional branches are loops incredibly amusing. It makes sense, obviously, since a for loop will have N jump instructions while ifs only jump once. But in the codebases I have worked with (iOS dev here), explicit for loops are exceedingly rare, while if statements are commonplace. I guess that's why I thought branch predictors would perform much worse.

Obviously loops must be everywhere in the compiled code, but they're often hidden inside higher-order functions such as map, filter, etc, which are preferred (to most, at least. I'm not yet convinced, I suspect people unknowingly nest or unnecessarily repeat loops when coding this way, and I'm not sure that the compiler is smart enough to optimize all that away).

That gets us to the second-order effect. For “if” statements that aren’t part of loop-type structures, there tends to be a “tendency.” A given “if” rarely evaluates to “true” and “false” an equal number of times. So if you keep a little record (a bit, or a counter) that keeps track of what it did *last* time, or what it *usually* does, you have a better than even chance of being right if you assume that it will do the same thing *this* time. Not as helpful as the loop trick, but it is the basic technique used to get that last 15-20% of prediction accuracy. (The counting trick can have many variations and can get very complicated, with local vs global guesses, hybrid guesses, etc.)

Of course, if you use the counter technique you don’t need to keep track of backward vs forward branches - loops will take care of themselves since the counter will end up showing a strong tendency to take the backward branch. But it can still be used as the fallback for the initial pass or two through the loop before the counter is accurate.

crazy dave · Oct 5, 2021

cmaier said:
Well, I threw it out there as a question, not as a suggestion. As I originally noted, I would need to profile actual code in quite some detail to figure out if it would be useful in anyway. My gut, though, is that an even lower power core might make sense. It should be easy to reduce power in an ”ultra efficient” core by around 30% (by looking at what Apple is doing in comparison with what Qualcomm is doing). The net performance of such a core might be, say, several times less than the current low power cores. Say they can fit two of them in the same space as a low power core. Are there enough threads running around on iOS and macOS that are just background threads that monitor things and are not performing useful calculations, such that a couple of these ultra efficient cores would be worth it? (by removing workload from the other cores?). I don’t know.

That’s effectively the role of the A55 - the problem is while they sip power and do indeed take up next to no space even compared to Apple’s E-core they’re also so slow that Apple’s E-core is in the end more power efficient overall. Thus the only thing you end up saving by using such U cores as opposed to simply adding more Apple style E-cores is die area. That’s not nothing - die area is expensive and becoming increasingly so as nodes progress. But, so far, Apple seems willing to pay if they need to.

ARM does have new in-order A510 designs coming out that are better, but a lot of their focus is still die area. However in the article about the A510, Andrei notes that ARM themselves make a similar point as you do: sure if you benchmark these littlest cores independently they look like crap, but in normal “real life” scenarios on the kinds of small, integer heavy background workloads they’re meant to handle while the rest of the cores are active, they are actually the most efficient way to fulfill that role not just in die area but also energy efficiency. Trouble is those tests are internal to ARM so kind of hard to verify.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

cmaier · Oct 5, 2021

crazy dave said:
That’s effectively the role of the A55 - the problem is while they sip power and do indeed take up next to no space even compared to Apple’s E-core they’re also so slow that Apple’s E-core is in the end more power efficient overall. Thus the only thing you end up saving by using such U cores as opposed to simply adding more Apple style E-cores is die area. That’s not nothing - die area is expensive and becoming increasingly so as nodes progress. But, so far, Apple seems willing to pay if they need to.

ARM does have new in-order A510 designs coming out that are better, but a lot of their focus is still die area. However in the article about the A510, Andrei notes that ARM themselves make a similar point as you do: sure if you benchmark these littlest cores independently they look like crap, but in normal “real life” scenarios on the kinds of small, integer heavy background workloads they’re meant to handle while the rest of the cores are active, they are actually the most efficient way to fulfill that role not just in die area but also energy efficiency. Trouble is those tests are internal to ARM so kind of hard to verify.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

yep. i've never profiled code for Arm, so I can only guess. I would imagine, for example, Windows on x86 would benefit greatly from this sort of thing, because Windows has a tendency to have a lot of threads sitting around just polling. (Or at least, they did 10 years ago. Not sure today). Getting those threads out of the way of more important threads may well be worth it.

crazy dave · Oct 5, 2021

cmaier said:
yep. i've never profiled code for Arm, so I can only guess. I would imagine, for example, Windows on x86 would benefit greatly from this sort of thing, because Windows has a tendency to have a lot of threads sitting around just polling. (Or at least, they did 10 years ago. Not sure today). Getting those threads out of the way of more important threads may well be worth it.

While not quite as extreme as adding small in-order cores, that’s also Intel’s big bet with Alder Lake.

Argon_ · Oct 5, 2021

crazy dave said:
While not quite as extreme as adding small in-order cores, that’s also Intel’s big bet with Alder Lake.

I think Alder Lake will succeed in unclogging the P cores, but still hobble itself with thermals. Obviously in a desktop application you can just screw on a bigger heatsink, but not so for mobile devices.

Joelist · Oct 5, 2021

Sorry I posted and ran so to speak - work intervened.

What I was trying to say is the Apple SOCs seem to contain a fair amount of specific blocks - one example is M1 apparently per some sources has some silicon specifically for executing some problematic x86 instructions which helps Rosetta 2 perform better. Another is the extremely fast dynamic refresh rate changes in the 13 Pros - seems too fast to be purely software so one wonders if A15 has some "help" in the SOC. And we know Deep Fusion uses the Neural Engine at minimum.

Could be wrong but there are unidentified elements on the SOC and Apple seems to have no compunctions about customizing things VERY tightly.

crazy dave · Oct 5, 2021

Probably not a big shock to anyone here but the A15 has a new interrupt controller that is explicitly built for multiple dies:

https://twitter.com/x/status/1444721883807686672

Prepping for the Mac Pro … and any other larger SOCs …

cmaier · Oct 5, 2021

crazy dave said:
Probably not a big shock to anyone here but the A15 has a new interrupt controller that is explicitly built for multiple dies:

https://twitter.com/x/status/1444721883807686672

Prepping for the Mac Pro … and any other larger SOCs …

Well sounds like the driver, not the interrupt controller, supports multiple die (the idea being that each die has an interrupt controller, so any die can trigger interrupts and needs support for assignment of interrupt vectors). But, still, cool.

New cores (A15)

macrumors 68030

macrumors Core

macrumors 65816

macrumors 68000

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 68030

Suspended

macrumors member

macrumors 6502a

Suspended

macrumors 6502a

Suspended

macrumors 6502a

Suspended

macrumors 68000

Suspended

macrumors 68000

macrumors 6502

macrumors 6502

macrumors 68000

Suspended

Our Staff