M3 Chip Generation - Discussion Megathread

deconstruct60 · Aug 12, 2023

leman said:
Isn’t the only difference between LPDDR5/X/T the operating frequency? I’d expect a controller that supports faster memory be backwards compatible with slower standards.

There are differences but mainly indirectly to support more speed . adjustments for read and write and mode registers . Dropping 8 wide banks ( which I don’t think Apple was using much anyway . ) so, more stuff to handle the quirks of running at higher speeds .

Taking LPDDR5 to the Next Level

To cater to ever-increasing bandwidth demands from low-power DRAMs especially for devices like cell phones, tablets and others with limited power budgets, JEDEC

community.cadence.com

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

But yes… basically need to implement effectively all of LPDDR5 to do 5X . Apple is far more focused on adding more memory controller paths/channels than in putting maximum dies on just one channel . For “ poor man’s HBM “ need more channels than capacity.

But pragmatic limits on just how wide can go while keep die size limited along with costs (high cost interposers for memory connections … least run into HBM like costs ).

the A series has a chopped down GPU cluster anyway .if Apple needs to control bill of material price creep , then. Going with fall back LPDDR5 mode isn’t going to hurt much. Outside the phone space though , Apple has completion problem where staying throttled on plain 5 isn’t buying much of anything.

deconstruct60 · Aug 12, 2023

leman said:
This would mean moving from four-core clusters to six-core clusters which I personally find less likely.

Not necessarily. Apple puts a 'half' P-cluster into the A-series. Conceptually, they could toss a 'half' ( 2P) cluster into the mix. ( 4 + 2 -> 6P ). That way if actually don't need the extra two P cores can just turn them and the associated L2 cache completely off. ( or in some corner cases ... vice-versa ; put 4+L2 to 'off'. a relatively low number of promotions out of the E cores may not be worth 4P cores. ) But yes, 6 cores clustered around an "even bigger" L2 ... not really a good way to manage complexity or save power. Just skip the "even bigger L2" though.

It would be suggestive that they ran into a space limitation on the die budget. ( similar to how the E cluster was chopped to 2E in the M1 Pro/Max. )

Confused-User · Aug 12, 2023

Alternatively they could move to a 3-core cluster. I originally wrote off that idea but (IIRC) name99 pointed out that this would increase cache available per core, which might be quite desirable. (Of course, increasing cache roughly negates the area advantage of N3. Doesn't mean Apple can't decide that that's what they need anyway.)

There is a rather poor article on wccftech today pointing out some new A17 geekbench results. Obviously suspect, but they say 31% better single-P-core performance... *isoclock*. That's really significant, if accurate, and roughly in line with my prediction from early this year. (Which would be nice, because I was way off on the timing!)

dgdosen · Aug 12, 2023

Confused-User said:
Alternatively they could move to a 3-core cluster. I originally wrote off that idea but (IIRC) name99 pointed out that this would increase cache available per core, which might be quite desirable. (Of course, increasing cache roughly negates the area advantage of N3. Doesn't mean Apple can't decide that that's what they need anyway.)

There is a rather poor article on wccftech today pointing out some new A17 geekbench results. Obviously suspect, but they say 31% better single-P-core performance... *isoclock*. That's really significant, if accurate, and roughly in line with my prediction from early this year. (Which would be nice, because I was way off on the timing!)

3269/7666! That would be good news for the M3...

leman · Aug 12, 2023

deconstruct60 said:
Not necessarily. Apple puts a 'half' P-cluster into the A-series. Conceptually, they could toss a 'half' ( 2P) cluster into the mix.

Sure, thats also possible, but would be slightly odd in the context of Pro/Mac chips. At any rate, I expect them to keep the 4-core configuration and maybe throw in an extra cluster for the Max.

Confused-User said:
There is a rather poor article on wccftech today pointing out some new A17 geekbench results. Obviously suspect, but they say 31% better single-P-core performance... *isoclock*.

That article also claims that A16 is clocked at 3.7ghz (it’s not), so their IPC estimates are a bit inflated. And the ratio of single to multi scores doesn’t really make sense to me. All in all, I doubt these results are real.

senttoschool · Aug 13, 2023

Confused-User said:
There is a rather poor article on wccftech today pointing out some new A17 geekbench results. Obviously suspect, but they say 31% better single-P-core performance... *isoclock*. That's really significant, if accurate, and roughly in line with my prediction from early this year. (Which would be nice, because I was way off on the timing!)

100% BS. 3,000 Twitter followers. Giving away free stuff in Twitter account.

Anyone can edit the HTML on a website to anything they want.

Confused-User · Aug 13, 2023

leman said:
That article also claims that A16 is clocked at 3.7ghz (it’s not), so their IPC estimates are a bit inflated. And the ratio of single to multi scores doesn’t really make sense to me. All in all, I doubt these results are real.

I agree, they're definitely off base, but I don't know by how much. I do think it's entirely plausible we'll see >30% improvement single-core in the A17. The big question, as I think I mentioned a few months ago in another thread, is whether the new cores are designed to scale up clocks more than the current ones (which most likely can't). +31% perf on the A17 is nice, even if you're boosting clocks from (I think) 3.46GHz, giving you a 22.5% IPC bump. So... If they can get clocks up to 4.5GHz on desktops (uncharacteristic, but hardly implausible), you're now looking at a 22% boost from clocks, for a total of almost exactly 50% higher single-core performance, assuming cache/memory can keep up (it should).

That would be a major statement - 30+% faster than the fastest Intel and AMD cores out today. Put 12 of these in a chip along with 4 E cores and you've got a monster that will likely destroy the AMD7950 or the Intel 13900 by a wide margin. ("Likely" because it still depends on Apple's NoC making strides too - which I definitely expect.) And well it should, really, given the price of the Max.

How wide a margin? Well, scale the P cores up linearly by 20%, ignore the E cores, and you're looking at a total improvement of 1.8x. That's a reasonable approximation, though a really good NoC improvement could move it higher, and a poor one lower. Using the GB6 score of the M2Max, 14446, then that brings you to 26,000. That's ~25% higher than the Intel 13900 numbers, even better compared to AMD 7950. That would likely even stave off the next generation of chips from both... which will surely be necessary if the M4 takes another two years. :-/ Though I'm an optimist, I don't think it will.

Will they actually do this? I don't know, but we're a lot closer to finding out than we were last time I played with these numbers.

BTW, as an aside, the GB6 multicore numbers look like crap for all processors - they show pretty poor scaling. Is that reasonable? I dunno... Presumably they're indicative of real-world test cases, but you have to hope that the stuff that really needs all that CPU is going to do better. And I don't just mean ZIP...

Confused-User · Aug 13, 2023

Confused-User said:
BTW, as an aside, the GB6 multicore numbers look like crap for all processors - they show pretty poor scaling. Is that reasonable? I dunno... Presumably they're indicative of real-world test cases, but you have to hope that the stuff that really needs all that CPU is going to do better. And I don't just mean ZIP...

Further on this, that means that the multicore numbers I projected could be a little optimistic - I scaled linearly for the extra 2 P cores and ignored the E cores. But even so I don't think that changes the shape of my prediction.

It all comes down to whether or not they designed the P cores to scale up clocks substantially compared to the M2. They've clearly been working on that. But are they going to deliver in the M3? I am considering holding my breath.

Edit: Actually, going from 6P4E to 8P4E (33% more P, 0% more E) translates to about a 17% bump on GB6. Going from 8P4E to 12P4E is 50% more P, so that suggests rather better than 20% effective increase - except, this won't scale linearly. NoC improvements will tend to drive numbers higher, while the larger number of cores drives the number lower (scaling is never linear and falls off with increasing core counts, typically). Still, my spitballed 20% bump in GB6 multicore due to extra cores is looking a lot more reasonable in this light, and 25% isn't all that unlikely.

leman · Aug 13, 2023

Confused-User said:
I agree, they're definitely off base, but I don't know by how much. I do think it's entirely plausible we'll see >30% improvement single-core in the A17.

Oh, I agree, but that’s just common sense and extrapolating from past developments. It doesn’t mean that the numbers are real. Anyone can draw a bunch of numbers like that.

Confused-User said:
The big question, as I think I mentioned a few months ago in another thread, is whether the new cores are designed to scale up clocks more than the current ones (which most likely can't)

Yep, that’s the big question. Apples 5nm family is not designed for vertical scaling. Will the 3nm family? We’ll see…

Confused-User said:
BTW, as an aside, the GB6 multicore numbers look like crap for all processors - they show pretty poor scaling. Is that reasonable? I dunno... Presumably they're indicative of real-world test cases, but you have to hope that the stuff that really needs all that CPU is going to do better. And I don't just mean ZIP...

GB6 changed how multicore scores are computed. Before, GB used to run copies of tasks on multiple cores (naive MC estimation), GB6 now uses multiple cores to cooperate on solving a single task. There are pros and contras to each approach and these kind of benchmarks can be very difficult to implement properly. Personally, I agree with GB6 authors that the new methodology are a better match for personal computers, while the naive MC is more typical of server/some HPC use.

Regardless, it’s the ratio of MC/SC in the claimed A17 leak I was talking about. It doesn’t make much sense to me that the ratio would suffer with the new generation. This would mean that either Apple uses more aggressive SC boost or that the performance of E-cores has decreased. I find both to be unlikely.

Confused-User · Aug 13, 2023

leman said:
GB6 changed how multicore scores are computed. Before, GB used to run copies of tasks on multiple cores (naive MC estimation), GB6 now uses multiple cores to cooperate on solving a single task. There are pros and contras to each approach and these kind of benchmarks can be very difficult to implement properly. Personally, I agree with GB6 authors that the new methodology are a better match for personal computers, while the naive MC is more typical of server/some HPC use.

Yes, I'd read about that at the time GB6 came out. I was just extending my reasoning, based on the observation about how poorly it scales. (And it's *bad*. The numbers for Intel and AMD server chips are ridiculous. Nobody would buy them if they were at all representative of what we do with those chips.)

Regardless, it’s the ratio of MC/SC in the claimed A17 leak I was talking about. It doesn’t make much sense to me that the ratio would suffer with the new generation. This would mean that either Apple uses more aggressive SC boost or that the performance of E-cores has decreased. I find both to be unlikely.

I agree with your conclusion but you're missing the effect of the NoC as core counts climb. And not just CPU cores - in fact not just all the cores, rather all functional units on the NoC. If Apple were very lazy, and not working on improving their NoC, this ratio could be vaguely plausible. But that in itself isn't really any more plausible than your two discounted hypotheses. We know they have been, and there's a ton of patents out there showing improvements that haven't yet appeared in the M2.

Either way, note my edit to the previous comment, made after your post.

leman · Aug 13, 2023

Confused-User said:
I agree with your conclusion but you're missing the effect of the NoC as core counts climb. And not just CPU cores - in fact not just all the cores, rather all functional units on the NoC. If Apple were very lazy, and not working on improving their NoC, this ratio could be vaguely plausible. But that in itself isn't really any more plausible than your two discounted hypotheses. We know they have been, and there's a ton of patents out there showing improvements that haven't yet appeared in the M2.

A17 will have the same number of cores as A16, so not sure how your argument applies? What I meant is that on this “leak” the MC/SC ratio of A17 is a significant regression from A16, while historically this ratio has been fairly constant.

I also don’t think that Apple needs to increase the number of CPU cores. The phone and base M-series configurations are perfectly fine core-wise. Maybe high-end desktop could use a slight boost. In general given the massive advantage in perf-watt Apple holds I’d rather like them to increase the performance of individual cores + implement more flexible power curves.

souko · Aug 13, 2023

Geekbench 6 uses for multi-thread one task. Geekbench 5 uses for multi-thread one task per thread. So you can not expect same scalling using Geekbench 6 as Geekbench 5. Even when you compare A12-A16 you will get percentually worse scalling in Geekbench 6 than in Geekbench 5.

Confused-User · Aug 13, 2023

leman said:
A17 will have the same number of cores as A16, so not sure how your argument applies? What I meant is that on this “leak” the MC/SC ratio of A17 is a significant regression from A16, while historically this ratio has been fairly constant.

I was talking about M3 all along. I didn't notice you were referring to A17... different story there, as you say.

I also don’t think that Apple needs to increase the number of CPU cores. The phone and base M-series configurations are perfectly fine core-wise. Maybe high-end desktop could use a slight boost. In general given the massive advantage in perf-watt Apple holds I’d rather like them to increase the performance of individual cores + implement more flexible power curves.

Yes, agreed, at least about the phone. I wouldn't mind seeing a couple more P cores (or even E cores) on the base M3 but it's not critical, especially if we get the kind of IPC boost I expect, along with at least a modest clock bump as a possibility (higher top end in DVFS, as you're saying above).

name99 · Aug 13, 2023

Confused-User said:
I agree, they're definitely off base, but I don't know by how much. I do think it's entirely plausible we'll see >30% improvement single-core in the A17. The big question, as I think I mentioned a few months ago in another thread, is whether the new cores are designed to scale up clocks more than the current ones (which most likely can't). +31% perf on the A17 is nice, even if you're boosting clocks from (I think) 3.46GHz, giving you a 22.5% IPC bump. So... If they can get clocks up to 4.5GHz on desktops (uncharacteristic, but hardly implausible), you're now looking at a 22% boost from clocks, for a total of almost exactly 50% higher single-core performance, assuming cache/memory can keep up (it should).

That would be a major statement - 30+% faster than the fastest Intel and AMD cores out today. Put 12 of these in a chip along with 4 E cores and you've got a monster that will likely destroy the AMD7950 or the Intel 13900 by a wide margin. ("Likely" because it still depends on Apple's NoC making strides too - which I definitely expect.) And well it should, really, given the price of the Max.

Apple can (and are, IMHO) far more likely to achieve most of a 30% speedup by IPC improvements. There remains so much that can be done to boost IPC. I've listed many ideas in many different places, but options include
- marking some kinds of lines (eg instruction, page table) as critical, and treating them appropriately (maximum QoS when loading, higher priority in retaining them in caches). Known patent.
- zero content cache for L1 or L2 (low area low power cache that only stores lines that are zero). Known patent for something slightly more general.
- marking instructions as critical, and scheduling based on criticality rather than age
- use schemes like Long Term Parking to allow shunting aside instructions that are predicted to wait for DRAM
(This requires virtualization of registers and LSU slots, but almost all of that machinery is already in place)
- placing a queue between Decode and Allocate. Allows Decode to easily scale from 8 to 10 wide while not requiring scaling up of the (much more difficult to scale) Allocate stage
- more aggressive op-fusion. (Was of limited value when Decode and Allocate width are equal, but of much more value with wider Decode)

There are also obvious ideas available to improve the E-cores. One cute one is Seznec's Omnipredictor, which uses the TAGE storage to also predict indirect branches and load-store aliasing; not quite as accurate as three state of the art predictors, but a whole lot less area.

And the NoC improvements are already known. Both the "wire layout" and the coherence protocol have had substantial redesigns, so I'm not worried on that score.

Confused-User · Aug 13, 2023

name99 said:
Apple can (and are, IMHO) far more likely to achieve most of a 30% speedup by IPC improvements. There remains so much that can be done to boost IPC.[...]

You misunderstand- I was saying that getting 30% from clocks is plausible *after* they take the big win on IPC. I said that if the score quoted was accurate, then after factoring in their error about the A16's clock, we were looking at a 22.5% IPC improvement from A16 to A17. Then take that as the core going into the M3, bump up speed to 4.5GHz, and you have a roughly 50% improvement in single-core over the M2.

Again, I'm not saying they're going to do that. I'm saying they probably could. Of course it's also possible they're so intent on optimizing the cores for the iphone case (low power) that the M3 still won't be able to run anywhere near that fast. We might not know for a while, since the first meaningful indication will be when the M3 Studio ships (or maaaaaybe the Mini Pro).

And the NoC improvements are already known. Both the "wire layout" and the coherence protocol have had substantial redesigns, so I'm not worried on that score.

Right, up through the Max I think it's a no-brainer they're going to be fine. I'm more curious about whether they've managed to make significant advances with the Ultrafusion interconnect. I'd bet on it, but I don't feel certain.

leman · Aug 13, 2023

name99 said:
Apple can (and are, IMHO) far more likely to achieve most of a 30% speedup by IPC improvements. There remains so much that can be done to boost IPC. I've listed many ideas in many different places, but options include
- marking some kinds of lines (eg instruction, page table) as critical, and treating them appropriately (maximum QoS when loading, higher priority in retaining them in caches). Known patent.
- zero content cache for L1 or L2 (low area low power cache that only stores lines that are zero). Known patent for something slightly more general.
- marking instructions as critical, and scheduling based on criticality rather than age
- use schemes like Long Term Parking to allow shunting aside instructions that are predicted to wait for DRAM
(This requires virtualization of registers and LSU slots, but almost all of that machinery is already in place)
- placing a queue between Decode and Allocate. Allows Decode to easily scale from 8 to 10 wide while not requiring scaling up of the (much more difficult to scale) Allocate stage
- more aggressive op-fusion. (Was of limited value when Decode and Allocate width are equal, but of much more value with wider Decode)

There are also obvious ideas available to improve the E-cores. One cute one is Seznec's Omnipredictor, which uses the TAGE storage to also predict indirect branches and load-store aliasing; not quite as accurate as three state of the art predictors, but a whole lot less area.

And the NoC improvements are already known. Both the "wire layout" and the coherence protocol have had substantial redesigns, so I'm not worried on that score.

Maynard, have you seen these recently published patents on prediction?

WIPO - Search International and National Patent Collections

Would be curious to hear your analysis and how these relate to the state of the art.

name99 · Aug 14, 2023

leman said:
Maynard, have you seen these recently published patents on prediction?

WIPO - Search International and National Patent Collections

WIPO - Search International and National Patent Collections

Would be curious to hear your analysis and how these relate to the state of the art.

Woah, those are really strange! I need some time to read through them fully and figure out what they mean.
As is frequently the case, there seem to be two separate ideas bundled into one patent. (I don't think is malicious on anyone's part; it's just that engineers have limited time to talk to the lawyers, and the lawyers are not competent/confident to split a dump of "here's how we changed X" into two or more separate individual concepts.)

So the more obvious idea is to add to the Prefetch and Address Generation stages of the front end a dedicated "fully-biased" predictor to catch the (surprisingly common) cases of branches that always go in the same direction. [eg things like "if (divisor!=0)" or "if(error condition)" ]
This probably allows these address generation stages to be a little larger, and saves some power (we don't need to run the TAGE machinery subsequent to the address generation to confirm the prediction). It's "fairly obvious" (which doesn't mean I thought of it! but means once I see it, it's clearly "oh yeah, that makes sense, neat idea").

Along with that there's a very weird idea, unlike anything I have ever seen discussed anywhere else, for how to handle mispredicted branches. I really have to think a lot about how it might work; right now I have no real idea.
(One possibility is that it implements an idea I have suggested elsewhere of snipping out short segments of mispredicted code. Consider something like
if(condition){ three instructions }
The traditional prediction/fetch mechanism would, at address generation, effectively predict the condition true and and then either load the trace of the three instructions, so the instruction stream looks like {conditional branch, three instructions, next instruction},
or predict it false and not load the trace, so the instruction stream looks like {conditional branch, next instruction}.

But that's suboptimal with an Apple-type design! Better is not to care about these sorts of branches (conditional followed by up to about 3 or 4 instructions). Just generate the instruction stream as {conditional branch, three instructions, next instruction} and let TAGE (which is a much better predictor than the Fetch Address predictor) decide the condition of the branch. Then mark the three instructions, if they are not to be executed, as NOPs, and drop them in the transfer between Instruction Queue and Decode.

I don't believe Apple (or anyone else) does this yet, though they should! It's POSSIBLE that the mechanism they describe essentially does something like this. It seems more heavyweight, more overhead and energy than what I have described, but maybe there are some advantages to balance that? Anyway I need to look at this much more carefully.

How did you discover these? I have my ways of looking for new technical patents (basically following up on the names on previous patents) but it's not very streamlined or efficient for discovering new patents as they come in; so sometimes it's a year before I find something new.
And doing a search for all "new" Apple patents is hopeless, you get lost in a stream of design patents, or OS patents, or just general stuff of much less interest to me.

There is a new edition of my PDFs coming out soon. (The goal would be to release it just before the next iPhone event, we'll see. The main new stuff is a new GPU volume, plus a quick update to the previous 5 volumes covering every new patent I've discovered in the past year; the above two will of course go into that!)

kiranmk2 · Aug 14, 2023

Never mind all this CPU/GPU core counts, is the M3 finally going to move from 8/16/32 GB RAM configs to 12/24/48 GB RAM configs?

Confused-User · Aug 14, 2023

name99 said:
There is a new edition of my PDFs coming out soon. (The goal would be to release it just before the next iPhone event, we'll see. The main new stuff is a new GPU volume, plus a quick update to the previous 5 volumes covering every new patent I've discovered in the past year; the above two will of course go into that!)

Excellent! I thought you were done with those. I'm looking forward to the new volume. (And, I guess, the diffs on the rest.)

leman · Aug 14, 2023

name99 said:
Woah, those are really strange! I need some time to read through them fully and figure out what they mean.

What was particularly interesting to me is the notion of two types of execution units, one which can recover quickly from miso reduction, and another one where recovery is much more costly. Could this be simply a way to optimize the die area by avoiding the costly rollback machinery on every execution path? But then again, I’m looking at this from very naive amateur perspective.

name99 said:
How did you discover these?

I just developed a habit of scanning any new Apple patents twice per month. https://patentscope.wipo.int/search/en/search.jsf is great for this as they update their patent database weekly. It’s several hundreds patents on every batch but it just takes few minutes to scan the titles and identify things related to chip design (I suppose one could also filter by category but I didn’t have much luck with these). You can’t do the same with Google patents because the database is updated more sporadically and also somewhat randomly.

name99 said:
There is a new edition of my PDFs coming out soon. (The goal would be to release it just before the next iPhone event, we'll see. The main new stuff is a new GPU volume, plus a quick update to the previous 5 volumes covering every new patent I've discovered in the past year; the above two will of course go into that!)

Nice, looking forward to it!

name99 · Aug 14, 2023

name99 said:
Woah, those are really strange! I need some time to read through them fully and figure out what they mean.
As is frequently the case, there seem to be two separate ideas bundled into one patent. (I don't think is malicious on anyone's part; it's just that engineers have limited time to talk to the lawyers, and the lawyers are not competent/confident to split a dump of "here's how we changed X" into two or more separate individual concepts.)

So the more obvious idea is to add to the Prefetch and Address Generation stages of the front end a dedicated "fully-biased" predictor to catch the (surprisingly common) cases of branches that always go in the same direction. [eg things like "if (divisor!=0)" or "if(error condition)" ]
This probably allows these address generation stages to be a little larger, and saves some power (we don't need to run the TAGE machinery subsequent to the address generation to confirm the prediction). It's "fairly obvious" (which doesn't mean I thought of it! but means once I see it, it's clearly "oh yeah, that makes sense, neat idea").

Along with that there's a very weird idea, unlike anything I have ever seen discussed anywhere else, for how to handle mispredicted branches. I really have to think a lot about how it might work; right now I have no real idea.
(One possibility is that it implements an idea I have suggested elsewhere of snipping out short segments of mispredicted code. Consider something like
if(condition){ three instructions }
The traditional prediction/fetch mechanism would, at address generation, effectively predict the condition true and and then either load the trace of the three instructions, so the instruction stream looks like {conditional branch, three instructions, next instruction},
or predict it false and not load the trace, so the instruction stream looks like {conditional branch, next instruction}.

But that's suboptimal with an Apple-type design! Better is not to care about these sorts of branches (conditional followed by up to about 3 or 4 instructions). Just generate the instruction stream as {conditional branch, three instructions, next instruction} and let TAGE (which is a much better predictor than the Fetch Address predictor) decide the condition of the branch. Then mark the three instructions, if they are not to be executed, as NOPs, and drop them in the transfer between Instruction Queue and Decode.

I don't believe Apple (or anyone else) does this yet, though they should! It's POSSIBLE that the mechanism they describe essentially does something like this. It seems more heavyweight, more overhead and energy than what I have described, but maybe there are some advantages to balance that? Anyway I need to look at this much more carefully.

How did you discover these? I have my ways of looking for new technical patents (basically following up on the names on previous patents) but it's not very streamlined or efficient for discovering new patents as they come in; so sometimes it's a year before I find something new.
And doing a search for all "new" Apple patents is hopeless, you get lost in a stream of design patents, or OS patents, or just general stuff of much less interest to me.

There is a new edition of my PDFs coming out soon. (The goal would be to release it just before the next iPhone event, we'll see. The main new stuff is a new GPU volume, plus a quick update to the previous 5 volumes covering every new patent I've discovered in the past year; the above two will of course go into that!)

OK, I got it!

There are two, very different parts of the patents.
The first part takes advantage of the fact that many branches are strongly biased (essentially all instructions testing against errors of various sorts - NULL ptr, buffer overflow, etc). These branches take space in the expensive TAGE machinery and global branch history, but have no useful content.
So what the patent does is
(a) have a bias predictor (can be small and lightweight as these things go) associated with the path bringing instructions from L2 into L1I. This path already pre-decodes instructions (marks branches and suchlike) so we just reuse that machinery. We mark branches that are strongly biased.
(b) subsequently no prediction machinery bothers with strongly biased branches. They are treated either as NOPs (untaken) or unconditionally taken. We don't waste TAGE space on them, global history bits, or anything else.
This happy state lasts until one of those biased branches mispredicts, in which case we change its setting in the bias predictor and in the L1I.

Net result is substantial energy savings for TAGE and substantial area savings for TAGE at the same prediction performance (or increased performance if we don't shrink the TAGE buffers).

The second part is very different; it's about branches that are high confidence in prediction. Obviously strongly biased branches have a high confidence in the prediction, but we may also be (and usually are) confident about non-biased branches. The TAGE machinery has a very natural way to provide the confidence of a branch as well as its outcome.
The issue is that (at least for M1 and M2) we have two branch handling pipelines, and both of them need a tiny amount of circuitry to "execute" the branch (test a flag, or whatever) and a large amount of circuitry to deal with a misprediction (in which case we need to send instructions to the front-end to flush everything queued up and start fetching from the correct address, to the ROB to flush various instructions and roll back to a checkpoint, to the middle-end to flush instructions, to the branch training/updating machinery, etc etc). Huge drama.
The insight of the patent is that MOST OF THE TIME this machinery is not required; most branches are correctly predicted. So why provide two sets of this recovery machinery?
Instead we provide a lightweight branch execution unit and a heavyweight unit. We preferentially send branches to the lightweight unit, only sending them to the heavyweight unit if we're not confident in our prediction. If the lightweight unit says the branch was mispredicted (hopefully a very uncommon event, requiring both the predictor fail and our confidence in the predictor be incorrect) then we resubmit the branch to the heavyweight predictor.
This allows us to save a whole lot of area associated with the branch execution unit.

My suspicion is that, as part of this nice reduction in the area of a branch execution unit for the M3 we might see three branch execution units: two lightweight, and one heavyweight.
And if there is a common theme running through both of these patents, it's "support for a wider machine"...
I've suggested in various places ways in which Apple could bump the effective width of the A17/M3 to 10-wide without paying much of an area or frequency cost, and these patents, while by themselves very nice additions to say an M1 or M2 class machine, also very much fit into the sort of thing you would need to do if you wanted to widen the M3.

name99 · Aug 14, 2023

leman said:
What was particularly interesting to me is the notion of two types of execution units, one which can recover quickly from miso reduction, and another one where recovery is much more costly. Could this be simply a way to optimize the die area by avoiding the costly rollback machinery on every execution path? But then again, I’m looking at this from very naive amateur perspective.

I just developed a habit of scanning any new Apple patents twice per month. https://patentscope.wipo.int/search/en/search.jsf is great for this as they update their patent database weekly. It’s several hundreds patents on every batch but it just takes few minutes to scan the titles and identify things related to chip design (I suppose one could also filter by category but I didn’t have much luck with these). You can’t do the same with Google patents because the database is updated more sporadically and also somewhat randomly.

I answered/explained this in my first reply, above.

Many thanks for the link! I'll try that going forward. And feel free to let me know of anything you see that looks especially interesting. I may have missed it!

koyoot · Aug 14, 2023

name99 said:
Apple can (and are, IMHO) far more likely to achieve most of a 30% speedup by IPC improvements. There remains so much that can be done to boost IPC. I've listed many ideas in many different places, but options include

Considering that on smaller nodes we will get into diminishing returns territory of improvements in performance/clock speeds, everybody has to face reality, including AMD, Nvidia, Intel, that only way possible for meanigful perf improvements are IPC increases in their architectures.

leman · Aug 14, 2023

name99 said:
I answered/explained this in my first reply, above.

Thanks! Nice to see that Apple continues to do it with smarts rather than brute force.

name99 said:
Many thanks for the link! I'll try that going forward. And feel free to let me know of anything you see that looks especially interesting. I may have missed it!

Will do! Btw, here is another one in case you didn’t see it. It’s Apples “poor man ECC RAM” (I think there was some speculation about this type of thing on rwt some months ago): https://patentscope.wipo.int/search/en/detail.jsf?docId=US403904970&_cid=P12-LLBQX0-80508-2

scottrichardson · Aug 15, 2023

Assuming these RAM options?

M3 - 12GB / 24GB / 36GB
M3 Pro - 24GB / 36GB / 48GB
M3 Max - 48GB / 96GB / 144GB
M3 Ultra - 48GB / 96GB / 144GB / 288GB

M3 Chip Generation - Discussion Megathread

macrumors G5

macrumors G5

macrumors 6502a

macrumors 68030

macrumors Core

macrumors 68030

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors Core

macrumors 6502

macrumors 6502a

macrumors 68030

macrumors 6502a

macrumors Core

macrumors 68030

macrumors 68020

macrumors 6502a

macrumors Core

macrumors 68030

macrumors 68030

macrumors 603

macrumors Core

macrumors 6502a

Our Staff