M4+ Chip Generation - Speculation Megathread [MERGED]

thenewperson · Aug 2, 2024

leman said:
Ah, sorry about that, I was certain they include the query in the link. At any rate, you can use the following query —

Code:

FP:(neural) AND ALLNAMES:(Apple)

Some of that was converted to emoji 🫢

tenthousandthings · Aug 2, 2024

name99 said:
This is one of these claims that is endlessly repeated on the internet. But repetition doesn't make it true.
I keep waiting for the evidence of this supposed 2x INT8 throughput and I never see it. [...] F16 and I8 results mostly run on ANE, and we usually see very little difference either between F16 and I8, or not much beyond the 10% or so from usual tweaks and higher GHz from A16 to A17. There's a lot of variation, sometimes explicable, usually not, but nothing that suggests a pattern of ANE I8 runs at 2x ANE F16. Certainly nothing that supports these claims.

I continue to hold that the most likely meaning of Apple's claim of doubled TOPs in A17 is basically a redefinition of TOPs in reaction to other companies, not redefining I8 vs F16 but redefining what counts as an "op" to include who knows what other elements that the ANE can perform in parallel with FMAC's - maybe activations?, maybe pooling and other planar ops?, maybe applying some standardized scaling to the timing of some standardized benchmark?

The original discrepancy was/is between the M3 number Apple used, and the A17 number. This flew in the face of everything that had come before, from the introduction of the ANE in 2017, and then A12/A12X+; A14/M1+; and A15/M2+ -- each of those generations had the same ANE in the base iPhone and the iPad/Mac variants.

The initial assumption was that it must be what you're saying here, Apple's marketing being forced to keep up with the Joneses, so to speak. Look at how Qualcomm presents this here, explaining the math (which is beyond me):

A guide to AI TOPS and NPU performance metrics

TOPS is a measurement of the potential peak AI inferencing performance based on the architecture and frequency required of the NPU.

www.qualcomm.com

"The current industry standard for measuring AI inference in TOPS is at INT8 precision."

So Apple had to start doing that. Simple enough. We know that they used FP16 in the past, they say it explicitly here:

Deploying Transformers on the Apple Neural Engine

An increasing number of the machine learning (ML) models we build at Apple each year are either partly or fully adopting the Transformer…

machinelearning.apple.com

I'm sorry to say, however, that I may have been the one to introduce the idea of "support" for INT8. This was based on something Ryan Smith said in Anandtech re: the A17/M3 discrepancy in 2023: "The question is [...] whether the NPU in the M3 even supports INT8, given that it was just recently added for the A17. Either it does support INT8, in which case Apple is struggling with consistent messaging here, or it’s an older generation of the NPU architecture that lacks INT8 support."

But you'll notice that in his more recent mention of it, with regard to M4, he does not use the word "support" while saying, "freely mixing precisions, or even just not disclosing them, is headache-inducing to say the least" ... His failure to mention the question of architectural "support" for the lower 8-bit precision would seem to align with your argument, that it's not a thing, or at least that there's no evidence that it is a thing.

2023 M3: https://www.anandtech.com/show/2111...-family-m3-m3-pro-and-m3-max-make-their-marks

2024 M4: https://www.anandtech.com/show/21387/apple-announces-m4-soc-latest-and-greatest-starts-on-ipad-pro

komuh · Aug 2, 2024

If folks are still interested apple own diffusion repo with highly optimized options provide results for already compressed models it seems like A17 and A15 isn't much different. (last column iter/sec).
So it seems that they indeed just change F16 to I8 for TOPS calculation as most NPU providers are doing atm.

iPhone 12 Pro	CPU_AND_NE	SPLIT_EINSUM	116*	0.50
iPhone 13 Pro Max	CPU_AND_NE	SPLIT_EINSUM	86*	0.68
iPhone 14 Pro Max	CPU_AND_NE	SPLIT_EINSUM	77*	0.83
iPhone 15 Pro Max	CPU_AND_NE	SPLIT_EINSUM	31	0.85
iPad Pro (M1)	CPU_AND_NE	SPLIT_EINSUM	36	0.69
iPad Pro (M2)	CPU_AND_NE	SPLIT_EINSUM	27	0.98

name99 · Aug 2, 2024

altaic said:
What do you mean by activations? Are you referring to its usage in ML or something else?

What makes inference not just trivial linear algebra is that after each linear step (eg a convolution of matrix multiplication) there is non-linear step that maps the result through a non-linear function. Simplest of these is ReLU, but many different options exist, some "biologically inspired", some based on a hypothesis or other.
These non-linear lookups are frequently called activations.

ReLU is trivial to implement in hardware. Some of the other activation functions are (if you implement the exact spec) fairly expensive. It's unclear what Apple does, especially on ANE. Certainly in early versions of ANE, ReLU was cheap and implemented in multiple places (so that whatever the previous step was - convolution, pooling, normalization, whatever, it was free to just add a ReLU after that if required).

The Apple Foundation Models for Language use a fairly complex activation function, SiLU (I believe this is biologically inspired) so I suspect that ANE has been upgraded to handle more generic activation functions (maybe something like hardware where you program ten points into a table and the hardware does a linear interpolation based on those ten points?) but probably still able to give you a "free" activation after every "serious" layer type.

(BTW what activation is optimal where is still an unsolved, not even really an understood, problem. Like so much in neural nets there's a massive amount of optimization still waiting to be applied even to fairly old networks, to truly understand what matters vs what computation can be simplified.)

627d12431fbd5e61913b7423_60be4975a399c635d06ea853_hero_image_activation_func_dark.png

name99 · Aug 2, 2024

altaic said:
I’d love to have a look at those NPU-related patents if you have the links handy.

Look at volume 7 of my PDFs at

GitHub - name99-org/AArch64-Explore

Contribute to name99-org/AArch64-Explore development by creating an account on GitHub.

github.com

name99 · Aug 2, 2024

tenthousandthings said:
The original discrepancy was/is between the M3 number Apple used, and the A17 number. This flew in the face of everything that had come before, from the introduction of the ANE in 2017, and then A12/A12X+; A14/M1+; and A15/M2+ -- each of those generations had the same ANE in the base iPhone and the iPad/Mac variants.

The initial assumption was that it must be what you're saying here, Apple's marketing being forced to keep up with the Joneses, so to speak. Look at how Qualcomm presents this here, explaining the math (which is beyond me):

A guide to AI TOPS and NPU performance metrics

TOPS is a measurement of the potential peak AI inferencing performance based on the architecture and frequency required of the NPU.

www.qualcomm.com

"The current industry standard for measuring AI inference in TOPS is at INT8 precision."

So Apple had to start doing that. Simple enough. We know that they used FP16 in the past, they say it explicitly here:

Deploying Transformers on the Apple Neural Engine

An increasing number of the machine learning (ML) models we build at Apple each year are either partly or fully adopting the Transformer…

machinelearning.apple.com

I'm sorry to say, however, that I may have been the one to introduce the idea of "support" for INT8. This was based on something Ryan Smith said in Anandtech re: the A17/M3 discrepancy in 2023: "The question is [...] whether the NPU in the M3 even supports INT8, given that it was just recently added for the A17. Either it does support INT8, in which case Apple is struggling with consistent messaging here, or it’s an older generation of the NPU architecture that lacks INT8 support."

But you'll notice that in his more recent mention of it, with regard to M4, he does not use the word "support" while saying, "freely mixing precisions, or even just not disclosing them, is headache-inducing to say the least" ... His failure to mention the question of architectural "support" for the lower 8-bit precision would seem to align with your argument, that it's not a thing, or at least that there's no evidence that it is a thing.

2023 M3: https://www.anandtech.com/show/2111...-family-m3-m3-pro-and-m3-max-make-their-marks

2024 M4: https://www.anandtech.com/show/21387/apple-announces-m4-soc-latest-and-greatest-starts-on-ipad-pro

I don't know what you are saying here.

We know how the ANE arithmetic unit works (takes in FP16 or INT8 [or a combination of the two]), multiplies, and accumulates to FP32 or INT32. I couldn't be bothered to look up the patent here, but it's one of the first ANE patents, and is mentioned in my volume 7.
It's not hypothesis that INT8 exists, and it's not true that it was in any sense recently added. There are multiple lines of evidence, from the patent trail to performance data, to CoreML APIs.

Apple are always reluctant to say anything even slightly technical about their hardware, not least because idiots then use the slightest technical mis-statement as an excuse for a lawsuit. So I wouldn't spend too much effort on the sort of parsing you're engaged in.
With the ANE it's even more complex than with other NPUs because ANE supports mixed precision, namely one of the inputs to the FMAC can be FP16 while the other is INT8 (that's one reason they have a patent).
So do you count this as an INT8 OP or an FP16 OP? You can see how stupid this counting becomes (even before the fact that, as just mentioned above, and as I have pointed out multiple times, TOPS count has little correlation with exhibited inference performance).

name99 · Aug 2, 2024

komuh said:
If folks are still interested apple own diffusion repo with highly optimized options provide results for already compressed models it seems like A17 and A15 isn't much different. (last column iter/sec).
So it seems that they indeed just change F16 to I8 for TOPS calculation as most NPU providers are doing atm.

iPhone 12 Pro CPU_AND_NE SPLIT_EINSUM 116* 0.50
iPhone 13 Pro Max CPU_AND_NE SPLIT_EINSUM 86* 0.68
iPhone 14 Pro Max CPU_AND_NE SPLIT_EINSUM 77* 0.83
iPhone 15 Pro Max CPU_AND_NE SPLIT_EINSUM 31 0.85
iPad Pro (M1) CPU_AND_NE SPLIT_EINSUM 36 0.69
iPad Pro (M2) CPU_AND_NE SPLIT_EINSUM 27 0.98

A17 and A15 aren't much different YET...

One of the recent changes to ANE is hardware support for streaming, when that's feasible. This allows part of a layer to be computed and then moved to the next layer (as opposed to completing an entire layer then moving to the next). This reduces memory traffic so save energy and probably will help some with performance.
But it requires a lot of SW infrastructure in the ML runtime and compiler. We are probably getting this infrastructure with the OS's this year, but it won't show up in results before who knows, maybe iOS18, maybe iOS18.2 or so.

I pointed out above that for one text benchmark iOS18 is dramatically faster (like 4x faster) than iOS17. I don't know how much of this we will see with the full iOS18 ML updates, but point is there are at least some big improvements still available on the SW side; and those improvements, taking full advantage of what the HW can already offer as of A16, are still in progress.

One of the things I keep trying to hammer in is just how different ANE hardware is, not just from CPU but also from GPU. This means, among other things, that much of what Apple is doing on the SW side counts as research. It's not just a question of "write some code to implement a transformer on ANE"; it's more like
- figure out an optimal algorithm that's like Transformer but accesses the data in different patterns to exploit the hardware
- then figure out compiler algorithms to translate the previous ideas into instructions to move data around the ANE
- then implement the algorithms
- then time the results, try various modifications, look at unexpected bottlenecks, etc etc
This could be a five year PhD project.
What we see is implementation each year of as much of the above as is ready, but the first attempt is usually a basic "get the damn thing working", a year later we might get a 15% improvement from simple optimizations, then two years later we might get a 3x improvement from better algorithm compiled optimally.

tenthousandthings · Aug 2, 2024

name99 said:
I don't know what you are saying here.

We know how the ANE arithmetic unit works (takes in FP16 or INT8 [or a combination of the two]), multiplies, and accumulates to FP32 or INT32. I couldn't be bothered to look up the patent here, but it's one of the first ANE patents, and is mentioned in my volume 7.
It's not hypothesis that INT8 exists, and it's not true that it was in any sense recently added. There are multiple lines of evidence, from the patent trail to performance data, to CoreML APIs.

Apple are always reluctant to say anything even slightly technical about their hardware, not least because idiots then use the slightest technical mis-statement as an excuse for a lawsuit. So I wouldn't spend too much effort on the sort of parsing you're engaged in.
With the ANE it's even more complex than with other NPUs because ANE supports mixed precision, namely one of the inputs to the FMAC can be FP16 while the other is INT8 (that's one reason they have a patent).
So do you count this as an INT8 OP or an FP16 OP? You can see how stupid this counting becomes (even before the fact that, as just mentioned above, and as I have pointed out multiple times, TOPS count has little correlation with exhibited inference performance).

By me, I gather you mean Ryan Smith at Anandtech? I’m just the messenger. Are you saying Apple did not “freely mix precisions” (to paraphrase Smith) in the starkly different TOPS numbers cited for A17 (and now M4) versus M3?

name99 · Aug 2, 2024

tenthousandthings said:
By me, I gather you mean Ryan Smith at Anandtech? I’m just the messenger. Are you saying Apple did not “freely mix precisions” (to paraphrase Smith) in the starkly different TOPS numbers cited for A17 (and now M4) versus M3?

I simply could not parse the set of sentences you provided to reach any sort of conclusion. Maybe that's because Ryan had nothing explicit to say? Or perhaps you were trying so hard to distance yourself from what Ryan was saying that the resultant levels of indirection became impenetrable?

Regardless, that's uninteresting.
Something I'm much more interested in is the way Apple is using LoRA layers to modify their base model. They describe this, for example, in the Foundation Model for Language paper, as how they implement the per-user personalized functionality on top of the base model.

One reason this is interesting is (and this is a question now): could this be used for models that continually learn?
Obviously humans (and other animals) engage in on-going learning. But LLMs do not. They undergo a one-time "simultaneous" (ie all weights modified together) training exercise, and are then frozen. Attempts to go beyond this (ie to train an already trained model on a large quantity of something new) result in catastrophic forgetting of the old material and (as far as I know) there is no robust way of "locking in" or "segregating" the earlier material to achieve something fairly reliably like how a human learns year after year.
(I THINK [but I have not seen results] that if you engaged in sequential training where each new round built on and reinforced the previous round, you would get positive results. But if this is mostly not the case, eg take an English LLM and now try to train it in Chinese, you don't get a good bilingual LLM; instead you have to train in Chinese and English simultaneously.)

So the question is: can you engage in continual modification of LoRA layers to achieve lightweight ongoing learning (ie at least learn day by day what's of interest to me, how my interests evolve, my writing style, my current contacts, etc) in a way that preserves the value in the base model while still being at least approximately as smart and useful as a secretary/butler?

Lots of speculation and probably wrong details above. Any corrections or suggestions welcome!

Also, somewhat related to the above, people interested in this tech may want to read

WWDC 24: Running Mistral 7B with Core ML

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

which amplifies on a few of the points I keep making (Apple is doing things people aren't aware of; you can get a lot of extra boost by restructing an nVidia-optimized model to match Apple [MPS or ANE] hardware).

leman · Aug 2, 2024

tenthousandthings said:
By me, I gather you mean Ryan Smith at Anandtech? I’m just the messenger. Are you saying Apple did not “freely mix precisions” (to paraphrase Smith) in the starkly different TOPS numbers cited for A17 (and now M4) versus M3?

There is no evidence that Apple NPU has different performance for different data types. There is at least some evidence that it has the same performance for both INT8 and FP16. I think a lot of the conversation here is going in circles…

treehuggerpro · Aug 2, 2024

tenthousandthings said:
You can see in that earnings call that TSMC can't talk about their progress in advanced packaging without leaking their customers' plans, so they just ignore and/or deflect specific questions about it. One person asks about silicon interposers (pp. 12-13, the CEO's answer is "I'm not going to share with you because this is related to my customers' demand."), . . .

This seems to sum up how TSMC has orientated to their various packaging options:

Well, all right. . . . You know the CoWoS-R, CoWoS-S, CoWoS-L, blah, blah, blah. All these kind of thing is because of our customers' requirement. . . . - C.C. Wei - TSMC CEO

Which got me wondering who might be using CoWoS-R, or be the motivation for this approach?

Seems to meet the primary aspirations outlined in Apple's x4 approach:

- Die Last ✅
- RDL Interposer (Not limited to the Reticle Limits of Silicon Interposers) ✅

Basically a combination of the CoWoS Die Last approach with a RDL Fan-Out (InFO) interposer suited to building large chiplet SiP's. All that's needed from there is an interconnect that will meet their bandwidth, latency and power spec's without the constraint of needing a silicon interposer . . .

innerproduct said:
I wonder, have you (people of this thread) come to any conclusions on why the “cadence” was lost and we got a m3max with RT last year but not an ultra. Why the m4 is parallel to the m3 and what the rumors for high end chips for studio and mac pro is and hand and plausible timeline? I really hope for a ultra (or better) desktop solution this fall but is that even in the cards?

This is wild speculation from earlier in this thread; the M3 Max was a significant jump over the M2 Max, while Apple's existing Ultrafusion (5nm) interconnect provided just ~1.0 Tbps/mm (20 Terabits / 20 mm) on a silicon bridge.

I don't follow what's happening outside of Apple (or even inside Apple most the time), but Ultrafusion's ~1.0 Tbps/mm is about equal to Intel's MDFIO on advanced packaging (silicon interposer) and AMD's Infinity interconnect on standard packaging in the table below. All of which are proprietary protocols that look feeble next to current spec's and that are not well suited to the standardised approach of the incoming chiplet era.

So, maybe Ultrafusion's (interconnect) PHY was not enough to meet the requirements of an M3 Ultra, let alone future generations, x4 configurations or the shift to Chiplet SiPs. Meaning, perhaps, Apple had to either opt for designing around an available interconnect Standard within the M3 Ultra's bandwidth requirements and timeframe, or skip it and wait while continuing to work toward a Standard that will meet their preferred packaging approach and bandwidth requirements going forward?

Apple could have used UCIe on a silicon bridge for the M3 Ultra with a bucketload of additional bandwidth, but they didn't, possibly because it would have been a short term fix and expense not suited to where they're heading. (Pure speculation.)

tenthousandthings · Aug 2, 2024

leman said:
There is no evidence that Apple NPU has different performance for different data types. There is at least some evidence that it has the same performance for both INT8 and FP16. I think a lot of the conversation here is going in circles…

Just to be clear — it wasn’t rhetorical, it was a genuine question. I’m still confused — is the A17 Pro NPU twice as fast as the M3 NPU or not? I guess there’s something fundamental about the question that I don’t understand, because the answers I’m getting don’t make sense to me.

komuh · Aug 3, 2024

tenthousandthings said:
Just to be clear — it wasn’t rhetorical, it was a genuine question. I’m still confused — is the A17 Pro NPU twice as fast as the M3 NPU or not? I guess there’s something fundamental about the question that I don’t understand, because the answers I’m getting don’t make sense to me.

It seems it is in INT8 (or just they change how they calculate TOPS to I8 from F16) in every apple repo with they custom implementations for opensource ML models the difference seems to be minimal so M3 -> A17 seems like just TOPS over Flops difference.

leman · Aug 3, 2024

tenthousandthings said:
Just to be clear — it wasn’t rhetorical, it was a genuine question. I’m still confused — is the A17 Pro NPU twice as fast as the M3 NPU or not? I guess there’s something fundamental about the question that I don’t understand, because the answers I’m getting don’t make sense to me.

According to the currently available benchmarks and tests the A17 NPU is only marginally faster, if at all.

komuh said:
It seems it is in INT8 (or just they change how they calculate TOPS to I8 from F16) in every apple repo with they custom implementations for opensource ML models the difference seems to be minimal so M3 -> A17 seems like just TOPS over Flops difference.

I don’t understand this. If you want to make an argument like that you‘d need to demonstrate actual performance difference at various precisions. Why do you expect INT8 to be any faster than FP16?

komuh · Aug 3, 2024

leman said:
According to the currently available benchmarks and tests the A17 NPU is only marginally faster, if at all.

I don’t understand this. If you want to make an argument like that you‘d need to demonstrate actual performance difference at various precisions. Why do you expect INT8 to be any faster than FP16?

I argue that the A17 chip is slightly faster, so they or they could modify the performance calculation method (e.g., TOPs vs. FLOPS, and I8 was always twice as fast as F16?). However, I lack the ability to test this claim in low-level code because there’s no public API that provides access to any low-level NPU calls, and exporting I8 and F16 consistently yields almost the same performance on Mac (M1 and M2) and nowhere near twice the performance.

Furthermore, it’s possible that this performance difference is due to software limitations rather than a hardware. Additionally, I don’t have access to an M4 or an A17 device to test any of these claims on the latest beta.

leman · Aug 3, 2024

komuh said:
I argue that the A17 chip is slightly faster, so they or they could modify the performance calculation method (e.g., TOPs vs. FLOPS, and I8 was always twice as fast as F16?). However, I lack the ability to test this claim in low-level code because there’s no public API that provides access to any low-level NPU calls, and exporting I8 and F16 consistently yields almost the same performance on Mac (M1 and M2) and nowhere near twice the performance.

Furthermore, it’s possible that this performance difference is due to software limitations rather than a hardware. Additionally, I don’t have access to an M4 or an A17 device to test any of these claims on the latest beta.

Ok, in other words all evidence you have suggests that I8 and FP16 run at the same speed. What reason do you have then to argue that they are different?

komuh · Aug 3, 2024

leman said:
Ok, in other words all evidence you have suggests that I8 and FP16 run at the same speed. What reason do you have then to argue that they are different?

As i wrote before i don't argue just speculation as APPLE IS SAYING they double the speed of NPU not me man xD

leman · Aug 3, 2024

komuh said:
As i wrote before i don't argue just speculation as APPLE IS SAYING they double the speed of NPU not me man xD

Right, and we don’t really know what their claims about doubling the speed means, as we don’t really see it in the benchmarks (not yet, at least). What I am curious about however is why you think that 2x I8 speed over F16 is the best answer.

altaic · Aug 3, 2024

name99 said:
Look at volume 7 of my PDFs at

GitHub - name99-org/AArch64-Explore

Contribute to name99-org/AArch64-Explore development by creating an account on GitHub.

github.com

The most recent patent you cite is from 2022, no?

altaic · Aug 4, 2024

name99 said:
What makes inference not just trivial linear algebra is that after each linear step (eg a convolution of matrix multiplication) there is non-linear step that maps the result through a non-linear function. Simplest of these is ReLU, but many different options exist, some "biologically inspired", some based on a hypothesis or other.
These non-linear lookups are frequently called activations.

ReLU is trivial to implement in hardware. Some of the other activation functions are (if you implement the exact spec) fairly expensive. It's unclear what Apple does, especially on ANE. Certainly in early versions of ANE, ReLU was cheap and implemented in multiple places (so that whatever the previous step was - convolution, pooling, normalization, whatever, it was free to just add a ReLU after that if required).

The Apple Foundation Models for Language use a fairly complex activation function, SiLU (I believe this is biologically inspired) so I suspect that ANE has been upgraded to handle more generic activation functions (maybe something like hardware where you program ten points into a table and the hardware does a linear interpolation based on those ten points?) but probably still able to give you a "free" activation after every "serious" layer type.

(BTW what activation is optimal where is still an unsolved, not even really an understood, problem. Like so much in neural nets there's a massive amount of optimization still waiting to be applied even to fairly old networks, to truly understand what matters vs what computation can be simplified.)

I know about ML, but nice explanation for others 👍

Activations can be any nonlinear function, and there are even learned activations. It’s true that optimal activation functions are still an open question, but they are tightly coupled to the parameter types.

name99 said:
Attempts to go beyond this (ie to train an already trained model on a large quantity of something new) result in catastrophic forgetting of the old material and (as far as I know) there is no robust way of "locking in" or "segregating" the earlier material to achieve something fairly reliably like how a human learns year after year.

There are currently a bunch of ways of doing this while avoiding catastrophic forgetting, so here are a few:
- use at least 10% of the original dataset for continual training; if the original dataset isn’t available, one can use the base model to synthesize an analog
- add new layers to train while keeping the original layers frozen
- add the “short term” knowledge to the context via the prompt; this can be persisted across sessions so the prompt doesn’t need to be recalculated
- distill and train a separate lightweight model and use it in a mixture of agents (MoA) configuration

name99 said:
So the question is: can you engage in continual modification of LoRA layers to achieve lightweight ongoing learning

Indeed, one can train a model from scratch using only adapters (like lora), but it’s not very efficient doing so. Adapters are great for stylistic modifications, but one can still cause catastrophic forgetting if you want to cram a bunch of new knowledge in without healing the model (as in the first example I gave above).

caribbeanblue · Aug 4, 2024

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Confused-User · Aug 4, 2024

caribbeanblue said:
They've done GPU binning on iPhone-grade silicon before with the A15. The Pro model had 1 more GPU core (5 vs. 4) than the A15 in the regular phone. What would them doing different mask sets mean for the Pro chip, that they have bigger plans for it? When the M3 Pro went for a more density optimized design, that enabled them to go bigger with the M3 Max.

The GPU binning, as I said before, was likely as much a matter of market segmentation as of recovering chips due to defects. Although I think (I am not sure) that the GPU "cores" are rather larger than CPU P cores, so binning for a failure in one of them is more practically useful compared to binning for a defect in a single P core (as it recovers more chips that would otherwise be scrap).

Doing a separate mask wouldn't mean anything for the Pro. It's almost certainly the other way around - strategy for the Pro determines design, which in turn makes for a straightforward calculation of whether or not it saves money to do separate masks. Of course it's not always that way - in the M1 and M2 generation, budget ruled and you got a single mask set for both Pro and Max. But they seem to be past that now.

name99 · Aug 4, 2024

altaic said:
The most recent patent you cite is from 2022, no?

If you can find later patents, by all means share.

leman · Aug 4, 2024

Confused-User said:
The GPU binning, as I said before, was likely as much a matter of market segmentation as of recovering chips due to defects. Although I think (I am not sure) that the GPU "cores" are rather larger than CPU P cores, so binning for a failure in one of them is more practically useful compared to binning for a defect in a single P core (as it recovers more chips that would otherwise be scrap).

If I recall correctly, it has been speculated that these chips were discarded based on power consumption (and of course market segmentation) rather than defects. As you rightly pointed out, it’s highly unlikely that a defect would fall on a GPU core.

Confused-User · Aug 4, 2024

leman said:
If I recall correctly, it has been speculated that these chips were discarded based on power consumption (and of course market segmentation) rather than defects. As you rightly pointed out, it’s highly unlikely that a defect would fall on a GPU core.

As discussed a month or two ago, "failure" includes "runs too hot"/"leaks too much", not just "doesn't work at all".

It's not *that* unlikely, actually. GPUs are dense and all together they take up a meaningful portion of the chip's area. The replicated cores (as opposed to the shared "misc" area, or for P CPUs the caches) comprise maybe a sixth of the entire M3 chip, IIRC. More that that if you factor out dark silicon. On the A15, which was binned with 4- and 5-core GPU versions, the GPUs were less, little more than 1/8th.

So yes, it's probably mostly market segmentation, but it's quite possible that they did claw back some otherwise dead chips. Enough to notice? I have no idea.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 65816

Contributor

Suspended

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68030

Contributor

macrumors 68030

macrumors Core

macrumors regular

Contributor

Suspended

macrumors Core

Suspended

macrumors Core

Suspended

macrumors Core

macrumors 6502a

macrumors 6502a

Cancelled

macrumors 6502a

macrumors 68030

macrumors Core

macrumors 6502a

Our Staff