Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

diamond.g · Dec 9, 2020

theorist9 said:
According to several reports, they don't. They knew a certain percentage of the chips were coming out with a bad GPU core. Thus, rather than throwing those out, they created an SKU for the Air with 7 GPU cores, and use them in that. The process is called "binning".

To the extent they offer different CPU/GPU core counts within each model in the upcoming generation of AS Macs, some (but not all) of that will likely result from binning as well.

So there should eventually be a way to see if the 8th core on the 7 core systems is actually bad or just hidden via firmware (like folks used to do with ATI and Nvidia cards back in the day).

@leman I found this link that shows that A14 FP32 rate is half of M1 FP32. https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985 of course this test was written by a person and not sure if they are gonna give the code out for verification. What is also interesting is they state that M1 FP16 is the same speed as FP32, M1 currently doesn't support dual issue FP16.

Pressure · Dec 9, 2020

diamond.g said:
So there should eventually be a way to see if the 8th core on the 7 core systems is actually bad or just hidden via firmware (like folks used to do with ATI and Nvidia cards back in the day).

@leman I found this link that shows that A14 FP32 rate is half of M1 FP32. https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985 of course this test was written by a person and not sure if they are gonna give the code out for verification. What is also interesting is they state that M1 FP16 is the same speed as FP32, M1 currently doesn't support dual issue FP16.

No, they are fused off nowadays.

diamond.g · Dec 9, 2020

Pressure said:
No, they are fused off nowadays.

Right, I know AMD/Nvidia/Intel got hip to that trick and fused the "binned" parts off.

leman · Dec 9, 2020

diamond.g said:
@leman I found this link that shows that A14 FP32 rate is half of M1 FP32. https://www.realworldtech.com/forum/?threadid=197759&curpostid=197985 of course this test was written by a person and not sure if they are gonna give the code out for verification. What is also interesting is they state that M1 FP16 is the same speed as FP32, M1 currently doesn't support dual issue FP16.

Yeah, I was about to post that here. That "K.K." fellow must be very smart and handsome

I will most likely publish the source code for the benchmark, but I have to clean it up and comment it a bit.

Anyway, I did some more benchmark mixing FP32 and FP16 computations (you will find it further down the realworldech thread) , and I am fairly convinced now that A14 has FP16 ALUs (that can also run FP32 but at half the speed) but the M1 has proper FP32 ALUs that can also run FP16 at the same speed. Would also explain why the M1 GPU is so large compared to the A14 (GPU on M1 takes much more space on the die than GPU on A14 when looking at the annotated pictures). This was quite a surprise for me, since I thought that A14 and M1 are practically the same thing... but it turns out Apple did implement some significant hardware changes for the Mac SoC.

Boil · Dec 9, 2020

leman said:
I don't think that is going to be enough bandwidth. For that kind info chip, you really want 200+GB/s... so something "HBM-like". Frankly, I am starting thinking that Apple will have a DYI-HBM with stacked LPDDR chips and very wide memory bus, with 8 memory controllers or more. They already stack RAM on top of the iPhone chip, so I don't see why it wouldn't be possible.

I don't think it's unprecedented. There are laptops shipping with large and hot GPUs. A mobile RTX 2080 is 545mm2 with a TDP of 80W in the Max-Q configuration. An Apple SoC with a 12+4 CPU and a 32-core GPU will probably have a combined TDP of around 60-70watts. Shouldn't be that much of a challenge to cool in the current 16" chassis.

For high-end configs (like the Mac Pro), I don't really see them doing monolithic chips — yields will probably be abysmal. But a multi-chip package, with CPU+GPU dies connected to a shared I/O+cache die (possibly stacked with RAM) — that should be doable. AMD does it with Zen3 and it seems to work just fine.

But than again, it's Apple we are talking about. I can totally see them using a 1000mm2 monolithic die that costs $1000 to make just to prove a point. Still cheaper than the Xeons and the GPUs they have to buy from a third party

However they do it, I hope Apple keeps some GPU capability in the "high-end" (32-Pcore/4-Ecore SoC w/32-core GPU & 32-core Neural Engine), allowing folks who need the PCIe slots (of which there will be fewer in the forthcoming smaller Mac Pro), meaning the audio guys mainly, to not need to fill one of the remaining PCIe slots with a GPU...

And with the reduced power needs for Apple's GPU (GPGPU) cores, maybe the extra beefy 4-slot heat sinks are not needed...?

I wish Apple would ditch the feet (and ever so overpriced optional wheels) for the (again, forthcoming) smaller Mac Pro, and maybe have the finish be that new matte black dealio rumored for the laptops...?

By dropping the feet & handles, and reducing the chassis from three to two fans, you have basically reduced the overall height by half...? Width would stay the same, but depth (front to back) might also be reduced...!

And the PSU can definitely be of a lower output...!

I lol'ed at your 1000mm2 mega monolithic die, and I could see Apple doing something like that...!

But I think I would rather see a MCM approach; something like a M1X main SoC packaged with Pcore & GPU chiplets & a bunch of HBM3; with LDDR5 / DDR5 slots in a tiered memory approach, and maybe some sort of implementation or something of resizeable BAR to bring any memory on Apple GPUs (64-core & 128-core variants) hanging off PCIe Gen5 x16 slots into the UMA "pool"...?!?

awesomedeluxe · Dec 9, 2020

leman said:
I don't think that is going to be enough bandwidth. For that kind info chip, you really want 200+GB/s... so something "HBM-like". Frankly, I am starting thinking that Apple will have a DYI-HBM with stacked LPDDR chips and very wide memory bus, with 8 memory controllers or more. They already stack RAM on top of the iPhone chip, so I don't see why it wouldn't be possible.

It would definitely be bandwidth constrained, but I do think 4x LPDDR5-6400 modules could reach 200+GB/s (barely). I added 50% to the speed of the two LPDDR4X modules in the Mac Mini to account for the higher speed rating of LPDDR5 and then doubled it to account for the extra channels. Pretty lazy math and I could be wrong.

leman said:
I don't think it's unprecedented. There are laptops shipping with large and hot GPUs. A mobile RTX 2080 is 545mm2 with a TDP of 80W in the Max-Q configuration. An Apple SoC with a 12+4 CPU and a 32-core GPU will probably have a combined TDP of around 60-70watts. Shouldn't be that much of a challenge to cool in the current 16" chassis.

This is a really useful comparison. I see where your head is at, but I think I can show you what I mean by unprecedented. Let's say for argument the M99 has a 65W TDP.

1. 65W over 3 square centimeters is actually much more heat dense than 80W over 5.5 square centimeters: 21.7 W/cm vs 14.5 W/cm.

2. Most of these machines using the RTX 2080 are four to five centimeters thick. That provides for a lot more cooling than the roughly three centimeter thick MBP16.

3. GPUs have a built-in advantage: because they work in parallel, they tend to distribute heat evenly.

4. Lastly, that 80W TDP number for the 2080 Max Q includes VRAM. We're already looking at a more heat-dense part with less space for cooling, and we haven't even added high bandwidth memory yet.

I am asking around about this btw, so I'll let you know if someone more knowledgeable than me weighs in. All I can really say right now is that I don't think anyone has put a part this heat-dense in an enclosure this small.

leman said:
For high-end configs (like the Mac Pro), I don't really see them doing monolithic chips — yields will probably be abysmal. But a multi-chip package, with CPU+GPU dies connected to a shared I/O+cache die (possibly stacked with RAM) — that should be doable. AMD does it with Zen3 and it seems to work just fine.

But than again, it's Apple we are talking about. I can totally see them using a 1000mm2 monolithic die that costs $1000 to make just to prove a point. Still cheaper than the Xeons and the GPUs they have to buy from a third party

I guess I can see it. Some kind of weird memory and I/O die designed to communicate with a separate GPU and CPU that have their own memory controllers. I don't know enough about memory to really put it in my mind's eye, though. Wouldn't separate GPU and CPU controllers be competing for access to memory channels?

Pressure said:
No, that’s 12 High Performance cores and 4 High Efficiency cores.

But let us be honest with ourselves.

I think the 16” MacBook Pro will get at maximum 8 High Performance cores, 4 High Efficiency cores and up to 16-core GPU (12- to 16-cores depending on configuration).

That's certainly where my head was at before the Bloomberg article. But I don't mind being more optimistic if that's what the evidence suggests. The 8+16 design idea was always rooted in the original Bloomberg APU leak - it just sort of takes that, assumes the die will be 150mm2 or less, and assumes Apple really likes multiples of four. Reasonable assumptions. But it also had the major failing that it would fall just short of the Radeon Pro 5600M.

The Bloomberg article presents a different, but fully coherent strategy: design a big APU, let the yield fall where it may, and bin it aggressively to cover the MBP14, MBP16, and the iMacs.

Pressure · Dec 9, 2020

awesomedeluxe said:
That's certainly where my head was at before the Bloomberg article. But I don't mind being more optimistic if that's what the evidence suggests. The 8+16 design idea was always rooted in the original Bloomberg APU leak - it just sort of takes that, assumes the die will be 150mm2 or less, and assumes Apple really likes multiples of four. Reasonable assumptions. But it also had the major failing that it would fall just short of the Radeon Pro 5600M.

The Bloomberg article presents a different, but fully coherent strategy: design a big APU, let the yield fall where it may, and bin it aggressively to cover the MBP14, MBP16, and the iMacs.

I can’t come up with a good reason as to why Apple would offer quadruple the performance in the next tier up. They can milk their position over the next few years instead with a reasonable approach and still offer 30-40% higher performance in multithreaded workloads or beefier hardware decode and encode capabilities for video and machine learning.

leman · Dec 9, 2020

awesomedeluxe said:
This is a really useful comparison. I see where your head is at, but I think I can show you what I mean by unprecedented. Let's say for argument the M99 has a 65W TDP.

Yeah, I see what you mean. It's an issue of initial heat transfer more than overall heat dissipation. No idea what modern vapor chambers are capable of...

By the way, when I say "65W for SoC", this of course includes RAM power usage.

awesomedeluxe said:
I guess I can see it. Some kind of weird memory and I/O die designed to communicate with a separate GPU and CPU that have their own memory controllers. I don't know enough about memory to really put it in my mind's eye, though. Wouldn't separate GPU and CPU controllers be competing for access to memory channels?

In this model neither CPU not the GPU have they own memory controllers. It's really the same implementation as in a monolithic SoC — CPU and GPU access the system-level cache and cache misses involve memory requests. You would just "unstack" the entire implementation over multiple chips. This is how Zen3 works. The obvious disadvantage is that this significantly increases the latency.

Also, I know I wrote that I don't see the going monolithic SoC for the Mac Pro, but then again I was thinking and I wrote this. I suppose in the end I don't really have a strong opinion. Knowing how perfectionist Apple is, I actually tend to think that they will go the monolithic die route, even if it's much more expensive. They will probably view chiplet design as a "hack". And they love their low-latency caches

EntropyQ3 · Dec 9, 2020

leman said:
But than again, it's Apple we are talking about. I can totally see them using a 1000mm2 monolithic die that costs $1000 to make just to prove a point. Still cheaper than the Xeons and the GPUs they have to buy from a third party

Just to keep your dreams realistic (*cough*), TSMC maximum reticle field size is 26 mm by 33 mm or 858 mm2. 😀
Yields 62 dies per 300mm wafer.

leman · Dec 9, 2020

EntropyQ3 said:
Just to keep your dreams realistic (*cough*), TSMC maximum reticle field size is 26 mm by 33 mm or 858 mm2. 😀
Yields 62 dies per 300mm wafer.

How much does the wafer usually cost?

diamond.g · Dec 9, 2020

Cause I am dense, @leman are you rikrak on B3D?

theorist9 · Dec 9, 2020

EntropyQ3 said:
Just to keep your dreams realistic (*cough*), TSMC maximum reticle field size is 26 mm by 33 mm or 858 mm2. 😀
Yields 62 dies per 300mm wafer.

What percent yield do they get with such a large die?

My coarse-grained calculation suggests the chance of getting a fully-working chip (i.e., the chance of getting a chip that doesn't need to be binned or discarded) decreases exponentially with die area:

Suppose the chance of fatal error (due to crystal defect, mask defect, etc.) is 0.1% per mm^2. Then the chance of having a fully-working chip would be q = 0.999^a, where a is the area in mm^2.

Thus, for a = 100, 200, 400, and 800 mm^2, respectively, the chance of producing a fully-working chip would be 90%, 82%, 67%, and 45%, respectively.

Of course, chips are heterogeneous, and errors may be correlated, and a calculation that incorporated such details would be much more complicated.

EntropyQ3 · Dec 10, 2020

leman said:
How much does the wafer usually cost?

That’s a really tricky one. I’ll give an answer, but first a list of reasons why the error bars are huge.
Pricing depends on where you are in the life cycle of a node, your order volume, your bargaining strength, the competitive landscape for the node you want to produce on, if there is a competitor (Samsung) that drastically drops prices to improve fab line utilization, ...
Any wafer cost you may see is just a snapshot that is dependent on all of the above. And that data is not public information but negotiated on a case by case basis, so any leaked number qualifies as a rumour.

All that said - TSMC 7nm, a bit under $10000, 5nm likely just over but it’s currently pretty much filled by Apple, and they don’t necessarily have the same deal as others with TSMC.

leman · Dec 10, 2020

EntropyQ3 said:
That’s a really tricky one. I’ll give an answer, but first a list of reasons why the error bars are huge.
Pricing depends on where you are in the life cycle of a node, your order volume, your bargaining strength, the competitive landscape for the node you want to produce on, if there is a competitor (Samsung) that drastically drops prices to improve fab line utilization, ...
Any wafer cost you may see is just a snapshot that is dependent on all of the above. And that data is not public information but negotiated on a case by case basis, so any leaked number qualifies as a rumour.

All that said - TSMC 7nm, a bit under $10000, 5nm likely just over but it’s currently pretty much filled by Apple, and they don’t necessarily have the same deal as others with TSMC.

I have played a bit with the die yield calculator using the Anandtech estimated defect rate of 1.2 per cm2, and the yield is basically 0%

diamond.g said:
Cause I am dense, @leman are you rikrak on B3D?

There goes my anonymity (no worries, I pretty much broke it myself, and that is a throwaway account anyway).

EntropyQ3 · Dec 10, 2020

theorist9 said:
Of course, chips are heterogeneous, and errors may be correlated, and a calculation that incorporated such details would be much more complicated.

Complex matters are complex. Throw in the ability to improve yields by being able to shut off some parts of the circuitry, such as CPU/GPU/NPU-cores, sometimes blocks of cache - but not others, and the picture gets awfully blurry for the armchair expertise.
We know that Apple already used those tactics on the A12x (one GPU cluster disabled) and by the time the A12z was launched, their yields were good enough that disabling a GPU-cluster didn’t provide any tangible benefit over simply discarding the odd defective die.
Also, the yield of their 5nm phone SoCs have to be pretty damn good, because there is no sign of any cutting down being used anywhere to improve yields.

With some not-up-to-date single figure of merit data for 5nm defect density, and no information about yield enhancing measures taken, nor perfect die size estimations for various hypothetical configurations we might be just as well off with decent extrapolations from 7nm products.

diamond.g · Dec 10, 2020

leman said:
I have played a bit with the die yield calculator using the Anandtech estimated defect rate of 1.2 per cm2, and the yield is basically 0%

There goes my anonymity (no worries, I pretty much broke it myself, and that is a throwaway account anyway).

Oh cool. The Apple GPU/CPU thread over there doesn’t get a lot of love. So thanks!

leman · Dec 10, 2020

diamond.g said:
Oh cool. The Apple GPU/CPU thread over there doesn’t get a lot of love. So thanks!

Yeah, it’s weird. There are so many smart and knowledgeable people out there, and I really feel humbled by the amount and quality of information - but that particular discussion is filled with people repeating some old dogmas and using ARM Mali (which is not even a TBDR architecture) to argue that “TBDR doesn’t work”. Anyway, it doesn’t seem like the folks there are interested in Apple GPUs. Which is a shame I think because they are so different. A GPU geek should find this alone exiting, regardless of their opinion towards Apple or their products.

diamond.g · Dec 10, 2020

leman said:
Yeah, it’s weird. There are so many smart and knowledgeable people out there, and I really feel humbled by the amount and quality of information - but that particular discussion is filled with people repeating some old dogmas and using ARM Mali (which is not even a TBDR architecture) to argue that “TBDR doesn’t work”. Anyway, it doesn’t seem like the folks there are interested in Apple GPUs. Which is a shame I think because they are so different. A GPU geek should find this alone exiting, regardless of their opinion towards Apple or their products.

I think most of the animosity is that Apple is very closed with regards to their architecture compare to nvidia and AMD.

leman · Dec 10, 2020

diamond.g said:
I think most of the animosity is that Apple is very closed with regards to their architecture compare to nvidia and AMD.

AMD is open, Nvidia not really. They are obfuscating a lot of their hardware details behind catchy marketing and Nvidia PTX itself is quite different from the real hardware. For example, Nvidia GPUs are VLIW designs, but Nvidia itself is carefully hiding this fact - probably because VLIW has the stigma of being “bad” or slow.

Edit: scrap that last part, I misremembered. Nvidia is not VLIW. They just use software scheduling.

jeanlain · Dec 11, 2020

The may be a silly question, but why aren't the geekbench Metal scores for the M1 higher, if the M1 has much better FP32 performance than the A14? Scores aren't even twice the A14 scores, despite the doubling of cores.

Pressure · Dec 11, 2020

jeanlain said:
The may be a silly question, but why aren't the geekbench Metal scores for the M1 higher, if the M1 has much better FP32 performance than the A14? Scores aren't even twice the A14 scores, despite the doubling of cores.

Performance rarely scales linearly. There will be some other bottleneck, like memory bandwidth or something else.

leman · Dec 11, 2020

jeanlain said:
The may be a silly question, but why aren't the geekbench Metal scores for the M1 higher, if the M1 has much better FP32 performance than the A14? Scores aren't even twice the A14 scores, despite the doubling of cores.

They are pretty much doubled, 9k vs 18k, but as Pressure says, FLOPS is not everything. It also depends on how the test itself is set up, how much data needs to be processed etc. Also, it depends on how the benchmark is timed. If they only measure the compute kernel time and exclude the data synchronization time, a dGPU with dedicated RAM will most certainly end up with faster scores, even if a unified memory system like M1 might end up faster in practice.

My guess is that GB5 benchmarks are almost entirely bandwidth limited. That would explain a bunch of things, such as 1650 being almost twice as fast in GB5 despite having the same compute throughout, and M1 itself being only twice as fast as A14. What I am still curious about however is why iPad is faster than the iPhone in some selected tests, that does t make any sense to me.

awesomedeluxe · Dec 11, 2020

theorist9 said:
What percent yield do they get with such a large die?

My coarse-grained calculation suggests the chance of getting a fully-working chip (i.e., the chance of getting a chip that doesn't need to be binned or discarded) decreases exponentially with die area:

Suppose the chance of fatal error (due to crystal defect, mask defect, etc.) is 0.1% per mm^2. Then the chance of having a fully-working chip would be q = 0.999^a, where a is the area in mm^2.

Thus, for a = 100, 200, 400, and 800 mm^2, respectively, the chance of producing a fully-working chip would be 90%, 82%, 67%, and 45%, respectively.

Of course, chips are heterogeneous, and errors may be correlated, and a calculation that incorporated such details would be much more complicated.

This makes for a fun way to test the viability of the strategy we saw in Bloomberg. Just test a 300mm2 die to see if 16+32 will actually print.

But do we know anything about where N5 yield is right now? Just searching for "N5 Yield" or "TSMC Yield 5nm" only tells me about yield projections from last summer. I don't have a source for this, so take it with a grain of salt, but I heard from some AMD engineers that TSMC kind of overstated their N7 yield. So I'm not sure how much stock I put in articles from last summer saying they were at .11 per cm2 and improving. But I'll use it since it's all I've got.

Screen Shot 2020-12-11 at 11.42.24 AM.png

This is pretty low, but far from abysmal. It helps that around 40% of the die would be GPU cores, and a lot of the defects in that area would be salvageable.

Pressure · Dec 11, 2020

The price for a 300mm N5 wafer is allegedly $16,988.

That brings it down to a mere $130 per chip at those yields.

leman · Dec 11, 2020

Pressure said:
The price for a 300mm N5 wafer is allegedly $16,988.

That brings it down to a mere $130 per chip at those yields.

Even if it costs Apple $500 to manufacture one MacBook Pro 16” SoC, it’s still significantly cheaper than buying chips from Intel and AMD. Not to mention that they are also saving costs on the main board as well. They could live with worse yields and still make a good profit over the Intel Macs.

Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

macrumors G5

macrumors 603

macrumors G5

macrumors Core

macrumors 68040

macrumors 6502

macrumors 603

macrumors Core

macrumors 6502a

macrumors Core

macrumors G5

macrumors 601

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors G5

macrumors Core

macrumors G5

macrumors Core

macrumors 68020

macrumors 603

macrumors Core

macrumors 6502

macrumors 603

macrumors Core

Our Staff