M3 Die Shot suggests display engines take up nearly the same area as P-cores

Confused-User · Nov 23, 2023

Confused-User said:
I'm relatively unfamiliar with display hardware at a low level, so I have some questions...

At the moment, it would be prohibitively expensive to buffer an entire frame on-chip. A 6k monitor would require minimally 60MB (more likely 80MB, allowing for 10bpc). This isn't going to change in the next 5 years due to process technology; SRAM scaling is effectively dead, and eDRAM is not likely to be used. However, the industry is moving towards chiplets and complex packaging, so that leads me to wonder if it might not be feasible over time, even for the base Mx chips, to stack enough RAM on top of the base chip to hold entire frame buffers, much like AMD is stacking cache RAM on their X3D CPUs.

Is this likely to be a practical possibility in a few years? ISTM that this would dramatically lower DRAM utilization in the idle case, and would be a not insubstantial reduction in bandwidth use in all cases (5GBps for a 6k60 monitor).

Of course if you're going to do this you might want to move the entire controller off the base chip anyway, which might open up other possibilities.

ISTM that Apple has a big advantage over other architectures if they wanted to do this, because they'd have an easier time knowing when they'd need to copy from system RAM into the display buffer. Assuming they don't make the stacked-chiplet buffer the only buffer - I'm not clear on whether double-buffering would be better or worse here. (I mean, if the CPU or GPU is writing to the display, why even bother writing to main memory, if you have this buffer handy on-chip?)

As you can see I'm well out of my comfort zone here, so if this is utterly stupid feel fee to say so, though I'd prefer to know why.

I'm a little disappointed nobody has answered me. :-( Is it because the idea is so dumb? @name99, @mr_roboto, anyone else?

mr_roboto · Nov 23, 2023

Confused-User said:
I'm a little disappointed nobody has answered me. :-( Is it because the idea is so dumb? @name99, @mr_roboto, anyone else?

Stacked chiplet DRAM is technically feasible but it seems like an awful lot of trouble to go to just for a frame buffer. If you were Apple and you wanted to do it, you'd just do it for all the DRAM, not merely the frame buffer.

It's not a new idea. About 10 years ago JEDEC published standards for "Wide IO" DRAM, where DRAM die are stacked on top of logic die using through-silicon vias for interconnect. The 'wide' is because this interface specified wide data busses clocked at lower speeds compared to traditional DRAM interfaces, made possible by the much smaller size of TSV interconnect compared to traditional DRAM wire-bond pad structures. (Similar to how Apple's Ultra Fusion bridge saves power by using many thousands of connections clocked at a slow speed.) There was never much adoption, probably because it costs a lot more to build than ordinary DRAM.

theorist9 · Nov 23, 2023

Confused-User said:
In the real world, for any given feature, the answer will always be "both", because the two are not separable.

Nope, in the real world, they are separable. There are many examples. Companies will often produce a single silicon die to cover an entire product line, and then disable features on it to create their lower-end models. When it's disabling features on those lower-end models, it's not doing that to optimize their functionality. In fact, it's doing the opposite. It's reducing fuctionality to create product segmentation, thus explicitly separating what's optimum for the prouduct (allowing that functionality to remain), from what they believed was optimum for their business (disabling it). I.e., this is an action taken purely for a business reason, even though it runs counter to product optimization considerations.

Now you many argue that Apple not offering an additional feature is different from disabling an existing feature. Yes, but this case still falls under the same general concept I'm articulating: Suppose this additional feature would only add marginally to the cost, yet provided a significant benefit. Thus if Apple were operating based purely on product optimization considerations, they would offer it. But suppose that doing so would hurt their produuct segmentation strategy, and they thus decided not to. That would be a clear case of Apple doing something purely for business reasons (product segmentation), while running counter to product optimization.

I'm not saying Apple did this for either one reason, or the other, or both. I'm simply saying all three are possibilities. As I stated before, I don't know which of these applies—only Apple does.

Confused-User said:
They set a budget for chip size (cost, really, but it's mostly the same thing, until the advent of chiplets) and then decide what to spend that budget on. Sometimes someone will go to bat with management to argue that a certain feature is worth increasing the budget for. Sometimes they might even get their way.

In your terms, the budget is a "business reason". Deciding what to spend it on is "product optimization".

Nope, you misunderstand the terms as I've defined them. You seem to be presuming to lecture me (including presuming to tell me how the "real world" works) on what you don't understand. Take another look at my posts about this.

theorist9 · Nov 23, 2023

mr_roboto said:
The proper way to think about this is that GPUs use power to draw into buffers residing somewhere in RAM. They aren't too closely coupled to how many displays happen to be around. While an external display is showing a static image that's not being recomputed at all each refresh interval, it needs 0W of additional GPU power.

This is true (or can be) even for PC-style discrete GPUs, but you might not think it since I've long heard of dumb driver/firmware stuff producing the illusion of something else. For example, you might run into driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput.

My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?

Confused-User · Nov 23, 2023

mr_roboto said:
Stacked chiplet DRAM is technically feasible but it seems like an awful lot of trouble to go to just for a frame buffer. If you were Apple and you wanted to do it, you'd just do it for all the DRAM, not merely the frame buffer.

It's not a new idea. About 10 years ago JEDEC published standards for "Wide IO" DRAM, where DRAM die are stacked on top of logic die using through-silicon vias for interconnect. The 'wide' is because this interface specified wide data busses clocked at lower speeds compared to traditional DRAM interfaces, made possible by the much smaller size of TSV interconnect compared to traditional DRAM wire-bond pad structures. (Similar to how Apple's Ultra Fusion bridge saves power by using many thousands of connections clocked at a slow speed.) There was never much adoption, probably because it costs a lot more to build than ordinary DRAM.

I vaguely recall reading about Wide I/O. It's basically HBM, but even slower and wider, IIRC. I think even for M chips, heat dissipation would likely be too challenging for them to use this for all DRAM.

I was suggesting a much more specific optimization, though, a much smaller RAM element mounted only on the display controller part of the chip, which presumably generates less heat than an GPU or CPU core. I know this isn't feasible now, but it may well be a few years down the road as packaging tech continues to improve.

Analog Kid · Nov 23, 2023

leman said:
By the way, what really shocks me is the size of the thunderbolt controllers. They are crazy large.

I agree they're a good fraction of the die, but just comparing to other things around it they don't seem out of proportion. They're not huge compared to the DDR controller, for example. And TB4 is a pretty complex agglomeration of technologies-- PCIe, video controllers, USB controllers, bus mastering, I'm pretty sure they'd have a microcontroller embedded to handle negotiations...

These types of discussions make it clear why certain decisions are made... Each Thunderbolt bus costs actual money to add, which explains why Apple limits them on the lower end products. It also explains why USB has survived for so long when Thunderbolt appears superior in every way-- USB wins on cost.

Analog Kid · Nov 24, 2023

ArkSingularity said:
6144 * 3456 pixels times 3 bytes per pixel would be over 63 megabytes

It's probably 10 bits or more per color channel for HDR, right?

quarkysg · Nov 24, 2023

Analog Kid said:
So if @ArkSingularity is counting right, then it's something over 128MB for two frame buffers, the SLC on the M3 base is 48MB, I believe? A display buffer can be relatively slow, but it's going to be hard to get the numbers to work with the floorplans shown, I think.

Double it to 256MB if using double buffering for two video buffer outputs.

Chancha · Nov 24, 2023

Analog Kid said:
I agree they're a good fraction of the die, but just comparing to other things around it they don't seem out of proportion. They're not huge compared to the DDR controller, for example. And TB4 is a pretty complex agglomeration of technologies-- PCIe, video controllers, USB controllers, bus mastering, I'm pretty sure they'd have a microcontroller embedded to handle negotiations...

These types of discussions make it clear why certain decisions are made... Each Thunderbolt bus costs actual money to add, which explains why Apple limits them on the lower end products. It also explains why USB has survived for so long when Thunderbolt appears superior in every way-- USB wins on cost.

The TB controllers are like 1.5-2x the size on Intel Alder Lake, but of course they usually have 4 of them, and that they are nowhere near TSMC 3nm in node size.

mr_roboto · Nov 24, 2023

theorist9 said:
My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?

Impossible to say. It could even be true that the 750M genuinely must clock up when driving more displays, due to quirks of how its hardware was designed. They didn't have to design it that way; as I said, modern GPUs don't have to be tightly coupled to display refresh hardware at all. However, they might have designed in such a coupling anyways. Whenever you dig into PC industry designs you tend to find lots of weird legacy nonsense that makes no sense in the modern context, but that's how things were done 20 years ago, and if it isn't too badly broken, just keep shipping it.

(or like... I just dreamt up a scenario in my head. Refresh taps into the same memory bandwidth which serves the GPU compute elements. If they designed the 750M such that the lowest power state for the memory controller has about enough bandwidth to serve one display out plus a low level of GPU compute activity, then turning on a second display will probably require moving out of that lowest power state. If, in turn, they didn't design with a lot of granularity in their power states, and/or coupled GPU frequency to memory frequency in some way, well, there's your problem.)

pgolik · Nov 24, 2023

theorist9 said:
My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?

The same thing happened on my 2015 15” which had an AMD dGPU, so it wasn’t Nvidia.

theorist9 · Nov 24, 2023

pgolik said:
The same thing happened on my 2015 15” which had an AMD dGPU, so it wasn’t Nvidia.

Interesting...or it could be that the NVIDIA and AMD GPU's share the design characteristic that causes the issue. E.g., perhaps they both unnecessarily ramp their clocks when more displays are attached.

Analog Kid · Nov 24, 2023

Chancha said:
The TB controllers are like 1.5-2x the size on Intel Alder Lake, but of course they usually have 4 of them, and that they are nowhere near TSMC 3nm in node size.

I haven’t looked to much at modern Intel designs, but in the past Intel has favored dumb peripherals that require more CPU intervention than smart peripherals that can offload demand from the CPU and manage themselves. I’d expect Apple to favor the opposite.

theorist9 · Nov 24, 2023

Chancha said:
The TB controllers are like 1.5-2x the size on Intel Alder Lake, but of course they usually have 4 of them, and that they are nowhere near TSMC 3nm in node size.

How did you arrive at that 1.5–2x estimate? If you were comparing the die shots by eye, did you first adjust the two pics to the same scale?

Analog Kid · Nov 25, 2023

theorist9 said:
My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?

I suspect it can also depend on the connection method being used. It can take a lot of power to run high speed data down long cables. Just look at the 10GB ethernet equipment out there-- it's all heat sinked and huge, but the 40GB Thunderbolt standards are more modern built to more modern tolerances so runs at much lower power. I'm not sure where HDMI and DisplayPort fit into the spectrum, but it's another variable.

Analog Kid · Nov 25, 2023

ArkSingularity said:
I still personally find it very convenient that the energy consumption is low even when connected to a display

This has nothing to do with your larger point, but just a random musing after having just read @theorist9's discussion of energy and power consumption.

Leaving aside that it's illegal to consume energy, conservation laws only permit transforming it, this looks like a good place to reference power rather than energy. When you're talking about a setup like this it's for a rather indeterminant amount of time. Energy is a useful metric for discrete tasks, but power is a better fit for ongoing activities.

Wando64 · Nov 25, 2023

ArkSingularity said:
Intel's chips (with only a tiny fraction of the transistor counts) have long been able to power multiple external displays even on chips that are very small in comparison, so it's clear that Intel has not necessarily been building display engines that are nearly as large in silicon.

I am sorry but, before I go any further in reading this thread, have you actually checked this or are you just making what you think is a logical deduction?
Someone else later makes a comment about the thunderbolt controllers being very large.
Compared to which other thunderbolt controller of same specifications?

Maybe you guys are completely correct in your statements, but can you give any factual equivalent comparison reference?

Pressure · Nov 25, 2023

For comparison sake a normal Intel JHL8440 Thunderbolt 4 Controller measures 10.7 x 10.7 mm.

Chancha · Nov 25, 2023

theorist9 said:
How did you arrive at that 1.5–2x estimate? If you were comparing the die shots by eye, did you first adjust the two pics to the same scale?

Well yes I pulled that out of my fuzzy memory and conjecture. With how you questioned what I posted, I went on to actually check. Can't say I was *that* far off but I wasn't close to being correct:

https://twitter.com/x/status/1450271726827413508

https://www.patreon.com/posts/info-snack-tgl-57887608

Here are known die details of M1 family vs Tiger Lake H, similar release time frame in 2021.
By counting the pixels here are the respective dimensions:
Tiger Lake H : (4.6057 mm) x (2.0783 mm)
M1 Max : (5.7623 mm) x (2.1583 mm)

Using M1 Max to compare since it carries 4 controllers with corresponding number of 4 ports. Also the I/O portion of Apple Silicon is off to the side, it is unclear to me if we have to add that on top of the TB controllers to be functionally equivalent to what's in the Intel.

With the M3 gen and its process shrink, I naturally expect a slight size down but we don't have enough info to know exactly how much.

CWallace · Nov 25, 2023

The base M SoCs have to support 6K @ 60Hz with 10-bit color because Apple sells a 60Hz 6K display with 10-bit color. And yes, the number of people who connect an MBA or 13"/14" MBP to such a display are probably very rare, but it does happen because I know them.

As to why Apple designed their SOC display controller configurations the way they did, I do believe it is both from a power-savings angle and a product differentiation angle.

ArkSingularity · Nov 25, 2023

Wando64 said:
I am sorry but, before I go any further in reading this thread, have you actually checked this or are you just making what you think is a logical deduction?
Someone else later makes a comment about the thunderbolt controllers being very large.
Compared to which other thunderbolt controller of same specifications?

Maybe you guys are completely correct in your statements, but can you give any factual equivalent comparison reference?

It's an assumption based on the estimated transistor counts in most of the 8th gen lineup of U-series Intel CPUs (most of which have under one billion transistors according to most estimates). Die shots of Coffee Lake era chips can be found on this page, which show the relative size of the system agent (which contains the display controllers, IO bus, among other such things.)

I am using the 8th gen CPUs primarily because this is what I am familiar with from the research I had done previously. If anyone knows where more precise transistor count information can be found, it would be very useful for making better comparisons on this front (Intel doesn't usually share official information for these, so we have to rely on estimations).

JPack · Nov 25, 2023

Wando64 said:
I am sorry but, before I go any further in reading this thread, have you actually checked this or are you just making what you think is a logical deduction?
Someone else later makes a comment about the thunderbolt controllers being very large.
Compared to which other thunderbolt controller of same specifications?

Maybe you guys are completely correct in your statements, but can you give any factual equivalent comparison reference?

It's common knowledge for anyone who has used a PC in the past two decades.

If you've worked in an office of any kind that uses Dell desktops, those systems at minimum supported two displays. The Intel chip itself supported at least three. MacBook Pros of that era also used Intel chips. Here's a $42 Celeron launched in 2013 that supports three displays.

Intel® Celeron® Processor G1610 (2M Cache, 2.60 GHz) - Product Specifications | Intel

Intel® Celeron® Processor G1610 (2M Cache, 2.60 GHz) quick reference with specifications, features, and technologies.

ark.intel.com

Transistor count? Ivy Bridge is 1.4B. M3 is 25B.

Chancha · Nov 25, 2023

There is an usefully stark example, Apple got two MBA releases in 2020:

MacBook Air Early 2020, Intel 10th gen Ice Lake Y / Iris Plus G7 iGPU
External display supported: 2

MacBook Air Late 2020, M1 SoC GPU
External display supported: 1

In retrospect, this is quite obviously a conscious (design) decision onwards.

theorist9 · Nov 25, 2023

Locuza wrote the following in their die annotation of Alder Lake-P:

"Starting at the top-left again, that's where Intel invested into 4x Thunderbolt 4 ports. The bidirectional bandwidth per port can be 5 GB/s and it’s 20 GB/s in total. That’s a lot of throughput which can be used to exchange data and to drive high-resolution display output. The area required for this featureis quite large, around 10.23 mm²."

Die walkthrough: Alder Lake-S/P and a touch of Zen 3

I try to make at least one video project per month and in January I was bouncing back and forth between three different projects.

locuza.substack.com

Locuza's 10.23 mm^2 figure refers to the entire area outlined in the first screenshot below (the two 2x PHY blocks plus the Mutiprotocol block).

Since the entire die has an area of 217.18 mm^2, I was able able to do my own measurements by displaying the entire die on my screen. I then determined the area of each block as follows:
block area = (area of block on screen)/(area of die on screen) x 217.18 mm^2

Here's what I got (the 10.3 mm^2 is very close to Locuza's figure):

ALDER LAKE-P
Two 2x PHY blocks: 4.1 mm^2
Multiprotocol block: 6.2 mm^2
Total: 10.3 mm^2

This is harder to do for the M3 because we don't know the die area. Ryan Smith of Anandtech estimated <400 mm^2 for the Max (which, like Alder Lake P, also has 4 x TB4 controllers) (https://www.anandtech.com/show/2111...-family-m3-m3-pro-and-m3-max-make-their-marks)

So let's consider a range of 350-400 mm^2. Given this, I obtained (see screenshot at bottom):

M3 MAX
Four purple "I/O" blocks: 4.0 mm^2 – 4.6 mm^2
Four orange "Thunderbolt blocks": 8.9 mm^2 – 10.1 mm^2
Total: 12.9 mm^2 – 14.7 mm^2

Unfortunately, because they are made on different processes, these areas aren't comparable. I should really be comparing based on transistor counts, but I was unable to find that value for Alder Lake-P. Does anyone know this? Or equivalently, do we know the relative transistor density of Intel 10 nm (Alder Lake) and TSMC N3B (M3)?

Further, somene who understand this better than I will need to let us know if the total region identifie for Alder Lake-P TB4 functionality is equivalent to the total identified for M3 Max, i.e., have these captured the same functional areas?

ALDER LAKE-P (FROM: https://locuza.substack.com/p/die-walkthrough-alder-lake-sp-and)

M3 MAX (FROM

)

theorist9 · Nov 25, 2023

Chancha said:
There is an usefully stark example, Apple got two MBA releases in 2020:

MacBook Air Early 2020, Intel 10th gen Ice Lake Y / Iris Plus G7 iGPU
External display supported: 2

MacBook Air Late 2020, M1 SoC GPU
External display supported: 1

In retrospect, this is quite obviously a conscious (design) decision onwards.

Yeah, the Intel Air's display support likely wasn't an explicit design decision—that was presumably the best chip for the Air at the time, and support for 3 displays (internal + external) was simply what that chip happend to offer (3 display pipes, each capable of 5k60). So Apple simply needed to decide if 3 displays was acceptable.

That's qualitatively different from the case with the AS Air, where Apple needed to explictly decide how many display pipes to include in the chip they would be designing for it (and the other models that would be using this chip).

M3 Die Shot suggests display engines take up nearly the same area as P-cores

macrumors 6502a

macrumors 6502a

macrumors 601

macrumors 601

macrumors 6502a

macrumors G3

macrumors G3

macrumors 65816

macrumors 68030

macrumors 6502a

macrumors member

macrumors 601

macrumors G3

macrumors 601

macrumors G3

macrumors G3

macrumors 68020

macrumors 603

macrumors 68030

macrumors G5

macrumors 6502a

macrumors G5

macrumors 68030

macrumors 601

macrumors 601

Our Staff