Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
I'm relatively unfamiliar with display hardware at a low level, so I have some questions...

At the moment, it would be prohibitively expensive to buffer an entire frame on-chip. A 6k monitor would require minimally 60MB (more likely 80MB, allowing for 10bpc). This isn't going to change in the next 5 years due to process technology; SRAM scaling is effectively dead, and eDRAM is not likely to be used. However, the industry is moving towards chiplets and complex packaging, so that leads me to wonder if it might not be feasible over time, even for the base Mx chips, to stack enough RAM on top of the base chip to hold entire frame buffers, much like AMD is stacking cache RAM on their X3D CPUs.

Is this likely to be a practical possibility in a few years? ISTM that this would dramatically lower DRAM utilization in the idle case, and would be a not insubstantial reduction in bandwidth use in all cases (5GBps for a 6k60 monitor).

Of course if you're going to do this you might want to move the entire controller off the base chip anyway, which might open up other possibilities.

ISTM that Apple has a big advantage over other architectures if they wanted to do this, because they'd have an easier time knowing when they'd need to copy from system RAM into the display buffer. Assuming they don't make the stacked-chiplet buffer the only buffer - I'm not clear on whether double-buffering would be better or worse here. (I mean, if the CPU or GPU is writing to the display, why even bother writing to main memory, if you have this buffer handy on-chip?)

As you can see I'm well out of my comfort zone here, so if this is utterly stupid feel fee to say so, though I'd prefer to know why.
I'm a little disappointed nobody has answered me. :-( Is it because the idea is so dumb? @name99, @mr_roboto, anyone else?
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
I'm a little disappointed nobody has answered me. :-( Is it because the idea is so dumb? @name99, @mr_roboto, anyone else?
Stacked chiplet DRAM is technically feasible but it seems like an awful lot of trouble to go to just for a frame buffer. If you were Apple and you wanted to do it, you'd just do it for all the DRAM, not merely the frame buffer.

It's not a new idea. About 10 years ago JEDEC published standards for "Wide IO" DRAM, where DRAM die are stacked on top of logic die using through-silicon vias for interconnect. The 'wide' is because this interface specified wide data busses clocked at lower speeds compared to traditional DRAM interfaces, made possible by the much smaller size of TSV interconnect compared to traditional DRAM wire-bond pad structures. (Similar to how Apple's Ultra Fusion bridge saves power by using many thousands of connections clocked at a slow speed.) There was never much adoption, probably because it costs a lot more to build than ordinary DRAM.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
In the real world, for any given feature, the answer will always be "both", because the two are not separable.
Nope, in the real world, they are separable. There are many examples. Companies will often produce a single silicon die to cover an entire product line, and then disable features on it to create their lower-end models. When it's disabling features on those lower-end models, it's not doing that to optimize their functionality. In fact, it's doing the opposite. It's reducing fuctionality to create product segmentation, thus explicitly separating what's optimum for the prouduct (allowing that functionality to remain), from what they believed was optimum for their business (disabling it). I.e., this is an action taken purely for a business reason, even though it runs counter to product optimization considerations.

Now you many argue that Apple not offering an additional feature is different from disabling an existing feature. Yes, but this case still falls under the same general concept I'm articulating: Suppose this additional feature would only add marginally to the cost, yet provided a significant benefit. Thus if Apple were operating based purely on product optimization considerations, they would offer it. But suppose that doing so would hurt their produuct segmentation strategy, and they thus decided not to. That would be a clear case of Apple doing something purely for business reasons (product segmentation), while running counter to product optimization.

I'm not saying Apple did this for either one reason, or the other, or both. I'm simply saying all three are possibilities. As I stated before, I don't know which of these applies—only Apple does.

They set a budget for chip size (cost, really, but it's mostly the same thing, until the advent of chiplets) and then decide what to spend that budget on. Sometimes someone will go to bat with management to argue that a certain feature is worth increasing the budget for. Sometimes they might even get their way.

In your terms, the budget is a "business reason". Deciding what to spend it on is "product optimization".
Nope, you misunderstand the terms as I've defined them. You seem to be presuming to lecture me (including presuming to tell me how the "real world" works) on what you don't understand. Take another look at my posts about this.
 
Last edited:
  • Like
Reactions: nateo200

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
The proper way to think about this is that GPUs use power to draw into buffers residing somewhere in RAM. They aren't too closely coupled to how many displays happen to be around. While an external display is showing a static image that's not being recomputed at all each refresh interval, it needs 0W of additional GPU power.

This is true (or can be) even for PC-style discrete GPUs, but you might not think it since I've long heard of dumb driver/firmware stuff producing the illusion of something else. For example, you might run into driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput.
My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?
 

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
Stacked chiplet DRAM is technically feasible but it seems like an awful lot of trouble to go to just for a frame buffer. If you were Apple and you wanted to do it, you'd just do it for all the DRAM, not merely the frame buffer.

It's not a new idea. About 10 years ago JEDEC published standards for "Wide IO" DRAM, where DRAM die are stacked on top of logic die using through-silicon vias for interconnect. The 'wide' is because this interface specified wide data busses clocked at lower speeds compared to traditional DRAM interfaces, made possible by the much smaller size of TSV interconnect compared to traditional DRAM wire-bond pad structures. (Similar to how Apple's Ultra Fusion bridge saves power by using many thousands of connections clocked at a slow speed.) There was never much adoption, probably because it costs a lot more to build than ordinary DRAM.
I vaguely recall reading about Wide I/O. It's basically HBM, but even slower and wider, IIRC. I think even for M chips, heat dissipation would likely be too challenging for them to use this for all DRAM.

I was suggesting a much more specific optimization, though, a much smaller RAM element mounted only on the display controller part of the chip, which presumably generates less heat than an GPU or CPU core. I know this isn't feasible now, but it may well be a few years down the road as packaging tech continues to improve.
 

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
By the way, what really shocks me is the size of the thunderbolt controllers. They are crazy large.
I agree they're a good fraction of the die, but just comparing to other things around it they don't seem out of proportion. They're not huge compared to the DDR controller, for example. And TB4 is a pretty complex agglomeration of technologies-- PCIe, video controllers, USB controllers, bus mastering, I'm pretty sure they'd have a microcontroller embedded to handle negotiations...

These types of discussions make it clear why certain decisions are made... Each Thunderbolt bus costs actual money to add, which explains why Apple limits them on the lower end products. It also explains why USB has survived for so long when Thunderbolt appears superior in every way-- USB wins on cost.
 
  • Like
Reactions: altaic and Basic75

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
So if @ArkSingularity is counting right, then it's something over 128MB for two frame buffers, the SLC on the M3 base is 48MB, I believe? A display buffer can be relatively slow, but it's going to be hard to get the numbers to work with the floorplans shown, I think.
Double it to 256MB if using double buffering for two video buffer outputs.
 

Chancha

macrumors 68020
Mar 19, 2014
2,307
2,134
I agree they're a good fraction of the die, but just comparing to other things around it they don't seem out of proportion. They're not huge compared to the DDR controller, for example. And TB4 is a pretty complex agglomeration of technologies-- PCIe, video controllers, USB controllers, bus mastering, I'm pretty sure they'd have a microcontroller embedded to handle negotiations...

These types of discussions make it clear why certain decisions are made... Each Thunderbolt bus costs actual money to add, which explains why Apple limits them on the lower end products. It also explains why USB has survived for so long when Thunderbolt appears superior in every way-- USB wins on cost.
The TB controllers are like 1.5-2x the size on Intel Alder Lake, but of course they usually have 4 of them, and that they are nowhere near TSMC 3nm in node size.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?
Impossible to say. It could even be true that the 750M genuinely must clock up when driving more displays, due to quirks of how its hardware was designed. They didn't have to design it that way; as I said, modern GPUs don't have to be tightly coupled to display refresh hardware at all. However, they might have designed in such a coupling anyways. Whenever you dig into PC industry designs you tend to find lots of weird legacy nonsense that makes no sense in the modern context, but that's how things were done 20 years ago, and if it isn't too badly broken, just keep shipping it.

(or like... I just dreamt up a scenario in my head. Refresh taps into the same memory bandwidth which serves the GPU compute elements. If they designed the 750M such that the lowest power state for the memory controller has about enough bandwidth to serve one display out plus a low level of GPU compute activity, then turning on a second display will probably require moving out of that lowest power state. If, in turn, they didn't design with a lot of granularity in their power states, and/or coupled GPU frequency to memory frequency in some way, well, there's your problem.)
 
  • Like
Reactions: theorist9

pgolik

macrumors member
Sep 13, 2011
67
49
My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?
The same thing happened on my 2015 15” which had an AMD dGPU, so it wasn’t Nvidia.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
The same thing happened on my 2015 15” which had an AMD dGPU, so it wasn’t Nvidia.
Interesting...or it could be that the NVIDIA and AMD GPU's share the design characteristic that causes the issue. E.g., perhaps they both unnecessarily ramp their clocks when more displays are attached.
 
Last edited:

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
The TB controllers are like 1.5-2x the size on Intel Alder Lake, but of course they usually have 4 of them, and that they are nowhere near TSMC 3nm in node size.
I haven’t looked to much at modern Intel designs, but in the past Intel has favored dumb peripherals that require more CPU intervention than smart peripherals that can offload demand from the CPU and manage themselves. I’d expect Apple to favor the opposite.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
The TB controllers are like 1.5-2x the size on Intel Alder Lake, but of course they usually have 4 of them, and that they are nowhere near TSMC 3nm in node size.
How did you arrive at that 1.5–2x estimate? If you were comparing the die shots by eye, did you first adjust the two pics to the same scale?
 

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
My 2014 15" MBP thermally throttled more readily (indicating significantly more power consumption) when it was driving three externals instead of one, i.e., a total of four displays instead of two (one external through each of the two TB ports, and the third external via the HDMI port). As you say, the images the monitors are displaying are mostly static.

When at least one external display is connected, this MBP automatically uses the dGPU (an NVIDIA GeForce GT 750M w/ 2 GB GDDR5). Thus based on what you wrote, this sounds like an example of poorly-designed "driver/firmware stacks that force the GPU to clock higher as long as there's more than 1 display attached, even while there's no actual demand for more GPU throughput."

If so, the Apple engineers were at fault for not catching this during testing. But as to how they ended up with this design in the first place, would this have been the fault of Apple, NVIDIA, or both plus a product of bad communication/collaboration between the two?
I suspect it can also depend on the connection method being used. It can take a lot of power to run high speed data down long cables. Just look at the 10GB ethernet equipment out there-- it's all heat sinked and huge, but the 40GB Thunderbolt standards are more modern built to more modern tolerances so runs at much lower power. I'm not sure where HDMI and DisplayPort fit into the spectrum, but it's another variable.
 

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
I still personally find it very convenient that the energy consumption is low even when connected to a display
This has nothing to do with your larger point, but just a random musing after having just read @theorist9's discussion of energy and power consumption.

Leaving aside that it's illegal to consume energy, conservation laws only permit transforming it, this looks like a good place to reference power rather than energy. When you're talking about a setup like this it's for a rather indeterminant amount of time. Energy is a useful metric for discrete tasks, but power is a better fit for ongoing activities.
 
  • Haha
Reactions: jido

Wando64

macrumors 68020
Jul 11, 2013
2,338
3,109
Intel's chips (with only a tiny fraction of the transistor counts) have long been able to power multiple external displays even on chips that are very small in comparison, so it's clear that Intel has not necessarily been building display engines that are nearly as large in silicon.

I am sorry but, before I go any further in reading this thread, have you actually checked this or are you just making what you think is a logical deduction?
Someone else later makes a comment about the thunderbolt controllers being very large.
Compared to which other thunderbolt controller of same specifications?

Maybe you guys are completely correct in your statements, but can you give any factual equivalent comparison reference?
 

Chancha

macrumors 68020
Mar 19, 2014
2,307
2,134
How did you arrive at that 1.5–2x estimate? If you were comparing the die shots by eye, did you first adjust the two pics to the same scale?
Well yes I pulled that out of my fuzzy memory and conjecture. With how you questioned what I posted, I went on to actually check. Can't say I was *that* far off but I wasn't close to being correct:


Here are known die details of M1 family vs Tiger Lake H, similar release time frame in 2021.
By counting the pixels here are the respective dimensions:
Tiger Lake H : (4.6057 mm) x (2.0783 mm)
M1 Max : (5.7623 mm) x (2.1583 mm)

Using M1 Max to compare since it carries 4 controllers with corresponding number of 4 ports. Also the I/O portion of Apple Silicon is off to the side, it is unclear to me if we have to add that on top of the TB controllers to be functionally equivalent to what's in the Intel.

With the M3 gen and its process shrink, I naturally expect a slight size down but we don't have enough info to know exactly how much.

FCAKp8jWUAIrvOB.jpeg FCBl1gcWEAUOdRw.jpeg tiger-lake-die-size-dqyjxj.jpg tgl-die-shot-remaster77jqp.jpg
 

CWallace

macrumors G5
Aug 17, 2007
12,525
11,542
Seattle, WA
The base M SoCs have to support 6K @ 60Hz with 10-bit color because Apple sells a 60Hz 6K display with 10-bit color. And yes, the number of people who connect an MBA or 13"/14" MBP to such a display are probably very rare, but it does happen because I know them. :)

As to why Apple designed their SOC display controller configurations the way they did, I do believe it is both from a power-savings angle and a product differentiation angle.
 

ArkSingularity

macrumors 6502a
Original poster
Mar 5, 2022
928
1,130
I am sorry but, before I go any further in reading this thread, have you actually checked this or are you just making what you think is a logical deduction?
Someone else later makes a comment about the thunderbolt controllers being very large.
Compared to which other thunderbolt controller of same specifications?

Maybe you guys are completely correct in your statements, but can you give any factual equivalent comparison reference?
It's an assumption based on the estimated transistor counts in most of the 8th gen lineup of U-series Intel CPUs (most of which have under one billion transistors according to most estimates). Die shots of Coffee Lake era chips can be found on this page, which show the relative size of the system agent (which contains the display controllers, IO bus, among other such things.)

I am using the 8th gen CPUs primarily because this is what I am familiar with from the research I had done previously. If anyone knows where more precise transistor count information can be found, it would be very useful for making better comparisons on this front (Intel doesn't usually share official information for these, so we have to rely on estimations).
 
Last edited:

JPack

macrumors G5
Mar 27, 2017
13,535
26,158
I am sorry but, before I go any further in reading this thread, have you actually checked this or are you just making what you think is a logical deduction?
Someone else later makes a comment about the thunderbolt controllers being very large.
Compared to which other thunderbolt controller of same specifications?

Maybe you guys are completely correct in your statements, but can you give any factual equivalent comparison reference?

It's common knowledge for anyone who has used a PC in the past two decades.

If you've worked in an office of any kind that uses Dell desktops, those systems at minimum supported two displays. The Intel chip itself supported at least three. MacBook Pros of that era also used Intel chips. Here's a $42 Celeron launched in 2013 that supports three displays.


Transistor count? Ivy Bridge is 1.4B. M3 is 25B.
 
  • Like
Reactions: MRMSFC

Chancha

macrumors 68020
Mar 19, 2014
2,307
2,134
There is an usefully stark example, Apple got two MBA releases in 2020:

MacBook Air Early 2020, Intel 10th gen Ice Lake Y / Iris Plus G7 iGPU
External display supported: 2

MacBook Air Late 2020, M1 SoC GPU
External display supported: 1

In retrospect, this is quite obviously a conscious (design) decision onwards.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
Locuza wrote the following in their die annotation of Alder Lake-P:

"Starting at the top-left again, that's where Intel invested into 4x Thunderbolt 4 ports. The bidirectional bandwidth per port can be 5 GB/s and it’s 20 GB/s in total. That’s a lot of throughput which can be used to exchange data and to drive high-resolution display output. The area required for this featureis quite large, around 10.23 mm²."


Locuza's 10.23 mm^2 figure refers to the entire area outlined in the first screenshot below (the two 2x PHY blocks plus the Mutiprotocol block).

Since the entire die has an area of 217.18 mm^2, I was able able to do my own measurements by displaying the entire die on my screen. I then determined the area of each block as follows:
block area = (area of block on screen)/(area of die on screen) x 217.18 mm^2

Here's what I got (the 10.3 mm^2 is very close to Locuza's figure):

ALDER LAKE-P
Two 2x PHY blocks: 4.1 mm^2
Multiprotocol block: 6.2 mm^2
Total: 10.3 mm^2

This is harder to do for the M3 because we don't know the die area. Ryan Smith of Anandtech estimated <400 mm^2 for the Max (which, like Alder Lake P, also has 4 x TB4 controllers) (https://www.anandtech.com/show/2111...-family-m3-m3-pro-and-m3-max-make-their-marks)

So let's consider a range of 350-400 mm^2. Given this, I obtained (see screenshot at bottom):

M3 MAX
Four purple "I/O" blocks: 4.0 mm^2 – 4.6 mm^2
Four orange "Thunderbolt blocks": 8.9 mm^2 – 10.1 mm^2
Total: 12.9 mm^2 – 14.7 mm^2

Unfortunately, because they are made on different processes, these areas aren't comparable. I should really be comparing based on transistor counts, but I was unable to find that value for Alder Lake-P. Does anyone know this? Or equivalently, do we know the relative transistor density of Intel 10 nm (Alder Lake) and TSMC N3B (M3)?

Further, somene who understand this better than I will need to let us know if the total region identifie for Alder Lake-P TB4 functionality is equivalent to the total identified for M3 Max, i.e., have these captured the same functional areas?


ALDER LAKE-P (FROM: https://locuza.substack.com/p/die-walkthrough-alder-lake-sp-and)
1700941780161.png


1700941718146.png


M3 MAX (FROM
)

1700943910513.png
 
Last edited:
  • Love
Reactions: ArkSingularity

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
There is an usefully stark example, Apple got two MBA releases in 2020:

MacBook Air Early 2020, Intel 10th gen Ice Lake Y / Iris Plus G7 iGPU
External display supported: 2

MacBook Air Late 2020, M1 SoC GPU
External display supported: 1

In retrospect, this is quite obviously a conscious (design) decision onwards.
Yeah, the Intel Air's display support likely wasn't an explicit design decision—that was presumably the best chip for the Air at the time, and support for 3 displays (internal + external) was simply what that chip happend to offer (3 display pipes, each capable of 5k60). So Apple simply needed to decide if 3 displays was acceptable.

That's qualitatively different from the case with the AS Air, where Apple needed to explictly decide how many display pipes to include in the chip they would be designing for it (and the other models that would be using this chip).
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.