I think the important topic to discuss is about the merit of Apple's approach. To refresh, it has been observed that the display controller on M-series chips occupies a significant amount of space — the two display controllers on M1 die are similar in size to the P-CPU cluster.
Two? The die floorplan labeling in the first post is of the M1 Pro; not the M1. The Pro/Max have two, but are also substantively bigger dies. The M1 also has fewer Thunderbolt (TB) controllers. Can't really increase the TB controllers without increasing the Display controllers (and still try to hit TB4/TB3 certification requirements. ). Four TB soak up substantive die space also. But having them integrated saves a substantive amount of power.
Second, it two Display controller complexes are approximately as bing as one (of the two) P core clusters. Not bigger than the two P core clusters. if you round up just the P cores without L2, the DisExts are close to just the P cores. Throw in the P2 core L2 cache and it isn't that close if compare two to two. That was a bit of Apples to Oranges comparison. The Display cores with substantive framebuffer working space RAM and the P cores with registers and L1 cache. Yeah substantive RAM capacities takes up more space on the die.
The cache replacement policy for the Display Controllers is going to be largely different from the GPUs and CPUs (and NPUs). So there are several upsides to given it is own working space if that would partially trash the shared L3 case of the CPU and GPU (both line replacements policy mismatches and bus congestion to/from the L3 ) ... those could have to take a performance hit. ( similar coupled resourced issues when the Display controller runs hotter and there is thermal bleed over into the other function units. More P cores that run into a thermal limit is helpful how on general workloads? )
One of the issues Apple is juggling here is there they are more than a few relatively 'hungry' bandwidth consumers sharing the same stuff. GPU want lots. CPU when fired up on high multiple threads want lots. Video decoders are shipping lots of output around on the internal bus. NPU are bursting on computations. The display controllers have higher latency tolerances to meet service level requirements, etc. They all have to share and also not share as appropriate.
This of course explains the lack of additional display support — die space cones at high cost after all. And these display controllers are truly high-end devices capable of very efficient operation due to their own large RAM buffers as well as 6K resolution support.
It is also likely the DispEXT allocation is also coupled to the TB controller allocation. So where the TB port count is low then the display count is going to go lower in lock step.
Folks get caught up in the P core counts. If Apple is trying to build a System on a Chip then need to have more than maximum P core count. If there is primarily just P cores on the die, then it isn't much of a 'System' ( well rounded , completed package).
Still, one can argue whether it's a good use of die space. Is an ultra-efficient display controller really necessary with external displays for example. Maybe it would make sense to have asymmetric display controllers — a "fancy" large one for the internal display (for excellent battery life) and multiple smaller ones for the external displays. These smaller controllers could for example use smaller internal memory buffers and operate less efficiently, but one could fit more of them on the same die.
I/O off the chip is where they are going to take lumps on power. That is one reason why Apple is using PCI-e v4 more so to reduce the number of pins out from the SoC than the generally increase bandwidth off the chip. ( x4 PCI-e v4 == x8 PCI-e v3 as opposed to trying to tag larger overall bandwidth off the SoC than the Intel packages replacing. They aren't. ]
I suspect the single Internal display ( that gets no where near 6K ) is the problematical power and bandwidth consumer problem here.
And that if want to have "world's fastest" iGPU then some of the other internal function units have to get some compensation for the "bandwidth hog" that GPU is going to present. ( GPUs get the bulk of the transistor budget but going to get some die grow because other units need room also just to share with that "problem child" .)