You wouldn't. But back in several replies you put forward how cache and TBDR was going to solve the problem.
It doesn't matter can't exactly match bandwidths because need less bandwidth. The probably here is that the cache isn't going to be effective at mitigating bandwidth if you are not using it. Same issue with TBDR. Has next to no traction here because it is a display controller issue not a "tile" (or tile cache) issue.
I think you might be overestimating the bandwidth needs for driving displays. In a typical demanding graphics workload (e.g. a game), the real bandwidth problem is fetching texture data. And TBDR+large cache can effectively reduce the real bandwidth required to do this.
There is a significant factor where this bypass 'eats' into the savings of the other techniques. For example with some approximate numbers. 60 GB/s for the memory system and 3 GB/s for a display stream. Your position is that it is a small fraction (5% ) . True, but that small fraction diminishes your savings. For example if it is 55% that is only applied to 95% of the workload. That gives an overall effective rate of 52%. With two displays that falls to 50%. Three displays 47%. 47% isn't going to scale up as well as 55% might.
You seem to be assuming that every display is fully redrawn on every frame (it’s almost never the case) and you are dismissing framebuffer compression.
In a real world, multi-monitor workloads are desktop compositing workloads, where changes are infrequent and confined to small regions most of the time. Also, real-world GUIs consist of large homogeneous regions which are a present for the compressor. Sure, if you want to play a game on 3 4K displays simultaneously, the bandwidth hit will be noticeable, but your GPU hardly has the performance to do this anyway.
You are drinking lots of Cupertino kool-aid. Apple plays some games with their L2 cache numbers. Those L2 are attached to multiple core.
Exactly, which is why I am saying that Apple‘s L2 fulfills the same role as L2+L3 on more traditional designs. They seem to have around 3MB of „fast“ cache region per core, but each core can access other regions at a slight loss of performance (would be interesting to see some benchmarks btw). Sure, something like Zen3 Vermeer has a bit more effective cache per core (that’s to their humongous L3), but the amount of fast per-core cache in Apples designs is unprecedented.