Why would you trash your cache with the the contents of the screen? That’s probably just written out to RAM bypassing the cache. Frankly, I don’t know how this stuff works… it’s possible that the chips have some sort of dedicated buffers for video outputs and they don’t land in the RAM at all.
You wouldn't. But back in several replies you put forward how cache and TBDR was going to solve the problem.
It doesn't matter can't exactly match bandwidths because need less bandwidth. The probably here is that the cache isn't going to be effective at mitigating bandwidth if you are not using it. Same issue with TBDR. Has next to no traction here because it is a display controller issue not a "tile" (or tile cache) issue.
There is a significant factor where this bypass 'eats' into the savings of the other techniques. For example with some approximate numbers. 60 GB/s for the memory system and 3 GB/s for a display stream. Your position is that it is a small fraction (5% ) . True, but that small fraction diminishes your savings. For example if it is 55% that is only applied to 95% of the workload. That gives an overall effective rate of 52%. With two displays that falls to 50%. Three displays 47%. 47% isn't going to scale up as well as 55% might.
So the real gap between GDDR6 and HBM2 will be larger. HDR ( 10-12 bit color ) and/or high refresh ( 90-120 Hz) also expose the problem.
The geometry render function units (and video decode units ) have to "pass" the rendered screen elements off. there are many of them and much fewer of the display controllers. They need a central , shared area to do communication through. If we have banned them from using the system cache the next level up of shared area is RAM. You're trying to direct this away from RAM to some on-die large cache area... possible with a substantially larger die area consumption which won't scale up very well either.
Fancy up/down scaling ( XeSS/DLSS?FSR up and "Retina" down ..... again similar bandwidth pressures where pragmatically need to hold "old" and "new" in order to get to the altered state.
Trying to balloon squeeze this problem off to monitors isn't going to work well either. There are some monitors that have a big enough buffer to statically refresh a completely static screen. But that is an optional feature. Similarly there are some monitors that can accept DSC compressed streams. Again optional; not standard. And in order to get to a DSC compressed stream need to have the original frame. [ Even if Apple applied something like DSC to every "internal only" framebuffer and/or frambuff contributiion fragment. They could just in time decompress into external streams for non DSC monitors. that cuts down on the display diminish rate but still number of monitors can scale right back up to cover compression "wins". Probably cost increased area consumption though. ]
I still think that the screen output capability of M1 is a limitation of early implementation. I don’t know if it has anything to do with memory or caches.
Die space is probably a significant factor. Bandwidth is another as I've already illustrated. Those are actually coupled as the
"....
| | |
---|
| Change (factor) | Comments |
---|
CPU 1 (mm2) | 2.1 | - | CPU 1 # cores | 2.0 | M1 has twice as many CPU 1 cores | CPU 2 (mm2) | 1.0 | Same size as CPU 2 |
| | |
| | |
...
DDR (mm2) | 2.7 | M1 has double the DDR interfaces |
....
GPU (mm2) | 2.1 | M1 has double the GPU cores |
....
System cache (mm2) | 0.8 | M1 has 25% smaller system cache |
...."
Available Logic Subscriptions > 
www.techinsights.com
The "go super wide" is a dual edge sword "solution" . It works much better on mid-size to larger dies. It doesn't work so well once get into the smaller die "half" of the spectrum. Relatively, the system cache shrank and the DDR system is closer to 3 than 2 in scaling factor. ( Not really surprising since to interface with off chip elements through the bottom bumps on package that isn't going to happen a 5nm kind of sizes. Going off die is bigger ). There isn't a good reason to shrink the system cache other running out of area budget.
Already demonstrative that it is "cheaper" to add more GPU cores than it is to add more DDR interfaces . If apply a 4x to the GPU cores there isn't really a way for a pragmatic 6x DDR expansion to keep up.
Similar issue with cache. First, the hit rates tends to go "flat" ( or at least increasing sub "slope 1" linear ). Second, their implementation density is different that logic rates .
Sure, but that’s why the CPU clusters have their own shared L2 cache that fulfills the role of the traditional CPU L3.
You are drinking lots of Cupertino kool-aid. Apple plays some games with their L2 cache numbers. Those L2 are attached to multiple core. The E cores' 4MB L2 cache sounds "L3" big up until divide by 4 ( 1MB L3 cache isn't traditional L3 cache in 2020-2021 era for laptops and desktops ). P cores' 8MB doesn't realy measure up to be huge uniform whole.
www.anandtech.com
That "8+MB" L2 is collapsing in effectiveness around 3MB ( 3072KB ) and basically "done" at 4MB ( 4096) .
[ apple did something similar with the iMac Pro and Mac Pro tech specs were cobbled together some aggregate caches to present a "bigger number than those other guys". ]
There is a comment you quotes about " x TFLOPS needs y amount of bandwidth". The NPU substantive has a large amount of TFLOPs and sits largely coupled to the L3 system cache. The issue is just a GPU silo issue. The M series SoC have multiple TFLOP consumers sitting on the L3 system cache. Can do RAM en/decompression to offlload bandwdith pressure somewhat , but also need to be able to walk and chew gum at the same time.
Using GDDR6 or HMB2 dGPU implementations more headroom to be able to walk and chew gum at the same time. Can plug in 2-4 monitors and it copes with the larger workload. Apple is a cut off scale approach. Limit RAM capacity .