I am not even sure that this is such a big limitation in practical terms. A single 5K frame is around 60Mb uncompressed. To sustain a 60 FPS stream of such frames you need less than 3.6GB/s. M1 can likely support multiple 5K monitors before the bandwidth impact is felt. Not to mention that display image will be compressed in practice, saving you tons of bandwidth.'
I think you mean 60MB not Mb. ( 5120*2880*30 = 442Mb - 55 MB )
And yet the M1 Macs are capped with support of just one 6K XDR display.
Additionally have the prep the next frame buffers while the display controller is reading the finished ones out ( the display controller is going to "write to " the wires going out of the system; not into memory. And also if the compression scheme isn't exactly the same as DisplayPorts then all of that has to be decompressed and then recompressed into the different encoding scheme.
Reading and writing don't happen at the same time so pragmatically looking at about twice what you highly above. And if the display controller and GPU core writes on the same memory bus you'll have that read/write contention.
That's a great reference and analysis! Just to add to this — M1 uses 128bit memory interface, with 8 independent 16-bit memory channels. Having these many memory channels is one of the many features that allows M1 to have high memory-level parallelism - multiple memory requests can be in flight simultaneously and so memory is used efficiently (having wider bus means that you might fetch more memory than you actually need, wasting the channel).
Depends upon what is going on. On a dense set of vector/matrix ops fetching too much isn't all that likely limited to a 16-bit wide sources. However, yes. keep 8 CPU and 8 GPU cores "happy" with 8-to-16 different memory address sequence request streams will work better on the "wider" set up that Apple has. And if doing vectors on a relaitvely tiny 16-bit wide path it much better if possibly no other requests have to time slice with to get the much longer stream need to get some work done. ( 4 16-bit requests is only going to one 64-bit word... not a decent sized data vector. )
Careful measurements of M1 RAM show that they perform just you'd expect regular regular LPDDR4X-4267 RAM. Peak bandwidth and latency is identical to Intel's Tiger Lake platform for example (as measured by Anandtech).
LPDDR4 for 8 concurrent cores requests
www.anandtech.com
Pragmatically, that is probably lower effective latencies than if had 8+ x86+GPU core reads stacked up on top of each other. Not sure that test is really point at total bisection bandwidth. In fact, probably not since memory copies are actually higher than that rate.
"... Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. ..."
(unless copying a single small fragment out of the L2 cache. )
LPDDR5 would gives them a nice boost in bandwidth, but they would still need to use multi-channel controllers. That's why the conservative prediction for the upcoming prosumer chip is at least 256-bit RAM interface (double that of M1)
LPDDR5 would be a reduction of channels. It just gives them an outlet to grow up closer to 3080-6900 in future iterations. It is a growth path because have used up the "go wider" approach.
True, but let's not forget that Appel has large system level caches — something that GPUs lack. An RTX 3080 has only 5MB of L2 cache, even M1 has 16GB LLC.
Again, probably not the right number. 16MB LLC. Pointing at the system RAM and claiming it is some kind of "cache' is just wrong. A memory cache is some subset of the whole; not the whole thing.
IF go back to the techinsight article the M1 System cache is actually smaller than the A14. It is no where near the single GB capacity ( let alone 16 ).
As for no GPUs have large caches. Not really true.
"..Navi 21 will have a 128MB Infinity Cache. Meanwhile AMD isn’t speaking about other GPUs, but those will presumably include smaller caches as fast caches eat up a lot of die space.
...
On that note, let's talk about the size of the infinity Cache. In short, based on currently available data it would appear that AMD has dedicated a surprisingly large portion of their transistor budget to it. Doing some quick paper napkin math and assuming AMD is using standard 6T SRAM, Navi 21’s Infinity Cache would be at least
6 billion transistors in size, which is a significant number of transistors even on TSMC’s 7nm process (for reference, the entirety of Navi 10 is 10.3B transistors)... "
https://www.anandtech.com/show/1620...-starts-at-the-highend-coming-november-18th/2
Even the 6700 has 96MB of Infinity Cache.
en.wikipedia.org
So Apple "way out in front of everyone else" .... not really. It is going to be tough for them to out compete the 6900 which has higher baseline VRAM bandwidth and a larger "system cache" already. let alone when get to 6nm or 5nm iterations.
Large caches help to compensate the limited system RAM bandwidth in both compute and graphical workloads, and on the graphics side Apple also has TBDR. And again, let's not forget compression — A14/M1 has hardware memory compression between the cache and the RAM for GPU compute workloads, which allows it to save memory bandwidth.
TBDR isn't going to help on compute tasks ( e.g., de-bayer RED RAW or something that the M-series fixed fuction media logic doesn't do. ). TBDR is why Apple is trying to herd developers into fully optimizing for their specific GPUs though. ( can't do any other GPU optimization on M-series macs at the moment for any other GPU implementaion. )
AMD and Nvidia have various texture and frame compression features too. Again sitting on a faster baseline bandwidth VRAM tech and adding in the same tricks ... Apple isn't going to pass them up at the top levels. Apple can eat into old "discrete exclusive" territory but blow past the entire spectrum? Probably not.
At the same time, Apple is less reliant on PCI-e lanes. GPU, SSD — all the usual culprits run on some sort of internal chip magic so you are pretty much left with Thunderbolt for your PCIe requirements. Of course, it remains to be seen how (and whether) they will solve modularity issues for a new Mac Pro. But even if it will be modular, I doubt that it will allow much in terms of third-party PCI-e device expansion.
That's the same "paint ourselves into a corner" Apple already said was a problem before. Only one internal storage drive is good enough for all high end users. Nope. Only an embedded proprieitary GPU is good enough. Nope. Thunderbolt as a panacea for high end bandwidth I/O cards . Nope ( a single RX 6900 is whipping Vega II Duos from a couple years ago on intense computational tasks. So two RX 6900's would do what to a single Apple iGPU.... more than extremely likely leave it in the dust. let alone what a 7900 or 8900 would do in a several of years. )
For the laptops? Yes. Apple never used all of the PCI-e lane allocation anyway. They didn't use it all in the iMac Pro. If Apple kneecaps the large screen iMac to fit inside the chin so they can maximize the thin-out of the system. yeah can see hand waving away the lanes. But that is basically the retreat back to desktop using laptop focused CPUs and GPUs.