Not quite accurate, at least from how Apple described it. The old integrated memory systems still treated the RAM allocated to the GPU as separate from the CPU. It was taken from the same pool but kept separate. From what I understand however, the M1 has RAM that can be read by both CPU and GPU simultaneously.
That might have been the case with some really old systems, but doesn’t really apply to newer Intel and AMD iGPUs from what I understand. We don’t really know what (and if) is different about Apple‘s implementation, except of course the fact that with Apple Silicon, unified memory is a primary feature and everything else is tuned around it. Apple also has a lot of shared cache that allows very fast data synchronization between CPU, GPU, Neural Engine and other processors. I expect it in general to be faster than what we have with other systems, where it’s kind of built out of necessity rather than embraced as an essential foundation of a high performance infrastructure.
Other systems where we gave unified memory enabling high performance are game consoles and some custom supercomputers. Like the Fujitsu A64FX which is essentially a CPU/GPU hybrid (ARM CPU with very wide vector processing units) that uses HMB2 for its memory.
Note, the new AMD GPUs have a sort-of similar speed advantage when plugged into an AMD motherboard with an AMD CPU, the CPU has full native access to the GPU's 16GB RAM.
It doesn’t have native access, it can just execute copies to the GPU memory a bit more efficiently.
So in the old systems with integrated graphics, say you have 16GB of RAM and 3GB are allocated to the integrated GPU, then an asset that is 2GB gets loaded, leaving 11GB of system RAM, and then that asset gets copied to the GPU RAM pool, leaving 1GB of GPU allocated RAM, so that one 2GB asset ends up taking 4GB overall?
Most integrated GPUs can just use that asset directly.
In a discrete GPU setting, that same 2GB asset will still reside in system RAM, so still eating up 2GB, while also being copied over to GPU VRAM?
Sometimes, depending on what you want to do with it. If the GPU driver thinks that you might need to read the data back from the GPU for example, it might device to keep a copy in system memory.
So the improvement that we're seeing with apple silicon is that 2GB asset gets loaded once into system unified RAM and then GPU directly renders it from there, so no duplication of assets that was seen in previous integrated GPU cases?
So it seems like with unified memory, then with Apple Silicon, GPU VRAM would essentially be redundant and no longer relevant?
Apples implementation avoids the need to copy data (caveat: sometimes the data will be optimized for GPU use, which could mean for example applying compression or rearranging the data layout so that the GPU can access it faster). Furthermore, Apple implementation allows very fast data exchange between different processors since they all share the same last level cache. And finally, Apples implementation radically simplifies memory management for the OS, since all memory is just... memory. No matter which processors use a particular RAM page, it can be compressed, swapped out and swapped in as necessary. This makes things simpler, faster and essentially more robust - there is less space for memory management bugs and fewer special cases to take care of.