...maybe?
I don't know anything about academic research on hierarchical memories. So I don't know if certain schemes have already been modeled or tested. I do have some understanding of the related problem in storage, but I don't even know if that understanding is portable - does the relationship between disk and SSD match that between RAM and HBM? No idea.
But I can at least imagine a few different ways you might want to approach the problem of the OS supporting some level of memory-tier awareness. Obviously you could support direct allocations from HBM or RAM, with varying failure modes when HBM isn't available. You could have priority levels for determining who gets evicted to DRAM. This might need to interact with OS-level permissions.
When you talk about a "dedicated cache die", I assume you're talking about something like AMD's X3D. Does that make sense for Apple? I mean, what does it buy them, in their target markets? So far, X3D has proven useful mostly (though not exclusively) in games. That's not exactly a motivating factor for Apple.
Of course Apple would have a minor advantage with a cache die, in that they wouldn't have to reduce their clock speed to take advantage of it the way AMD does, since they clock lower anyway.
The real question is what problem does this solve for Apple?
You only care about these bandwidths if you have a lot of GPU performance, in which case you have the perimeter (with many ideas for how to extend this) to support lots of memory controllers, PHYs, and separate DRAM chips. Why not use a scheme that works well (including with all the Apple infrastructure no-one ever talks about, like QoS and use of the SLC for IP block communication), which can grow the bandwidth as required, and which uses standard DRAM rather than more expensive HBM?
A 3090 has ~950 GB/s to HBM.
An M2 Ultra has 800 GB/s to DRAM, and that's with the older LP-DDR5, an M4 Ultra would presumably have about 1.3x this, so ~1050 GB/s. Then an Extreme doubles this...
I can see value in adding an additional die on top of the SoC that provides something "different", possibly a very large SLC (so equivalent to V-cache), more likely IMHO an augmentation of SLC that provides persistence (ie MRAM or ReRAM). But I just don't see the point of adding HBM.
Certainly in the V-cache version, there's no issue of APIs, you could just use the device as an extension of the SLC.
There's a reason I always refer to what Apple offers as an SLC, not an L3. The SLC does a lot more, and is a lot smarter than, a standard L3.
Remember that the SLC
(a) acts as a memory-side cache. In other words ALL memory traffic is visible to the SLC, which is why it can act as a communication mechanism between devices communicating via DMA - rather than the DMA writing to RAM, then being read from it, the DMA writes to SLC (which effectively IS RAM as far as addresses and the memory controller are concerned) then some other DMA reads from SLC.
(b) the SLC has a massive amount of controllability, in that different clients are given quotas, within a client's quota, different streams of use can be given streamIDs, so that different elements can fight it amongst themselves who lives in cache and who doesn't without affecting other clients, clients can lock items in cache (eg GPU can lock items that are used every frame, but which might otherwise be paged out as the textures and so on of each frame are pulled in), etc etc. There are amazing features there, like large items (think texture) can have only each n'th line cached, so that we pull in a cached line rapidly and while it is being processed, the next (n-1) lines are being transferred from memory, etc etc.
You don't want to give up any of this goodness!
The obvious way to do things, IF Apple goes down this path, would, IMHO to copy the way things were done back in the early 90s - keep the SLC tags on the main SoC die, taking up the space of the SLC on say a Max chip, and have the SLC storage in the separate die mounted above. This is Apple, not AMD - they don't have to cater to every imaginable config, they can just say that (as an example) when you get a Pro you get an on-die SLC (of 36MB or whatever) and when you buy a Max, you get an off-die SLC of 256 MB (or whatever).
On-die tags will mean that MOST transactions will run just as fast as the Pro, only transferring data to or from the SLC cache will add a few cycles (to the 50 or so cycles already taken).