Any RAM beyond the UMA on the SoC / SiP / MCM would be secondary RAM, faster than swapping to the SSD, hereforth to be known as AppleCache...
The idea is that the RAM would be split in two: 1) unified memory either lpDDR5 or HBM which has a smaller pool but larger bandwidth and feeds the SOC directly and 2) standard DRAM with a larger pool that could go to that 1.5-2 TB limit and would act like a cache for the huge data set. Technically not *necessary* but is an approach Apple *could* adopt.
This is where I was going, though whether the on package or off package memory is the cache is probably a matter of whether you view it like a processor cache or a disk cache.
Either way, this was essentially my point. The on package RAM has a number of key benefits-- power certainly, but more important in a desktop is the fact that this memory is lower latency than going out to big bus of DIMMs, and it is equally accessible to all the on-chip computation units so there's no need to push GPU or Neural data across a PCIe bus or whatever, and then pull the results back.
If you try to pool external RAM into the Unified pool, then you'll have address dependent access profiles which really complicates how the system will handle memory.
It seems more sensible to use off-package RAM as a distinct pool in the virtual memory stack. There's essentially two ways to look at it: you can think of package RAM as an L4 cache (or whatever) or you can think of the off package RAM essentially as a RAM-disk for swap. In the end, the difference is academic-- I'd imagine that the package would page less used memory off package, and page needed memory into the package.
This seems simpler than
@crazy dave 's idea of massive bandwidth to the off package RAM which would be necessary to prevent slowing down the workload depending on address if it was all treated as a massive unified pool. The cores would execute against the on-package RAM at all times, it would retain unified access to all cores. The cores wouldn't have direct access to off-package RAM, it would be paged into the package before it was accessed.
You might see a difference if you set all cores to calculating a parallel running sum of the entire address space, but I'd suspect that 256GB is a sufficiently large local workspace to not be a major bottleneck for most workloads.