There are 3 approaches I hypothesize:
1. stick with the unified memory approach. whatever program uses the accelerator card will have compute done on the card with access to only that memory on package
The major problem with that approach is Apple has spent the last 3+ years telling software developers to assume otherwise. So where are these "programs only for an accelerator" coming from?
The farther away the memory is placed from the main system RAM the less "unified" it will be. Might be flatly addressable, but the latency will creep up. Apple's "unified" basically covers both to make it highly transparent to software. And Apple has be pushing folks to assume that while writing highly optimized code.
If these cards fit into PCI-e standard slots. ( And Apple has show about zero interest in CXL. That doesn't avoid the latencies issue )
If push the "programs run on local resources" to a more complete state then the whole the whole stack ( except for network mounted storage drives ) is there. The OS/Libraries/Application/etc all on local resources.
2. take nvidia approach and use hbm memory which will necessarily be memory bandwidth limited (relatively speaking, itll still be /fast/) but it goes against their design philosophy of apple silicon
Apple Silicon's approach to memory already is "poor man's" HBM. Going to more expensive HBM really isn't going to 'solve' much if have introduced non-unified memory. Going to a A-series small chip also isn't going to have much expensive HBM affinity either (smaller dies lead to smaller edges which makes HBM less tractable. )
If have a 'full size' Mn/MnPro/MnMax package then already have the 'poor man's HBM" for free (no additional R&D costs or divergence from the semi-custom RAM packages used for those systems. ). For example
tweaked A10 for T2. A13 tossed into the Studio Display. Watch SoC tossed into HomePod. etc. etc.
3. some other way i haven't thought of
" ... Using this communication, the worker processes synchronize gradients before each iteration. I'll show this in action using four Mac Studios connected to each other with Thunderbolt cables. For this example, I will train ResNet, a classifier for images. The bar to the side of each Mac Studio shows the GPU utilization while training this network. For a single Mac Studio, the performance is about 200 images per second. When I add another Mac Studio connected via Thunderbolt, the performance almost doubles to 400 images per second since both GPUs are utilized to the fullest. Finally, when I connect two more Mac Studios, the performance is elevated to 800 images per second. This is almost linear scaling on your compute bound training workloads. ..."
Accelerate machine learning with Metal - WWDC22 - Videos - Apple Developer
Discover how you can use Metal to accelerate your PyTorch model training on macOS. We'll take you through updates to TensorFlow training...
developer.apple.com
If Apple was looking to get rid of the Thunderbolt cables (because Apple 'hates' wires) a simple way of doing it would be to just substituate out the standard PCI-e card edge (run a virtual Ethernet connection over it) for TB/10GbEthernet connection.
The cluster compute for that demo would work as well the exact same software with 10GbE between the studio boxes ( It is just easier to do 3 one-to-one connections with Thunderbolt because there are 4+ TBs ports on the MS and only just one 10GbE port on the system (at best could hook to only one other MS ). What is could do cluster with no wires and no external Ethernet switches. Also no external power cable wires ( one per MS in WWDC cluster)... so more 'wires eliminated' for the fewer wires Jihad.
Put a Mac on a card and just tell folks to use the same 'compute' clustering software they have been encouraging folks to make for more than several years.
A bit of copying what Nvidia does with NVlink for intra-box clustering using a consistent software memory model. Only probably using standard PCI-e v4 (which Apple already has). So it ends up closer to a "Poor man's NVLink". Following generation might go to PCI-e v5 which would be an incrementally better "Poor man's NVLink" for relatively little 'extra' work.