Actually wrong, the LSi and its substrate are unrestricted this way, look at AMD Epyc Genoa for better understad, and TSMC papers.
Epyc Genoa doesn't have LSI 3D packaging.
AMD EPYC 7004 "Genoa" processor has been pictured, features twelve Zen4 chiplets - VideoCardz.com
The chiplets there are the same "PCB interposer" tech that AMD has used all along with the general Zen chiplet strategy.
The 3D cache variants of Genoa are not "LSI 3D packaging" technology that TSMC does. The cache is on the 'wrong' side and the connection is only through the bond ( not to a another chiplet. Only two chips involved where LSI is between three. )
If want to hand wave at something then MI-300 a kitchen sink of chip bonding techniques going on. Would have better chance of hitting broad side of a barn there.
I theorize, Apple could join that 4 M2 using an single squared UltraFusion, each edge matchinf an M2 and the inner silicon dedicated to buffers, cache, and related logic. like this
Code:
[m2 m]
||
[m2 m] = [UF HUB] = [m2 m]
|| ||
[m2 m] ======= extra PCIe Lines Multiplexed from each M2 Max
(*)= and || denotes UltraFusion Interface; ================= denotes extra PCIe bus
That layout has significant problems. First, really haven't accounted for where the Memory packages go and that those are NOT directly attached to the same interposer that the main core dies are attached to. The core die interposer is attached to another similar to the"PCB interposer" that the Eypc uses to couple to the memory chips. So if draw a square around you thing and then try to place the memory chips outside that square , then you run into substantive issues.
Second, The NUMA blow up there are likey quite high. If you want to get from the top right corner of the top {m2 m] chip to the lower eft corner of the [ m2 m] chip on the hub's left , then have to go all the way down and then over to the left. Nothing at all like a striaght line . Or even semi-straght line you could do from a rectangular mesh.
Making that show up to programmers transparently as a single GPU and a smoothly coherent CPU complex is likely going to be problematical. Similar to how AMD MI-250 systems show up as two GPUs; not one even though the dies are LSI-like coupled. Four would even more likely be in that substantial NUMA zone. ( decent chance MI-300 isn't going to be a completely 'flat', uniform solution either, but much better than pure discrete cards. )
Finally, the last problem is that with TSMC LSI 3D packaging the connection LSI die is completely covered by the dies being placed on top. There is no way that 4 , same sized dies are going to completely cover that hub. The edges from [m2 m] dies will come into conflict with one another as you try to draw them closer to the center.
Furthermore the "ultra fusion" containing edge of the M2 is around 22mm ( the M1 is 19.96mm and the M2 series dies bloat out bigger ) . If that hub die is 22x22 that is yet another 484mm^2 die sitting there. ( even if only bloat to 21mm that a 441 die there. Again bigger than the original M1 Max. ). The bigger that die is the longer the distances between the main [m2 m] dies. The longer the distances the bigger the latencies (and power loss ).
Pointing at the Epyc above and saying 'but, but , but the outer chiplet die that is two away from the main I/O hub of that package works". Yeah , AMD takes a make everything slower to minize the NUMA hit. CPUs are going to be much more tolerant of that than GPU will. If trying to make that look like one big unified GPU to the customer then that won't work. Precisely why the 7900 reverses the chiplet decomposition approach ( memory controllers and cache out into the chiplets and the cores on one centralized unified, single die mesh. )
Trying to pound the 'round peg' of the laptop monolithic chip into a round > 2 chiplet design how is a bad design. It is just physically aggregated sub optimally.
[UF Hub]: an squared silicon 4X4 bridge, interfacing at each side an M2 ultra, indeed sligtly bigger tan an M2 Ultra, this solution un-obstruct DRAM/Integrated peripherals busses,
un-obstructed but significantly longer isn't going to solve the problem. That is a 'jumping out of the frying pan and into the fire' kind of solution.
while allows optimal cooling and reduces required parts notwithstanding it would be as expensive as an M1 Max on 5/7nm, UF Hub allows for L3 cache/UF buffer, as werll sharing PCIe lines from each M2 Ultra into 4 Muxer allowing upto 8 PCe4/5 x16 peripherals (assuming m2 mas includes undisclosed provisions for a sinlg PCIe5 x16 bus..
Yeah have stuff as monstrous > 400mm^2 die in there in the middle but it physically pushes the core dies farther apart. Putting the largest possible thing in the middle doesn't help get the separate GPU core clusters closer together. Nor the even more distance memory packages closer to the GPU cores.