Edit: On further thought, Apple could basically create each "cube" as a separate computer or node. Treat the hard drives as if they were on a SAN (to be shared with all of the CPU nodes). Still not sure how this would interact with the PCIE lanes or other ports, but one can dream right (maybe thunderbolt?)?
Actually every computer is already split into various areas (like e.g. mass storage, central processing, I/O etc.). They just happen to sit in one common housing and often also on a common PCB. Sometimes, however, a unit has its own dedicated PCB, like e.g. the external ports or mass storage backplane. Quite common even in industrial computers.
Now if one would take the areas with little latency "awareness" (like e.g. external ports or mass storage) and put them into separate boxes and then take care that the various possible nodes share the available resources properly even when located in a different box, you would come closer to what we are fabulating about here.
PCIe would probably a bit tricky, but i could imagine an approach as you already sketched: Have 1-2 PCIe slots together with one local processor and some local Ram (to avoid latency and timing problems) in one common box and let this box act as individual node. Need more slots? Add a node! Need more Ram? Add a node! And so on...
The OS (and the user) would see the individual resources per box as one big pool being transparently presented to them by low-level routines.
So for example:
Box 1: Central coordination
Box 2: Mass storage
Box 3: Graphic unit ("graphic card")
Box 4: I/O (USB, Firewire, Ethernet etc. pp. - even Thunderbolt could be located here - if they can transfer data over Thunderbolt externally over a comparably long distance, they can do so internally as well)
Box 1 and 3 would probably have more powerful local processors (e.g. core i5/7 or even Xeon class), while box 2 and 4 would be fine with a lesser processor - core i3 or even ARM/A5.
Let the central hub delegate the work involved in copying data from e.g. the mass storage in box 2 to the Ram in box 3 to the local processor in either box 2 or 3 (whichever may have more resources/can provide more "crunch").
Perhaps one can even differentiate between time-critical data that has to be close to the graphic card in box 3 (thus needs to be in box 3 local ram) and non-critical data that can stay in local ram of box 2, thus freeing capacity on both box 3 ram and interconnect bandwidth.
In the meantime the "director" CPU in box 1 can communicate with the local processor in box 4 to process possible user inputs on USB, Firewire or other ports. Maybe a stream is coming in from box 4 (the user is recording from his video cam) and has to be routed to mass storage in box 2. Or mouse/keyboard events have to be processed.
As i wrote in some other post, Apple has quite some of the building blocks required for such an approach already in their portfolio (like Grand Central Dispatch, XSan etc.) - they just need to put the pieces together!