I have been suggesting this solution since the Mac Studio was first announced most of us would be happy to sacrifice a thunderbolt port (or 2) to gain a fully pledged PCIe gen 5 slot, or dual industry standard m.2 slots.
A couple of things.
One, getting a x16 PCI-e controller out of the space of one (or two ) TB controllers with just a x4 PCI-e v3 controller embedded inside of them would be a problem. To get a x16 controller would more likely see 4 TB controllers go and that is probably more than most ( who have multiple monitors and devices ) would want to do. Nor Apple.
I do think that > 4 is kind of going out there on the fringe. However, the fringe buying 3-4 XDRs is attractive to Apple. And -4 on the Max Mac Studio would wind all the way back to zero.
Second, Apple doesn't have to sacrifice any of the TB controllers if just add the PCI-e controller on another chiplet. Which appears to be what they did.
If Apple did PCI-e v5 then that would probably be done more so to get more bandwidth out through fewer lands ( making the impact on Apple's die edge space smaller). They'd run that out to an external switch and fan-out from there. Conceptually, Apple could have done all of these PCI-e v4 slots from single x16 PCI-e v5 backhaul. I suspect it is a two x16 PCI-e v4 ... but on TSMC N3 or later implementations they'd slide backwards in lane count and up in backhaul speed.
PCI-e v5 (and v6) without CXL 2.0 (and up) don't make a lot of sense to me. And I don't see Apple saying much about being enamored with CXL. So I don't expect Apple to get there anytime soon except for perhaps backhaul lane reduction footprint. Decent chance CXL might have conflicts with their PCI-e security model and v5 SERDES probably has problems with their Perf/Watt goals.
Apple seems intent on wholely kneecapping the Apple silicon lineup. What is the point of a 800GB/s bus when the solitary internal SSD can only handle at best 7GB/s, and thunderbolt ports only half that each.
The point of the 800GB/s bus is the GPU. Period. The other stuff, including CPU , are segmented to more limited bandwidth subset with QoS (quality of service) metrics for all around ( so nobody starves anybody else for data).
It appears that Apple moved a decent amount of the intra GPU com traffic off to a separate mesh so it is easier to get better QoS to the other competing compute clusters of different types. Still 800GB/s but the GPU performance jumped up significantly.
As for a 'faster' SSD. Well, with the Mac Pro you can have one in the PCI-e v4 x8 or x16 slots.
The rest of the line up? Thunderbolt 5 is a bit log jamed. That will probably come with a faster LPDDR5 and a even more upgraded internal mesh backhaul for inter-cluster data traffic. Once the internal mesh aggregate bandwidth is up the SSD controller will probably get a bump.
TBv5 is probably coming over next two years. The internal SSD could/should get a bump at M3 generation family.
[ Some of that is likel also Intel and USB-IF 4v2 stumbling blocks. Apple can't make USB-IF move faster. And Intel running behind schedule too. Looks like Meteor Lake won't have TBv5 , so Intel won't go forward until 2H 2024 at the earliest. ]
There is no way to keep the CPU and GPU fully saturated with data.
Aggregate memory bandwidth is a shared resources and pragmatically have limited edge space. The 'poor man's' HBM is wide and can only get just so many memory controllers coming out of fixed die size ( and also have some other I/O that isn't memory).
That is what caches are for. So every core doesn't have to go to memory all the time.
I suppose that is why Apple silicon only clocks out at a relatively slow rate, ant such low temps. The board architecture fails to deliver the raw data those chips need to really perform.
Apple Silicon clocks at a slower (than top fuel drag racing desktop) rate because the whole system is designed around wider and moderate-high speed clocks rather than narrow channels and fantical clocks speed that bleed extra power losses.
GPUs don't clock at desktop rates and still get higher aggregate FLOPs work done.
Apple's P cores can issue like 10 uOps in a clock. AMD/Intel about 6-8. So Apple is 25% wider issue. If AMD/Intel clocks are 20% higher are they really doing more work per every 1 second?
The other problem is that faster clock often means a bigger area footprint for the core. Intel has to double , triple up on E-cores in part because their P-cores are so big. Those E-cores can't 'top fuel drag race' clock either.