Given compute is arguably the most important aspect of high end workstations, dGPU is the only way to carry forward Mac Pro.
Discrete GPGPU doesn't necessarily have to be video output GPU. Several replies above the Nvidia DGX-station uses a complete different class of GPU to do display for the workstation than for the compute. The notion that this all has to be completely homogenous mix of GPUs doesn't have lots of weight.
If the Mac Pro has one x16 PCI-e v4 slot that could be connected to an PCI-e expansion box ( same thing happened with with 2009-2012 models for folks who had lots of cards). For scaled compute in one-two boxes under the desk that could work. It wouldn't be all of the addressable market but it would be some.
I remain convinced that AfterBurner is not a one off product category, but an initial test toward a lasting product line of AS-designed task-focused discrete compute hardware.
ProRes video acceleration next being added to the A15 iPhone SoC points to exactly the opposite. (yes the Max/Pro SoCs were likely also done at the time, but shipped after the A15. ) Afterburner more so a prototype of something to be matured before incorporating it into the SoC die. ( make it work, then add it to a die ... where fixes are harder. )
ProRes was a "CPU processor offload" that got added back to the CPU die.
As Apple goes to TSMC N3 to N2 to 10A progression the transistor budget on the SoC die is just going to get bigger. Which likely means more compute in the same Package, if not on the same die.
Modern advances in 3D packages with cost and power effective layered interposers only reinforces that trend line. If have 80billion transistors on one die then 2-4 dies gets you 160-320 billion budgets.
The M1 Ultra can do more 8K streams than the Afterburner in part because the Afterburner is choked on x16 PCI-e v3. PCI-e slots are fixed in time on a normal system board. If Apple puts v4 on the baseline motherboard bus then 4-5 years from now there will be something much faster than v4. Getting modularity more so than lower latency, more bandwidth , and/or higher perf/watt. The intra-package bus in modern packages is way faster and more effective than PCI-e is. To go out through PCI-e means taking a hit on bandwidth and latency and power consumption.
PCI-e v5 and up is good enough for intermodule memory traffic that is on par with older links used for NUMA connections between CPU packages in the past. There will be more use cases where dGPU and CPU couple in a more unified manner with latency overhead. But single unit/node performance is going to see more packaging over time from multiple vendors; not just Apple.
If Apple is to truly empower AR/VR creators and take a seat at AI/ML it has no choice but to go beyond the SoC to deliver these experiences.
AR/VR that doesn't render on an Apple headset does what for Apple? Not alot. There are folks who drift into AR being more "artificial reality" rather than "augmented reality". That is probably not where Apple is trying to go.
Some 3D creation tools are still being ported over to M-series and optimized for Apple GPUs. It isn't just Apple that is in play here, the software has to catch up also.
Given only Apple GPU set as an optimization target that will likely happen faster if there is a tractable set of targets to optimize for. ( a super broad set of widely diverging targets is one reason why there is huge barrier to entry in the mainstream GPU market. between game/app + driver + hardware permutations there is a never ending pile of tweaks to do. To do everything for everybody can be used as a barrier to entry. )
As far as AI/ML, on the inference side if this is "quad Max" die then there would be 64 NPU cores. Smoking a whole lot of something if try to position that as not taking a seat at the AI/ML inference table. It is. Even at 32 NPU cores it is past substantive. Similar issue here evolutionary wise with future process shrinks coming along, Apple isn't going to loose tons of ground on inference if allocate decent transistor budget increases for bigger caches , more function units , etc.
Is Apple after the mega-trainer business? Probably not. Cerbras is throwing whole wafers at the mega scale workloads. For reasonable sized models , Apple's four die set up is likely going to be decent enough to get some work done (between GPU cores , AMX 'cores' , and NPU cores as appropriately delegated.) . Apple is more limited on lack of software foundation to do training than on hardware. And training is more of a non-GUI endeavor. When it is racks of extremely highly coupled nodes in a data center, then really outside of Apple's wheelhouse.