...
How would you even know? Apple never used NUMA architecture in their computers. At any rate, these are things that can be tweaked in.
Not true on the "never". Mac Pro 2009-2012 used CPU packages with memory controller built into the package. The dual package models required a 'remote' memory access through QPI to get to the 'other' set of DIMM slots. And yes there were gaps between macOS and Windows/Linux on systems implemented with dual packages from that era. Huge no. macOS slowest of the pack on large footprint , highly parallel jobs? yes.
Frankly, I don't see any way for them to support unified memory on a Mac Pro class computer without adopting some variant of NUMA.
don't really need unified memory as much as flat , shared , vritual address space. Some data doesn't really need to move if have decent, coherent caching. And some data doesn't really need to hop through the system RAM to the VRAM from the storage drive ( PCI-e drive do a DMA move to VRAM . Or if drive to "dumb" then CPU driven DMA move to VRAM. ). Read only textures ... isn't that hard ( AMD had a pro product at some point with a SSD attached. ) .
Making a SoC powerful enough is not feasible (at least not on higher performance levels). Maintaining the modular design while also keeping unified memory is not feasible. This is why I believe we will see modules that contain CPU + GPU clusters, with shared on-board memory. These devices will be organized into "local groups" of some sort and you will have APIs to discover such groups and schedule work on them. Transferring data between groups is going to be slower.
If need 10x as many GPU cores you shouldn't be have to be saddled with having to take 10x as many CPU cores also. That artificial coupling is quite similar to the "painted into the corner" that Apple got into with the Mac Pro 2013. Unnecessarily coupling things that don't need to be coupled is dubious. The scaling benefit for GPU cores is way different than for general purpose CPU code.
As opposed to artificial tight couplings Apple should be looking at a roadmap with CXL or CCIX when they merge in along with adopting PCI-e v5 or v6. That's what other folks are doing.
www.anandtech.com
Apple already has elements of such API with their Metal Peer Groups for multi-GPU systems.
which is entirely between homogeneous AMD GPU cores.
How will Mac Pro will talk to GPU boards ?
Why do you think Apple developed MPX modules in the first place? They are free to implement whatever interface they see fit. Also check out how they implement Infinity Link connections — future Apple Silicon Mac Pro could use something similar to provide fast data transfer between modules.
First, Apple didn't implement Infintity Fabric links any more than Dell or HP did. It is an AMD implementation. AMD did a semi custom Fabric width for them that would inside the parameters of the Fabric.
Second, all data movement between MPX modules host systems travels via PCI-e. Apple's MPX implement is entirely synergistic with PCI-e. You can actually put a standard PCi-e card into a MPX bay in the same "slot" that the MPX module fits into. Add adds a second connector to provision data for Thunderbolt and to get out of the Molex power cable game. That's MPX.
Of course they would need more robust I/O. Why do you assume that they won't have it? This is not something that was important for the mobile SoC, but it is essential with the Macs.
One port wonder MacBook . (where Apple stripped away almost every port ) . Rumors of a no port future iPhone. Slot-less Mac Pro 2013. Four Port MBP 13" with one pair provisioned a different bandwidth than the other pair. ...
... It is Apple. They have a track record of blowing on I/O in the pro space.
Begrudgingly. more robust workstation level I/O will probably get pulled into doing a decent Mac Pro SoC , but quite likely Apple is going to move more systems to the "iGPU only and less than or equal to 4-5 ports " side as much as they can. The Mac Mini and iMac port level I/O and optional discrete Graphics probably will come some time later in 2021.
We already pretty much know that Apple is working on a custom I/O controller that provides better memory isolation to connected devices (it's in the video linked by the OP), and we know that Apple Silicon Macs will support high-performance external devices such as SCSI controllers.
IOMMU has nothing to do with this. They need PCI-e controllers. More than a couple memory controllers to DIMM slots. More than two USB ports. etc.
You don't need dedicated GPU memory to have a high-performance GPU. As Apple's graphics guy Gokhan Avkarogullari says "bandwidth is the function of ALU power". You need a certain amount of bandwidth to increase your GPU performance. And there are multiple ways of getting there. You can use faster system RAM, more caches, memory compression, etc. LPDDR5 already offers over 100Gb/s of bandwidth — this is competitive with GDDR5. For higher-end applications, Apple will need to use something faster or utilize more channels. Since they control the platform, there are many paths they can take here.
The leading edge AMD and Nvidia GPUs all used GDDR6 ( and GDDR6X). So LPDDR5 is showing up behind the curve. Apple isn't going to match the ALU power in their top end systems because they don't match the bandwidth.
And let's not forget that Apple GPUs need significantly less bandwidth to provide the same level of performance.
Not on everything. Apple has some vertex removal , tiling tricks that lower bandwidth but if transcoding 10K REDRAW they don't have anything special. Finite element analysis. Nope. Quantum chemistry modeling. Nope.
Why pay for features that don't make any sense? No current GPU has good support for FP64 anyway,
Chuckle. You know how many systems on the top500 list do not support FP64? Zero.
Nvidia's pro GPU line up that don't support FP64? Zero.
AMD's Pro GPU line up that don't support FP64? Zero.
Metal doesn't support FP64. No Apple GPU has FP64 support ( which makes the absence in Metal not surprising at all ) . Apple's Cupertino Kool-aid about how Metal is a compete replacement for OpenCL doesn't "need" FP64. The gamer tech porn crowd doesn't need FP64. But it is used every day by some pro users.
not that it's needed. If you need more precision — roll your own extended precision data structure. It's still going to be faster than native FP64 on most hardware
That is nonesense. Just more kool-aid. Software implemented data type is faster than fixed function logic.
and you can optimize it for your need. Memory controllers — why have a separate one on the GPU if you could have a multi-channel one on the SoC level? Things like multi-monitor support can be easily added depending on the platform needs.
Large resolution multiple monitor support requires bandwidth. The SoC has to be able to walk and chew gum at that same time. If put 4-5 highly bandwidth consumers on a single SoC eventually run out of bandwidth. It will work great at "one at a time" synthetic benchmarks but if trying to do lots of work it will probably throttle.[/quote]
[automerge]1601268878[/automerge]