Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
This is easier with CPUs anyway as each CPU core is logically autonomous. Apple GPU tech does not have this limitation as shown by Ultra.
Wouldn't the GPU cores (be it Apple's, AMD's. nVidia's or Intel's) be fully autonomous as well?

In Apple's case, each cores (CPUs, NPUs, GPUs, etc in the M1/Pro/Max/Ultra) can see the entire addressable memory space. AMD's CPUs would be similar in terms of being able to see the entire addressable memory space.

dGPU's have their own memories, but it would have to be individually programmed as well right, each with it own data, program and stack memory space? For linked multi-dGPU setup, I would imagine the GPU cores would be able to either copy or directly access linked dGPU VRAMs, depending on the programming model.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Wouldn't the GPU cores (be it Apple's, AMD's. nVidia's or Intel's) be fully autonomous as well?

In Apple's case, each cores (CPUs, NPUs, GPUs, etc in the M1/Pro/Max/Ultra) can see the entire addressable memory space. AMD's CPUs would be similar in terms of being able to see the entire addressable memory space.

dGPU's have their own memories, but it would have to be individually programmed as well right, each with it own data, program and stack memory space? For linked multi-dGPU setup, I would imagine the GPU cores would be able to either copy or directly access linked dGPU VRAMs, depending on the programming model.

It's not about the memory the core can address but about what it can do. Someone has to set up the work queues and MMU descriptors, allocate the memory and make sure that everything is points where it should so that the GPU can do its work. You could theoretically put a general purpose processor in every GPU core but that would be incredibly wasteful. From what I understand Apple uses an auxiliary ARM core that takes care of these things, assigns work to the individual GPU cores and collects the results. On older GPUs this is done on the main CPU (driver sets these things up then initiates GPU code execution). I have no idea how modern dGPUs do it, I suppose they also have a general purpose orchestrator (or several) on the GPU.

You don't have this problem with the CPUs since every core is capable of flexible code execution, so you can make each CPU core make its own decisions.
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
Latency is more critical for CPUs than bandwidth. Opposite for GPUs. This is why multi-GPUs weren't made yet until the M1 Ultra. M1 Ultra is literally the first multi-die GPU.

AMD is rumored to be using a multi-die GPU as their next generation discrete GPU.
AMD released a multi-GCD "GPU" in November of last year (AMD Instinct MI250), technically beating Apple to the punch.
It's not about the memory the core can address but about what it can do. Someone has to set up the work queues and MMU descriptors, allocate the memory and make sure that everything is points where it should so that the GPU can do its work. You could theoretically put a general purpose processor in every GPU core but that would be incredibly wasteful. From what I understand Apple uses an auxiliary ARM core that takes care of these things, assigns work to the individual GPU cores and collects the results. On older GPUs this is done on the main CPU (driver sets these things up then initiates GPU code execution). I have no idea how modern dGPUs do it, I suppose they also have a general purpose orchestrator (or several) on the GPU.

You don't have this problem with the CPUs since every core is capable of flexible code execution, so you can make each CPU core make its own decisions.
Modern GPUs all have hardware accelerated GPU scheduling (nvidia exposes it via regular drivers, AMD only exposes it via the Windows Store driver). From my understanding performance is meh so no one recommends enabling it.

This may or may not be the same thing that Apple is going.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
AMD released a multi-GCD "GPU" in November of last year (AMD Instinct MI250), technically beating Apple to the punch.

From what I understand MI250 is stille exposed as two logical GPUs, so not the same thing.

Modern GPUs all have hardware accelerated GPU scheduling (nvidia exposes it via regular drivers, AMD only exposes it via the Windows Store driver). From my understanding performance is meh so no one recommends enabling it.

This may or may not be the same thing that Apple is going.

"Hardware accelerated GPU scheduling" can literally mean anything. Based on the Asahi docs, Apple uses a dedicated coprocessor to manage many the GPU related things, so the system driver can be fairly thin. Does that qualify as "hardware accelerated GPU scheduling"?
 
  • Like
Reactions: senttoschool

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
AMD released a multi-GCD "GPU" in November of last year (AMD Instinct MI250), technically beating Apple to the punch.
M1 Ultra is the first multi-die GPU that is presented as a single GPU to software. This is the actual breakthrough.

Instinct is still presented as two separate GPUs.

First, from a software perspective, a single accelerator is exposed as two separate GPUs. This means that your algorithm needs to be multi-GPU aware in order to fully utilize the accelerator. Second, despite being connected by 4 high-speed IF links, the chip-to-chip bandwidth is still much lower than the HBM memory bandwidth, leading to moving data between GCDs being rather painful, and significantly slower than memory accesses. This is actually the reason AMD opted to expose the accelerators as two distinct GPUs; they couldn’t match the HBM bandwidth between the two dies to be able to expose both GPUs as one, nor at least half of it to support a distributed mode configuration, so they decided to expose them as different GPUs.

From what I understand MI250 is stille exposed as two logical GPUs, so not the same thing.
You literally posted this seconds before me.
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
"Hardware accelerated GPU scheduling" can literally mean anything. Based on the Asahi docs, Apple uses a dedicated coprocessor to manage many the GPU related things, so the system driver can be fairly thin. Does that qualify as "hardware accelerated GPU scheduling"?
That is a good question. Because performance is so meh on the PC side Microsoft isn't forcing it to be used. No clue how long that stance will last though. No one seems to show it in architecture shots either. I know RDNA has a Graphics Command Processor, but it isn't clear if that is what HAGS leverages.
M1 Ultra is the first multi-die GPU that is presented as a single GPU to software. This is the actual breakthrough.

Instinct is still presented as two separate GPUs.




You literally posted this seconds before me.
To be fair you said you said first multi-die GPU. Which Alderbran meets the definition of, even if it shows up as two GPU's.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
That is a good question. Because performance is so meh on the PC side Microsoft isn't forcing it to be used. No clue how long that stance will last though. No one seems to show it in architecture shots either. I know RDNA has a Graphics Command Processor, but it isn't clear if that is what HAGS leverages.

I remember Apple's GPU driver team leader commenting on twitter that Apple had this kind of stuf for years. But it's so difficult to imagine what these things are doing without in-depth information.

To be fair you said you said first multi-die GPU. Which Alderbran meets the definition of, even if it shows up as two GPU's.

Ah, sorry, I mean "single" GPU. I would classify the MI250 as two GPUs on one package. From the system configuration point of view it's not that different from any other multi-GPU system that is connected via infinity link.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Latency is more critical for CPUs than bandwidth. Opposite for GPUs. This is why multi-GPUs weren't made yet until the M1 Ultra. M1 Ultra is literally the first multi-die GPU.

That is only for the GPU cores. The GPU from the end user perspective covers getting output all the way to the display where it can be consumed. Most folks won't consider a GPU card with not display output a GPU card.




AMD is rumored to be using a multi-die GPU as their next generation discrete GPU.

The MI210 isn't a rumor. ( nor is the MI250X )




Those multidie GPUs present as two GPUs to the user.


The upcoming 7900 are not the multiple die of the early rumors ( which were trying to make the HPC intinct set ups go mainstream , high frame rate market ).

AMD-Navi-31-MCD-GCD-1536x785.jpg




Similar to the "hub and spoke" set up of the desktop CPU dies/chiplets , but the role of I/O is reversed (from 'hub' to 'spoke' ) . The compute is on the hub and the memory I/O is on the spoked ( MCD memory cache die ( TSMC N6) . GCD graphics compute die (TSMC N5) ). [ the PCI-e is still in the hub though. ]

The GCD presents as one GPU but it is also only one chip. AMD is pursuing both paths, but the multiple die compute path isn't presenting as one, unified GPU (or doing any significant display data driving duties. ) . A better packaged and performing W6800X duo minus the display output subsystem.
 
Last edited:

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Ah, sorry, I mean "single" GPU. I would classify the MI250 as two GPUs on one package. From the system configuration point of view it's not that different from any other multi-GPU system that is connected via infinity link.


What would significantly matter though is if AMD implemented the Infintity Fabric connections differently. If they are 2.5/3D imposer only different implementations they don't have to be as large as the 'long distance' Infinity Fabric implementations. The die-to-die IF connections is not exactly like what Apple did with Infinity Fabric or what AMD is doing elsewhere with IF.



MI200_Package.jpg


https://www.anandtech.com/show/1705...i200-accelerator-family-cdna2-exacale-servers

" ... But it still comes with caveats, as each MI200 accelerator is presented as two GPUs, and even as fast as 4 IF links are, moving data between GPUs is still much slower than within a single, monolithic GPU. ..." (**)

Initial search for a public annotated floor plan turned up a blank , but the clearly the side where the die-to-die , fixed point to point connect extremely likely lies is not the same size as the general Infinity Fabric or PCI-e external package links. Four significantly smaller IF links makes a different in die space usage allocations.


For example, Nvidia completely dumped NVLink off the 4080 die to allocate more space to RTv3 units.


" ... The news is coming directly from NVIDIA founder and CEO Jensen Huang, who on a call with the tech press he said that Ada Lovelace drops the NVLink connector so that the I/O room for "something else". We do know that NVIDIA has crammed a helluva lot more into Ada Lovelace GPUs, with transistors, CUDA core, and ROP counts going through the roof.
....
Jensen said that NVIDIA engineers wanted to use as much of the silicon area they had their hands-on to "cram in as much AI processing as we could". We also now know: Ada Lovelace does NOT support PCIe 5.0, but NVIDIA included the Gen5 power connector. ..."
https://www.tweaktown.com/news/8860...-for-ada-lovelace-silent-death-sli/index.html


Apple is getting high bandwidth and lower latencies with UltraFusion but they also are tossing any significant high bandwidth link off the package down the drain to get there. There is no 'free lunch' there. If scaling to four dies means completely doing an even bigger off die bandwidth trade-off , then even less of a "free lunch" there. A path that pragmatically blocks PCI-e v5 (and CXL) for very long time would come at a cost.


(**) In Anandtech article in tech specs table about MI200 line up versus previous there is a substantive uptick in IF links number. How they got that without expending gobs of edge space is significant. But there is a trade-off ( on distance versus space).
 
Last edited:

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
AMD uses a central IO die that routes all the communication between individual chips. That’s not the most latency-efficient multi-chip technology. Apples UltraFusion directly connects on-chip networks.

I disagree with your conclusion that AMD’s GPUs are monolithic because of the latency concerns. They are probably monolithic simply because that’s how AMDs technology works. In their current architecture the GPU is a single isolated device and they have no means (currently) to orchestrate workloads across multiple dies. This is easier with CPUs anyway as each CPU core is logically autonomous. Apple GPU tech does not have this limitation as shown by Ultra.
Oh, I didn’t say (directly) that interchip latency is the killer in terms of multi die rendering. I’m sure you’ve seen the far more in-depth discussions in other places about the issues as well. I just offered it as a rare example of cross die latencies actually being measured in the real world.

Unfortunately I no longer know a place where engineers from AMD or Nvidia discuss things like this.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Unfortunately I no longer know a place where engineers from AMD or Nvidia discuss things like this.

I don't know if these two videos from last year's HotChips where Intel explains their technology

Or the two videos about CXL from this year's HotChips can help.
 
Last edited:

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX

altaic

macrumors 6502a
Jan 26, 2004
711
484
Diamond Rapids will use CXL 3.0 Is it faster than UltraFusion?
Depends how many lanes per die. UltraFusion is 2.5 TB/s and one CXL 3.0 lane is 64 GT/s (8 GB/s), so 313 lanes would be equivalent. However, in a quad tiling like that, I’d think it’d require double to triple that, so 626 to 939 lanes per die.
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Apple’s innovative UltraFusion uses a silicon interposer that connects the chips across more than 10,000 signals, providing a massive 2.5TB/s of low latency, inter-processor bandwidth — more than 4x the bandwidth of the leading multi-chip interconnect technology.

Are CXL lanes equivalent to UltraFusion signals?
 

darngooddesign

macrumors P6
Jul 4, 2007
18,362
10,114
Atlanta, GA
What are the odds that Apple will move away from integrated graphics for their Mac Pro and iMac Pro?
While I get what you meant, "integrated graphics" usually means a low-performance GPU which is certainly not the case with the M# chips. I think Apple is fully capable of building and stunningly fast GPU in the upcoming MacPro; something that isn't just 4*Ultra.
 
  • Love
Reactions: spaz8

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Diamond Rapids will use CXL 3.0 Is it faster than UltraFusion?

CXL is a device connection and state synchronization protocol that uses PCIe to transmit data. It’s something entirely different. It’s a technology designed to facilitate cache-coherent communication in large systems that are comprised from different devices connected via PCI lanes. It has no value for Apple who rely on tight physical integration. CXL is not suitable for multi-chip connections and is not an alternative to UktraFusion. I would be very very surprised if Intel uses CXL to connect the individual dies in a Sapphire Lake package, unless they are ok with the intra-die connection being dead slow.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
I don’t know much about CXL, but apparently it uses the same PHY as PCIe. I imagine UltraFusion is completely different physically.
CXL doesn't just use the PHY, it is the full PCIe stack plus extensions which provide hardware cache coherency features. All the additions to the base PCIe spec should be well above the PHY layer.

UltraFusion is a very different physical layer - we can figure that out just from the few numbers Apple has disclosed. Doing the math on 2.5 TB/s transported over 10,000 signals results in only 2 Gbps per signal. Even Gen1 PCIe uses a line rate of 2.5 Gbps per lane or 'signal'.

Given that they've claimed UltraFusion is extremely low latency, and that it's only used to connect two die abutted together and connected by a passive interposer or bridge device, I'd guess that UF signals are single-ended with no SERDES. There's no serializer/deserializer gearboxes, in other words, hence the low clock rate. They're simply taking advantage of the signal integrity and wire density made possible by the bridge to run their internal ASIC interconnect fabric directly between the two chips.
 
  • Like
Reactions: jdb8167 and Xiao_Xi

jeanlain

macrumors 68020
Mar 14, 2009
2,459
953
I didn't want to add a new topic for that, but this video shows how awful Apple's implementation of openGL for their GPU is:
Have Apple done such a bad job on purpose?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.