Apple dedicated GPU

quarkysg · Oct 3, 2022

leman said:
This is easier with CPUs anyway as each CPU core is logically autonomous. Apple GPU tech does not have this limitation as shown by Ultra.

Wouldn't the GPU cores (be it Apple's, AMD's. nVidia's or Intel's) be fully autonomous as well?

In Apple's case, each cores (CPUs, NPUs, GPUs, etc in the M1/Pro/Max/Ultra) can see the entire addressable memory space. AMD's CPUs would be similar in terms of being able to see the entire addressable memory space.

dGPU's have their own memories, but it would have to be individually programmed as well right, each with it own data, program and stack memory space? For linked multi-dGPU setup, I would imagine the GPU cores would be able to either copy or directly access linked dGPU VRAMs, depending on the programming model.

leman · Oct 4, 2022

quarkysg said:
Wouldn't the GPU cores (be it Apple's, AMD's. nVidia's or Intel's) be fully autonomous as well?

In Apple's case, each cores (CPUs, NPUs, GPUs, etc in the M1/Pro/Max/Ultra) can see the entire addressable memory space. AMD's CPUs would be similar in terms of being able to see the entire addressable memory space.

dGPU's have their own memories, but it would have to be individually programmed as well right, each with it own data, program and stack memory space? For linked multi-dGPU setup, I would imagine the GPU cores would be able to either copy or directly access linked dGPU VRAMs, depending on the programming model.

It's not about the memory the core can address but about what it can do. Someone has to set up the work queues and MMU descriptors, allocate the memory and make sure that everything is points where it should so that the GPU can do its work. You could theoretically put a general purpose processor in every GPU core but that would be incredibly wasteful. From what I understand Apple uses an auxiliary ARM core that takes care of these things, assigns work to the individual GPU cores and collects the results. On older GPUs this is done on the main CPU (driver sets these things up then initiates GPU code execution). I have no idea how modern dGPUs do it, I suppose they also have a general purpose orchestrator (or several) on the GPU.

You don't have this problem with the CPUs since every core is capable of flexible code execution, so you can make each CPU core make its own decisions.

diamond.g · Oct 4, 2022

senttoschool said:
Latency is more critical for CPUs than bandwidth. Opposite for GPUs. This is why multi-GPUs weren't made yet until the M1 Ultra. M1 Ultra is literally the first multi-die GPU.

AMD is rumored to be using a multi-die GPU as their next generation discrete GPU.

AMD released a multi-GCD "GPU" in November of last year (AMD Instinct MI250), technically beating Apple to the punch.

leman said:
It's not about the memory the core can address but about what it can do. Someone has to set up the work queues and MMU descriptors, allocate the memory and make sure that everything is points where it should so that the GPU can do its work. You could theoretically put a general purpose processor in every GPU core but that would be incredibly wasteful. From what I understand Apple uses an auxiliary ARM core that takes care of these things, assigns work to the individual GPU cores and collects the results. On older GPUs this is done on the main CPU (driver sets these things up then initiates GPU code execution). I have no idea how modern dGPUs do it, I suppose they also have a general purpose orchestrator (or several) on the GPU.

You don't have this problem with the CPUs since every core is capable of flexible code execution, so you can make each CPU core make its own decisions.

Modern GPUs all have hardware accelerated GPU scheduling (nvidia exposes it via regular drivers, AMD only exposes it via the Windows Store driver). From my understanding performance is meh so no one recommends enabling it.

This may or may not be the same thing that Apple is going.

leman · Oct 4, 2022

diamond.g said:
AMD released a multi-GCD "GPU" in November of last year (AMD Instinct MI250), technically beating Apple to the punch.

From what I understand MI250 is stille exposed as two logical GPUs, so not the same thing.

diamond.g said:
Modern GPUs all have hardware accelerated GPU scheduling (nvidia exposes it via regular drivers, AMD only exposes it via the Windows Store driver). From my understanding performance is meh so no one recommends enabling it.

This may or may not be the same thing that Apple is going.

"Hardware accelerated GPU scheduling" can literally mean anything. Based on the Asahi docs, Apple uses a dedicated coprocessor to manage many the GPU related things, so the system driver can be fairly thin. Does that qualify as "hardware accelerated GPU scheduling"?

senttoschool · Oct 4, 2022

diamond.g said:
AMD released a multi-GCD "GPU" in November of last year (AMD Instinct MI250), technically beating Apple to the punch.

M1 Ultra is the first multi-die GPU that is presented as a single GPU to software. This is the actual breakthrough.

Instinct is still presented as two separate GPUs.

First, from a software perspective, a single accelerator is exposed as two separate GPUs. This means that your algorithm needs to be multi-GPU aware in order to fully utilize the accelerator. Second, despite being connected by 4 high-speed IF links, the chip-to-chip bandwidth is still much lower than the HBM memory bandwidth, leading to moving data between GCDs being rather painful, and significantly slower than memory accesses. This is actually the reason AMD opted to expose the accelerators as two distinct GPUs; they couldn’t match the HBM bandwidth between the two dies to be able to expose both GPUs as one, nor at least half of it to support a distributed mode configuration, so they decided to expose them as different GPUs.

Hot Chips 34 – AMD’s Instinct MI200 Architecture

It’s no secret that AMD has been slowly but surely executing on their plans to achieve leadership in all domains of computing. On both the server CPU and consumer sides, we have seen them bre…

chipsandcheese.com

leman said:
From what I understand MI250 is stille exposed as two logical GPUs, so not the same thing.

You literally posted this seconds before me.

diamond.g · Oct 4, 2022

leman said:
"Hardware accelerated GPU scheduling" can literally mean anything. Based on the Asahi docs, Apple uses a dedicated coprocessor to manage many the GPU related things, so the system driver can be fairly thin. Does that qualify as "hardware accelerated GPU scheduling"?

That is a good question. Because performance is so meh on the PC side Microsoft isn't forcing it to be used. No clue how long that stance will last though. No one seems to show it in architecture shots either. I know RDNA has a Graphics Command Processor, but it isn't clear if that is what HAGS leverages.

senttoschool said:
M1 Ultra is the first multi-die GPU that is presented as a single GPU to software. This is the actual breakthrough.

Instinct is still presented as two separate GPUs.

Hot Chips 34 – AMD’s Instinct MI200 Architecture

It’s no secret that AMD has been slowly but surely executing on their plans to achieve leadership in all domains of computing. On both the server CPU and consumer sides, we have seen them bre…

chipsandcheese.com

You literally posted this seconds before me.

To be fair you said you said first multi-die GPU. Which Alderbran meets the definition of, even if it shows up as two GPU's.

leman · Oct 4, 2022

diamond.g said:
That is a good question. Because performance is so meh on the PC side Microsoft isn't forcing it to be used. No clue how long that stance will last though. No one seems to show it in architecture shots either. I know RDNA has a Graphics Command Processor, but it isn't clear if that is what HAGS leverages.

I remember Apple's GPU driver team leader commenting on twitter that Apple had this kind of stuf for years. But it's so difficult to imagine what these things are doing without in-depth information.

diamond.g said:
To be fair you said you said first multi-die GPU. Which Alderbran meets the definition of, even if it shows up as two GPU's.

Ah, sorry, I mean "single" GPU. I would classify the MI250 as two GPUs on one package. From the system configuration point of view it's not that different from any other multi-GPU system that is connected via infinity link.

Xiao_Xi · Oct 4, 2022

For those interested in the ins and outs of Apple's GPUs, the two main people responsible for reverse engineering Apple's GPU will be giving a talk later today.

https://twitter.com/x/status/1577185649743962112

deconstruct60 · Oct 5, 2022

senttoschool said:
Latency is more critical for CPUs than bandwidth. Opposite for GPUs. This is why multi-GPUs weren't made yet until the M1 Ultra. M1 Ultra is literally the first multi-die GPU.

That is only for the GPU cores. The GPU from the end user perspective covers getting output all the way to the display where it can be consumed. Most folks won't consider a GPU card with not display output a GPU card.

senttoschool said:
AMD is rumored to be using a multi-die GPU as their next generation discrete GPU.

The MI210 isn't a rumor. ( nor is the MI250X )

AMD Instinct™ MI210 Accelerator

AMD Instinct™ MI210 accelerators bring Exascale-class technologies to mainstream HPC & AI servers to reduce the time to science and discovery. Learn more!

www.amd.com

Press Release Search

View the latest press releases from AMD.

www.amd.com

Those multidie GPUs present as two GPUs to the user.

The upcoming 7900 are not the multiple die of the early rumors ( which were trying to make the HPC intinct set ups go mainstream , high frame rate market ).

AMD Navi 31 GCD (Graphics Complex Die) is reportedly 350 mm²+ in size - VideoCardz.com

AMD flagship RDNA3 GPU reportedly has a small graphics die, but there is a reason The upcoming Navi 31 GPU reportedly features 350 mm² graphics die, significantly smaller than the rumored figures from last year. According to Greymon55, AMD Navi 31 GCD (Graphics Complex Die) is only 350 mm² in...

videocardz.com

Similar to the "hub and spoke" set up of the desktop CPU dies/chiplets , but the role of I/O is reversed (from 'hub' to 'spoke' ) . The compute is on the hub and the memory I/O is on the spoked ( MCD memory cache die ( TSMC N6) . GCD graphics compute die (TSMC N5) ). [ the PCI-e is still in the hub though. ]

The GCD presents as one GPU but it is also only one chip. AMD is pursuing both paths, but the multiple die compute path isn't presenting as one, unified GPU (or doing any significant display data driving duties. ) . A better packaged and performing W6800X duo minus the display output subsystem.

deconstruct60 · Oct 5, 2022

leman said:
Ah, sorry, I mean "single" GPU. I would classify the MI250 as two GPUs on one package. From the system configuration point of view it's not that different from any other multi-GPU system that is connected via infinity link.

What would significantly matter though is if AMD implemented the Infintity Fabric connections differently. If they are 2.5/3D imposer only different implementations they don't have to be as large as the 'long distance' Infinity Fabric implementations. The die-to-die IF connections is not exactly like what Apple did with Infinity Fabric or what AMD is doing elsewhere with IF.

https://www.anandtech.com/show/1705...i200-accelerator-family-cdna2-exacale-servers

" ... But it still comes with caveats, as each MI200 accelerator is presented as two GPUs, and even as fast as 4 IF links are, moving data between GPUs is still much slower than within a single, monolithic GPU. ..." (**)

Initial search for a public annotated floor plan turned up a blank , but the clearly the side where the die-to-die , fixed point to point connect extremely likely lies is not the same size as the general Infinity Fabric or PCI-e external package links. Four significantly smaller IF links makes a different in die space usage allocations.

For example, Nvidia completely dumped NVLink off the 4080 die to allocate more space to RTv3 units.

" ... The news is coming directly from NVIDIA founder and CEO Jensen Huang, who on a call with the tech press he said that Ada Lovelace drops the NVLink connector so that the I/O room for "something else". We do know that NVIDIA has crammed a helluva lot more into Ada Lovelace GPUs, with transistors, CUDA core, and ROP counts going through the roof.
....
Jensen said that NVIDIA engineers wanted to use as much of the silicon area they had their hands-on to "cram in as much AI processing as we could". We also now know: Ada Lovelace does NOT support PCIe 5.0, but NVIDIA included the Gen5 power connector. ..."
https://www.tweaktown.com/news/8860...-for-ada-lovelace-silent-death-sli/index.html

Apple is getting high bandwidth and lower latencies with UltraFusion but they also are tossing any significant high bandwidth link off the package down the drain to get there. There is no 'free lunch' there. If scaling to four dies means completely doing an even bigger off die bandwidth trade-off , then even less of a "free lunch" there. A path that pragmatically blocks PCI-e v5 (and CXL) for very long time would come at a cost.

(**) In Anandtech article in tech specs table about MI200 line up versus previous there is a substantive uptick in IF links number. How they got that without expending gobs of edge space is significant. But there is a trade-off ( on distance versus space).

EntropyQ3 · Oct 6, 2022

leman said:
AMD uses a central IO die that routes all the communication between individual chips. That’s not the most latency-efficient multi-chip technology. Apples UltraFusion directly connects on-chip networks.

I disagree with your conclusion that AMD’s GPUs are monolithic because of the latency concerns. They are probably monolithic simply because that’s how AMDs technology works. In their current architecture the GPU is a single isolated device and they have no means (currently) to orchestrate workloads across multiple dies. This is easier with CPUs anyway as each CPU core is logically autonomous. Apple GPU tech does not have this limitation as shown by Ultra.

Oh, I didn’t say (directly) that interchip latency is the killer in terms of multi die rendering. I’m sure you’ve seen the far more in-depth discussions in other places about the issues as well. I just offered it as a rare example of cross die latencies actually being measured in the real world.

Unfortunately I no longer know a place where engineers from AMD or Nvidia discuss things like this.

Xiao_Xi · Oct 6, 2022

EntropyQ3 said:
Unfortunately I no longer know a place where engineers from AMD or Nvidia discuss things like this.

I don't know if these two videos from last year's HotChips where Intel explains their technology

Or the two videos about CXL from this year's HotChips can help.

Conference Stream

play.streamingvideoprovider.com

Xiao_Xi · Oct 13, 2022

Could the Mac Pro SoC look similar to this?

Intel's Next-Gen Xeon CPU Leaks: 4-Way Sapphire Rapids In Q1 23, 8-Way In Q3 23, Granite Rapids & Diamond Rapids With HBM Options

New information on Intel's next-gen Sapphire Rapids, Granite Rapids, and Diamond Rapids Xeon CPUs has been leaked out.

wccftech.com

diamond.g · Oct 13, 2022

Xiao_Xi said:
Could the Mac Pro SoC look similar to this?

Intel's Next-Gen Xeon CPU Leaks: 4-Way Sapphire Rapids In Q1 23, 8-Way In Q3 23, Granite Rapids & Diamond Rapids With HBM Options

New information on Intel's next-gen Sapphire Rapids, Granite Rapids, and Diamond Rapids Xeon CPUs has been leaked out.

wccftech.com

Ooo, chiplet?! It seems like low enough volume to test out a chiplet design like that.

Xiao_Xi · Oct 13, 2022

Diamond Rapids will use CXL 3.0 Is it faster than UltraFusion?

altaic · Oct 13, 2022

Xiao_Xi said:
Diamond Rapids will use CXL 3.0 Is it faster than UltraFusion?

Depends how many lanes per die. UltraFusion is 2.5 TB/s and one CXL 3.0 lane is 64 GT/s (8 GB/s), so 313 lanes would be equivalent. However, in a quad tiling like that, I’d think it’d require double to triple that, so 626 to 939 lanes per die.

Xiao_Xi · Oct 13, 2022

Apple’s innovative UltraFusion uses a silicon interposer that connects the chips across more than 10,000 signals, providing a massive 2.5TB/s of low latency, inter-processor bandwidth — more than 4x the bandwidth of the leading multi-chip interconnect technology.

Apple unveils M1 Ultra, the world’s most powerful chip for a personal computer

Apple today announced M1 Ultra, the next giant leap for Apple silicon and the Mac.

www.apple.com

Are CXL lanes equivalent to UltraFusion signals?

altaic · Oct 13, 2022

Xiao_Xi said:
Apple unveils M1 Ultra, the world’s most powerful chip for a personal computer

Apple today announced M1 Ultra, the next giant leap for Apple silicon and the Mac.

www.apple.com

Are CXL lanes equivalent to UltraFusion signals?

I don’t know much about CXL, but apparently it uses the same PHY as PCIe. I imagine UltraFusion is completely different physically.

darngooddesign · Oct 13, 2022

Zest28 said:
What are the odds that Apple will move away from integrated graphics for their Mac Pro and iMac Pro?

While I get what you meant, "integrated graphics" usually means a low-performance GPU which is certainly not the case with the M# chips. I think Apple is fully capable of building and stunningly fast GPU in the upcoming MacPro; something that isn't just 4*Ultra.

leman · Oct 13, 2022

Xiao_Xi said:
Diamond Rapids will use CXL 3.0 Is it faster than UltraFusion?

CXL is a device connection and state synchronization protocol that uses PCIe to transmit data. It’s something entirely different. It’s a technology designed to facilitate cache-coherent communication in large systems that are comprised from different devices connected via PCI lanes. It has no value for Apple who rely on tight physical integration. CXL is not suitable for multi-chip connections and is not an alternative to UktraFusion. I would be very very surprised if Intel uses CXL to connect the individual dies in a Sapphire Lake package, unless they are ok with the intra-die connection being dead slow.

Sydde · Oct 14, 2022

a note about Apple's GPU IP partner

mr_roboto · Oct 14, 2022

altaic said:
I don’t know much about CXL, but apparently it uses the same PHY as PCIe. I imagine UltraFusion is completely different physically.

CXL doesn't just use the PHY, it is the full PCIe stack plus extensions which provide hardware cache coherency features. All the additions to the base PCIe spec should be well above the PHY layer.

UltraFusion is a very different physical layer - we can figure that out just from the few numbers Apple has disclosed. Doing the math on 2.5 TB/s transported over 10,000 signals results in only 2 Gbps per signal. Even Gen1 PCIe uses a line rate of 2.5 Gbps per lane or 'signal'.

Given that they've claimed UltraFusion is extremely low latency, and that it's only used to connect two die abutted together and connected by a passive interposer or bridge device, I'd guess that UF signals are single-ended with no SERDES. There's no serializer/deserializer gearboxes, in other words, hence the low clock rate. They're simply taking advantage of the signal integrity and wire density made possible by the bridge to run their internal ASIC interconnect fabric directly between the two chips.

Xiao_Xi · Oct 14, 2022

leman said:
I would be very very surprised if Intel uses CXL to connect the individual dies in a Sapphire Lake package

It seems that Intel will use EMIB.

leman · Oct 14, 2022

Xiao_Xi said:
It seems that Intel will use EMIB.

EMIB is a packaging technology, CXL is a protocol. Two very different layers.

jeanlain · Oct 14, 2022

I didn't want to add a new topic for that, but this video shows how awful Apple's implementation of openGL for their GPU is:

Have Apple done such a bad job on purpose?

Apple dedicated GPU

macrumors 65816

macrumors Core

macrumors G5

macrumors Core

macrumors 68030

macrumors G5

macrumors Core

macrumors 68000

macrumors G5

macrumors G5

macrumors 6502a

macrumors 68000

macrumors 68000

macrumors G5

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors Core

macrumors Core

macrumors 68030

macrumors 6502a

macrumors 68000

macrumors Core

macrumors 68020

Our Staff