Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Pressure

macrumors 603
May 30, 2006
5,079
1,413
Denmark
A practical example of a device family with greater theoretical tflops numbers than practical graphics performance is AMD’s Vega. The big Vegas were very good compute cards with excellent raw numbers. But for graphics work are often beaten by much weaker (flops wise) RDNA based gpus.
That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.
 
  • Like
Reactions: throAU

Basic75

macrumors 68000
May 17, 2011
1,995
2,341
Europe
A practical example of a device family with greater theoretical tflops numbers than practical graphics performance is AMD’s Vega. The big Vegas were very good compute cards with excellent raw numbers. But for graphics work are often beaten by much weaker (flops wise) RDNA based gpus.
Another example is AMD RDNA3. The dual issue architecture that should double the FLOPS looks to be nearly useless.
 

casperes1996

macrumors 604
Jan 26, 2014
7,497
5,675
Horsens, Denmark
That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.
Those fp64 rates are often artificial to get you to buy Quadro or Fire Pro. At least they used to be. With more diverging architectures for consumer end workstation gpus it may be more real
 

leman

macrumors Core
Oct 14, 2008
19,316
19,329
That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.

Very few applications outside academia care about FP64 performance. Vega architecture was simply better optimized for compute than for graphics. It lacked dedicated graphics resources like ROPs and texturing units, and from what I understand its efficiency when running fragment shaders was not the best either.

The interesting thing about Apple GPUs is that thanks to TBDR it can achieve very decent graphical performance with a compute-focused architecture. In TBDR world, running a fragment shader is as simple as invoking a compute program over a small pixel grid. You don't need to deal with ROPs or any other kind of data races.
 
  • Like
Reactions: casperes1996

Pressure

macrumors 603
May 30, 2006
5,079
1,413
Denmark
Very few applications outside academia care about FP64 performance. Vega architecture was simply better optimized for compute than for graphics. It lacked dedicated graphics resources like ROPs and texturing units, and from what I understand its efficiency when running fragment shaders was not the best either.

The interesting thing about Apple GPUs is that thanks to TBDR it can achieve very decent graphical performance with a compute-focused architecture. In TBDR world, running a fragment shader is as simple as invoking a compute program over a small pixel grid. You don't need to deal with ROPs or any other kind of data races.
I mean, it was during the crypto boom. VEGA was very good at that because of its compute performance.

These days FP64 rates are largely irrelevant as nearly nothing uses it as you say.
 

throAU

macrumors G3
Feb 13, 2012
8,966
7,123
Perth, Western Australia
A practical example of a device family with greater theoretical tflops numbers than practical graphics performance is AMD’s Vega. The big Vegas were very good compute cards with excellent raw numbers. But for graphics work are often beaten by much weaker (flops wise) RDNA based gpus.

Vega RX64 (I have two) was (I suspect) crippled by not having enough ROPS. The 1080TI had more ROPs and crushed it, but as you say Vega was good for compute.
 

throAU

macrumors G3
Feb 13, 2012
8,966
7,123
Perth, Western Australia
That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.

Yup, and the D500/D700 in the 2013 Mac Pro has 1/4 FP64!

Still pretty strong cards today for FP64, just bugger all actually uses it :D
 

crazy dave

macrumors 65816
Sep 9, 2010
1,300
999
Hey! I'm one of those academics who wouldn't mind better FP64 throughput in my consumer cards and there are dozens of us! DOZENS! (okay there are a lot more of us than that, but yeah that ship has sailed)

I believe for Nvidia GPUs basically only Hopper has decent FP64 throughput now - the graphics workstation cards have dropped the Quadro branding, the difference with the consumer cards and Quadros was almost always in drivers even before with only a few being compute oriented, now it's only drivers and the size of the VRAM (especially at the high end).
 

BanjoDudeAhoy

macrumors 6502a
Aug 3, 2020
847
1,493
[...] the graphics workstation cards have dropped the Quadro branding, the difference with the consumer cards and Quadros was almost always in drivers even before with only a few being compute oriented, now it's only drivers and the size of the VRAM (especially at the high end).

Thanks for this. I didn't know that, but was always somewhat wondering what the difference was between the consumer and Quadro cards.
 
  • Like
Reactions: crazy dave

dmccloud

macrumors 68030
Sep 7, 2009
2,994
1,736
Anchorage, AK
That was my initial thought as well, then I looked at PCIe latency and bandwidth numbers and now I am not sure whether the incurred delays are high enough to be noticeable in practice. And we do know that latency of initializing operations on Apple GPUs can actually be higher — substantially so — than on Nvidia. This is what makes me suspect some sort of power/clock optimization going on. Maybe it's not the latency of the PCI-e bus transfer as much as the latency of changing the transfer bus power state? I don't know anything about these things though.

The other complicating factor on the PC side is that depending on the motherboard in use, the PCIe slots might be running at different speeds than what they're rated for. There are multiple AM4 and Intel 12th gen boards where the lower M.2 SSD slot actually shares PCIe lanes with the lower x16 slot, meaning that it is actually running at x8 or even x4 under certain configurations.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,524
953
I just read from a random user on another forum that Imagination Tech licenses its GPU instruction set to Apple, and not its IP, like ARM does with its instruction set. Is there any truth to that?
 

MrGunny94

macrumors 65816
Dec 3, 2016
1,108
649
Malaga, Spain
I just read from a random user on another forum that Imagination Tech licenses its GPU instruction set to Apple, and not its IP, like ARM does with its instruction set. Is there any truth to that?
Absolutely take a read here: https://www.imaginationtech.com/news/imagination-and-apple-sign-new-agreement/

They give full access to the IP to develop on that, but for sure Apple a lot in house but use the GPU IP from them. I'm not sure how much imagination input is taken in consideration at Apple though.. That's something worth asking to John Grubber, maybe he has some sort of insight
 

leman

macrumors Core
Oct 14, 2008
19,316
19,329
I just read from a random user on another forum that Imagination Tech licenses its GPU instruction set to Apple, and not its IP, like ARM does with its instruction set. Is there any truth to that?

No. Apple does not use Imagination's shader cores or GPU instruction set. Apple does use some of the Img's fixed-function IP, such as the rasterization logic and possibly parts of the raytracing functionality.

Apple GPUs have been described as an PowerVR GPU with shader cores and a lot of logic swapped out. Imagine you buy a car and then replace the engine and the drivetrain with some custom advanced solution. People working on reverse-engineered drivers have noted that there are a lot of traces of PowerVR left in the basic configuration interface.
 
  • Like
Reactions: obamacarefan

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,524
953
Apple GPUs have been described as an PowerVR GPU with shader cores and a lot of logic swapped out. [...] People working on reverse-engineered drivers have noted that there are a lot of traces of PowerVR left in the basic configuration interface.
Is this still the case for the M3 GPU or only for the M1/M2 GPU? Didn't Apple consider the M3 GPU to be a new GPU design?

They give full access to the IP to develop on that
It is not clear to me what they mean by IP: just the GPU IP blocks or also the instruction set?
 

leman

macrumors Core
Oct 14, 2008
19,316
19,329
Is this still the case for the M3 GPU or only for the M1/M2 GPU? Didn't Apple consider the M3 GPU to be a new GPU design?

It's the case for all Apple GPUs I believe since A10. M3 GPU uses a new design with a new instruction set, different from M1/2.

It is not clear to me what they mean by IP: just the GPU IP blocks or also the instruction set?

You'd probably need to read the contracts to understand that. It is possible that Apple has access to everything and then they can choose what exactly they want to use. At any rate, they do not use IMG's instruction set.
 
  • Like
Reactions: Xiao_Xi

MrGunny94

macrumors 65816
Dec 3, 2016
1,108
649
Malaga, Spain
Is this still the case for the M3 GPU or only for the M1/M2 GPU? Didn't Apple consider the M3 GPU to be a new GPU design?


It is not clear to me what they mean by IP: just the GPU IP blocks or also the instruction set?
I believe it's the IP.. but the specifics I doubt we'll be able to find them online...
 

name99

macrumors 68020
Jun 21, 2004
2,277
2,139
It's the case for all Apple GPUs I believe since A10. M3 GPU uses a new design with a new instruction set, different from M1/2.
Can you tell if M3 is using clauses?
To my eyes, clauses (as implemented in the earlier GPUs) provide a lot of benefit, but of course also with obvious downsides. So I still can’t get a read on whether they are still being used (most obviously as load/store vs datapath instructions).
 

leman

macrumors Core
Oct 14, 2008
19,316
19,329
Can you tell if M3 is using clauses?
To my eyes, clauses (as implemented in the earlier GPUs) provide a lot of benefit, but of course also with obvious downsides. So I still can’t get a read on whether they are still being used (most obviously as load/store vs datapath instructions).

I do not believe that Apple used clauses in any of their more recent GPU designs. A14 is a straightforward in-order SIMD architecture that interleaves instructions from different program threads to hide latency. Then again, my knowledge of Apple GPU ISA is entirely based on the work of Dougall Johnson and others. I do not have the tools or the knowledge to do similar work for M3.
 

leman

macrumors Core
Oct 14, 2008
19,316
19,329
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.