Apple Silicon GPU performance

Pressure · May 7, 2024

casperes1996 said:
A practical example of a device family with greater theoretical tflops numbers than practical graphics performance is AMD’s Vega. The big Vegas were very good compute cards with excellent raw numbers. But for graphics work are often beaten by much weaker (flops wise) RDNA based gpus.

That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.

Basic75 · May 7, 2024

casperes1996 said:
A practical example of a device family with greater theoretical tflops numbers than practical graphics performance is AMD’s Vega. The big Vegas were very good compute cards with excellent raw numbers. But for graphics work are often beaten by much weaker (flops wise) RDNA based gpus.

Another example is AMD RDNA3. The dual issue architecture that should double the FLOPS looks to be nearly useless.

casperes1996 · May 7, 2024

Pressure said:
That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.

Those fp64 rates are often artificial to get you to buy Quadro or Fire Pro. At least they used to be. With more diverging architectures for consumer end workstation gpus it may be more real

leman · May 7, 2024

Pressure said:
That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.

Very few applications outside academia care about FP64 performance. Vega architecture was simply better optimized for compute than for graphics. It lacked dedicated graphics resources like ROPs and texturing units, and from what I understand its efficiency when running fragment shaders was not the best either.

The interesting thing about Apple GPUs is that thanks to TBDR it can achieve very decent graphical performance with a compute-focused architecture. In TBDR world, running a fragment shader is as simple as invoking a compute program over a small pixel grid. You don't need to deal with ROPs or any other kind of data races.

leman · May 7, 2024

Basic75 said:
Another example is AMD RDNA3. The dual issue architecture that should double the FLOPS looks to be nearly useless.

It's because it is not really dual issue.

Pressure · May 7, 2024

leman said:
Very few applications outside academia care about FP64 performance. Vega architecture was simply better optimized for compute than for graphics. It lacked dedicated graphics resources like ROPs and texturing units, and from what I understand its efficiency when running fragment shaders was not the best either.

The interesting thing about Apple GPUs is that thanks to TBDR it can achieve very decent graphical performance with a compute-focused architecture. In TBDR world, running a fragment shader is as simple as invoking a compute program over a small pixel grid. You don't need to deal with ROPs or any other kind of data races.

I mean, it was during the crypto boom. VEGA was very good at that because of its compute performance.

These days FP64 rates are largely irrelevant as nearly nothing uses it as you say.

throAU · May 7, 2024

casperes1996 said:
A practical example of a device family with greater theoretical tflops numbers than practical graphics performance is AMD’s Vega. The big Vegas were very good compute cards with excellent raw numbers. But for graphics work are often beaten by much weaker (flops wise) RDNA based gpus.

Vega RX64 (I have two) was (I suspect) crippled by not having enough ROPS. The 1080TI had more ROPs and crushed it, but as you say Vega was good for compute.

throAU · May 7, 2024

Pressure said:
That is because usually graphic cards have 1/32nd FP64 rates and are bandwidth limited.

The VEGA 10 die (used for the Radeon RX 56 and 64) had 1/16th FP64 rate but paired with (at the time) fast HBM with low latency.

The VEGA 20 die (Radeon VII) had 1/2 FP64 with even more bandwidth, so that was a very fast compute card.

Something like the modern NVIDIA Geforce RTX 4090 has 1/64th FP64 rate.

Yup, and the D500/D700 in the 2013 Mac Pro has 1/4 FP64!

Still pretty strong cards today for FP64, just bugger all actually uses it

crazy dave · May 7, 2024

Hey! I'm one of those academics who wouldn't mind better FP64 throughput in my consumer cards and there are dozens of us! DOZENS! (okay there are a lot more of us than that, but yeah that ship has sailed)

I believe for Nvidia GPUs basically only Hopper has decent FP64 throughput now - the graphics workstation cards have dropped the Quadro branding, the difference with the consumer cards and Quadros was almost always in drivers even before with only a few being compute oriented, now it's only drivers and the size of the VRAM (especially at the high end).

BanjoDudeAhoy · May 7, 2024

crazy dave said:
[...] the graphics workstation cards have dropped the Quadro branding, the difference with the consumer cards and Quadros was almost always in drivers even before with only a few being compute oriented, now it's only drivers and the size of the VRAM (especially at the high end).

Thanks for this. I didn't know that, but was always somewhat wondering what the difference was between the consumer and Quadro cards.

dmccloud · May 8, 2024

leman said:
That was my initial thought as well, then I looked at PCIe latency and bandwidth numbers and now I am not sure whether the incurred delays are high enough to be noticeable in practice. And we do know that latency of initializing operations on Apple GPUs can actually be higher — substantially so — than on Nvidia. This is what makes me suspect some sort of power/clock optimization going on. Maybe it's not the latency of the PCI-e bus transfer as much as the latency of changing the transfer bus power state? I don't know anything about these things though.

The other complicating factor on the PC side is that depending on the motherboard in use, the PCIe slots might be running at different speeds than what they're rated for. There are multiple AM4 and Intel 12th gen boards where the lower M.2 SSD slot actually shares PCIe lanes with the lower x16 slot, meaning that it is actually running at x8 or even x4 under certain configurations.

Xiao_Xi · May 10, 2024

I just read from a random user on another forum that Imagination Tech licenses its GPU instruction set to Apple, and not its IP, like ARM does with its instruction set. Is there any truth to that?

MrGunny94 · May 10, 2024

Xiao_Xi said:
I just read from a random user on another forum that Imagination Tech licenses its GPU instruction set to Apple, and not its IP, like ARM does with its instruction set. Is there any truth to that?

Absolutely take a read here: https://www.imaginationtech.com/news/imagination-and-apple-sign-new-agreement/

They give full access to the IP to develop on that, but for sure Apple a lot in house but use the GPU IP from them. I'm not sure how much imagination input is taken in consideration at Apple though.. That's something worth asking to John Grubber, maybe he has some sort of insight

leman · May 10, 2024

Xiao_Xi said:
I just read from a random user on another forum that Imagination Tech licenses its GPU instruction set to Apple, and not its IP, like ARM does with its instruction set. Is there any truth to that?

No. Apple does not use Imagination's shader cores or GPU instruction set. Apple does use some of the Img's fixed-function IP, such as the rasterization logic and possibly parts of the raytracing functionality.

Apple GPUs have been described as an PowerVR GPU with shader cores and a lot of logic swapped out. Imagine you buy a car and then replace the engine and the drivetrain with some custom advanced solution. People working on reverse-engineered drivers have noted that there are a lot of traces of PowerVR left in the basic configuration interface.

Basic75 · May 10, 2024

leman said:
It's because it is not really dual issue.

It does seem to fall short.

Xiao_Xi · May 10, 2024

leman said:
Apple GPUs have been described as an PowerVR GPU with shader cores and a lot of logic swapped out. [...] People working on reverse-engineered drivers have noted that there are a lot of traces of PowerVR left in the basic configuration interface.

Is this still the case for the M3 GPU or only for the M1/M2 GPU? Didn't Apple consider the M3 GPU to be a new GPU design?

MrGunny94 said:
They give full access to the IP to develop on that

It is not clear to me what they mean by IP: just the GPU IP blocks or also the instruction set?

leman · May 10, 2024

Xiao_Xi said:
Is this still the case for the M3 GPU or only for the M1/M2 GPU? Didn't Apple consider the M3 GPU to be a new GPU design?

It's the case for all Apple GPUs I believe since A10. M3 GPU uses a new design with a new instruction set, different from M1/2.

Xiao_Xi said:
It is not clear to me what they mean by IP: just the GPU IP blocks or also the instruction set?

You'd probably need to read the contracts to understand that. It is possible that Apple has access to everything and then they can choose what exactly they want to use. At any rate, they do not use IMG's instruction set.

MrGunny94 · May 10, 2024

Xiao_Xi said:
Is this still the case for the M3 GPU or only for the M1/M2 GPU? Didn't Apple consider the M3 GPU to be a new GPU design?

It is not clear to me what they mean by IP: just the GPU IP blocks or also the instruction set?

I believe it's the IP.. but the specifics I doubt we'll be able to find them online...

name99 · May 27, 2024

leman said:
It's the case for all Apple GPUs I believe since A10. M3 GPU uses a new design with a new instruction set, different from M1/2.

Can you tell if M3 is using clauses?
To my eyes, clauses (as implemented in the earlier GPUs) provide a lot of benefit, but of course also with obvious downsides. So I still can’t get a read on whether they are still being used (most obviously as load/store vs datapath instructions).

leman · May 27, 2024

name99 said:
Can you tell if M3 is using clauses?
To my eyes, clauses (as implemented in the earlier GPUs) provide a lot of benefit, but of course also with obvious downsides. So I still can’t get a read on whether they are still being used (most obviously as load/store vs datapath instructions).

I do not believe that Apple used clauses in any of their more recent GPU designs. A14 is a straightforward in-order SIMD architecture that interleaves instructions from different program threads to hide latency. Then again, my knowledge of Apple GPU ISA is entirely based on the work of Dougall Johnson and others. I do not have the tools or the knowledge to do similar work for M3.

casperes1996 · May 27, 2024

What are clauses in this context?

leman · May 27, 2024

casperes1996 said:
What are clauses in this context?

It’s an ISA mechanism by which instructions are grouped into interdependent blocks. Apparently it can help with scheduling and register access. My familiarity with the concept is only superfluous. One would need to look at the details of specific architectures to understand how it works exactly.

Here are some slides that give examples: https://community.arm.com/cfs-file/...00-71-76/HotChips-Presentation-Jem-Davies.pdf

Kronsteen · Oct 6, 2024

I came across this thread by chance and have enjoyed reading it. I realise it’s a few months old but would like to offer a couple of observations and examples that I hope may be of interest.

As several people have said, GPU performance — and even comparing the relative performance of GPUs — is not straightforward. But, in my modest experience of GPU programming, on and off, over the last decade or so, the overriding factor in achieving good performance (and realising a GPU’s potential) is keeping it supplied with work. Doing so can be constrained, not least, by both transferring data to/from the CPU and the cost of the GPU reading data from main memory (its own or shared). By implication, making effective use of the GPU’s cache (sometimes referred to as shared memory within a GPU workgroup) can also make a big difference, as can memory access contention (across cache lines, for instance) and latency.

Note: the different GPU manufacturers use different terms for what is often, or essentially, the same thing. For example, thread group (Apple, NVIDIA) and workgroup (AMD / OpenCL). I use ALU to refer to an individual execution unit that executes one thread at a given moment in time.

The first decent GPU I used was an AMD FirePro D700 in the 2013 Mac Pro. More recently, an AMD Radeon Pro 5500M in the 2019 MacBook Pro, an NVIDIA Ampere in the Jetson AGX Orin Developer Kit and the M3 Max in a new MacBook Pro. The D700 and NVIDIA both have 2,048 ALUs, the M3 Max 5,120. Here are a few benchmark numbers.

Metal (GeekBench6):

D700 33,713
5500M 37,812
M3 Max 156,963 (x4.2 5500M, x4.7 D700)

fp32 TFLOPS:

Jetson 5.3
M3Max 14.2 (x2.7 Jetson)

(NB. I fully accept that comparing GPU benchmarks is fraught with difficulties; at best, there should be taken only as a very rough guide.)

To illustrate the relative performance, here is how much quicker the M3 Max executes two of my GPU applications (doing a particular form of linear algebra and quite large modular arithmetic, respectively) relative to the others. The M3 uses Metal in both cases; the D700 and 5500M OpenCL; and the Jetson CUDA.

Linear algebra:

M3 vs D700 x4.3 faster
M3 vs 5500M x4.9 faster

Modular arithmetic:

M3 vs 5500M x4.6 faster
M3 vs Jetson x2.3 faster

Some comments and a conclusion:

1. The relative performance of the M3 and the older GPUs I’ve achieved in these applications broadly matches the various benchmark figures.
2. One slight exception to that is that the M3 achieves slightly less of a gain over the NVIDIA GPU than the fp32 figures might lead one to expect (this well may be because this particular workload, which is predominantly long unsigned integers, is rather better suited to the NVIDIA).
3. It is probably stating the obvious to say that an optimisation that works for one GPU will not necessarily suit another (particularly with resect to memory accesses). For example, one attempted optimisation that failed miserably on the D700 worked quite well for the M3.
4. ALU for ALU, the M3 Max and Ampere GPUs are actually quite similar (in the Jetson, the GPU frequency is 1.3GHz).
5. The 5500M does less well, in both cases, than the Metal benchmarks suggest it might. I wouldn’t be surprised if this is because of the OpenCL drivers (OpenCL is what I originally used on the D700 and then reused with the 5500M).

Overall, I am quite impressed by the M3’s GPU. Of course, given the benchmarks, it’s not a surprise that it is so much faster than these older GPUs, although I think the “little” NVIDIA GPU in the Jetson machine is also quite impressive. For sure, the “bigger” NVIDIA GPUs have much greater performance but, for a mobile device which consumes so much less power, the M3 Max is quite something, IMHO (YMMV). An M4 Ultra would be quite a tempting proposition ….

PS. I appreciate that these workloads may not be of great interest to most of you. When I have time, I may try doing a neural net training comparison between the M3 and the Jetson.

leman · Oct 6, 2024

Kronsteen said:
2. One slight exception to that is that the M3 achieves slightly less of a gain over the NVIDIA GPU than the fp32 figures might lead one to expect (this well may be because this particular workload, which is predominantly long unsigned integers, is rather better suited to the NVIDIA).

A quick comment on this: if I remember correctly most int operations execute at half the rate compared to fp arithmetics on M3. You can however execute fp and int operations simultaneously (similar to Nvidia) which can improve performance. Apple also offers native 24-bit int MAC, which might be faster. Division and modulo operations are very slow as they are implemented via injected subroutines. Bit operations on the other hand are fast (Apple GPUs have native bit sequence extract and popcount instructions).

I am currently rewriting my benchmarking code for Apple GPU and hope to publish a comprehensive overview in the next couple of months.

Kronsteen · Oct 6, 2024

leman said:
A quick comment on this: if I remember correctly most int operations execute at half the rate compared to fp arithmetics on M3. You can however execute fp and int operations simultaneously (similar to Nvidia) which can improve performance. Apple also offers native 24-bit int MAC, which might be faster. Division and modulo operations are very slow as they are implemented via injected subroutines. Bit operations on the other hand are fast (Apple GPUs have native bit sequence extract and popcount instructions).

I am currently rewriting my benchmarking code for Apple GPU and hope to publish a comprehensive overview in the next couple of months.

I believe that’s correct (about integer operations on the M3). I don’t think 24 bit MAC would help in my cases. I don’t go anywhere near division or modulo, though …. 😬 (other than to verify that things work).

Bitwise operations (which I use a lot in the linear algebra code) seem good on both, yes. The lack of double precision floats is mildly tiresome, though (but I suppose it’s simply a reflection of what Apple think is important).

I’d certainly be interested to see any results you get. Thanks for taking the time to reply, anyway.

Apple Silicon GPU performance

macrumors 603

macrumors 68020

macrumors 604

macrumors Core

macrumors Core

macrumors 603

macrumors G4

macrumors G4

macrumors 68000

macrumors 6502a

macrumors 68040

macrumors 68000

macrumors 65816

macrumors Core

macrumors 68020

macrumors 68000

macrumors Core

macrumors 65816

macrumors 68030

macrumors Core

macrumors 604

macrumors Core

macrumors member

macrumors Core

macrumors member

Our Staff