Apple Silicon GPU performance

macman4789 · May 5, 2024

Hi,

I wondered if someone out there more knowledgeable than me could answer a question I have about the GPU performance of Apple silicon.

I’ve had a look around and it’s common knowledge that ‘on paper’ the Apple Silicon GPU’s are less powerful than offerings of NVidia and AMD. If I remember rightly, the likes of M2 (possibly M2 Pro?) performed similarly than a 1080Ti? When measured in TFlops I think?

However, when comparing recent Apple silicon performance to certain tasks such as flicking through RAW images in LR or the responsiveness of brushes in LR or moving exposure sliders they seem to be quicker and more performant than laptops with say a NVidia 3080 from what I have seen. My question is, how/why is this? With so much more raw power ‘on paper’ than Apple’s GPUs, why are they less performant in certain tasks? Is it simply better software optimisation? Surely optimisation would be better on a platform that has been out much longer than 2020 ARM based software? I’m using LR as an example but I’m sure there are other examples.

Can anyone explain?

Thanks

crazy dave · May 5, 2024

@leman can answer more thoroughly but to briefly cover a couple of basic reasons why:

1) In laptops, especially smaller ones like 14", big power-hungry GPUs tend to throttle and aren't always able to hit their max performance (sometimes they can in "boost" mode at the cost of decibels and battery life but even then aren't always able to do it). Such GPUs are often thermally limited beyond their paper specs.

2) Apple's mobile heritage:

A) For normal rasterization workloads (outputting pixels calculated with normal graphics pipelines), Apple's GPUs have a major advantage that comes from their ImgTech lineage (the company whose GPUs Apple used to use and whose ray tracing tech Apple almost certainly licenses): tile-based deferred rendering or TBDR. Briefly, this technique allows Apple to pre-calculate what the user is actually going to see on screen before rendering it. So the GPU doesn't waste any memory bandwidth or computation rendering objects that are never going to be seen.

B) Apple also has separate 16bit floating point (FP16) pipes which are more efficient (especially in memory bandwidth) than FP32 (which in contrast gives more numerical precision but that precision is not always needed) when they can be used (appropriateness depends on workload). Other GPUs have different ways of issuing FP16 workloads. TFlops are typically measured in the number of FP32 fused multiplication and additions the GPU can perform so that one number may not give the best performance metric for every workload and may mean slightly (or very) different things for different GPU designs.

C) Unified memory can also be helpful. Apple has pretty good memory bandwidth relative their Tflops (this also helps in CPU workloads) and, for some models, large amounts of graphics accessible RAM relative to what you get in the PC space, especially for the GPU size. So workloads that are memory bound (require lots of trips back and forth between the GPU and RAM such that the GPU's compute units are often waiting around) or require lots of memory, Apple's GPU can be more performant than expected. Appropriately programmed, the GPU and CPU can also share resources avoiding expensive copies over PCIe when both need to work on the same data. Again, unified memory of this kind is more common the mobile space than the traditional PC market.

macman4789 said:
I’ve had a look around and it’s common knowledge that ‘on paper’ the Apple Silicon GPU’s are less powerful than offerings of NVidia and AMD. If I remember rightly, the likes of M2 (possibly M2 Pro?) performed similarly than a 1080Ti? When measured in TFlops I think?

An M2 Pro has less Tflops (~7) than a 1080Ti (~11) in fact. But yes, overall the M2 Pro will "punch above" its weight for many graphics related tasks.

casperes1996 · May 5, 2024

GPU performance is also a complicated matter. There’s fixed function units, programmable shader execution units, memory transfer and so much more to consider even at just a fairly high level. Performance superiority in one area does not necessarily guarantee a win in another.

leman · May 6, 2024

macman4789 said:
However, when comparing recent Apple silicon performance to certain tasks such as flicking through RAW images in LR or the responsiveness of brushes in LR or moving exposure sliders they seem to be quicker and more performant than laptops with say a NVidia 3080 from what I have seen. My question is, how/why is this?

What you are asking is the perceived performance (responsiveness) in one specific application type — image and video editing. The crucial point here is that these are one-off user-initiated tasks. Most GPUs are really optimized to run heavy workloads continuously (like games), the applications you talk about ask for burst performance instead.

I am not aware of any detailed analysis that would really answer this question. It could be that the unified memory architecture of Apple Silicon allows it to do these tasks with lower latency. It is also possible that Apple does some sort of additional latency optimization, we do know that this is a big focus area for them. Or maybe Apple is just better at shifting frequencies and going into high-perf mode (although in my testing this takes a few ms on Apple hardware). I think overall it is likely that Apple Silicon is responsive in content creation tasks because these are traditionally strong areas for the Mac and Apple designed the hardware with these applications in mind.

crazy dave said:
B) Apple also has separate 16bit floating point (FP16) pipes which are more efficient (especially in memory bandwidth) than FP32 (which in contrast gives more numerical precision but that precision is not always needed) when they can be used (appropriateness depends on workload). Other GPUs have different ways of issuing FP16 workloads. TFlops are typically measured in the number of FP32 fused multiplication and additions the GPU can perform so that one number may not give the best performance metric for every workload and may mean slightly (or very) different things for different GPU designs.

@crazy dave did a job summarizing what is unique about Apple GPUs, so just a quick note about FP16 from me. While it is true that Apple uses dedicated FP16 pipes, they cannot run FP16 operations any faster than full-precision operations. Some other architectures have that ability. I do not think it is very likely that FP16 gives Apple a performance edge in content creation software.

casperes1996 said:
GPU performance is also a complicated matter. There’s fixed function units, programmable shader execution units, memory transfer and so much more to consider even at just a fairly high level. Performance superiority in one area does not necessarily guarantee a win in another.

Amen to that.

casperes1996 · May 6, 2024

For the image editing case I think the UMA CPU->GPU->CPU latency is the biggest win if I were to guess. My assumption is that the time it takes to copy the image data to GPU VRAM and back once processing is completed on the GPU outweighs the actual GPU computing time on most high end Nvidia GPUs - Pure guesswork though, haven't done any attempts at measuring

leman · May 6, 2024

casperes1996 said:
For the image editing case I think the UMA CPU->GPU->CPU latency is the biggest win if I were to guess. My assumption is that the time it takes to copy the image data to GPU VRAM and back once processing is completed on the GPU outweighs the actual GPU computing time on most high end Nvidia GPUs - Pure guesswork though, haven't done any attempts at measuring

That was my initial thought as well, then I looked at PCIe latency and bandwidth numbers and now I am not sure whether the incurred delays are high enough to be noticeable in practice. And we do know that latency of initializing operations on Apple GPUs can actually be higher — substantially so — than on Nvidia. This is what makes me suspect some sort of power/clock optimization going on. Maybe it's not the latency of the PCI-e bus transfer as much as the latency of changing the transfer bus power state? I don't know anything about these things though.

quarkysg · May 6, 2024

A quick calculation for a 48MP photo (my napkin math shows 48m pixels x 4 bytes) comes to around 180MiB in size.

Assuming PCIe4 x 16 GPU card (which is about 32 GiB/s transfers rate and that's being generous due to overhead and no protocol setup latency), a one way transfer from system RAM to CPU VRAM will take about 5 ms and a copy back from VRAM to system RAM will take another 5 ms. We probably can consider GPU processing to be negligible.

CPU then need to process image to be displayed, which likely need to send image to GPU again via PCIe to be rastered onto the display.

On top of that, AS SoC memory bandwidth, even for M1 is already 60GiB/s when CPU is processing. Pro, Max and Ultra variant will have higher bandwidth.

A 60Hz display will show every frame in about 17 ms, so a 10+ ms time is almost the duration of a 60 Hz display frame. Higher refresh rate displays will make this more apparently I think.

So I think the above likely contributed to the perceived responsiveness of AS LR experience.

sunny5 · May 6, 2024

For photography software, a single core CPU performance is way more important unless you are exporting which still requires more CPU power than just GPU power. LR utilized all CPU cores while C1P does not which has been proven.

leman · May 6, 2024

quarkysg said:
CPU then need to process image to be displayed, which likely need to send image to GPU again via PCIe to be rastered onto the display.

Why do you assume this? On a desktop system the GPU is directly responsible for the video output. On a laptop, the system might need to send the data back to the iGPU for video output and there might be additional overhead in managing the textures, true.

quarkysg · May 6, 2024

leman said:
Why do you assume this? On a desktop system the GPU is directly responsible for the video output. On a laptop, the system might need to send the data back to the iGPU for video output and there might be additional overhead in managing the textures, true.

Well, once you get the processed image from GPU, it needs, among other tasks, to be resized by CPU for display, which means manipulating the full image data and then re-sampling to video buffer, which is likely at the GPU VRAM.

leman · May 6, 2024

quarkysg said:
Well, once you get the processed image from GPU, it needs, among other tasks, to be resized by CPU for display, which means manipulating the full image data and then re-sampling to video buffer, which is likely at the GPU VRAM.

Why would one resize the image on the CPU? User interface rendering has been done on the GPU in all major systems for many years now.

quarkysg · May 6, 2024

leman said:
Why would one resize the image on the CPU? User interface rendering has been done on the GPU in all major systems for many years now.

Wouldn’t the image then need to be sent back to the GPU for resizing, after the CPU decided which part of the image needed to be displayed? My point is that the image will need to be shuttled back and forth between system RAM and GPU VRAM.

leman · May 6, 2024

quarkysg said:
Wouldn’t the image then need to be sent back to the GPU for resizing, after the CPU decided which part of the image needed to be displayed? My point is that the image will need to be shuttled back and forth between system RAM and GPU VRAM.

The CPU usually doesn't need to examine the image data to select the portion to be displayed. The application running on the CPU will simply send the coordinate region to the GPU for the UI render.

splitpea · May 6, 2024

crazy dave said:
C) Unified memory can also be helpful. Apple has pretty good memory bandwidth relative their Tflops (this also helps in CPU workloads) and, for some models, large amounts of graphics accessible RAM relative to what you get in the PC space, especially for the GPU size. So workloads that are memory bound (require lots of trips back and forth between the GPU and RAM such that the GPU's compute units are often waiting around) or require lots of memory, Apple's GPU can be more performant than expected. Appropriately programmed, the GPU and CPU can also share resources avoiding expensive copies over PCIe when both need to work on the same data.

Thank you for this explanation! The assertion that unified memory decreases instead of increasing total memory needs finally actually makes sense to me.

Xiao_Xi · May 6, 2024

splitpea said:
The assertion that unified memory decreases instead of increasing total memory needs finally actually makes sense to me.

I may be wrong, but I thought that programs still need to copy from CPU memory to GPU memory even though they are side by side because GPU can't read CPU memory. Am I wrong?

leman · May 6, 2024

Xiao_Xi said:
I may be wrong, but I thought that programs still need to copy from CPU memory to GPU memory even though they are side by side because GPU can't read CPU memory. Am I wrong?

On Apple Silicon? Anything with access to DRAM can pretty much read anything in the DRAM for all practical intends and purposes. However, while the GPU can read CPU memory, it is not always straightforward to map CPU memory to GPU memory. This has to do with the memory ownership model and the way the GPU driver works. That said, you can make pretty much any CPU memory page visible to the GPU.

throAU · May 6, 2024

macman4789 said:
I’ve had a look around and it’s common knowledge that ‘on paper’ the Apple Silicon GPU’s are less powerful than offerings of NVidia and AMD. If I remember rightly, the likes of M2 (possibly M2 Pro?) performed similarly than a 1080Ti? When measured in TFlops I think?

It depends very much on the workload that the GPU is working on.

TFLOPs, etc. aren't the be all and end all, memory bandwidth, shader count, ROP count, etc. all matter for different things.

I can confirm that my 14" M1 Pro runs baldurs gate 3 pretty acceptably given the power consumption, noise, etc.

So... figure out what workloads you care about and find some benchmarks...

In terms of optimization - apple has an edge here because the same company and maybe even closely aligned teams are working on both the software side and the hardware side.

In PC land you have the OS vendor separate from the device manufacturer and possibly teams on both of those sides working towards a standard from a third party.

Apple has the one company responsible for the hardware, OS and drivers for the hardware. They can also build the exact hardware they want to run their APIs with full knowledge of both how their API currently works in practice and any future plans...

Homy · May 6, 2024

macman4789 said:
Hi,

I wondered if someone out there more knowledgeable than me could answer a question I have about the GPU performance of Apple silicon.

I’ve had a look around and it’s common knowledge that ‘on paper’ the Apple Silicon GPU’s are less powerful than offerings of NVidia and AMD. If I remember rightly, the likes of M2 (possibly M2 Pro?) performed similarly than a 1080Ti? When measured in TFlops I think?

However, when comparing recent Apple silicon performance to certain tasks such as flicking through RAW images in LR or the responsiveness of brushes in LR or moving exposure sliders they seem to be quicker and more performant than laptops with say a NVidia 3080 from what I have seen. My question is, how/why is this? With so much more raw power ‘on paper’ than Apple’s GPUs, why are they less performant in certain tasks? Is it simply better software optimisation? Surely optimisation would be better on a platform that has been out much longer than 2020 ARM based software? I’m using LR as an example but I’m sure there are other examples.

Can anyone explain?

Thanks

Depends on many things as people already have explained but that's the way it is. Under the right circumstances a M3 Max or M2 Ultra GPU can be twice as fast as 4090.

PugetBench Results

Build your own computer, workstation, mini pc, or server at Puget Custom Computers.

benchmarks.pugetsystems.com

PugetBench Results

Build your own computer, workstation, mini pc, or server at Puget Custom Computers.

benchmarks.pugetsystems.com

PugetBench Results

Build your own computer, workstation, mini pc, or server at Puget Custom Computers.

benchmarks.pugetsystems.com

leifp · May 6, 2024

Also: TFLOPs are only comparable within a system architecture. For example 10 TFLOPs on an nVidia RTX 40x0 are not directly comparable to 10 TFLOPs on an AMD RX 7x00. It has to do with the efficiency of a given architecture.

To explain a bit: a Trillion Floating point OPerations per second depends on what those operations actually are… CAVEAT: I am making the next part up as I do not know the actual efficiency per architecture. The numbers I use aren’t even a “best guess”.

Calculating a game scene with an AMD GPU may require a million operations. That same scene might require only eight hundred thousand operations on an nVidia GPU. Thus if you see 10 TFLOPs of AMD vs 9 TFLOPs nVidia, on the surface the AMD card seems more powerful. In this case, however the nVidia wins… 10m “scenes” per second vs 11.25m (it’s a stupid example but I wanted to show the maths with potential TFLOPs values). Those who have more than my superficial knowledge of TFLOPs feel free to correct my understanding…

Gelam · May 6, 2024

crazy dave said:
@leman can answer more thoroughly but to briefly cover a couple of basic reasons why:

1) In laptops, especially smaller ones like 14", big power-hungry GPUs tend to throttle and aren't always able to hit their max performance (sometimes they can in "boost" mode at the cost of decibels and battery life but even then aren't always able to do it). Such GPUs are often thermally limited beyond their paper specs.

2) Apple's mobile heritage:

A) For normal rasterization workloads (outputting pixels calculated with normal graphics pipelines), Apple's GPUs have a major advantage that comes from their ImgTech lineage (the company whose GPUs Apple used to use and whose ray tracing tech Apple almost certainly licenses): tile-based deferred rendering or TBDR. Briefly, this technique allows Apple to pre-calculate what the user is actually going to see on screen before rendering it. So the GPU doesn't waste any memory bandwidth or computation rendering objects that are never going to be seen.

B) Apple also has separate 16bit floating point (FP16) pipes which are more efficient (especially in memory bandwidth) than FP32 (which in contrast gives more numerical precision but that precision is not always needed) when they can be used (appropriateness depends on workload). Other GPUs have different ways of issuing FP16 workloads. TFlops are typically measured in the number of FP32 fused multiplication and additions the GPU can perform so that one number may not give the best performance metric for every workload and may mean slightly (or very) different things for different GPU designs.

C) Unified memory can also be helpful. Apple has pretty good memory bandwidth relative their Tflops (this also helps in CPU workloads) and, for some models, large amounts of graphics accessible RAM relative to what you get in the PC space, especially for the GPU size. So workloads that are memory bound (require lots of trips back and forth between the GPU and RAM such that the GPU's compute units are often waiting around) or require lots of memory, Apple's GPU can be more performant than expected. Appropriately programmed, the GPU and CPU can also share resources avoiding expensive copies over PCIe when both need to work on the same data. Again, unified memory of this kind is more common the mobile space than the traditional PC market.

An M2 Pro has less Tflops (~7) than a 1080Ti (~11) in fact. But yes, overall the M2 Pro will "punch above" its weight for many graphics related tasks.

Interesting! Which ray tracing tech does Apple end up using then? And how good/efficient/etc is it?

crazy dave · May 6, 2024

splitpea said:
Thank you for this explanation! The assertion that unified memory decreases instead of increasing total memory needs finally actually makes sense to me.

Welcome!

Xiao_Xi said:
I may be wrong, but I thought that programs still need to copy from CPU memory to GPU memory even though they are side by side because GPU can't read CPU memory. Am I wrong?

Yeah as @leman said there are some caveats but you can share data between them without copies. However, some PC iGPUs did (do?) in fact work exactly as you state here - they technically share a memory system but that memory was hard-partitioned between the CPU and GPU such that for say a 16GB iGPU each side only has access to 8GB and data has to be copied to be shared. That said, copies between the partitions would be relatively fast, you'd still have to do one, but at least it wasn't going over PCIe. A further problem with these systems was that memory was often with a standard memory controller for DDR/LPDDR meaning it was bandwidth limited though the iGPU was often so small and ineffectual that was not always a bottleneck. Consoles (PS/Xbox with AMD-based SOC) however I believe operate like Apple's SOC in addition to other mobile phone systems. Intel's plans for Lunar Lake (their upcoming mobile SOC later this year) is to have on-package RAM like Apple but I don't think their iGPU plans are overall as ambitious as Apple's yet - i.e. not scaling it out to Max-size.

leifp said:
Also: TFLOPs are only comparable within a system architecture. For example 10 TFLOPs on an nVidia RTX 40x0 are not directly comparable to 10 TFLOPs on an AMD RX 7x00. It has to do with the efficiency of a given architecture.

To explain a bit: a Trillion Floating point OPerations per second depends on what those operations actually are… CAVEAT: I am making the next part up as I do not know the actual efficiency per architecture. The numbers I use aren’t even a “best guess”.

Calculating a game scene with an AMD GPU may require a million operations. That same scene might require only eight hundred thousand operations on an nVidia GPU. Thus if you see 10 TFLOPs of AMD vs 9 TFLOPs nVidia, on the surface the AMD card seems more powerful. In this case, however the nVidia wins… 10m “scenes” per second vs 11.25m (it’s a stupid example but I wanted to show the maths with potential TFLOPs values). Those who have more than my superficial knowledge of TFLOPs feel free to correct my understanding…

It's even worse than that. As I was trying to get at with my FP16 discussion and @casperes1996 and @leman expounded on further, things get complicated. They start off really simple though: to get the theoretical TFLOPs of a GPU, you can technically just sum up the number of FP32 units (execution units per core x number of cores), multiply by 2 because they can do a fused multiply-add operation, multiply by the clockspeed and voila. That's it. Pretty simple metric. How that actually translates to performance though outside of doing simple FP calculations that don't interact much with the memory system or fixed function hardware or other Int/FP16/tensor core/ray tracing core/whatever crazy **** they've added gets complicated. For instance, some architectures perform FP16 operations using a dedicated FP16 unit (Apple), some use their FP32 units (Nvidia), some use a single FP32 unit but can perform two FP16 calculations simultaneously (AMD/ARM) if there are two available within a thread (instruction level parallelism - ILP - within a single GPU thread, just like how Apple's M1 and M2 CPU core can potentially do 8 instructions simultaneously within a thread), some can perform two FP32 operations (AMD and Nvidia ... I think), but again if there isn't such ILP to be found in the code their practical TFLOPs gets halved. Different architectures make different tradeoffs depending on their energy targets and simply what they think the code run on them is going to look like. And all of that's before you get into the hardware to turn these calculations into pixels (Apple doesn't even have ROPs) or more esoteric things like tensor cores and ray tracing cores. Nor have I dug into how that theoretical TFLOPs number interacts with the way the cache system and memory hierarchy works which can be absolutely critical for performance. EDIT: And I didn't even mention differing APIs or drivers where a lot of "secret sauce" for performance hides out (or the fact that games, especially AAA ones with custom engines, deliberately break graphics APIs for performance and day 0 updates for drivers are meant to account for this). For some types of pure compute workloads, despite everything I just said, theoretical TFLOPs can actually be a shockingly okay metric if two architectures are sufficiently similar, but yeah ...

Gelam said:
Interesting! Which ray tracing tech does Apple end up using then? And how good/efficient/etc is it?

When Apple and their long time GPU supplier ImaginationTech (ImgTech) split, it was a little acrimonious with ImgTech suing Apple because Apple wanted to stop paying ImgTech royalties and ImgTech alleging that Apple was still fundamentally using ImgTech's patented TBDR system and related GPU tech. Previously, under financial strain, ImgTech wanted Apple to buy them but Apple had been hiring away a lot of their people (and engineers from Nvidia/AMD/Intel) to form their own GPU division and decided against it (also because ImgTech has/had a few other customers despite Apple being far and away their biggest, regulators might've blocked it, you never know). Then suddenly a few years ago they not only buried the hatchet but did a brand new licensing deal and rumors were that ImgTech had developed a new ray tracing system that Apple wanted. Sure enough patents confirmed ImgTech had such a thing. A couple year later, Apple debuts the M3 and A17pro with ray tracing cores. While I don't know if anyone has absolutely confirmed if they are the same system as I think the terms of the deal were secret, that probably wouldn't be hard to do as you could match Apple's developer video against ImgTech's patents and just given the timings of everything, well it seems really likely.

ImgTech makes claims about how their ray tracing cores are wonderful, but I'm not an expert here and the very broad strokes seem similar to Nvidia's system. The devil is often in the details though. Performance-wise based on things like Blender, their ray tracing ability seems good, very comparable. Efficiency is where I've seen ImgTech makes its claims but I don't know of any direct Perf/W measurements comparing Apple and Nvidia ray tracing (ARM and Qualcomm have their own systems too - again I believe they're all broadly similar). No doubt, like Nvidia, there will be further iterations and improvements over time.

ignatius345 · May 6, 2024

Just a quick and off-topic side note -- I love threads like this. Much of this goes over my head but I appreciate the depth and good, focused discussion.

NT1440 · May 6, 2024

ignatius345 said:
Just a quick and off-topic side note -- I love threads like this. Much of this goes over my head but I appreciate the depth and good, focused discussion.

This is MR at its best, actual technical discussion instead of endless hot takes.

casperes1996 · May 6, 2024

A practical example of a device family with greater theoretical tflops numbers than practical graphics performance is AMD’s Vega. The big Vegas were very good compute cards with excellent raw numbers. But for graphics work are often beaten by much weaker (flops wise) RDNA based gpus.

leman · May 6, 2024

crazy dave said:
ImgTech makes claims about how their ray tracing cores are wonderful, but I'm not an expert here and the very broad strokes seem similar to Nvidia's system. The devil is often in the details though. Performance-wise based on things like Blender, their ray tracing ability seems good, very comparable. Efficiency is where I've seen ImgTech makes its claims but I don't know of any direct Perf/W measurements comparing Apple and Nvidia ray tracing (ARM and Qualcomm have their own systems too - again I believe they're all broadly similar). No doubt, like Nvidia, there will be further iterations and improvements over time.

The main similarity about Apples and Nvidia’s solution is that use a wide tree to store spatial indices, which helps them to minimize the number node traversals required to locate a specific primitive. AMD’s hardware is more limited for example and has to do more checks.

I do not know much about the inner workings of Nvidias solution. Apple RT is well described in their patents. They use a graph traversal coprocessor (integrated with the GPU cores) to asynchronously locate the primitives and then send this data back to the shader program. What is special about Apples solution is that it uses limited-precision conservative arithmetics. This reduces the area and energy consumption cost.

Apple Silicon GPU performance

macrumors 6502

macrumors 65816

macrumors 604

macrumors Core

macrumors 604

macrumors Core

macrumors 65816

Suspended

macrumors Core

macrumors 65816

macrumors Core

macrumors 65816

macrumors Core

macrumors 65816

macrumors 68000

macrumors Core

macrumors G3

macrumors 68020

macrumors 6502

macrumors regular

macrumors 65816

macrumors 604

macrumors G5

macrumors 604

macrumors Core

Our Staff