I came across this thread by chance and have enjoyed reading it. I realise it’s a few months old but would like to offer a couple of observations and examples that I hope may be of interest.
As several people have said, GPU performance — and even comparing the relative performance of GPUs — is not straightforward. But, in my modest experience of GPU programming, on and off, over the last decade or so, the overriding factor in achieving good performance (and realising a GPU’s potential) is keeping it supplied with work. Doing so can be constrained, not least, by both transferring data to/from the CPU and the cost of the GPU reading data from main memory (its own or shared). By implication, making effective use of the GPU’s cache (sometimes referred to as shared memory within a GPU workgroup) can also make a big difference, as can memory access contention (across cache lines, for instance) and latency.
Note: the different GPU manufacturers use different terms for what is often, or essentially, the same thing. For example, thread group (Apple, NVIDIA) and workgroup (AMD / OpenCL). I use ALU to refer to an individual execution unit that executes one thread at a given moment in time.
The first decent GPU I used was an AMD FirePro D700 in the 2013 Mac Pro. More recently, an AMD Radeon Pro 5500M in the 2019 MacBook Pro, an NVIDIA Ampere in the Jetson AGX Orin Developer Kit and the M3 Max in a new MacBook Pro. The D700 and NVIDIA both have 2,048 ALUs, the M3 Max 5,120. Here are a few benchmark numbers.
Metal (GeekBench6):
D700 33,713
5500M 37,812
M3 Max 156,963 (x4.2 5500M, x4.7 D700)
fp32 TFLOPS:
Jetson 5.3
M3Max 14.2 (x2.7 Jetson)
(NB. I fully accept that comparing GPU benchmarks is fraught with difficulties; at best, there should be taken only as a very rough guide.)
To illustrate the relative performance, here is how much quicker the M3 Max executes two of my GPU applications (doing a particular form of linear algebra and quite large modular arithmetic, respectively) relative to the others. The M3 uses Metal in both cases; the D700 and 5500M OpenCL; and the Jetson CUDA.
Linear algebra:
M3 vs D700 x4.3 faster
M3 vs 5500M x4.9 faster
Modular arithmetic:
M3 vs 5500M x4.6 faster
M3 vs Jetson x2.3 faster
Some comments and a conclusion:
1. The relative performance of the M3 and the older GPUs I’ve achieved in these applications broadly matches the various benchmark figures.
2. One slight exception to that is that the M3 achieves slightly less of a gain over the NVIDIA GPU than the fp32 figures might lead one to expect (this well may be because this particular workload, which is predominantly long unsigned integers, is rather better suited to the NVIDIA).
3. It is probably stating the obvious to say that an optimisation that works for one GPU will not necessarily suit another (particularly with resect to memory accesses). For example, one attempted optimisation that failed miserably on the D700 worked quite well for the M3.
4. ALU for ALU, the M3 Max and Ampere GPUs are actually quite similar (in the Jetson, the GPU frequency is 1.3GHz).
5. The 5500M does less well, in both cases, than the Metal benchmarks suggest it might. I wouldn’t be surprised if this is because of the OpenCL drivers (OpenCL is what I originally used on the D700 and then reused with the 5500M).
Overall, I am quite impressed by the M3’s GPU. Of course, given the benchmarks, it’s not a surprise that it is so much faster than these older GPUs, although I think the “little” NVIDIA GPU in the Jetson machine is also quite impressive. For sure, the “bigger” NVIDIA GPUs have much greater performance but, for a mobile device which consumes so much less power, the M3 Max is quite something, IMHO (YMMV). An M4 Ultra would be quite a tempting proposition ….
PS. I appreciate that these workloads may not be of great interest to most of you. When I have time, I may try doing a neural net training comparison between the M3 and the Jetson.