So M1 Max with 32 Core GPU matches RTX 3080 Mobile.

l0stl0rd · Oct 25, 2021

I never expected apple to beat the 3080 in compute as the 3080 does:
FP16 (half) performance29.77 TFLOPS FP32 (float) performance29.77 TFLOPS

Apple said in their keynote the 32 cores do 10,4 TFLOPS.

Also it is about as good as the desktop 5700 XT which is normal as that is 9,4 TFLOPS or the Vega 56 which is 10,5 TFLOPS.

If we talk Nvidia it is in between the 3050 TI 7,6 TFLOPS and the 3060 12,74 TFLOPs.

I still think it is very respectable for a integrated GPU.

leman · Oct 25, 2021

jeanlain said:
It'd be useful to have a plot showing the geekbench compute score as a function of FLOPS, for Apple Silicon, AMD 6XXX and Nvidia 3XXX.
Maybe I'll do it when I have the time. The challenge is finding reliable results, excluding overclocked GPUs and such. I think the ranking page shows average scores rather than median scores, which is not ideal.

If you just plot everything, put layers will be clearly visible. The real issue is getting the FLOPSs, since there are multiple variants of GPUs with different clocks etc. Not that it's a hard problem, but a tedious and boring one.

Erasmus · Oct 25, 2021

l0stl0rd said:
I never expected apple to beat the 3080 in compute as the 3080 does:
FP16 (half) performance29.77 TFLOPS FP32 (float) performance29.77 TFLOPS

Apple said in their keynote the 32 cores do 10,4 TFLOPS.

Also it is about as good as the desktop 5700 XT which is normal as that is 9,4 TFLOPS or the Vega 56 which is 10,5 TFLOPS.

If we talk Nvidia it is in between the 3050 TI 7,6 TFLOPS and the 3060 12,74 TFLOPs.

I still think it is very respectable for a integrated GPU.

30TFlop/s is the desktop RTX 3080. Apple are comparing against the mobile RTX 3080, which is nowhere near as fast as the desktop version. Then again, the desktop 3080 pulls 320W, twice that of the higher TDP variants of the mobile 3080 that Apple is comparing the M1 Max to (and admitting is marginally beaten by).

Blame NVIDIA and their poor naming scheme and high level of product line fragmentation for calling a 320W monster, a 160W space heater and a 100W gimped version (plus various in-between) all the RTX 3080.

crazy dave · Oct 25, 2021

jeanlain said:
It'd be useful to have a plot showing the geekbench compute score as a function of FLOPS, for Apple Silicon, AMD 6XXX and Nvidia 3XXX.
Maybe I'll do it when I have the time. The challenge is finding reliable results, excluding overclocked GPUs and such. I think the ranking page shows average scores rather than median scores, which is not ideal.

leman said:
If you just plot everything, put layers will be clearly visible. The real issue is getting the FLOPSs, since there are multiple variants of GPUs with different clocks etc. Not that it's a hard problem, but a tedious and boring one.

I’ve been looking at this. Unfortunately the some of more recent Desktop GPUs are too big to do a meaningful Perf/TFlop comparison with Apple’s, the laptop GPUs vary too much in terms of what they are actually rated at, different brands and different generations within a brand can have different scores wrt TFLOPS, and even beyond all this the ranking page can sometimes have very odd results for the same GPU (even beyond seemingly different clocks). So a big graph just pulling all the numbers down might just lead to noise.

Given all this, doing the best I can, I compared 3090 to the 3060 TI and the 2070 Super and 2060 in my last posts here and here. For the 2070 Super and 2060, GB scores increased linearly with TFlops for both OpenCL and CUDA. But this does break down when going from the 3060 TI to the 3090 - less so for CUDA - but the 3060 TI desktop GPU starts at a higher TFLOP than the M1Max.

For AMD, they get much lower absolute OpenCL scores per TFLOP compared to Nvidia, something @leman mentioned, but again the increase in perf relative to two AMD GPUs, Pro Vega II and Pro Vega 56, is sort of linear wrt to their stated TFLOPS depending on which numbers you believe.

OpenCL

ATI Radeon Pro Vega II Compute Engine (14.09 TFLOPs)

81785

ATI Radeon Pro Vega 56 Compute Engine (8.96 TFLOPs)

55927

Ratio of Perf to TFlops: 0.93

There were multiple entries for each with ratios ranging from 0.85 (bad but still better than M1) to 0.97 (near perfect), this *seemed* the best average for OpenCl and dovetailed with Metal scores below .

Metal

AMD Radeon Pro Vega II

101649

AMD Radeon Pro Vega56

69229

Ratio of Perf to TFlops: 0.93

So GB doesn't scale over TFlops for AMD GPUs quite as linearly as for Nvidia chips ... but still better than the M1 to M1Max and even scales pretty linearly over Metal! Now could the lack of linear scaling in to the M1Max in GB wrt TFLOPs still be an interaction between GB, Metal, and the M1Max? Yup. But that's we're down to beyond GPU throttling and/or lower peak clocks to be alleviated/raised by high performance mode.

jeanlain · Oct 25, 2021

leman said:
If you just plot everything, put layers will be clearly visible. The real issue is getting the FLOPSs, since there are multiple variants of GPUs with different clocks etc. Not that it's a hard problem, but a tedious and boring one.

I did a very rough plot. For AMD/nVidia, I relied on the chart page. For Apple Silicon, I used several scores, and all the scores I found M1 Pro and M1 Max. I used only openCL scores.

It shows that Apple GPUs yield less points/FLOPS than RTX 3000 series, and probably radeon RX 6000 series. nVidia does not seem to perform better than AMD in terms of score/FLOPS.

The performance increase from 24 to 32 M1 GPU cores is mediocre.

EDIT: I inverted team Red and team Green. I just relied on alphabetical order and how colours are coded in R.

crazy dave · Oct 25, 2021

jeanlain said:
I did a very rough plot. For AMD/nVidia, I relied on the chart page. For Apple Silicon, I used several scores, and all the scores I found M1 Pro and M1 Max. I used only openCL scores.
View attachment 1877573

It shows that Apple GPUs yield less points/FLOPS than RTX 3000 series, and probably radeon RX 6000 series. nVidia does not seem to perform better than AMD in terms of score/FLOPS.

The performance increase from 24 to 32 M1 GPU cores is mediocre.

EDIT: I inverted team Red and team Green. I just relied on alphabetical order and how colours are coded in R.

Could you add a table view to see raw numbers?

l0stl0rd · Oct 25, 2021

Erasmus said:
30TFlop/s is the desktop RTX 3080. Apple are comparing against the mobile RTX 3080, which is nowhere near as fast as the desktop version. Then again, the desktop 3080 pulls 320W, twice that of the higher TDP variants of the mobile 3080 that Apple is comparing the M1 Max to (and admitting is marginally beaten by).

Blame NVIDIA and their poor naming scheme and high level of product line fragmentation for calling a 320W monster, a 160W space heater and a 100W gimped version (plus various in-between) all the RTX 3080.

you are right the mobile version has like half 3080 Max Q is 15,3 TFLOPS

Still the one that gets close is the 3060 Max Q with 9.8 TFLOPS

jeanlain · Oct 25, 2021

crazy dave said:
Could you add a table view to see raw numbers?

I have attached the txt file.

treehuggerpro · Oct 25, 2021

crazy dave · Oct 25, 2021

jeanlain said:
I have attached the txt file.

Unfortunately you can't really use the 3050 TI and 3050. They're mobile GPUs and the TFLOPs for them on Techpowerup are base figures. They can be boosted up to 7 or 8 TFLOPs depending on how they've been configured.

Tom's Hardware has them at 7.6 and 6.1 TFLOPs respectively.

Nvidia RTX 3050 Ti and RTX 3050 Bring RT and DLSS to Budget Laptops

Will 'budget ray tracing' even make sense?

www.tomshardware.com

Which would give ratios of about 1 to 1 for TFLOPs and GB performance wrt the 3060 but that's only if they are actually those TFLOPs.

3060 96400 12.74 RTX 3000
3050m 45898 6.1 RTX 3000
3050tim 55425 7.6 RTX 3000

jeanlain · Oct 25, 2021

Regression lines must pass through the point (0, 0). If we consider that point, Nvidia GPUs do not show linear scaling, even though points are relatively well alined. Based on results from the 3050 mobile, the 3080 and 3090 should perform better.

crazy dave · Oct 25, 2021

jeanlain said:
Regression lines must pass through the point (0, 0). If we consider that point, Nvidia GPUs do not show linear scaling, even though points are relatively well alined. Based on results from the 3050 mobile, the 3080 and 3090 should perform better.

They show linear scaling up to the 3060 TI. Your 3050 TFLOP mobile results are skewed. After that, yes scaling drops off, especially for OpenCL, less for CUDA.

jeanlain · Oct 25, 2021

crazy dave said:
3060 96400 12.74 RTX 3000
3050m 45898 6.1 RTX 3000
3050tim 55425 7.6 RTX 3000

Ok, here the corresponding plot. The points are well aligned for nVidia GPUs.
EDIT : up to the 3060. The line assumes linear scaling based on the M1 GPU (8 cores).

crazy dave · Oct 25, 2021

jeanlain said:
Ok, here the corresponding plot. The points are well aligned for nVidia GPUs. View attachment 1877648

Yeah my guess is after a point GB isn't able to fill the really big desktop Nvidia GPUs with enough work to flex them or possibly as a @jmho said small differences in the scaling can eventually accumulate so the increased parallelism doesn't help. But my bet is the former with at least some of the subtests.

But the Nvidia GPUs around the M1 GPU size and even a little bigger? Yup those show perfect linear scaling. Looking at some of the OpenCL/Metal scores for AMD GPUs around that size also show okay scaling too (though not as good)

Hopefully we'll find out soon what is going on!

jeanlain · Oct 25, 2021

The plot shows that GPUs from various vendors perform similarly in terms of points/FLOPS, up to a level where efficiency starts to plateau.
But that decrease occurs sooner for the M1 Max. It could be due to the default power mode used in the laptops. Didn't we see something similar with the A14 on the iPhone? Its compute score is much lower than the A14 on the iPad Air.

leman · Oct 25, 2021

jeanlain said:
Didn't we see something similar with the A14 on the iPhone? Its compute score is much lower than the A14 on the iPad Air.

My suspicion is that the iPad enabled full FP32 throughput while on the iPhone A14 was throttled to 1/2 FP32 rate.

crazy dave · Oct 25, 2021

leman said:
My suspicion is that the iPad enabled full FP32 throughput while on the iPhone A14 was throttled to 1/2 FP32 rate.

We’ll Apple’s done something with the M1 Pro (especially with 32 cores) although it’s weird that Aztec Ruins High gives roughly the expected scaling. That’s why I keep going back to Apple aggressively throttling the GPU during GB. Why? I dunno 🤷‍♂️. This makes no sense.

But we do know that there’s a high performance mode for 16” aimed at the GPU. Can’t help but feel it’s related, especially give the huge drop off from 24 to 32 cores. From 8 to 16 near perfect 1 scaling from 8 to 24, 0.93 scaling (same as AMD chips on Metal!), but from 8 to 32? 0.8 scaling or worse. So from 24 to 32 it’s a scaling of 0.77. Ouch.

quarkysg · Oct 25, 2021

crazy dave said:
We’ll Apple’s done something with the M1 Pro with 32 cores although it’s weird that Aztec Ruins High gives roughly the expected scaling. That’s why I keep going back to Apple aggressively throttling the GPU during GB. Why? I dunno 🤷‍♂️. This makes no sense.

Most likely when the benchmark is for rasterisation vs pure compute, the scaling will be more in line with the increase core count. I suspect Aztec Ruins High did not stress the compute aspect of the GPU enough.

crazy dave · Oct 25, 2021

quarkysg said:
Most likely when the benchmark is for rasterisation vs pure compute, the scaling will be more in line with the increase core count. I suspect Aztec Ruins High did not stress the compute aspect of the GPU enough.

My experience is that it’s usually the other way around: pure compute scales better with core count. I mean that’s logical and even here it by in large works that way for other GPUs (up to a point waaay always from where the M1 is). But maybe not with the M1. It is a very different architecture that I’m used to, but Apple claims 4x performance and 10 TFlops of 32 bit compute. So I dunno. Something somewhere doesn’t add up.

Edit: no I don’t think even in this case that this line of argumentation works either. Aztec Normal doesn’t stress the GPU as much and as expected it doesn’t show the full scaling between Apple GPUs because it’s not stressing the GPUs compute abilities enough. Even the small ones are completing their calculations too quickly. That should show that if Aztec High also wasn’t stressing the GPU enough, then running the benchmark on each shouldn’t give the right scaling either.

quarkysg · Oct 25, 2021

crazy dave said:
My experience is that it’s usually the other way around: pure compute scales better with core count. I mean that’s logical and even here it by in large works that way for other GPUs (up to a point waaay always from where the M1 is). But maybe not with the M1. I dunno.

Usually yes, but the M1 Max has a turbo mode which is likely not turned on during the GB5 run. So I'm assuming that the M1 Max do not need the turbo mode to run the Aztec Ruin High run effectively, as it's likely intersperse with other operations (e.g. memory transfers) instead of just pure compute.

crazy dave · Oct 25, 2021

quarkysg said:
Usually yes, but the M1 Max has a turbo mode which is likely not turned on during the GB5 run. So I'm assuming that the M1 Max do not need the turbo mode to run the Aztec Ruin High run effectively, as it's likely intersperse with other operations (e.g. memory transfers) instead of just pure compute.

That’s why I’m going with the M1 Max being throttled in some way that high performance mode is meant to relieve, but as I edited my last point to say:

no I don’t think even in this case that this line of argumentation works either. Aztec Normal doesn’t stress the GPU as much and as expected it doesn’t show the full scaling between Apple GPUs because it’s not stressing the GPUs abilities enough. Even the small ones are completing their calculations or something too quickly. That should show that if Aztec High also wasn’t stressing the GPU enough, then running the benchmark on each M1 shouldn’t give the right scaling either.

leman · Oct 25, 2021

crazy dave said:
We’ll Apple’s done something with the M1 Pro (especially with 32 cores) although it’s weird that Aztec Ruins High gives roughly the expected scaling. That’s why I keep going back to Apple aggressively throttling the GPU during GB. Why? I dunno 🤷‍♂️. This makes no sense.

I agree, it has double the bandwidth, double the ALUs, and the benchmarks are massively parallel, so near-linear scaling is expected (and observed with other models). It could still be Geekbench btw, maybe they are launching kernels in a way that does not saturate the M1 Max.

crazy dave · Oct 25, 2021

Anandtech is out.

Apple M1 Pro and M1 Max: Specs, Performance, Everything We Know

Apple takes its fight against Intel to a second round.

www.anandtech.com

No pure GPU compute test. Those they do look pretty good for scaling! Even if they only test the 16 core vs 32 core.

crazy dave · Oct 25, 2021

leman said:
I agree, it has double the bandwidth, double the ALUs, and the benchmarks are massively parallel, so near-linear scaling is expected (and observed with other models). It could still be Geekbench btw, maybe they are launching kernels in a way that does not saturate the M1 Max.

Yeah maybe … it’s still possible but it’s weird that it’s okay on Metal for the AMD chips. So it’s not (completely) an API programming problem.

Serban55 · Oct 25, 2021

So M1 Max with 32 Core GPU matches RTX 3080 Mobile.

macrumors 6502a

macrumors Core

macrumors 68030

macrumors 68000

macrumors 68020

Attachments

macrumors 68000

macrumors 6502a

macrumors 68020

macrumors regular

macrumors 68000

macrumors 68020

macrumors 68000

macrumors 68020

Attachments

macrumors 68000

macrumors 68020

macrumors Core

macrumors 68000

macrumors 65816

macrumors 68000

macrumors 65816

macrumors 68000

macrumors Core

macrumors 68000

macrumors 68000

Suspended

Our Staff