Important: Geekbench Compute scores for Apple Silicon are not accurate

jeanlain · Mar 19, 2022

MayaUser said:
From my obsessive friends the difference in WoW same settings between 3090 and m1 ultra is 7-9 fps,
If that helps

Your friends already have an M1 ultra?

MayaUser · Mar 19, 2022

jeanlain said:
Your friends already have an M1 ultra?

Yes he got it on the 18th
I will get it, hopefully on 24

ikir · Mar 19, 2022

This video confirm M1 Ultra doesn’t ramp up in Geekbench

leman · Mar 19, 2022

ikir said:
This video confirm M1 Ultra doesn’t ramp up in Geekbench

According to that review Ultra doesn’t ramp up anywhere. Bug? Or some other weirdness? BTW, fully agree with Maxim that it’s a shame that the software is not ready.

Xiao_Xi · Mar 19, 2022

leman said:
According to that review Ultra doesn’t ramp up anywhere. Bug? Or some other weirdness? BTW, fully agree with Maxim that it’s a shame that the software is not ready.

Could Apple have a problem with the drivers?

sunny5 · Mar 19, 2022

I guess many software doesn't even use more than 60W for M1 Ultra GPU! Not M1 Ultra's fault at all.

jeanlain · Mar 19, 2022

leman said:
According to that review Ultra doesn’t ramp up anywhere. Bug?

You're talking about the GPU?
It does ramp up in 3DMark and in the GFXBench tests where it is much faster than the M1 Max. Some Tomb Raider results also showed good scaling at high resolution (not far from the RTX 3090).
The Blender test is disappointing. Wattage was very low although GPU frequency was normal and usage was reported at 100%. I suppose the software is not mature enough and doesn't use GPU ressources very well.

Analog Kid · Mar 19, 2022

leman said:
Again, this is not reflected in real-world usage and this is not how GPUs are utilized.

leman said:
It may not make it inaccurate in the narrow sense of the word, but it does make it uninteresting. GB5 compute is essentially measuring GPU performance on very short workloads. That’s not how GPUs are used.

You keep saying that's not how GPUs are used, but that's clearly not true. GPUs are used that way all the time-- acceleration of highly parallel computations, even if they're discrete operations rather than massive, continuous computation. We're just learning that the M1 series doesn't necessarily scale as well at that kind of workload.

That's interesting to me because it means it undercuts some of the benefit of the unified memory. The reason small workloads aren't often run on GPUs is because the acceleration benefit is often swamped by the memory transfer time. Unified memory should alleviate that tradeoff, but apparently the clock frequency ramps limit the effectiveness.

I had a project I was trying to accelerate through OpenCL (when that was still a thing) and I translated some of the array computations to the discrete GPU in my MBP with limited success. When I targeted the integrated Intel GPU, I actually did much better because I saved the memory transfer time even though there were fewer cores available to parallelize across. You discover things like that by profiling (or benchmarking) your code in different environments. If I'd relied on some form of gaming benchmark or AI training benchmark, I'd have insisted the discrete GPU was always the right one.

It's also interesting because this slow ramp runs counter to the "race to sleep" approach taken elsewhere.

This is what I mean when I say the benchmark is accurate, but needs to be taken in the context of other benchmarks. It's not a narrow sense of accurate, it is a true representation of how that kind of legitimate workload performs. Apple made a tradeoff in how they implement power savings in their GPUs-- the sacrificed performance of certain workloads to keep overall power down. That doesn't somehow make the workloads the M1 struggles with invalid.

leman said:
Running stuff under Rosetta 2 is fair in my book - a lot of apps are still Intel and knowing how fast they run on AS is an important metric. Depends on what you are after though. I agree that running some sort of CPU test using x86 code and then making conclusions about M1 CPU performance is dumb.

Yes, that's what I mean by "native". Running under Rosetta 2 to see how your x86 code will run on the new CPU versus your current Intel CPU in the short term is valid, but comparing an x86 application under Rosetta 2 with an i9 running native and saying the M1 or Arm is somehow limited beyond the need to migrate applications is not valid. Most critical applications have quickly been ported to AS native.

Analog Kid · Mar 19, 2022

jeanlain said:
You're talking about the GPU?
It does ramp up in 3DMark and in the GFXBench tests where it is much faster than the M1 Max. Some Tomb Raider results also showed good scaling at high resolution (not far from the RTX 3090).
The Blender test is disappointing. Wattage was very low although GPU frequency was normal and usage was reported at 100%. I suppose the software is not mature enough and doesn't use GPU ressources very well.

It's possible that 100% is referenced to a Max, not a dual-Max Ultra. The number says 100%, but the bar graph only goes half way up. I suspect the code is only using 100% of 32 cores.

ikir · Mar 20, 2022

leman said:
According to that review Ultra doesn’t ramp up anywhere. Bug? Or some other weirdness? BTW, fully agree with Maxim that it’s a shame that the software is not ready.

Software is young, nerds (us) needs to calm down, the product is still shipping to costumers, new products gets new crucial updates for months after release and it is a good thing.

jeanlain · Mar 20, 2022

Analog Kid said:
It's possible that 100% is referenced to a Max, not a dual-Max Ultra. The number says 100%, but the bar graph only goes half way up. I suspect the code is only using 100% of 32 cores.

The graph also goes halfway up during the 3DMark test, although both GPUs are used (we know it since the M1 ultra is much faster than the Max).

jeanlain · Mar 20, 2022

ikir said:
Software is young, nerds (us) needs to calm down, the product is still shipping to costumers, new products gets new crucial updates for months after release and it is a good thing.

If apps need to be updated to support the M1 Ultra, this is not good news. Apple strongly implied that the dual nature of the M1 ultra was invisible to software.

MayaUser · Mar 20, 2022

jeanlain said:
If apps need to be updated to support the M1 Ultra, this is not good news. Apple strongly implied that the dual nature of the M1 ultra was invisible to software.

still must be a bug since other apps see the Ultra extra hardware
If it wanst invisible by software then probably all of the apps get the M1 Max performance
From what i can see the Ultra gpu shines at 1440p or higher, but in 1080p is even worse than the Max

jeanlain · Mar 20, 2022

MayaUser said:
From what i can see the Ultra gpu shines at 1440p or higher, but in 1080p is even worse than the Max

In a sense, if the M1 ultra does not use its full potential at low res when it gives more than 120 fps anyway, this is totally fine. Reaching >200 fps is purely a benchmarking exercice, but it has not practical use on the Mac (no one uses a gaming monitor on a Mac, and I believe that improvements past 120 fps are hardly noticeable).
What's important is to avoid low frame rates.

Xiao_Xi · Mar 20, 2022

How did Apple test GPU drivers if the software and benchmarks don't accurately reflect Ultra performance? How did Apple do performance testing for marketing?

MayaUser · Mar 20, 2022

Xiao_Xi said:
How did Apple test GPU drivers if the software and benchmarks don't accurately reflect Ultra performance? How did Apple do performance testing for marketing?

they did it on beta version and i quote from Apple itself
"Prerelease Final Cut Pro 10.6.2" is just one example

leman · Mar 20, 2022

Analog Kid said:
You keep saying that's not how GPUs are used, but that's clearly not true. GPUs are used that way all the time-- acceleration of highly parallel computations, even if they're discrete operations rather than massive, continuous computation. We're just learning that the M1 series doesn't necessarily scale as well at that kind of workload.

You think so? I am not aware of many contexts where your just fire off a single very small kernel and that's it. Image/video editing software is probably the most common user of one-time kernels, but that type of software also continuously uses the graphical capabilities of the GPUs, thus giving the system more information about how the GPU is used and what level of performance is expected.

Of course, it all boils down to how Geekbench uses the GPU. If they do each of the benchmarks in isolation, creating a new pipeline separately for each benchmark run and then tearing it down, then no, that's definitely not how these devices are supposed to be used.

Analog Kid · Mar 20, 2022

leman said:
You think so? I am not aware of many contexts where your just fire off a single very small kernel and that's it. Image/video editing software is probably the most common user of one-time kernels, but that type of software also continuously uses the graphical capabilities of the GPUs, thus giving the system more information about how the GPU is used and what level of performance is expected.

Of course, it all boils down to how Geekbench uses the GPU. If they do each of the benchmarks in isolation, creating a new pipeline separately for each benchmark run and then tearing it down, then no, that's definitely not how these devices are supposed to be used.

It’s not sending a small task once and only once, it’s a sharing of resources between the parallelizable bits on GPU and the flow control and logic on the CPU.

How do you think Apple is handling the clock, then? GB5 is running a stereo disparity map, for example. How is the GPU supposed to know there’s another one coming after the first one? It’s starting to look more like an incomplete implementation in the drivers or something than a question of ramping the clock but if you’re right about a clock ramp I assumed it ramped up as a long job continued. Are you saying it can somehow predict how many small jobs it will see in the future?

And why ramp up, why not step up? This still seems to go against the race to sleep approach used elsewhere…

leman · Mar 20, 2022

Analog Kid said:
It’s not sending a small task once and only once, it’s a sharing of resources between the parallelizable bits on GPU and the flow control and logic on the CPU.

How do you think Apple is handling the clock, then? GB5 is running a stereo disparity map, for example. How is the GPU supposed to know there’s another one coming after the first one? It’s starting to look more like an incomplete implementation in the drivers or something than a question of ramping the clock but if you’re right about a clock ramp I assumed it ramped up as a long job continued. Are you saying it can somehow predict how many small jobs it will see in the future?

And why ramp up, why not step up? This still seems to go against the race to sleep approach used elsewhere…

I don’t know, I’m speculating

My guess is that Apple would use GPU utilization counters as well as additional heuristics to manage the GPU power. That’s something one should experiment with.

Or maybe Geekbench is simply not launching enough threadgroups to properly utilize the Ultra. Could also be.

ikir · Mar 21, 2022

jeanlain said:
If apps need to be updated to support the M1 Ultra, this is not good news. Apple strongly implied that the dual nature of the M1 ultra was invisible to software.

It is not that easy, M1 runs everything tat was running before, but with every computer realised comes firmware/bios/drivers updates in the first weeks to improve performance/stability, this computer will not be the only one expection. I still make daily BIOS and drivers update to Windows client that several years old. I would be worried that no updates come with a new machine.

diamond.g · Mar 21, 2022

It seems strange that Apple would release such a flagship device and have not embedded the required hardware to actually use all of the resources it can bring to bear.

ikir said:
It is not that easy, M1 runs everything tat was running before, but with every computer realised comes firmware/bios/drivers updates in the first weeks to improve performance/stability, this computer will not be the only one expection. I still make daily BIOS and drivers update to Windows client that several years old. I would be worried that no updates come with a new machine.

ikir · Mar 21, 2022

diamond.g said:
It seems strange that Apple would release such a flagship device and have not embedded the required hardware to actually use all of the resources it can bring to bear.

Doesn't seems strange to me since I use computer since I was a baby and this is the way for everyone, check firmware updates on new related Mac or PC even 10 years ago. Check Apple Support download and see that every big launch was followed by firmware updates. It is complex, it is not Apple or other company issue, it is normal.

Analog Kid · Mar 21, 2022

leman said:
I don’t know, I’m speculating My guess is that Apple would use GPU utilization counters as well as additional heuristics to manage the GPU power. That’s something one should experiment with.

Or maybe Geekbench is simply not launching enough threadgroups to properly utilize the Ultra. Could also be.

I'd say if the behavior we're seeing is due to the first paragraph, then the benchmark is accurate. If it's due to the second paragraph the benchmark is not accurate.

If Apple is playing games with clock rates based on workload, then it's going to do better at some than others-- that's the tradeoff they've designed in and I think benchmarking should be (well) designed to show the strengths and weaknesses. I think this test done by Anandtech, for example, is excellent at using different workloads to show the performance of a device in detail:

It's starting to look, though, like the problem is that GB may not have written their code to exploit the full parallelism. That's a failure. Apple made a point that the Ultra architecture makes the device look like one chip to software, but that could mean software that's well written to query and utilize the available resources.

leman · Mar 22, 2022

I've been doing some experiments on my M1 Max... so far, it seems to take about 10ms for the GPU to ramp up to the high power mode. That is, if the compute shaders are done in less than 10ms, you won't get peak performance. And 10ms is a very long time in GPU land. My experiment involved doing long chains of trivial dependent computation (1024 operations per GPU thread) without memory access and launching various number of computation groups (threagroups in Metal lingo) with 1024 threads each. Launching 1024 theadgroups (that's 1 million threads or 1 billion computations) takes less than a millisecond, so it's not enough to trigger high performance mode — I only get around 30% of peak performance (or 1.5 TFLOPS). I need to launch more then 64K threadgrouops (that's over 64 billion computations) for the speed to start ramping up, and it takes over 256k threadgroups to reach close to the maximal aggregated 5 TFLOPS for M1 Max.

Bottomline is: if you want to benchmark the maximal performance on M1 GPUs, you need to "warm up" the device by doing excess of 10ms of work, which is A LOT OF WORK on these devices. Benchmark runs should ideally push the GPU for at least 1 second sustained work, ideally longer. It is very unlikely that Geekbench uses any workloads that adhere to these criteria and their style of running benchmarks sequentially basically means that the GPU will never leave the minimal/medium power mode.

P.S. Note that I say that 5TFLOPS is maximal for M1 Max where the usual figures state 10TFLOPS — that's because fused multiply-add (a*b + c) is executed as a single operation. So peak FLOPS using FMA operations is obviously double, if you prefer to count it that way. That's how GPU makers "cheat" with their peak figures (and yes, everyone does it).

Analog Kid · Mar 22, 2022

leman said:
I've been doing some experiments on my M1 Max... so far, it seems to take about 10ms for the GPU to ramp up to the high power mode. That is, if the compute shaders are done in less than 10ms, you won't get peak performance. And 10ms is a very long time in GPU land. My experiment involved doing long chains of trivial dependent computation (1024 operations per GPU thread) without memory access and launching various number of computation groups (threagroups in Metal lingo) with 1024 threads each. Launching 1024 theadgroups (that's 1 million threads or 1 billion computations) takes less than a millisecond, so it's not enough to trigger high performance mode — I only get around 30% of peak performance (or 1.5 TFLOPS). I need to launch more then 64K threadgrouops (that's over 64 billion computations) for the speed to start ramping up, and it takes over 256k threadgroups to reach close to the maximal aggregated 5 TFLOPS for M1 Max.

Bottomline is: if you want to benchmark the maximal performance on M1 GPUs, you need to "warm up" the device by doing excess of 10ms of work, which is A LOT OF WORK on these devices. Benchmark runs should ideally push the GPU for at least 1 second sustained work, ideally longer. It is very unlikely that Geekbench uses any workloads that adhere to these criteria and their style of running benchmarks sequentially basically means that the GPU will never leave the minimal/medium power mode.

P.S. Note that I say that 5TFLOPS is maximal for M1 Max where the usual figures state 10TFLOPS — that's because fused multiply-add (a*b + c) is executed as a single operation. So peak FLOPS using FMA operations is obviously double, if you prefer to count it that way. That's how GPU makers "cheat" with their peak figures (and yes, everyone does it).

When you say “warm up”, does it also ramp down slowly, or does it need to start ramping up again after each work package completes? This might be interesting to see graphically, if you have the time. Is it linear? Would we expect the Ultra to take twice as long?

This is a strange design approach. I wonder if the reason it ramps up slowly is because the packaging approach can’t handle the inrush current required to just step the clocks up. Still, 10ms seems a long time to compensate for that.

Important: Geekbench Compute scores for Apple Silicon are not accurate

macrumors 68020

macrumors 68040

macrumors 68020

macrumors Core

macrumors 68000

Suspended

macrumors 68020

macrumors G3

macrumors G3

macrumors 68020

macrumors 68020

macrumors 68020

macrumors 68040

macrumors 68020

macrumors 68000

macrumors 68040

macrumors Core

macrumors G3

macrumors Core

macrumors 68020

macrumors G5

macrumors 68020

macrumors G3

macrumors Core

macrumors G3

Our Staff