Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,521
19,677
When you say “warm up”, does it also ramp down slowly, or does it need to start ramping up again after each work package completes? This might be interesting to see graphically, if you have the time. Is it linear? Would we expect the Ultra to take twice as long?

My methodology is probably much more primitive than you might think :) I am not aware of any way to query the GPU frequency, and the GPU power consumption counter is neither informative nor high resolution, so I simply looked at the output of the Metal trace profiler. There is an indicator that shows the GPU performance state (low, medium, high). The moment a kernel starts executing, the GPU goes into the “medium” state, and it takes it about 10ms to go into the “high” state. I will try to investigate how long it stays in the “high” state after the work is completed, not sure whether the profiler will give me this information.

This is a strange design approach. I wonder if the reason it ramps up slowly is because the packaging approach can’t handle the inrush current required to just step the clocks up. Still, 10ms seems a long time to compensate for that.

I don’t think it’s strange if you consider that M1 aims to be efficient. It would be wasteful to trigger the high performance mode every time the GPU to render a button, so their solution to keep the clocks low as long as there is no urgent work seems like a reasonable strategy. 10ms is brief enough to appear instantaneous to human perception, and it you can do the work within that period of time - great! If not, we’ll, time to gather those performance reserves.

In other words, you probably don’t care whether your image filter application takes 10ms or 1ms - both is quicker than the button press animation. But you probably care whether it takes 10 seconds or 40 seconds.

In the end, the only class of applications that “suffers” from this style of power management are benchmarks. Abs that’s why benchmarks should include a warmup phase to make sure that they are measuring the correct thing.
 

Homy

macrumors 68030
Jan 14, 2006
2,507
2,459
Sweden
I've been doing some experiments on my M1 Max... so far, it seems to take about 10ms for the GPU to ramp up to the high power mode. That is, if the compute shaders are done in less than 10ms, you won't get peak performance. And 10ms is a very long time in GPU land. My experiment involved doing long chains of trivial dependent computation (1024 operations per GPU thread) without memory access and launching various number of computation groups (threagroups in Metal lingo) with 1024 threads each. Launching 1024 theadgroups (that's 1 million threads or 1 billion computations) takes less than a millisecond, so it's not enough to trigger high performance mode — I only get around 30% of peak performance (or 1.5 TFLOPS). I need to launch more then 64K threadgrouops (that's over 64 billion computations) for the speed to start ramping up, and it takes over 256k threadgroups to reach close to the maximal aggregated 5 TFLOPS for M1 Max.

Bottomline is: if you want to benchmark the maximal performance on M1 GPUs, you need to "warm up" the device by doing excess of 10ms of work, which is A LOT OF WORK on these devices. Benchmark runs should ideally push the GPU for at least 1 second sustained work, ideally longer. It is very unlikely that Geekbench uses any workloads that adhere to these criteria and their style of running benchmarks sequentially basically means that the GPU will never leave the minimal/medium power mode.

P.S. Note that I say that 5TFLOPS is maximal for M1 Max where the usual figures state 10TFLOPS — that's because fused multiply-add (a*b + c) is executed as a single operation. So peak FLOPS using FMA operations is obviously double, if you prefer to count it that way. That's how GPU makers "cheat" with their peak figures (and yes, everyone does it).

What kind of benchmark do you suggest? I ran Geekbench Metal 7-8 times continuously. The first time I got the highest score. Then I got lower scores but again the last time the score went up to around the first time. Can you "warm up" the GPU with e.g. Tomb Raider?
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,677
What kind of benchmark do you suggest? I ran Geekbench Metal 7-8 times continuously. The first time I got the highest score. Then I got lower scores but again the last time the score went up to around the first time. Can you "warm up" the GPU with e.g. Tomb Raider?

I doubt yiu can do it reliably “by hand”. Instead, Geekbench should be updated to include a warm-up phase where they just run a dumb kernel for couple of seconds immediately before running the actual benchmark.
 

Homy

macrumors 68030
Jan 14, 2006
2,507
2,459
Sweden
I doubt yiu can do it reliably “by hand”. Instead, Geekbench should be updated to include a warm-up phase where they just run a dumb kernel for couple of seconds immediately before running the actual benchmark.
When I run game benchmarks on my MBP 14" 14c the fans max out after 10 minutes and the GPU gets hot so that seems to warm it up. :)
 
Last edited:
  • Haha
Reactions: Juraj22

LinkRS

macrumors 6502
Oct 16, 2014
402
331
Texas, USA
Benchmarks should inform users of the performance expectations of the subject under test. If the benchmark fails in this aspect, it is a "bad" benchmark. For this case, it seems that Apple needs to tweak how the GPU behaves on the Ultra and Max when in the Mac Studio. The only reason to lower power or ramp it down (so-to-speak) in a desktop, is to save power or reduce temps. Most GPUs in the Windows world have a "Windows" clock and a "game" clock. What I mean by this, is when the load on the GPU is lower (like at the Windows desktop), the GPU throttles down as it doesn't need its full power to render the workload. However, as soon as the workload increases, it ramps up to "game" mode and runs at its peak performance. The M1 Ultra (and Max for that matter) while in the desktop form, should ideally behave the same. Some folks may argue that it does that now, but clearly it is different than other GPU vendors. The benchmark shouldn't have to be modified to run correctly, as that would indicate that software vendors would need to modify their code to get the most out of the Ultra, which is not what Apple messaged about this SOC.

Also, since Apple deprecated OpenGL and OpenCL, it really isn't a fair comparison. Those drivers are probably not optimized, and would likely produce subpar results.

For an interesting story of driver manipulation, Google "ATI quack.exe."
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
Benchmarks should inform users of the performance expectations of the subject under test. If the benchmark fails in this aspect, it is a "bad" benchmark. For this case, it seems that Apple needs to tweak how the GPU behaves on the Ultra and Max when in the Mac Studio. The only reason to lower power or ramp it down (so-to-speak) in a desktop, is to save power or reduce temps. Most GPUs in the Windows world have a "Windows" clock and a "game" clock. What I mean by this, is when the load on the GPU is lower (like at the Windows desktop), the GPU throttles down as it doesn't need its full power to render the workload. However, as soon as the workload increases, it ramps up to "game" mode and runs at its peak performance. The M1 Ultra (and Max for that matter) while in the desktop form, should ideally behave the same. Some folks may argue that it does that now, but clearly it is different than other GPU vendors. The benchmark shouldn't have to be modified to run correctly, as that would indicate that software vendors would need to modify their code to get the most out of the Ultra, which is not what Apple messaged about this SOC.

Also, since Apple deprecated OpenGL and OpenCL, it really isn't a fair comparison. Those drivers are probably not optimized, and would likely produce subpar results.

For an interesting story of driver manipulation, Google "ATI quack.exe."

I am not sure I would agree. The slightly longer warmup time of Apple GPUs does not impact user experience or performance under normal circumstances. As I argued before, if your GPU workload takes under 10ms, performance doesn’t matter anyway. The main difference between Apple and other vendors seems to be that Apple allows you to you the GPU without using much power (if appropriate), where other vendors turn on max boost simply when you initialize the graphical device. No wonder Apple laptops have superior battery life.

P.S. And sure, maybe on desktop the warmup duration could be shorter… but does it really impact performance outside benchmarks?
 

LinkRS

macrumors 6502
Oct 16, 2014
402
331
Texas, USA
I am not sure I would agree. The slightly longer warmup time of Apple GPUs does not impact user experience or performance under normal circumstances. As I argued before, if your GPU workload takes under 10ms, performance doesn’t matter anyway. The main difference between Apple and other vendors seems to be that Apple allows you to you the GPU without using much power (if appropriate), where other vendors turn on max boost simply when you initialize the graphical device. No wonder Apple laptops have superior battery life.

P.S. And sure, maybe on desktop the warmup duration could be shorter… but does it really impact performance outside benchmarks?
As a benchmark, Geekbench should be exercising the hardware using generally available methods (API if you will), and demonstrating the maximized performance of the hardware. If it does not demonstrate the maximized performance, is it a fault of the benchmark itself, or the underlying hardware? I postulate it is the latter. If it were the benchmark, you would see the same behavior on non-Apple Silicon hardware. Unless, Geekbench writes specialized code for each device architecture it tests, this would be the only explanation for the shortfall. If Primate Labs does write custom code for each architecture, I would call it a suspect benchmark for anything. In order for a benchmark to be credible, there should be no appearance of, or act of favoritism. If the Primate Labs software engineers were Apple fans, they could theoretically spend more time optimizing that code path, or whatever their preferred architecture. Putting aside the fan aspect, inferior performance could also just be the result of unfamiliarity of the platform. Calling all results into question.

P.S. During the quest for speed, the faster you ramp up, the more work you can get done, so it *could* impact real world performance. :)

Thanks for the discussion!
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
As a benchmark, Geekbench should be exercising the hardware using generally available methods (API if you will), and demonstrating the maximized performance of the hardware. If it does not demonstrate the maximized performance, is it a fault of the benchmark itself, or the underlying hardware? I postulate it is the latter. If it were the benchmark, you would see the same behavior on non-Apple Silicon hardware. Unless, Geekbench writes specialized code for each device architecture it tests, this would be the only explanation for the shortfall. If Primate Labs does write custom code for each architecture, I would call it a suspect benchmark for anything. In order for a benchmark to be credible, there should be no appearance of, or act of favoritism. If the Primate Labs software engineers were Apple fans, they could theoretically spend more time optimizing that code path, or whatever their preferred architecture. Putting aside the fan aspect, inferior performance could also just be the result of unfamiliarity of the platform. Calling all results into question.

All of this is true. But then again, one would expect a benchmark to provide realistic estimates of what performance can be expected doing certain tasks. It is indeed the case that Geekbench is running afoul Apple Silicon-specific behaviour, but the end result is that it systematically underestimates the capability of the hardware.

In the end, the question is what exactly is being tested? As it stands, Geekbench is testing the M1 GPU performance for running extremely short compute kernels in absence of any other GPU work. The result is that other hardware is faster here because dGPUs go to 100% the moment you request a GPU context. This is an interesting and useful result on its own but hardly matches the intuition of a user running a benchmark. Because what the user actually wants to know is "how fast is my GPU when doing GPU-intensive work". Workloads of just few ms length are irrelevant for this. Where do people actually care about GPU compute performance? It's things like rendering, scientific computation, machine learning, video processing — these workloads take hundreds and hundreds times longer than the timespans we are discussing here. And if you do launch an occasional short kernel or two (e.g. when doing photo processing) — well, as I mentioned before, you don't really care whether it takes 1ms or 10ms, from your perspective it's instantaneous anyway. Now, if you are batch processing 1000 of images, you no longer have a short-running kernel and will get the maximal performance you GPU is capable of. Apple's approach appears fairly reasonable to me.

I wouldn't go as far as to claim that Geekbench is buggy here. It is not. It just does not do what the user expects it to do. When you are benchmarking a platform you need to take its particularities into account. And this has nothing to do with favouritism. Just different platforms might require different setup to achieve your goal.

P.S. During the quest for speed, the faster you ramp up, the more work you can get done, so it *could* impact real world performance. :)

Oh, sure, and it does. In my testing I need to have the compute pass run for 100ms or more in order for the aggregated rate to reach the expected peak values. But these things don't matter in practice. These are user-facing devices, not server farms. Some milliseconds of warm-up phase are not going to be noticeable.
 

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
My methodology is probably much more primitive than you might think :) I am not aware of any way to query the GPU frequency, and the GPU power consumption counter is neither informative nor high resolution, so I simply looked at the output of the Metal trace profiler. There is an indicator that shows the GPU performance state (low, medium, high). The moment a kernel starts executing, the GPU goes into the “medium” state, and it takes it about 10ms to go into the “high” state. I will try to investigate how long it stays in the “high” state after the work is completed, not sure whether the profiler will give me this information.



I don’t think it’s strange if you consider that M1 aims to be efficient. It would be wasteful to trigger the high performance mode every time the GPU to render a button, so their solution to keep the clocks low as long as there is no urgent work seems like a reasonable strategy. 10ms is brief enough to appear instantaneous to human perception, and it you can do the work within that period of time - great! If not, we’ll, time to gather those performance reserves.

In other words, you probably don’t care whether your image filter application takes 10ms or 1ms - both is quicker than the button press animation. But you probably care whether it takes 10 seconds or 40 seconds.

In the end, the only class of applications that “suffers” from this style of power management are benchmarks. Abs that’s why benchmarks should include a warmup phase to make sure that they are measuring the correct thing.
I appreciate you doing the bechmark-- this seems to support that this is an architectural decision, not a badly written benchmark.

I don't like to ask other people to put in work for satisfying my curiosity, but, um, ?, if you find yourself also curious, it might be interesting to see if this behavior changes with the "high power" energy mode in the Battery syspref.

Now the remaining question is whether the benchmark is meaningful. I'd emphatically say yes. Look what the benchmark, and the testing you've done as a result of it, has already taught us in this thread about how the M1 series is optimized and the tradeoffs it presents. The M1 isn't designed to use the GPU as what I'd call a coprocessor, it's meant to be used as what I'd call a processing engine. The Geekbench workload looks they're using something like OpenCV or something to accelerate highly parallel operations-- things that are more than you'd run through Neon, but not infinite workloads. That's what I'd consider a coprocessor. I need a stereo disparity or Canny image, or whatever, so I put the images out to the GPU and get a result back. Then I take the next image pair and send that. Processing each operation are much less than 10ms, but if done on the CPU would severly limit my execution time.

You seem focused on what I'd consider a processing engine-- it just keeps running, endlessly rendering or training a neural net, or simulating the weather all in one dispatch or, if we find it doesn't spin down after each call, a series of very lengthy dispatches.

What these results are saying is that the M1Ultra might double the performance of the M1Max if used as an engine, but will not if used as a coprocessor.

The Unified memory offers two benefits to the GPU: it gives the GPU access to much larger pools of memory (good for an engine) and it avoids memory to memory transfer costs from CPU to GPU (good for coprocessors). That second benefit is not fully realized with the Apple approach.

I'm also curious why the race to sleep paradigm is not seen as beneficial in this context. For dice this enormous, and process geometries this small, I'd have thought being able to power down large portions of the logic would be important to constraining leakage. Better to sweat hard, but leak half as long-- but maybe my assumptions on leakage versus voltage scaling are wrong here.

These are user-facing devices, not server farms. Some milliseconds of warm-up phase are not going to be noticeable.
I see that argument for iPhone, but not a high performance desktop like this. The Ultra is meant to compute, not provide better UX... If the milliseconds of spin up are recurring, it impedes the system's ability to compute.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
The M1 isn't designed to use the GPU as what I'd call a coprocessor, it's meant to be used as what I'd call a processing engine. The Geekbench workload looks they're using something like OpenCV or something to accelerate highly parallel operations-- things that are more than you'd run through Neon, but not infinite workloads. That's what I'd consider a coprocessor. I need a stereo disparity or Canny image, or whatever, so I put the images out to the GPU and get a result back. Then I take the next image pair and send that. Processing each operation are much less than 10ms, but if done on the CPU would severly limit my execution time.

You seem focused on what I'd consider a processing engine-- it just keeps running, endlessly rendering or training a neural net, or simulating the weather all in one dispatch or, if we find it doesn't spin down after each call, a series of very lengthy dispatches.

Why wouldn’t the kind of usage you describe qualify as “processing engine”? It doesn’t sound to me like the operations you describe are one-off things, you are continuously sending new images to be processed on the GPU, right? This is not about the duration of a single task or dispatch, this is about the utilization of the GPU as a coprocessor. If you are doing some work repeatedly within a reasonably brief period of time, you should get maximal performance. I will test this later :)

It is no different in the CPU world. The power state is driven by CPU utilization. If your CPU is just doing a few cycles worth of work and then pausing/closing the thread, turbo don’t kick in and your CPU will remain in the low power state. You need to actually do something CPU considers “work” for peak performance to trigger. It’s just that CPUs are configured to be more agile when it comes to these things.

P.S. got a reply from Apple on this: https://developer.apple.com/forums/thread/702651
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
I appreciate you doing the bechmark-- this seems to support that this is an architectural decision, not a badly written benchmark.

I don't like to ask other people to put in work for satisfying my curiosity, but, um, ?, if you find yourself also curious, it might be interesting to see if this behavior changes with the "high power" energy mode in the Battery syspref.

Now the remaining question is whether the benchmark is meaningful. I'd emphatically say yes. Look what the benchmark, and the testing you've done as a result of it, has already taught us in this thread about how the M1 series is optimized and the tradeoffs it presents. The M1 isn't designed to use the GPU as what I'd call a coprocessor, it's meant to be used as what I'd call a processing engine. The Geekbench workload looks they're using something like OpenCV or something to accelerate highly parallel operations-- things that are more than you'd run through Neon, but not infinite workloads. That's what I'd consider a coprocessor. I need a stereo disparity or Canny image, or whatever, so I put the images out to the GPU and get a result back. Then I take the next image pair and send that. Processing each operation are much less than 10ms, but if done on the CPU would severly limit my execution time.

You seem focused on what I'd consider a processing engine-- it just keeps running, endlessly rendering or training a neural net, or simulating the weather all in one dispatch or, if we find it doesn't spin down after each call, a series of very lengthy dispatches.

What these results are saying is that the M1Ultra might double the performance of the M1Max if used as an engine, but will not if used as a coprocessor.

The Unified memory offers two benefits to the GPU: it gives the GPU access to much larger pools of memory (good for an engine) and it avoids memory to memory transfer costs from CPU to GPU (good for coprocessors). That second benefit is not fully realized with the Apple approach.

I'm also curious why the race to sleep paradigm is not seen as beneficial in this context. For dice this enormous, and process geometries this small, I'd have thought being able to power down large portions of the logic would be important to constraining leakage. Better to sweat hard, but leak half as long-- but maybe my assumptions on leakage versus voltage scaling are wrong here.


I see that argument for iPhone, but not a high performance desktop like this. The Ultra is meant to compute, not provide better UX... If the milliseconds of spin up are recurring, it impedes the system's ability to compute.

So, I did some testing on how frequently one has to submit GPU work to get peak performance. It's not good. If I wait for as little as 1ms to submit the net kernel, I won't trigger the maximal performance mode. I didn't see any difference in behaviour using battery power or power adapter or using maximal performance profile vs. automatic, but my testing was not overly systematic, so who knows.

So indeed, if your use case requires you to submit short GPU work packages with brief delays between them (real-time processing, streaming, sequential CPU+GPU processing), you might not get the maximal possible performance. I don't think these use cases are very frequent, but it would be good if Apple had a power assertion for triggering the high performance GPU state programmatically.
 
  • Like
Reactions: Analog Kid

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
So, I did some testing on how frequently one has to submit GPU work to get peak performance. It's not good. If I wait for as little as 1ms to submit the net kernel, I won't trigger the maximal performance mode. I didn't see any difference in behaviour using battery power or power adapter or using maximal performance profile vs. automatic, but my testing was not overly systematic, so who knows.

So indeed, if your use case requires you to submit short GPU work packages with brief delays between them (real-time processing, streaming, sequential CPU+GPU processing), you might not get the maximal possible performance. I don't think these use cases are very frequent, but it would be good if Apple had a power assertion for triggering the high performance GPU state programmatically.
Thanks for running those tests. Interesting results for sure.

I suppose "frequent" is in the eye of the developer, so my view may be skewed by my experience. The unified memory did seem like an opportunity to do more of these smaller jobs with less overhead expense, but the power scaling witholds some of that benefit. Not all of it though. It should still be faster than sequential processing.

This also prevents any sort of predictable latencies. Not that a desktop computer is meant to run real-time systems...

Interesting... It seems Apple has the ability to force a performance state, they implemented it in Instruments, but they aren't exposing it to developers. This seems like a place they could better differentiate laptops from desktops-- take the request, grant it almost always for desktops and maybe vet the request for laptops?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.