M1 Ultra's GPU issue?

leman · Mar 22, 2022

PianoPro said:
I don't know what its clock rate is. But if we never see the GPU consume enough power to raise the temperature enough to increase the fan speed at all, then either the GPU clock rate is constrained by factors other than heat (and that could very well be), or why not allow the clock rate to increase? So far no one has reported the fans EVER leaving the idle speed with any benchmark or actual use, and that seems very odd.

Or maybe the thermal system is just way overprovisioned for the hardware. It could be that Apple is designing for the future hardware, and not just for M1. From what we know, M1 GPU peaks out at around 1.3ghz, which is likely some sort of hard/soft limit due to design/yield/whatever factors. And you are not the first to ask why Apple wouldn't clock these things higher, we all are wondering it. At the end of the day, there must be an engineering reason, because why would marketing leave "free" performance on the table?

PianoPro · Mar 22, 2022

leman said:
Or maybe the thermal system is just way overprovisioned for the hardware. It could be that Apple is designing for the future hardware, and not just for M1. From what we know, M1 GPU peaks out at around 1.3ghz, which is likely some sort of hard/soft limit due to design/yield/whatever factors. And you are not the first to ask why Apple wouldn't clock these things higher, we all are wondering it. At the end of the day, there must be an engineering reason, because why would marketing leave "free" performance on the table?

Sure, like I said there could be reasons the GPU simply can't be clocked faster for better performance that have nothing to do with heat. But otherwise, we got CPU and GPU idling at about 35C that apparently can't be pushed hard enough to speed up the fans spinning at 1325 rpm (which can be manually set to 1100 rpm). Yeah, that all seems a little odd to this product design engineer and peaks my curiosity. And yeah, maybe the case and cooling system is just waiting for a quad M1/2 Max and that's the next 'modular' Mac Pro - but I doubt it.

sunny5 · Mar 23, 2022

Taz Mangus said:
Open a case with Apple Support and you might get a response from engineering.

So far no responses.

dieselm · Mar 23, 2022

leman said:
Or maybe the thermal system is just way overprovisioned for the hardware. It could be that Apple is designing for the future hardware, and not just for M1. From what we know, M1 GPU peaks out at around 1.3ghz, which is likely some sort of hard/soft limit due to design/yield/whatever factors. And you are not the first to ask why Apple wouldn't clock these things higher, we all are wondering it. At the end of the day, there must be an engineering reason, because why would marketing leave "free" performance on the table?

I think it's a business/marketing decision. The M1s run the CPU/GPU clocks uniformly across the whole line.
Remember when choosing a laptop (or tablet) meant lower performance?

The customer chooses M1/Pro/Max/Ultra and the number of cores they want without worry about form factor (laptop, tablet, desktop).

Further, Apple makes these small form factors and virtually silent experiences. Maximum maximum performance isn't as important as forcing other manufacturers to tradeoff either louder, larger enclosures or lower perfomance to be competitive.

MayaUser · Mar 24, 2022

The GPU clocks between 389 and 1296 MHz and offers no short boost

PianoPro · Mar 24, 2022

MayaUser said:
The GPU clocks between 389 and 1296 MHz and offers no short boost

Do you know more about why the GPU isn't ramping up faster/more in benchmarks?

In the Blackmagic RAW Speedtest benchmark the Ultra is achieving 2x the GPU performance of the M1 Max with 3:1 compression, but with 12:1 compression in the same benchmark run it's around 1.6x. I'm guessing in the latter case the GPU is ramping up in the Max but not the Ultra?

diamond.g · Mar 24, 2022

PianoPro said:
Do you know more about why the GPU isn't ramping up faster/more in benchmarks?

In one benchmark the Ultra is achieving 2x the GPU performance of the M1 Max, but in less taxing parts of the same benchmark it's around 1.6x. I'm guessing in the latter case the GPU is ramping up in the Max but not the Ultra?

Is there a chance that there are CPU limitations in less complex scenes?

quarkysg · Mar 24, 2022

diamond.g said:
Is there a chance that there are CPU limitations in less complex scenes?

I would think the CPU could be the bottleneck for more complex scenes instead of less. With UMA tho. the CPU should not be a factor in the GPU computation pipeline, I would think, since the CPU plays no part in shuffling data to the GPU (in the case of the traditional dGPU design), other than orchestrating the work required. The M1 E-core should be more than enough to perform the orchestration.

osplo · Mar 24, 2022

PianoPro said:
Do you know more about why the GPU isn't ramping up faster/more in benchmarks?

In the Blackmagic RAW Speedtest benchmark the Ultra is achieving 2x the GPU performance of the M1 Max with 3:1 compression, but with 12:1 compression in the same benchmark run it's around 1.6x. I'm guessing in the latter case the GPU is ramping up in the Max but not the Ultra?

That's why I lean more towards the Studio Max instead of the Ultra now. Too much money and too much uncertainty about whether the apps will actually use all that power.

PianoPro · Mar 24, 2022

osplo said:
That's why I lean more towards the Studio Max instead of the Ultra now. Too much money and too much uncertainty about whether the apps will actually use all that power.

In that benchmark the Ultra is still processing 1.6 times more frames per second than the M1 Max at that compression rate and 2x better on less compressed files. That's nothing to sneeze at. But it appears Apple needs to unleash the Ultra in the Mac Studio where there is so much unused cooling capacity apparently going to waste (the fans don't ramp up more than 25 rpm (!!!) in that benchmark).

PianoPro · Mar 24, 2022

Deleted

sunny5 · Mar 24, 2022

M1 Ultra owners. Do you still have issues with GPU?

diamond.g · Mar 25, 2022

quarkysg said:
I would think the CPU could be the bottleneck for more complex scenes instead of less. With UMA tho. the CPU should not be a factor in the GPU computation pipeline, I would think, since the CPU plays no part in shuffling data to the GPU (in the case of the traditional dGPU design), other than orchestrating the work required. The M1 E-core should be more than enough to perform the orchestration.

Maybe. It would be interesting to see GPU utilization either way to tell. It would be hard to tell if CPU isse though since all M series CPU run at the same speed with no control over it to test. On the PC side if a faster CPU nets more performance you are more clearly CPU limited.

osplo · Mar 25, 2022

PianoPro said:
In that benchmark the Ultra is still processing 1.6 times more frames per second than the M1 Max at that compression rate and 2x better on less compressed files. That's nothing to sneeze at. But it appears Apple needs to unleash the Ultra in the Mac Studio where there is so much unused cooling capacity apparently going to waste (the fans don't ramp up more than 25 rpm (!!!) in that benchmark).

but here

you'll see how the Ultra is no faster than a Macbook when rendering Final Cut Pro files. This is what I use it for, but can signal that apps do need some tweaking to use the power.

Strange.

PianoPro · Mar 25, 2022

osplo said:
but here you'll see how the Ultra is no faster than a Macbook when rendering Final Cut Pro files. This is what I use it for, but can signal that apps do need some tweaking to use the power.

Yeah, it will be very interesting to see how this changes when Apple releases the prerelease version of Final Cut Pro (10.6.2) they used for their advertised test claims. Maybe we that will give us more insight into what's holding the Ultra back in several other benchmarks as well.

jinnyman · Mar 26, 2022

After watching through several reviews on M1 Max and Ultra in Mac Studio, I can't stop wondering why Apple has kept the performance level of chips the same as the ones in MBP.

I mean this is a desktop, and I get the importance of silent fans and Apple's obsession with it, it's been long known I get that. But I think they should have offered a turbo mode that maximize obvious thermal headroom that Mac Studio has.
I mean, by clocking more, the Mac Studio will obviously get loud, but I believe there's enough headroom in thermal design to accommodate that extra clock speed.

M1 Max Studio only offers additional ports and 10 gig lanport and whatnot compared to MBP M1 Max. No performance boost is there.

Perhaps there's no head room in performance in the current design of M1 and TSMC Node? If the performance of Apple's newer generation of desktop remains the same compared to MBP, I'd be pretty disappointed. When I first heard Apple's intention to go AS, I was pretty hyped up on what Apple would offer in a desktop level performance..

dieselm · Mar 26, 2022

jinnyman said:
After watching through several reviews on M1 Max and Ultra in Mac Studio, I can't stop wondering why Apple has kept the performance level of chips the same as the ones in MBP.

Business reasons. Simplification. They want customers focused away from chip specs and on to the form factor they prefer.

Customers will learn there's no performance penalty buying a laptop vs a desktop vs an all-in-one or even a tablet. A M1 Pro is an M1 Pro is an M1 Pro, whereever.

No more thinking about tiers (i3/i5/i7), no speed ratings (Ghz, base vs turbo, fans). The only performance differences are the number of CPU and GPU cores and those to the extent possible scale linearly.

Do you need an M1, a Pro, a Max or an Ultra? That's it. If you don't need the extra cores, feel confident you get the same excellent day-to-day performance as if you spent thousands more.

joema2 · Mar 26, 2022

osplo said:
...I am puzzled why the apps don't use the double encoder or the full GPU power. Apps should need no change to use more GPU capacity, shouldn't they?...

That is a good question. While it may intuitively seem that apps doing video decode/encode should require no tweaking to use additional hardware accelerators, there are several things to consider.

Prior to Apple Silicon, CPUs typically had a single decode/encode accelerator shared among all cores. E.g, Intel CPUs had only one Quick Sync block, not one per core. I believe it's the same with nVidia's NVENC/NVDEC and AMD's UVD/VCE -- one per card. With the M1 Max and increasing with the M1 Ultra, for the first time there are multiple accelerators available.

For Long GOP codecs such as H264 and HEVC, in theory each GOP (Group of Pictures) is independent and can be processed in parallel by a separate thread. It is impossible to use multiple threads within a GOP because the algorithm is inherently sequential -- you cannot process frame 10 until you are finished with frame 9. A typical GOP size might be 20 frames.

Thus on a 10 min video there might be 900 GOPs. If only you had a 900 core CPU

The hundreds of lightweight threads a GPU can muster cannot be used for this because they are individually too slow, and only one thread per GOP is allowed. So using software decoding, an 8 core CPU can use all eight cores, one per GOP, but it's still slow.

A hardware decode/encode accelerator is much faster since it executes the core algorithm in fixed-function hardware. However it must either be used by a single thread or a pool of threads must share it, saving state or coordinating with some synchronization method. Either way the limit will then become accelerator performance. If only somebody would make a CPU with multiple accelerators

Wait -- Apple just did!

Anytime a single hardware entity existed for a long time -- whether that was a single core machine or a single video accelerator -- software tends to develop assumptions and dependencies on that. E.g, when the first multi-core machines became available, this uncovered all kinds of timing windows and thread safety issues, even if the previous code was already multithreaded.

With M1 Max we have one H264/HEVC decoder, two encoders and two separate ProRes decode/encode units. Then with M1 Ultra we have two H264/HEVC decoders, four encoders and four separate ProRes engines. It's possible some software at the app layer or system layer has dependencies based on the assumption of a single media engine, and that's not yet fixed. It may not be a simple matter of spinning up another thread to use that in parallel on another GOP. All threads must use proper synchronization methods (mutex, semaphore, lock, etc) or the app will crash. Adding that logic and properly testing it on many workloads and platforms is difficult. Especially difficult is the "exception case" where a thread unpredictably hits an exception and must gracefully restore the state. E.g, 1 out of 10 times a user cancels an export it creates a condition where the system may crash 15 minutes later. There can be latent problems in multithreaded code which only appears when additional functional units (cores, accelerators, etc) are added. This is because having a single functional unit subtly synchronized the threads, hiding a concurrency bug.

However it seems that the ProRes engines already scale with increasing number of units. That might be because unlike Quick Sync, etc. which has existed in unitary form since 2011, ProRes accelerators are newer hence the code is newer and was designed at the outset for the possibility of multiple accelerators.

It seems very likely that newer versions of MacOS and/or FCP and other NLEs will better leverage the multiple H264/HEVC decode/encode accelerators present on the M1 Max and Ultra. This is different from a GPU which from the outset had many functional units. For the GPU case, software and development frameworks were always written with that in mind. When that software improvement is made to harness multiple video accelerators , it seems possible that an M1 Ultra could suddenly decode H264/HEVC video 2x faster and encode 4x faster.

OTOH, there's a separate possible issue with encoding formats that use "dependent GOPs", aka "open GOPs". In those cases one GOP refers to another GOP and concurrent processing of separate GOPs is more difficult. I don't know how widespread that is or how it affects performance scalability of multiple accelerators, but the academic literature says it's a more difficult case.

Taz Mangus · Mar 26, 2022

dieselm said:
Business reasons. Simplification. They want customers focused away from chip specs and on to the form factor they prefer.

Customers will learn there's no performance penalty buying a laptop vs a desktop vs an all-in-one or even a tablet. A M1 Pro is an M1 Pro is an M1 Pro, whereever.

No more thinking about tiers (i3/i5/i7), no speed ratings (Ghz, base vs turbo, fans). The only performance differences are the number of CPU and GPU cores and those to the extent possible scale linearly.

Do you need an M1, a Pro, a Max or an Ultra? That's it. If you don't need the extra cores, feel confident you get the same excellent day-to-day performance as if you spent thousands more.

ArkSingularity · Mar 26, 2022

leman said:
Or maybe the thermal system is just way overprovisioned for the hardware. It could be that Apple is designing for the future hardware, and not just for M1. From what we know, M1 GPU peaks out at around 1.3ghz, which is likely some sort of hard/soft limit due to design/yield/whatever factors. And you are not the first to ask why Apple wouldn't clock these things higher, we all are wondering it. At the end of the day, there must be an engineering reason, because why would marketing leave "free" performance on the table?

Apple's strategy with the GPU has been very different than that of some of their competitors for a while. They've been optimizing them for a while for mobile devices, where extremely low power consumption is obviously critical. Clock speeds make a HUGE difference here, since power usage doesn't scale linearly with clock speed. Double the clock speed, and you end up using well over double the power.

Apple has never really relied on ratcheting clock speeds high (unlike what Intel has done on their CPUs). They've sort of taken the opposite approach and have purposefully kept them lower, allowing them much more thermal/power headroom to just throw way more cores and parallel execution units in place. It costs quite a hefty transistor budget to do this, but Apple has proven that they're able to afford it, and it's part of why their GPUs use less power and generate less heat than those of their competitors.

The lower clock speeds aren't really surprising, but Apple has made up for those differences by just throwing more cores and more parallelization to it. It's actually a pretty smart approach, as long as the transistor budget doesn't become an issue (TSMC's 5nm nodes have helped a lot here).

Gnattu · Mar 27, 2022

jinnyman said:
M1 Max Studio only offers additional ports and 10 gig lanport and whatnot compared to MBP M1 Max. No performance boost is there.

But there is an M1 Ultra right there, that one is the desktop level performance.

jinnyman said:
But I think they should have offered a turbo mode that maximize obvious thermal headroom that Mac Studio has.

This is impossible on M1 series due to multiple design choices. You can only have it around 3GHz no matter what. Although Intel CPUs may always give you an impression that if you put more power then you can clock it higher, but it is not always true. Apple's approach to get supreme power efficiency limits the clocking rage (and the "upper limit") of the chip.

PianoPro · Mar 27, 2022

Gnattu said:
This is impossible on M1 series due to multiple design choices. You can only have it around 3GHz no matter what. Although Intel CPUs may always give you an impression that if you put more power then you can clock it higher, but it is not always true. Apple's approach to get supreme power efficiency limits the clocking rage (and the "upper limit") of the chip.

We are talking about GPU performance, and this response from Apple may imply that the current M1 GPU slow ramp up could be changed in the future for the Desktop Mac Studio (where battery time is not an issue) compared to how it's done in the MacBooks.

Ensuring peak M1 GPU performance f… | Apple Developer Forums

developer.apple.com

There is currently no way to hint the GPU to ramp up before executing work. Last year however, we introduced GPU Performance State inducer in our GPU tooling. The WWDC session "Discover Metal debugging, profiling, and asset creation tools" explains how to use these tools.

dieselm · Mar 27, 2022

Gnattu said:
But there is an M1 Ultra right there, that one is the desktop level performance.

Totally right. Apple doesn't want the customer thinking, "I need more performance, I need to get a desktop" or even the idea of "desktop-level" performance. They want us thinking in terms of Pro, Max or Ultra levels of performance.

Makes it much more difficult to readily comparison shop if that's the default frame.
Ghz, hyperthreading, i3/i5/i7/U/Y, integrated vs discrete gpus, chip decisions turn into "Which M1 do i need?"

leman · Mar 27, 2022

PianoPro said:
We are talking about GPU performance, and this response from Apple may imply that the current M1 GPU slow ramp up could be changed in the future for the Desktop Mac Studio (where battery time is not an issue) compared to how it's done in the MacBooks.

Ensuring peak M1 GPU performance f… | Apple Developer Forums

developer.apple.com

Which is completely irrelevant to what @Gnattu has been saying. Changing GPU ramp up behavior would only help with certain marginal GPU compute use cases. It wouldn’t make the GPU faster.

sunny5 · Mar 29, 2022

M1 Ultra's GPU issue?

macrumors Core

macrumors 6502a

Suspended

macrumors regular

macrumors 68040

macrumors 6502a

macrumors G5

macrumors 65816

macrumors 6502

macrumors 6502a

macrumors 6502a

Suspended

macrumors G5

macrumors 6502

macrumors 6502a

macrumors 6502a

macrumors regular

macrumors 68000

macrumors 604

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors regular

macrumors Core

Suspended

Our Staff