Anyone else feel the M1 Ultra could be a bit faster?

mr_roboto · Jun 16, 2022

joema2 said:
Thanks for pointing that out, it is very informative. The last slide mentioned that XCode has several new performance counters related to the MMU (Memory Mapping Unit), which is the hardware block on a CPU that translates logical to physical addresses. Those were MMU Limiter, MMU Utilization Counter, and MMU TLB Miss Rate. TLB = Translation Lookaside Buffer, which is the MMU's cache for recent logical-to-physical translations.

The speaker threw that in at the last moment of his talk, with no other explanation. However it implies that besides the normal GPU-related bottlenecks, there are possible MMU-related bottlenecks the programmer must be aware of. In some of his videos Max Yuryev has speculated that MMU TLB bottlenecks may be limiting GPU scalability on the M1 Max and Ultra.

Correction: in some of his videos Max Yuryev has baselessly speculated that MMU TLB bottlenecks etc. He's not a good source for technical information, he's a youtube talking head who doesn't even seem to really understand what a TLB is, much less how it could affect graphics performance.

You shouldn't take the existence of performance counters for various pieces of the system as corroboration for Yuryev's ideas.

leman · Jun 17, 2022

joema2 said:
The speaker threw that in at the last moment of his talk, with no other explanation. However it implies that besides the normal GPU-related bottlenecks, there are possible MMU-related bottlenecks the programmer must be aware of.

TLB misses pose a huge problem for GPU performance since it can launch thousands of memory requests simultaneously. This is not new and not exclusive to Apple GPUs, this has been an active field of academic (and practical) research since GPGPU became a reality. A TLB miss counter can be very valuable when designing your memory access patterns.

joema2 · Jun 17, 2022

leman said:
TLB misses pose a huge problem for GPU performance since it can launch thousands of memory requests simultaneously. This is not new and not exclusive to Apple GPUs, this has been an active field of academic (and practical) research since GPGPU became a reality. A TLB miss counter can be very valuable when designing your memory access patterns.

Interesting; I see in "Reducing GPU Address Translation Overhead with Virtual Caching", Yoon (2006), he discusses how traditional TLB caching may not work well for integrated CPU/GPU unified memory designs due to poor locality of reference: https://minds.wisconsin.edu/bitstream/handle/1793/75577/TR1842.pdf?sequence=3&isAllowed=y

In that paper he proposes a method of alleviating the bottleneck of MMU TLB misses caused by shared CPU/GPU address translation hardware.

With that in mind it's understandable why the WWDC2022-10159 session you mentioned advised that Xcode developers improve GPU scalability by monitoring the new MMU-related counters, inc'l MMU TLB miss rate.

Search

Search

Anyone else feel the M1 Ultra could be a bit faster?

mr_roboto

macrumors 6502a

leman

macrumors Core

joema2

macrumors 68000

Our Staff