3D Rendering on Apple Silicon, CPU&GPU

aytan · Oct 31, 2023

I am in between for upgrading my M1 Ultra to 16 inch M3 Max or just wait another 6 to 8 months for M3 Ultra. I will decide after first real life results. Maybe we could see some Benchmarks next week.
I did not have good memories with M1 Studio Max while working on 3D projects. M1 Studio Max did the the job at that time but Ultra is a completely different machine.

Sure Hardware Ray Tracing is a big thing but I had to see results with Redshift Benchmark. And Redshift team already have some issues with Sonoma, you can find these posts in Redshift Forums.

'' *macOS Sonoma 14 is supported on Apple Silicon devices, however there are currently issues using Sonoma with x86/AMD hardware which we are working with Apple to resolve.''

and finally Redshift released last version a couple of hours ago,

''Apple Engineers have helped solve a lot of long standing issues recently with MacOS in the release today of Redshift 3.5.21.''

Also Maxon release Cinebench M3 support a few hours ago,

Maxon Redshift and Cinebench Optimized for Mac Lineup with M3 Family…

With the new GPU of M3, features like hardware-accelerated ray tracing offer outstanding performance in Redshift rendering.

www.maxon.net

Everything looks like planned an executed on time both for Apple and Maxon.

However once you get any new M3 Mac, you can not downgrade your OS. So waiting a couple of weeks could be a wise choice. Otherwise you could stack with a brand new M3 Mac with old Softwares which can not work properly. Something similar happened with M2 Ultra and problems can not solved for a couple of weeks.

I hope performance will increase at least %50 between M1 and M3 for major 3D softwares. Otherwise any recent M1/M2 Ultra will rules for another year until M3 Ultra arrives.

aytan · Oct 31, 2023

dugbug said:
While dynamic caching is kind of fuzzy at the moment I think this article attempts to explain it pretty well

Nobody knows exactly how the M3’s Dynamic Caching works, but I have a theory

Apple's Dynamic Caching is one of the most technically dense features the company has ever revealed. Here's our best understanding of it.

www.digitaltrends.com

Thanks for the link, I am not sure about ''Dynamic Caching'' will work and reflects better results right now. There could be an Adobe issue until they upgrade whole set of softwares for that shiny new ''Dynamic Caching'' thing. I guess After Effects will not like ''Dynamic Caching''

.

Xiao_Xi · Oct 31, 2023

Should we add a new column to show if a program uses ray tracing hardware?

https://forums.macrumors.com/threads/tracking-3d-software-for-apple-silicon.2385911/

aytan · Oct 31, 2023

There is a new post on Redshift Forum about M3 Max performance,

https://redshift.maxon.net/topic/41339/m1-ultra-performance/839

Assuming you extrapolate from the presentation the 2.5x improvement over the M1 Max when used with Cinebench as stated by Apple. The M1 Max Cinebench score is ~4332 * 2.5 = 10,830.

Then you can assume the score will be near ~11k when using the new M3 Max & its new RT core ray-tracing acceleration. This puts it about ~1/3 of a full desktop 4090 on Cinebench. A pretty impressive jump in one generation especially for such a slim laptop with low energy consumption. So it is looking likely to be between an Nvidia 4070 - 4060.

So not bad at all for a laptop achieving 4070 performance, I can calculate my results which I get from my systems currently own M1 Ultra and 4070 PC. This gives me estimated %40 - %50 performance boost by M3 Max over M1 Ultra.

NT1440 · Oct 31, 2023

aytan said:
Thanks for the link, I am not sure about ''Dynamic Caching'' will work and reflects better results right now. There could be an Adobe issue until they upgrade whole set of softwares for that shiny new ''Dynamic Caching'' thing. I guess After Effects will not like ''Dynamic Caching'' .

Apple claimed it was “transparent to developers”. I can’t make out if that means it’s easy to take advantage of in code, or if it means the handling of it is completely automated by the OS.

If it’s in developer hands, yea don’t ever count on Adobe to put the minimal effort in to support it. They are not that company.

quarkysg · Oct 31, 2023

NT1440 said:
Apple claimed it was “transparent to developers”. I can’t make out if that means it’s easy to take advantage of in code, or if it means the handling of it is completely automated by the OS.

I think it’s even transparent to the OS.

leman · Oct 31, 2023

NT1440 said:
Apple claimed it was “transparent to developers”. I can’t make out if that means it’s easy to take advantage of in code, or if it means the handling of it is completely automated by the OS.

If it’s in developer hands, yea don’t ever count on Adobe to put the minimal effort in to support it. They are not that company.

Well, it’s very interesting to look at the Blender patch Apple has submitted for M3.

Cycles: Apple M3 tuning including hardware raytracing · f2bb4c617f

This PR adds tunings for the [newly announced](https://www.youtube.com/watch?v=ctkW3V0Mh-k) M3 family of chips. In particular, MetalRT will be enabled as the automatic default for intersection testing on M3 and beyond to take advantage of hardware raytracing. This will result in significant path...

projects.blender.org

This is code that sets up optimal compute group sizes (number of GPU threads per shader invocation etc.) for various parts of the GPU renderer. On previous models, they used complicated looking number combinations determined by empirical testing. On M3, they just set everything to 64 (which is a rather small dispatch size!!) with a comment saying that Dynamic Caching in the GPU will take care of the rest. It’s quite impressive, if true. It would also suggest that optimizing GPU code for occupancy will be much easier.

APCX · Oct 31, 2023

leman said:
Well, it’s very interesting to look at the Blender patch Apple has submitted for M3.

Cycles: Apple M3 tuning including hardware raytracing · f2bb4c617f

This PR adds tunings for the [newly announced](https://www.youtube.com/watch?v=ctkW3V0Mh-k) M3 family of chips. In particular, MetalRT will be enabled as the automatic default for intersection testing on M3 and beyond to take advantage of hardware raytracing. This will result in significant path...

projects.blender.org

This is code that sets up optimal compute group sizes (number of GPU threads per shader invocation etc.) for various parts of the GPU renderer. On previous models, they used complicated looking number combinations determined by empirical testing. On M3, they just set everything to 64 (which is a rather small dispatch size!!) with a comment saying that Dynamic Caching in the GPU will take care of the rest. It’s quite impressive, if true. It would also suggest that optimizing GPU code for occupancy will be much easier.

Very curious to see the real world results from these advances.

Xiao_Xi · Nov 1, 2023

@leman Did you miss the GPU dynamic caching patent?

US20210271606A1 - On-demand Memory Allocation - Google Patents

Techniques are disclosed relating to dynamically allocating and mapping private memory for requesting circuitry. Disclosed circuitry may receive a private address and translate the private address to a virtual address (which an MMU may then translate to physical address to actually access a...

patents.google.com

Apple Patent Shows GPU Dynamic Caching Has Been in Development For Years

Apple patented GPU Dynamic Caching back in 2020.

www.tomshardware.com

leman · Nov 1, 2023

Xiao_Xi said:
@leman Did you miss the GPU dynamic caching patent?

US20210271606A1 - On-demand Memory Allocation - Google Patents

Techniques are disclosed relating to dynamically allocating and mapping private memory for requesting circuitry. Disclosed circuitry may receive a private address and translate the private address to a virtual address (which an MMU may then translate to physical address to actually access a...

patents.google.com

Ah, that’s what this thing was for! I remember @name99 was discussing it a while ago…

ggCloud · Nov 1, 2023

aytan said:
I am in between for upgrading my M1 Ultra to 16 inch M3 Max or just wait another 6 to 8 months for M3 Ultra. I will decide after first real life results. Maybe we could see some Benchmarks next week.
I did not have good memories with M1 Studio Max while working on 3D projects. M1 Studio Max did the the job at that time but Ultra is a completely different machine.

Sure Hardware Ray Tracing is a big thing but I had to see results with Redshift Benchmark. And Redshift team already have some issues with Sonoma, you can find these posts in Redshift Forums.

'' *macOS Sonoma 14 is supported on Apple Silicon devices, however there are currently issues using Sonoma with x86/AMD hardware which we are working with Apple to resolve.''

and finally Redshift released last version a couple of hours ago,

''Apple Engineers have helped solve a lot of long standing issues recently with MacOS in the release today of Redshift 3.5.21.''

Also Maxon release Cinebench M3 support a few hours ago,

Maxon Redshift and Cinebench Optimized for Mac Lineup with M3 Family…

With the new GPU of M3, features like hardware-accelerated ray tracing offer outstanding performance in Redshift rendering.

www.maxon.net

Everything looks like planned an executed on time both for Apple and Maxon.

However once you get any new M3 Mac, you can not downgrade your OS. So waiting a couple of weeks could be a wise choice. Otherwise you could stack with a brand new M3 Mac with old Softwares which can not work properly. Something similar happened with M2 Ultra and problems can not solved for a couple of weeks.

I hope performance will increase at least %50 between M1 and M3 for major 3D softwares. Otherwise any recent M1/M2 Ultra will rules for another year until M3 Ultra arrives.

Same for Blender. V4.0 will have RT support for M3. Coming next week

Xiao_Xi · Nov 1, 2023

ggCloud said:
Same for Blender. V4.0 will have RT support for M3. Coming next week

From the Blender 4.0 release notes:

The new Apple M3 processor supports hardware ray-tracing. Cycles takes advantage of this by default, with the MetalRT in the GPU device preferences set to "Auto".

For the M1 and M2 processors MetalRT is no longer an experimental feature, and fully supported now. However it is off by default, as Cycles' own intersection code still has better performance when there is no hardware ray-tracing.

Metal AMD GPU rendering has a significant performance regression in scenes that use the Principled Hair BSDF, in addition to existing lack of support for the light tree and shadow caustics. Due to GPU driver limitations that are unlikely to be fixed, it is expected that AMD GPU rendering on macOS will be disabled entirely in a future release.

Cycles - Blender Developer Documentation

wiki.blender.org

name99 · Nov 3, 2023

Here's my best analysis of what Dynamic Caching REALLY is.
This is cut from a much larger document describing the Apple GPU in depth, but I think it stands alone enough to explain the relevant points.

resource allocation (2017 and now)

We've now seen that resource allocation is an important part of scheduling, not least in the sense that we cannot start the execution of a new task (whether threadblock or warp) until certain hardware is allocated.
Consider now the allocation of "memory-like" resources, eg Scratchpad storage for a new threadblock, or registers for a new warp.
How might we do this?

The simplest answer is we don't really try. One threadblock gets the entire scratchpad, each warp gets a hardwired set of registers, no attempt to do better. This may sound dumb but it is, for example, right now the way AMX handles registers, with a dedicated register set for each potential client core in a cluster...

If we want to do better, the next step up is to provide each active threadblock with a base and a length register. The first threadblock sets the base to 0 and the length to whatever was allocated. The next threadblock sets base to previous length+1, and so on. We keep going until we run out of Scratchpad storage. Each access to Scratchpad adds the nominal address to the base, while simultaneously testing that it is smaller than length, and we're set. Once an allocation fails, we wait until an earlier threadblock complete and releases its range, then try again. There are a variety of ways we can improve on this basic idea (read any discussion of malloc!) but what's worth doing depends on the details \[Dash] what makes sense if you expect hundreds of thousands of simultaneous allocations is different from if you expect up to maybe eight simultaneous allocations.
This scheme as described is not utterly terrible (and readers may recognize that variants of this were used in early mainframes, and in the very earliest PC operating systems).
If you want to make this work better you might want to add a few elements.
Firstly you might want to aggregate units together. Allocate Scratchpad maybe at the granularity of a "cache line" (64B or 128B?) or even larger like 1KB. Allocate warp registers at a granularity of maybe 16.
Second you will want a bitmap or something similar to track the free vs in-use granules. And hardware that can rapidly find runs of bits of a particular length, so that you can easily find a run of three (or whatever) bits set to 1, to match a three granule request.

At this point you have an allocator adequate for small purposes, and this is essentially the content of (2017) https://patents.google.com/patent/US10698687B1 Pipelined resource allocation using size aligned allocation.
The specific detail being patented is how, after a few rounds of allocation and deallocation have been performed, do you choose where to perform the next allocation? For example options you might imagine include
- allocate from the smallest free block that meets the requirements
- allocate from the largest free block that meets the requirements
- allocate from the "oldest" (in some sense) or "youngest" free block that meets the requirements
- allocate round-robin style (ie just keep moving left till you find a block that's big enough)
What Apple does is none of these, but should seem familiar from some malloc designs. Allocate blocks by "alignment" granularity.
So if we want to allocate a block that is say 5 units in size, we will look for runs of 5 consecutive free blocks, at locations 0, 5, 10, 15, ...
If we find more than one such location, we will choose the smallest fit.
This behaves somewhat like a malloc that keeps separate free lists for blocks of different sizes; while not requiring the overhead of such a design.
This scheme is implemented in hardware, and performs the allocation (or reports a failure) in one cycle. Not bad!

This is far from the end of the line, however! Suppose you want to do better. The big problem with the design above is that we get fragmentation. We may have three granules free, but not contiguous.
The first step to a more powerful design is you introduce a level of indirection. Add something like a TLB that maps a virtual granuleID to a physical granuleID. Now we no longer need to care about contiguity; a three granule request merely needs to find three free bits anywhere in the bitmap, and set the appropriate (sequential) mappings from the virtualID to the (non-sequential) physicalID. And every granule access goes through the TLB lookup. This is, of course, basic VM and should be very familiar.

The next step is to also allow oversubscription. Now we no longer even require that there be three free physical granules, we simply hand out three virtualIDs, all mapped to "invalid" in the TLB. To make this work, we also need
- a faulting mechanism, so that when access is made to an "invalid" entry, some sort of hardware kicks in to perform a physical allocation
- a means to handle oversubscription, ie what to do if there is no physical allocation available?
For memory-like resources, both of these are fairly easy in the sense that you simply need to copy out some memory to some alternative storage, and reallocate the block that was copied out. Obviously you know how this works for standard virtual memory, and we can do the same thing here. For example if the resource of interest is Scratchpad, we can copy a granule of Scratchpad storage out to the L1D, reallocate that granule as invalid in the mapping for the current user, and set up as valid for the new user, and then keep going.
Just like with VM, you can then use various heuristics to ensure that this swapping does not get out of control.
For a GPU two obvious such heuristics are
- we already know which tasks may be inactive waiting on either RAM or a synchronization event, and obviously it makes more sense to reallocate from them than from an active task
- we also expect that, generally, warps and threadblocks will tend to behave in a similar fashion. So we can do something like
+ for the first round of warps of a threadblock, begin them with just one of their register granule allocations hooked up to a physical allocation and see what happens. If no faults are generated, we are fine! If faults are generated, then for the next round of warps, try to ensure that two register granules (or however many are required) are hooked up before starting the warps.
Of course the details of exactly how aggressively you oversubscribe will depend on the cost of re-allocation faults.

This next technology stage was introduced with the A17/M3 GPU and is given the brand name Dynamic Caching by Apple. It is especially valuable for "unpredictable" shaders. For example a shader may have one common (data dependent) path that only requires a small number of registers or Scratchpad, and a different rare (data dependent) path. The shader has to allocate the full possible set of registers or Scratchpad storage it might require, but usually will not use most of its allocation. By only allocating resources at the point they are truly required, we can frequently pack more threadblocks and warps onto each core, meaning fewer dead cycles when the core simply cannot find a warp to schedule.
However Apple's scheme is even more ambitious!
What I've described above is a scheme that's basically a simple remap table per core, with the ability to fault on certain accesses.
Compare this to the virtual memory system of a modern SoC. Two differences are
- The remapping is common across the entire SoC. There are multiple local remap tables (ie TLBs) but these are all caches of a single system wide collection of remappings that we might collectively call the "page tables". The TLBs are local caches of this entire set of remappings.
- To deal with the size of all the remappings, the "page tables" have a hierarchical structure which lends itself to multiple levels of caching.
This same sort of machinery is duplicated across the GPU. Superficially It looks like the CPU page tables (because of course it's designed using the same ideas) but it's not a remapping of multiple process virtual address spaces into physical DRAM, rather it's a remapping of multiple GPU-specific address spaces (eg the Register address space of each Warp, or the Scratchpad address space of each Threadblock) into a common address space which refers to SRAM storage in the GPU.
We could make this common address space storage just refer to SRAM on the GPU, but Apple go one stage further. They define the target address space not as physical SRAM on the GPU but as the device-wide virtual address space for the relevant process. Thus, in principle, a register lookup translates the (warpID, register number) via a GPU-specific TLB into an (ASID, virtual address) which in turn is translated via a system-wide TLB into a physical address.
That's obviously clumsy if taken literally, and obviously there will be short-circuiting (for example virtual addresses with a particular set of high bits can be routed directly to GPU SRAM), but it allows for expansion into the main memory system as required. For example any oversubscription of GPU SRAM resources can change the mappings to route to the "real" virtual address space and thus be backed by SLC, and DRAM. (If/when Apple start doing crazy things like expanding into high-end systems, you can also use this sort of technology to provide a distributed virtual address space across multiple computers in a way that simultaneously allows for transparent growth of the GPUs.)

This scheme (virtualize resources, and handle oversubscription by faulting) is described in (2020) https://patents.google.com/patent/US20210271606A1 On-demand Memory Allocation. The patent as described would appear to apply to literal memory spaces (obviously Scratchpad, and also the private address space used by Ray Tracing) but experimental investigations suggest that it is more abstract and may apply to other resources, specifically it does appear to apply to warp registers. In other words you should think of this as a scheme for virtualization, generically considered, and with resources granules sized as appropriate (perhaps 1kB or 4kB for Scratchpad storage? perhaps 16 or 32 registers?) rather than as an extension of the "standard" paging system, and forced to use 16kB pages.
Once you have a scheme that looks like VM, you can in principle use it in ways like VM!
The patent suggests three possible extensions to how this might be used.
First suppose you need to move work from one GPU core to another. In principle you could have each task (at the appropriate granularity, warp, threadblock, kernel, whatever) have the taskID act as an ASID, and simply move the task from one core to another. The first access by the core would miss (ASID+address) in the new cores TLB, so the TLB would be filled in from the central TLB. Then the reference to the address (say a register) would miss in the new core, and the register data would be moved from SRAM in the previous core to SRAM in the new core. Very much like moving a process on a CPU. This seems to be what Apple ultimately envision, and you might want to do this for all the reasons you might want to move a process, from load-balancing across cores to consolidating work on a few cores and powering down the rest.
Second you can share "memory" between tasks. We can already share registers between the lanes of a warp in the form of uniforms, but the uniform has to be reinitialized for each warp. You could imagine something like a threadblock, or even an entire kernel, sets a block of registers to generally useful values, then that block gets "mapped into" every warp when the warp is activated. This both avoids the initialization instructions and allows the storage of these registers to be reused across each warp, rather than per-warp register allocation.
Something similar may be useful across threadblocks. You could even do things like Copy-on-Write, though it's unclear how useful this might be.
Of course it will take changes to MSL and the compiler to make these ideas visible, but we can imagine them coming; the primary constraint on how powerful they might be is the extent to which "faulting" is a GPU-wide notification versus a core-wide notification.
Third you can consolidate all storage in a single physical pool of SRAM, with this "page" indirection determining whether the storage is used by Scratchpad, Ray Tracing, Texture, Registers, or anything else; and you could even have the indirection route to the SRAM pool on a different core.
As we have seen, nVidia has gone some distance down this path: a common address space, somewhat common SRAM pools, and the ability for threadblocks on one core to use Scratchpad storage on another core.
Perhaps Apple ultimately plans to go even further (eg making registers part of the common storage pool?)
The patent as described by Apple is extremely ambitious, and it is likely that elements of this will be rolling out over the next few years, and that the version we have right now in the M3 as of 2023 is the simplest possible version (remapping and oversubscription within a single core, but not yet common cross-core address space, moving address spaces, shared address spaces, etc).

Kronsteen · Nov 4, 2023

name99 said:
Here's my best analysis of what Dynamic Caching REALLY is.
This is cut from a much larger document describing the Apple GPU in depth, but I think it stands alone enough to explain the relevant points.

resource allocation (2017 and now)

We've now seen that resource allocation is an important part of scheduling, not least in the sense that we cannot start the execution of a new task (whether threadblock or warp) until certain hardware is allocated.
Consider now the allocation of "memory-like" resources, eg Scratchpad storage for a new threadblock, or registers for a new warp.
How might we do this?

The simplest answer is we don't really try. One threadblock gets the entire scratchpad, each warp gets a hardwired set of registers, no attempt to do better. This may sound dumb but it is, for example, right now the way AMX handles registers, with a dedicated register set for each potential client core in a cluster...

If we want to do better, the next step up is to provide each active threadblock with a base and a length register. The first threadblock sets the base to 0 and the length to whatever was allocated. The next threadblock sets base to previous length+1, and so on. We keep going until we run out of Scratchpad storage. Each access to Scratchpad adds the nominal address to the base, while simultaneously testing that it is smaller than length, and we're set. Once an allocation fails, we wait until an earlier threadblock complete and releases its range, then try again. There are a variety of ways we can improve on this basic idea (read any discussion of malloc!) but what's worth doing depends on the details \[Dash] what makes sense if you expect hundreds of thousands of simultaneous allocations is different from if you expect up to maybe eight simultaneous allocations.
This scheme as described is not utterly terrible (and readers may recognize that variants of this were used in early mainframes, and in the very earliest PC operating systems).
If you want to make this work better you might want to add a few elements.
Firstly you might want to aggregate units together. Allocate Scratchpad maybe at the granularity of a "cache line" (64B or 128B?) or even larger like 1KB. Allocate warp registers at a granularity of maybe 16.
Second you will want a bitmap or something similar to track the free vs in-use granules. And hardware that can rapidly find runs of bits of a particular length, so that you can easily find a run of three (or whatever) bits set to 1, to match a three granule request.

At this point you have an allocator adequate for small purposes, and this is essentially the content of (2017) https://patents.google.com/patent/US10698687B1 Pipelined resource allocation using size aligned allocation.
The specific detail being patented is how, after a few rounds of allocation and deallocation have been performed, do you choose where to perform the next allocation? For example options you might imagine include
- allocate from the smallest free block that meets the requirements
- allocate from the largest free block that meets the requirements
- allocate from the "oldest" (in some sense) or "youngest" free block that meets the requirements
- allocate round-robin style (ie just keep moving left till you find a block that's big enough)
What Apple does is none of these, but should seem familiar from some malloc designs. Allocate blocks by "alignment" granularity.
So if we want to allocate a block that is say 5 units in size, we will look for runs of 5 consecutive free blocks, at locations 0, 5, 10, 15, ...
If we find more than one such location, we will choose the smallest fit.
This behaves somewhat like a malloc that keeps separate free lists for blocks of different sizes; while not requiring the overhead of such a design.
This scheme is implemented in hardware, and performs the allocation (or reports a failure) in one cycle. Not bad!

This is far from the end of the line, however! Suppose you want to do better. The big problem with the design above is that we get fragmentation. We may have three granules free, but not contiguous.
The first step to a more powerful design is you introduce a level of indirection. Add something like a TLB that maps a virtual granuleID to a physical granuleID. Now we no longer need to care about contiguity; a three granule request merely needs to find three free bits anywhere in the bitmap, and set the appropriate (sequential) mappings from the virtualID to the (non-sequential) physicalID. And every granule access goes through the TLB lookup. This is, of course, basic VM and should be very familiar.

The next step is to also allow oversubscription. Now we no longer even require that there be three free physical granules, we simply hand out three virtualIDs, all mapped to "invalid" in the TLB. To make this work, we also need
- a faulting mechanism, so that when access is made to an "invalid" entry, some sort of hardware kicks in to perform a physical allocation
- a means to handle oversubscription, ie what to do if there is no physical allocation available?
For memory-like resources, both of these are fairly easy in the sense that you simply need to copy out some memory to some alternative storage, and reallocate the block that was copied out. Obviously you know how this works for standard virtual memory, and we can do the same thing here. For example if the resource of interest is Scratchpad, we can copy a granule of Scratchpad storage out to the L1D, reallocate that granule as invalid in the mapping for the current user, and set up as valid for the new user, and then keep going.
Just like with VM, you can then use various heuristics to ensure that this swapping does not get out of control.
For a GPU two obvious such heuristics are
- we already know which tasks may be inactive waiting on either RAM or a synchronization event, and obviously it makes more sense to reallocate from them than from an active task
- we also expect that, generally, warps and threadblocks will tend to behave in a similar fashion. So we can do something like
+ for the first round of warps of a threadblock, begin them with just one of their register granule allocations hooked up to a physical allocation and see what happens. If no faults are generated, we are fine! If faults are generated, then for the next round of warps, try to ensure that two register granules (or however many are required) are hooked up before starting the warps.
Of course the details of exactly how aggressively you oversubscribe will depend on the cost of re-allocation faults.

This next technology stage was introduced with the A17/M3 GPU and is given the brand name Dynamic Caching by Apple. It is especially valuable for "unpredictable" shaders. For example a shader may have one common (data dependent) path that only requires a small number of registers or Scratchpad, and a different rare (data dependent) path. The shader has to allocate the full possible set of registers or Scratchpad storage it might require, but usually will not use most of its allocation. By only allocating resources at the point they are truly required, we can frequently pack more threadblocks and warps onto each core, meaning fewer dead cycles when the core simply cannot find a warp to schedule.
However Apple's scheme is even more ambitious!
What I've described above is a scheme that's basically a simple remap table per core, with the ability to fault on certain accesses.
Compare this to the virtual memory system of a modern SoC. Two differences are
- The remapping is common across the entire SoC. There are multiple local remap tables (ie TLBs) but these are all caches of a single system wide collection of remappings that we might collectively call the "page tables". The TLBs are local caches of this entire set of remappings.
- To deal with the size of all the remappings, the "page tables" have a hierarchical structure which lends itself to multiple levels of caching.
This same sort of machinery is duplicated across the GPU. Superficially It looks like the CPU page tables (because of course it's designed using the same ideas) but it's not a remapping of multiple process virtual address spaces into physical DRAM, rather it's a remapping of multiple GPU-specific address spaces (eg the Register address space of each Warp, or the Scratchpad address space of each Threadblock) into a common address space which refers to SRAM storage in the GPU.
We could make this common address space storage just refer to SRAM on the GPU, but Apple go one stage further. They define the target address space not as physical SRAM on the GPU but as the device-wide virtual address space for the relevant process. Thus, in principle, a register lookup translates the (warpID, register number) via a GPU-specific TLB into an (ASID, virtual address) which in turn is translated via a system-wide TLB into a physical address.
That's obviously clumsy if taken literally, and obviously there will be short-circuiting (for example virtual addresses with a particular set of high bits can be routed directly to GPU SRAM), but it allows for expansion into the main memory system as required. For example any oversubscription of GPU SRAM resources can change the mappings to route to the "real" virtual address space and thus be backed by SLC, and DRAM. (If/when Apple start doing crazy things like expanding into high-end systems, you can also use this sort of technology to provide a distributed virtual address space across multiple computers in a way that simultaneously allows for transparent growth of the GPUs.)

This scheme (virtualize resources, and handle oversubscription by faulting) is described in (2020) https://patents.google.com/patent/US20210271606A1 On-demand Memory Allocation. The patent as described would appear to apply to literal memory spaces (obviously Scratchpad, and also the private address space used by Ray Tracing) but experimental investigations suggest that it is more abstract and may apply to other resources, specifically it does appear to apply to warp registers. In other words you should think of this as a scheme for virtualization, generically considered, and with resources granules sized as appropriate (perhaps 1kB or 4kB for Scratchpad storage? perhaps 16 or 32 registers?) rather than as an extension of the "standard" paging system, and forced to use 16kB pages.
Once you have a scheme that looks like VM, you can in principle use it in ways like VM!
The patent suggests three possible extensions to how this might be used.
First suppose you need to move work from one GPU core to another. In principle you could have each task (at the appropriate granularity, warp, threadblock, kernel, whatever) have the taskID act as an ASID, and simply move the task from one core to another. The first access by the core would miss (ASID+address) in the new cores TLB, so the TLB would be filled in from the central TLB. Then the reference to the address (say a register) would miss in the new core, and the register data would be moved from SRAM in the previous core to SRAM in the new core. Very much like moving a process on a CPU. This seems to be what Apple ultimately envision, and you might want to do this for all the reasons you might want to move a process, from load-balancing across cores to consolidating work on a few cores and powering down the rest.
Second you can share "memory" between tasks. We can already share registers between the lanes of a warp in the form of uniforms, but the uniform has to be reinitialized for each warp. You could imagine something like a threadblock, or even an entire kernel, sets a block of registers to generally useful values, then that block gets "mapped into" every warp when the warp is activated. This both avoids the initialization instructions and allows the storage of these registers to be reused across each warp, rather than per-warp register allocation.
Something similar may be useful across threadblocks. You could even do things like Copy-on-Write, though it's unclear how useful this might be.
Of course it will take changes to MSL and the compiler to make these ideas visible, but we can imagine them coming; the primary constraint on how powerful they might be is the extent to which "faulting" is a GPU-wide notification versus a core-wide notification.
Third you can consolidate all storage in a single physical pool of SRAM, with this "page" indirection determining whether the storage is used by Scratchpad, Ray Tracing, Texture, Registers, or anything else; and you could even have the indirection route to the SRAM pool on a different core.
As we have seen, nVidia has gone some distance down this path: a common address space, somewhat common SRAM pools, and the ability for threadblocks on one core to use Scratchpad storage on another core.
Perhaps Apple ultimately plans to go even further (eg making registers part of the common storage pool?)
The patent as described by Apple is extremely ambitious, and it is likely that elements of this will be rolling out over the next few years, and that the version we have right now in the M3 as of 2023 is the simplest possible version (remapping and oversubscription within a single core, but not yet common cross-core address space, moving address spaces, shared address spaces, etc).

Great analysis, @name99

I’d be really interested to see the document you mention, if it’s something you can post a link to (but completely understand if you can’t).

And, as an aside, if you can suggest any pointers to material that describes how to optimise compute (non-graphical) workloads using Metal (I’m going to convert from OpenCL when I have time), to run on Mx GPUs, I’d be most grateful.

Thanks …. Andrew

name99 · Nov 4, 2023

Kronsteen said:
Great analysis, @name99

I’d be really interested to see the document you mention, if it’s something you can post a link to (but completely understand if you can’t).

And, as an aside, if you can suggest any pointers to material that describes how to optimise compute (non-graphical) workloads using Metal (I’m going to convert from OpenCL when I have time), to run on Mx GPUs, I’d be most grateful.

Thanks …. Andrew

Eventually. Like the other documents it takes time, and I began knowing much less about GPUs, so much more background research was necessary.

ggCloud · Nov 4, 2023

ggCloud said:
Same for Blender. V4.0 will have RT support for M3. Coming next week

Been delayed to Nov 14.

Kronsteen · Nov 4, 2023

name99 said:
Eventually. Like the other documents it takes time, and I began knowing much less about GPUs, so much more background research was necessary.

Look forward to it. Thanks!

l0stl0rd · Nov 6, 2023

So now add Hardware RT to that 🤔

Ah sorry Blender Opendata if anyone wonders.

bcortens · Nov 6, 2023

l0stl0rd said:
View attachment 2308022

So now add Hardware RT to that 🤔

Ah sorry Blender Opendata if anyone wonders.

Nice! Over a thousand additional points with RT and the rest of the GPU improvements!

bcortens · Nov 6, 2023

l0stl0rd said:
View attachment 2308022

So now add Hardware RT to that 🤔

Ah sorry Blender Opendata if anyone wonders.

If the scaling is the same from M3 Max to M3 Ultra we could see the Ultra hit 4070 levels of performance with Optx!

l0stl0rd · Nov 6, 2023

bcortens said:
Nice! Over a thousand additional points with RT and the rest of the GPU improvements!

I do not think RT is used as 3.6 should not be ready but I could be wrong.

Blender 4 should be.

Pressure · Nov 6, 2023

waltteriii · Nov 6, 2023

Screenhot from iPhonedo video:

Blender 3.6 / White Lands demo scene using only GPU.
For comparison my:

MacBook Pro M2 Max 38GPU (64GB ram)
Result: 132

PC with Nvidia 4090
Result: 19

l0stl0rd · Nov 6, 2023

Every lower than Blender 4 is not really relevant.

Any how…

diamond.g · Nov 6, 2023

l0stl0rd said:
Every lower than Blender 4 is not really relevant.

Any how…

View attachment 2308096

@leman mentioned this strong showing could be due to the dynamic cache.

It is very impressive.

3D Rendering on Apple Silicon, CPU&GPU

macrumors regular

macrumors regular

macrumors 68000

macrumors regular

macrumors P6

macrumors 65816

macrumors Core

Suspended

macrumors 68000

macrumors Core

Suspended

macrumors 68000

macrumors 68030

macrumors member

macrumors 68030

Suspended

macrumors member

macrumors 6502

macrumors 65816

macrumors 65816

macrumors 6502

macrumors 603

macrumors newbie

macrumors 6502

macrumors G5

Our Staff