Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

aytan

macrumors regular
Dec 20, 2022
161
110
I am in between for upgrading my M1 Ultra to 16 inch M3 Max or just wait another 6 to 8 months for M3 Ultra. I will decide after first real life results. Maybe we could see some Benchmarks next week.
I did not have good memories with M1 Studio Max while working on 3D projects. M1 Studio Max did the the job at that time but Ultra is a completely different machine.

Sure Hardware Ray Tracing is a big thing but I had to see results with Redshift Benchmark. And Redshift team already have some issues with Sonoma, you can find these posts in Redshift Forums.

'' *macOS Sonoma 14 is supported on Apple Silicon devices, however there are currently issues using Sonoma with x86/AMD hardware which we are working with Apple to resolve.''

and finally Redshift released last version a couple of hours ago,

''Apple Engineers have helped solve a lot of long standing issues recently with MacOS in the release today of Redshift 3.5.21.''

Also Maxon release Cinebench M3 support a few hours ago,

Everything looks like planned an executed on time both for Apple and Maxon.

However once you get any new M3 Mac, you can not downgrade your OS. So waiting a couple of weeks could be a wise choice. Otherwise you could stack with a brand new M3 Mac with old Softwares which can not work properly. Something similar happened with M2 Ultra and problems can not solved for a couple of weeks.

I hope performance will increase at least %50 between M1 and M3 for major 3D softwares. Otherwise any recent M1/M2 Ultra will rules for another year until M3 Ultra arrives.
 
  • Like
Reactions: ader42

aytan

macrumors regular
Dec 20, 2022
161
110
While dynamic caching is kind of fuzzy at the moment I think this article attempts to explain it pretty well

Thanks for the link, I am not sure about ''Dynamic Caching'' will work and reflects better results right now. There could be an Adobe issue until they upgrade whole set of softwares for that shiny new ''Dynamic Caching'' thing. I guess After Effects will not like ''Dynamic Caching'' :).
 

aytan

macrumors regular
Dec 20, 2022
161
110
There is a new post on Redshift Forum about M3 Max performance,


Assuming you extrapolate from the presentation the 2.5x improvement over the M1 Max when used with Cinebench as stated by Apple. The M1 Max Cinebench score is ~4332 * 2.5 = 10,830.

c2786f14-9631-4ba3-a4ce-a087020d496f-image.png

Then you can assume the score will be near ~11k when using the new M3 Max & its new RT core ray-tracing acceleration. This puts it about ~1/3 of a full desktop 4090 on Cinebench. A pretty impressive jump in one generation especially for such a slim laptop with low energy consumption. So it is looking likely to be between an Nvidia 4070 - 4060.


So not bad at all for a laptop achieving 4070 performance, I can calculate my results which I get from my systems currently own M1 Ultra and 4070 PC. This gives me estimated %40 - %50 performance boost by M3 Max over M1 Ultra.
 
  • Like
Reactions: komuh and sirio76

NT1440

macrumors Pentium
May 18, 2008
15,092
22,158
Thanks for the link, I am not sure about ''Dynamic Caching'' will work and reflects better results right now. There could be an Adobe issue until they upgrade whole set of softwares for that shiny new ''Dynamic Caching'' thing. I guess After Effects will not like ''Dynamic Caching'' :).
Apple claimed it was “transparent to developers”. I can’t make out if that means it’s easy to take advantage of in code, or if it means the handling of it is completely automated by the OS.

If it’s in developer hands, yea don’t ever count on Adobe to put the minimal effort in to support it. They are not that company.
 
  • Like
Reactions: aytan

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
Apple claimed it was “transparent to developers”. I can’t make out if that means it’s easy to take advantage of in code, or if it means the handling of it is completely automated by the OS.
I think it’s even transparent to the OS.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Apple claimed it was “transparent to developers”. I can’t make out if that means it’s easy to take advantage of in code, or if it means the handling of it is completely automated by the OS.

If it’s in developer hands, yea don’t ever count on Adobe to put the minimal effort in to support it. They are not that company.

Well, it’s very interesting to look at the Blender patch Apple has submitted for M3.


This is code that sets up optimal compute group sizes (number of GPU threads per shader invocation etc.) for various parts of the GPU renderer. On previous models, they used complicated looking number combinations determined by empirical testing. On M3, they just set everything to 64 (which is a rather small dispatch size!!) with a comment saying that Dynamic Caching in the GPU will take care of the rest. It’s quite impressive, if true. It would also suggest that optimizing GPU code for occupancy will be much easier.
 

APCX

Suspended
Sep 19, 2023
262
337
Well, it’s very interesting to look at the Blender patch Apple has submitted for M3.


This is code that sets up optimal compute group sizes (number of GPU threads per shader invocation etc.) for various parts of the GPU renderer. On previous models, they used complicated looking number combinations determined by empirical testing. On M3, they just set everything to 64 (which is a rather small dispatch size!!) with a comment saying that Dynamic Caching in the GPU will take care of the rest. It’s quite impressive, if true. It would also suggest that optimizing GPU code for occupancy will be much easier.
Very curious to see the real world results from these advances.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
@leman Did you miss the GPU dynamic caching patent?

 
  • Like
Reactions: komuh

leman

macrumors Core
Oct 14, 2008
19,521
19,674
@leman Did you miss the GPU dynamic caching patent?

Ah, that’s what this thing was for! I remember @name99 was discussing it a while ago…
 
  • Wow
Reactions: Xiao_Xi

ggCloud

Suspended
Oct 29, 2023
34
40
I am in between for upgrading my M1 Ultra to 16 inch M3 Max or just wait another 6 to 8 months for M3 Ultra. I will decide after first real life results. Maybe we could see some Benchmarks next week.
I did not have good memories with M1 Studio Max while working on 3D projects. M1 Studio Max did the the job at that time but Ultra is a completely different machine.

Sure Hardware Ray Tracing is a big thing but I had to see results with Redshift Benchmark. And Redshift team already have some issues with Sonoma, you can find these posts in Redshift Forums.

'' *macOS Sonoma 14 is supported on Apple Silicon devices, however there are currently issues using Sonoma with x86/AMD hardware which we are working with Apple to resolve.''

and finally Redshift released last version a couple of hours ago,

''Apple Engineers have helped solve a lot of long standing issues recently with MacOS in the release today of Redshift 3.5.21.''

Also Maxon release Cinebench M3 support a few hours ago,

Everything looks like planned an executed on time both for Apple and Maxon.

However once you get any new M3 Mac, you can not downgrade your OS. So waiting a couple of weeks could be a wise choice. Otherwise you could stack with a brand new M3 Mac with old Softwares which can not work properly. Something similar happened with M2 Ultra and problems can not solved for a couple of weeks.

I hope performance will increase at least %50 between M1 and M3 for major 3D softwares. Otherwise any recent M1/M2 Ultra will rules for another year until M3 Ultra arrives.
Same for Blender. V4.0 will have RT support for M3. Coming next week
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Same for Blender. V4.0 will have RT support for M3. Coming next week
From the Blender 4.0 release notes:
The new Apple M3 processor supports hardware ray-tracing. Cycles takes advantage of this by default, with the MetalRT in the GPU device preferences set to "Auto".

For the M1 and M2 processors MetalRT is no longer an experimental feature, and fully supported now. However it is off by default, as Cycles' own intersection code still has better performance when there is no hardware ray-tracing.
Metal AMD GPU rendering has a significant performance regression in scenes that use the Principled Hair BSDF, in addition to existing lack of support for the light tree and shadow caustics. Due to GPU driver limitations that are unlikely to be fixed, it is expected that AMD GPU rendering on macOS will be disabled entirely in a future release.
 
Last edited:
  • Wow
Reactions: diamond.g

name99

macrumors 68020
Jun 21, 2004
2,410
2,315
Here's my best analysis of what Dynamic Caching REALLY is.
This is cut from a much larger document describing the Apple GPU in depth, but I think it stands alone enough to explain the relevant points.



resource allocation (2017 and now)

We've now seen that resource allocation is an important part of scheduling, not least in the sense that we cannot start the execution of a new task (whether threadblock or warp) until certain hardware is allocated.
Consider now the allocation of "memory-like" resources, eg Scratchpad storage for a new threadblock, or registers for a new warp.
How might we do this?

The simplest answer is we don't really try. One threadblock gets the entire scratchpad, each warp gets a hardwired set of registers, no attempt to do better. This may sound dumb but it is, for example, right now the way AMX handles registers, with a dedicated register set for each potential client core in a cluster...

If we want to do better, the next step up is to provide each active threadblock with a base and a length register. The first threadblock sets the base to 0 and the length to whatever was allocated. The next threadblock sets base to previous length+1, and so on. We keep going until we run out of Scratchpad storage. Each access to Scratchpad adds the nominal address to the base, while simultaneously testing that it is smaller than length, and we're set. Once an allocation fails, we wait until an earlier threadblock complete and releases its range, then try again. There are a variety of ways we can improve on this basic idea (read any discussion of malloc!) but what's worth doing depends on the details \[Dash] what makes sense if you expect hundreds of thousands of simultaneous allocations is different from if you expect up to maybe eight simultaneous allocations.
This scheme as described is not utterly terrible (and readers may recognize that variants of this were used in early mainframes, and in the very earliest PC operating systems).
If you want to make this work better you might want to add a few elements.
Firstly you might want to aggregate units together. Allocate Scratchpad maybe at the granularity of a "cache line" (64B or 128B?) or even larger like 1KB. Allocate warp registers at a granularity of maybe 16.
Second you will want a bitmap or something similar to track the free vs in-use granules. And hardware that can rapidly find runs of bits of a particular length, so that you can easily find a run of three (or whatever) bits set to 1, to match a three granule request.

At this point you have an allocator adequate for small purposes, and this is essentially the content of (2017) https://patents.google.com/patent/US10698687B1 Pipelined resource allocation using size aligned allocation.
The specific detail being patented is how, after a few rounds of allocation and deallocation have been performed, do you choose where to perform the next allocation? For example options you might imagine include
- allocate from the smallest free block that meets the requirements
- allocate from the largest free block that meets the requirements
- allocate from the "oldest" (in some sense) or "youngest" free block that meets the requirements
- allocate round-robin style (ie just keep moving left till you find a block that's big enough)
What Apple does is none of these, but should seem familiar from some malloc designs. Allocate blocks by "alignment" granularity.
So if we want to allocate a block that is say 5 units in size, we will look for runs of 5 consecutive free blocks, at locations 0, 5, 10, 15, ...
If we find more than one such location, we will choose the smallest fit.
This behaves somewhat like a malloc that keeps separate free lists for blocks of different sizes; while not requiring the overhead of such a design.
This scheme is implemented in hardware, and performs the allocation (or reports a failure) in one cycle. Not bad!

This is far from the end of the line, however! Suppose you want to do better. The big problem with the design above is that we get fragmentation. We may have three granules free, but not contiguous.
The first step to a more powerful design is you introduce a level of indirection. Add something like a TLB that maps a virtual granuleID to a physical granuleID. Now we no longer need to care about contiguity; a three granule request merely needs to find three free bits anywhere in the bitmap, and set the appropriate (sequential) mappings from the virtualID to the (non-sequential) physicalID. And every granule access goes through the TLB lookup. This is, of course, basic VM and should be very familiar.

The next step is to also allow oversubscription. Now we no longer even require that there be three free physical granules, we simply hand out three virtualIDs, all mapped to "invalid" in the TLB. To make this work, we also need
- a faulting mechanism, so that when access is made to an "invalid" entry, some sort of hardware kicks in to perform a physical allocation
- a means to handle oversubscription, ie what to do if there is no physical allocation available?
For memory-like resources, both of these are fairly easy in the sense that you simply need to copy out some memory to some alternative storage, and reallocate the block that was copied out. Obviously you know how this works for standard virtual memory, and we can do the same thing here. For example if the resource of interest is Scratchpad, we can copy a granule of Scratchpad storage out to the L1D, reallocate that granule as invalid in the mapping for the current user, and set up as valid for the new user, and then keep going.
Just like with VM, you can then use various heuristics to ensure that this swapping does not get out of control.
For a GPU two obvious such heuristics are
- we already know which tasks may be inactive waiting on either RAM or a synchronization event, and obviously it makes more sense to reallocate from them than from an active task
- we also expect that, generally, warps and threadblocks will tend to behave in a similar fashion. So we can do something like
+ for the first round of warps of a threadblock, begin them with just one of their register granule allocations hooked up to a physical allocation and see what happens. If no faults are generated, we are fine! If faults are generated, then for the next round of warps, try to ensure that two register granules (or however many are required) are hooked up before starting the warps.
Of course the details of exactly how aggressively you oversubscribe will depend on the cost of re-allocation faults.

This next technology stage was introduced with the A17/M3 GPU and is given the brand name Dynamic Caching by Apple. It is especially valuable for "unpredictable" shaders. For example a shader may have one common (data dependent) path that only requires a small number of registers or Scratchpad, and a different rare (data dependent) path. The shader has to allocate the full possible set of registers or Scratchpad storage it might require, but usually will not use most of its allocation. By only allocating resources at the point they are truly required, we can frequently pack more threadblocks and warps onto each core, meaning fewer dead cycles when the core simply cannot find a warp to schedule.
However Apple's scheme is even more ambitious!
What I've described above is a scheme that's basically a simple remap table per core, with the ability to fault on certain accesses.
Compare this to the virtual memory system of a modern SoC. Two differences are
- The remapping is common across the entire SoC. There are multiple local remap tables (ie TLBs) but these are all caches of a single system wide collection of remappings that we might collectively call the "page tables". The TLBs are local caches of this entire set of remappings.
- To deal with the size of all the remappings, the "page tables" have a hierarchical structure which lends itself to multiple levels of caching.
This same sort of machinery is duplicated across the GPU. Superficially It looks like the CPU page tables (because of course it's designed using the same ideas) but it's not a remapping of multiple process virtual address spaces into physical DRAM, rather it's a remapping of multiple GPU-specific address spaces (eg the Register address space of each Warp, or the Scratchpad address space of each Threadblock) into a common address space which refers to SRAM storage in the GPU.
We could make this common address space storage just refer to SRAM on the GPU, but Apple go one stage further. They define the target address space not as physical SRAM on the GPU but as the device-wide virtual address space for the relevant process. Thus, in principle, a register lookup translates the (warpID, register number) via a GPU-specific TLB into an (ASID, virtual address) which in turn is translated via a system-wide TLB into a physical address.
That's obviously clumsy if taken literally, and obviously there will be short-circuiting (for example virtual addresses with a particular set of high bits can be routed directly to GPU SRAM), but it allows for expansion into the main memory system as required. For example any oversubscription of GPU SRAM resources can change the mappings to route to the "real" virtual address space and thus be backed by SLC, and DRAM. (If/when Apple start doing crazy things like expanding into high-end systems, you can also use this sort of technology to provide a distributed virtual address space across multiple computers in a way that simultaneously allows for transparent growth of the GPUs.)

This scheme (virtualize resources, and handle oversubscription by faulting) is described in (2020) https://patents.google.com/patent/US20210271606A1 On-demand Memory Allocation. The patent as described would appear to apply to literal memory spaces (obviously Scratchpad, and also the private address space used by Ray Tracing) but experimental investigations suggest that it is more abstract and may apply to other resources, specifically it does appear to apply to warp registers. In other words you should think of this as a scheme for virtualization, generically considered, and with resources granules sized as appropriate (perhaps 1kB or 4kB for Scratchpad storage? perhaps 16 or 32 registers?) rather than as an extension of the "standard" paging system, and forced to use 16kB pages.
Once you have a scheme that looks like VM, you can in principle use it in ways like VM!
The patent suggests three possible extensions to how this might be used.
First suppose you need to move work from one GPU core to another. In principle you could have each task (at the appropriate granularity, warp, threadblock, kernel, whatever) have the taskID act as an ASID, and simply move the task from one core to another. The first access by the core would miss (ASID+address) in the new cores TLB, so the TLB would be filled in from the central TLB. Then the reference to the address (say a register) would miss in the new core, and the register data would be moved from SRAM in the previous core to SRAM in the new core. Very much like moving a process on a CPU. This seems to be what Apple ultimately envision, and you might want to do this for all the reasons you might want to move a process, from load-balancing across cores to consolidating work on a few cores and powering down the rest.
Second you can share "memory" between tasks. We can already share registers between the lanes of a warp in the form of uniforms, but the uniform has to be reinitialized for each warp. You could imagine something like a threadblock, or even an entire kernel, sets a block of registers to generally useful values, then that block gets "mapped into" every warp when the warp is activated. This both avoids the initialization instructions and allows the storage of these registers to be reused across each warp, rather than per-warp register allocation.
Something similar may be useful across threadblocks. You could even do things like Copy-on-Write, though it's unclear how useful this might be.
Of course it will take changes to MSL and the compiler to make these ideas visible, but we can imagine them coming; the primary constraint on how powerful they might be is the extent to which "faulting" is a GPU-wide notification versus a core-wide notification.
Third you can consolidate all storage in a single physical pool of SRAM, with this "page" indirection determining whether the storage is used by Scratchpad, Ray Tracing, Texture, Registers, or anything else; and you could even have the indirection route to the SRAM pool on a different core.
As we have seen, nVidia has gone some distance down this path: a common address space, somewhat common SRAM pools, and the ability for threadblocks on one core to use Scratchpad storage on another core.
Perhaps Apple ultimately plans to go even further (eg making registers part of the common storage pool?)
The patent as described by Apple is extremely ambitious, and it is likely that elements of this will be rolling out over the next few years, and that the version we have right now in the M3 as of 2023 is the simplest possible version (remapping and oversubscription within a single core, but not yet common cross-core address space, moving address spaces, shared address spaces, etc).
 

Kronsteen

macrumors member
Nov 18, 2019
76
66
Here's my best analysis of what Dynamic Caching REALLY is.
This is cut from a much larger document describing the Apple GPU in depth, but I think it stands alone enough to explain the relevant points.



resource allocation (2017 and now)

We've now seen that resource allocation is an important part of scheduling, not least in the sense that we cannot start the execution of a new task (whether threadblock or warp) until certain hardware is allocated.
Consider now the allocation of "memory-like" resources, eg Scratchpad storage for a new threadblock, or registers for a new warp.
How might we do this?

The simplest answer is we don't really try. One threadblock gets the entire scratchpad, each warp gets a hardwired set of registers, no attempt to do better. This may sound dumb but it is, for example, right now the way AMX handles registers, with a dedicated register set for each potential client core in a cluster...

If we want to do better, the next step up is to provide each active threadblock with a base and a length register. The first threadblock sets the base to 0 and the length to whatever was allocated. The next threadblock sets base to previous length+1, and so on. We keep going until we run out of Scratchpad storage. Each access to Scratchpad adds the nominal address to the base, while simultaneously testing that it is smaller than length, and we're set. Once an allocation fails, we wait until an earlier threadblock complete and releases its range, then try again. There are a variety of ways we can improve on this basic idea (read any discussion of malloc!) but what's worth doing depends on the details \[Dash] what makes sense if you expect hundreds of thousands of simultaneous allocations is different from if you expect up to maybe eight simultaneous allocations.
This scheme as described is not utterly terrible (and readers may recognize that variants of this were used in early mainframes, and in the very earliest PC operating systems).
If you want to make this work better you might want to add a few elements.
Firstly you might want to aggregate units together. Allocate Scratchpad maybe at the granularity of a "cache line" (64B or 128B?) or even larger like 1KB. Allocate warp registers at a granularity of maybe 16.
Second you will want a bitmap or something similar to track the free vs in-use granules. And hardware that can rapidly find runs of bits of a particular length, so that you can easily find a run of three (or whatever) bits set to 1, to match a three granule request.

At this point you have an allocator adequate for small purposes, and this is essentially the content of (2017) https://patents.google.com/patent/US10698687B1 Pipelined resource allocation using size aligned allocation.
The specific detail being patented is how, after a few rounds of allocation and deallocation have been performed, do you choose where to perform the next allocation? For example options you might imagine include
- allocate from the smallest free block that meets the requirements
- allocate from the largest free block that meets the requirements
- allocate from the "oldest" (in some sense) or "youngest" free block that meets the requirements
- allocate round-robin style (ie just keep moving left till you find a block that's big enough)
What Apple does is none of these, but should seem familiar from some malloc designs. Allocate blocks by "alignment" granularity.
So if we want to allocate a block that is say 5 units in size, we will look for runs of 5 consecutive free blocks, at locations 0, 5, 10, 15, ...
If we find more than one such location, we will choose the smallest fit.
This behaves somewhat like a malloc that keeps separate free lists for blocks of different sizes; while not requiring the overhead of such a design.
This scheme is implemented in hardware, and performs the allocation (or reports a failure) in one cycle. Not bad!

This is far from the end of the line, however! Suppose you want to do better. The big problem with the design above is that we get fragmentation. We may have three granules free, but not contiguous.
The first step to a more powerful design is you introduce a level of indirection. Add something like a TLB that maps a virtual granuleID to a physical granuleID. Now we no longer need to care about contiguity; a three granule request merely needs to find three free bits anywhere in the bitmap, and set the appropriate (sequential) mappings from the virtualID to the (non-sequential) physicalID. And every granule access goes through the TLB lookup. This is, of course, basic VM and should be very familiar.

The next step is to also allow oversubscription. Now we no longer even require that there be three free physical granules, we simply hand out three virtualIDs, all mapped to "invalid" in the TLB. To make this work, we also need
- a faulting mechanism, so that when access is made to an "invalid" entry, some sort of hardware kicks in to perform a physical allocation
- a means to handle oversubscription, ie what to do if there is no physical allocation available?
For memory-like resources, both of these are fairly easy in the sense that you simply need to copy out some memory to some alternative storage, and reallocate the block that was copied out. Obviously you know how this works for standard virtual memory, and we can do the same thing here. For example if the resource of interest is Scratchpad, we can copy a granule of Scratchpad storage out to the L1D, reallocate that granule as invalid in the mapping for the current user, and set up as valid for the new user, and then keep going.
Just like with VM, you can then use various heuristics to ensure that this swapping does not get out of control.
For a GPU two obvious such heuristics are
- we already know which tasks may be inactive waiting on either RAM or a synchronization event, and obviously it makes more sense to reallocate from them than from an active task
- we also expect that, generally, warps and threadblocks will tend to behave in a similar fashion. So we can do something like
+ for the first round of warps of a threadblock, begin them with just one of their register granule allocations hooked up to a physical allocation and see what happens. If no faults are generated, we are fine! If faults are generated, then for the next round of warps, try to ensure that two register granules (or however many are required) are hooked up before starting the warps.
Of course the details of exactly how aggressively you oversubscribe will depend on the cost of re-allocation faults.

This next technology stage was introduced with the A17/M3 GPU and is given the brand name Dynamic Caching by Apple. It is especially valuable for "unpredictable" shaders. For example a shader may have one common (data dependent) path that only requires a small number of registers or Scratchpad, and a different rare (data dependent) path. The shader has to allocate the full possible set of registers or Scratchpad storage it might require, but usually will not use most of its allocation. By only allocating resources at the point they are truly required, we can frequently pack more threadblocks and warps onto each core, meaning fewer dead cycles when the core simply cannot find a warp to schedule.
However Apple's scheme is even more ambitious!
What I've described above is a scheme that's basically a simple remap table per core, with the ability to fault on certain accesses.
Compare this to the virtual memory system of a modern SoC. Two differences are
- The remapping is common across the entire SoC. There are multiple local remap tables (ie TLBs) but these are all caches of a single system wide collection of remappings that we might collectively call the "page tables". The TLBs are local caches of this entire set of remappings.
- To deal with the size of all the remappings, the "page tables" have a hierarchical structure which lends itself to multiple levels of caching.
This same sort of machinery is duplicated across the GPU. Superficially It looks like the CPU page tables (because of course it's designed using the same ideas) but it's not a remapping of multiple process virtual address spaces into physical DRAM, rather it's a remapping of multiple GPU-specific address spaces (eg the Register address space of each Warp, or the Scratchpad address space of each Threadblock) into a common address space which refers to SRAM storage in the GPU.
We could make this common address space storage just refer to SRAM on the GPU, but Apple go one stage further. They define the target address space not as physical SRAM on the GPU but as the device-wide virtual address space for the relevant process. Thus, in principle, a register lookup translates the (warpID, register number) via a GPU-specific TLB into an (ASID, virtual address) which in turn is translated via a system-wide TLB into a physical address.
That's obviously clumsy if taken literally, and obviously there will be short-circuiting (for example virtual addresses with a particular set of high bits can be routed directly to GPU SRAM), but it allows for expansion into the main memory system as required. For example any oversubscription of GPU SRAM resources can change the mappings to route to the "real" virtual address space and thus be backed by SLC, and DRAM. (If/when Apple start doing crazy things like expanding into high-end systems, you can also use this sort of technology to provide a distributed virtual address space across multiple computers in a way that simultaneously allows for transparent growth of the GPUs.)

This scheme (virtualize resources, and handle oversubscription by faulting) is described in (2020) https://patents.google.com/patent/US20210271606A1 On-demand Memory Allocation. The patent as described would appear to apply to literal memory spaces (obviously Scratchpad, and also the private address space used by Ray Tracing) but experimental investigations suggest that it is more abstract and may apply to other resources, specifically it does appear to apply to warp registers. In other words you should think of this as a scheme for virtualization, generically considered, and with resources granules sized as appropriate (perhaps 1kB or 4kB for Scratchpad storage? perhaps 16 or 32 registers?) rather than as an extension of the "standard" paging system, and forced to use 16kB pages.
Once you have a scheme that looks like VM, you can in principle use it in ways like VM!
The patent suggests three possible extensions to how this might be used.
First suppose you need to move work from one GPU core to another. In principle you could have each task (at the appropriate granularity, warp, threadblock, kernel, whatever) have the taskID act as an ASID, and simply move the task from one core to another. The first access by the core would miss (ASID+address) in the new cores TLB, so the TLB would be filled in from the central TLB. Then the reference to the address (say a register) would miss in the new core, and the register data would be moved from SRAM in the previous core to SRAM in the new core. Very much like moving a process on a CPU. This seems to be what Apple ultimately envision, and you might want to do this for all the reasons you might want to move a process, from load-balancing across cores to consolidating work on a few cores and powering down the rest.
Second you can share "memory" between tasks. We can already share registers between the lanes of a warp in the form of uniforms, but the uniform has to be reinitialized for each warp. You could imagine something like a threadblock, or even an entire kernel, sets a block of registers to generally useful values, then that block gets "mapped into" every warp when the warp is activated. This both avoids the initialization instructions and allows the storage of these registers to be reused across each warp, rather than per-warp register allocation.
Something similar may be useful across threadblocks. You could even do things like Copy-on-Write, though it's unclear how useful this might be.
Of course it will take changes to MSL and the compiler to make these ideas visible, but we can imagine them coming; the primary constraint on how powerful they might be is the extent to which "faulting" is a GPU-wide notification versus a core-wide notification.
Third you can consolidate all storage in a single physical pool of SRAM, with this "page" indirection determining whether the storage is used by Scratchpad, Ray Tracing, Texture, Registers, or anything else; and you could even have the indirection route to the SRAM pool on a different core.
As we have seen, nVidia has gone some distance down this path: a common address space, somewhat common SRAM pools, and the ability for threadblocks on one core to use Scratchpad storage on another core.
Perhaps Apple ultimately plans to go even further (eg making registers part of the common storage pool?)
The patent as described by Apple is extremely ambitious, and it is likely that elements of this will be rolling out over the next few years, and that the version we have right now in the M3 as of 2023 is the simplest possible version (remapping and oversubscription within a single core, but not yet common cross-core address space, moving address spaces, shared address spaces, etc).
Great analysis, @name99

I’d be really interested to see the document you mention, if it’s something you can post a link to (but completely understand if you can’t).

And, as an aside, if you can suggest any pointers to material that describes how to optimise compute (non-graphical) workloads using Metal (I’m going to convert from OpenCL when I have time), to run on Mx GPUs, I’d be most grateful.

Thanks …. Andrew
 
Last edited:
  • Love
Reactions: Romain_H

name99

macrumors 68020
Jun 21, 2004
2,410
2,315
Great analysis, @name99

I’d be really interested to see the document you mention, if it’s something you can post a link to (but completely understand if you can’t).

And, as an aside, if you can suggest any pointers to material that describes how to optimise compute (non-graphical) workloads using Metal (I’m going to convert from OpenCL when I have time), to run on Mx GPUs, I’d be most grateful.

Thanks …. Andrew
Eventually. Like the other documents it takes time, and I began knowing much less about GPUs, so much more background research was necessary.
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
IMG_0546.jpeg


So now add Hardware RT to that 🤔

Ah sorry Blender Opendata if anyone wonders.
 

waltteriii

macrumors newbie
Sep 5, 2017
13
8
Screenshot 2023-11-06 at 17.05.40.png



Screenhot from iPhonedo video:

Blender 3.6 / White Lands demo scene using only GPU.
For comparison my:

MacBook Pro M2 Max 38GPU (64GB ram)
Result: 132

PC with Nvidia 4090
Result: 19
 
  • Like
Reactions: komuh and aytan
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.