Here's my best analysis of what Dynamic Caching REALLY is.
This is cut from a much larger document describing the Apple GPU in depth, but I think it stands alone enough to explain the relevant points.
resource allocation (2017 and now)
We've now seen that resource allocation is an important part of scheduling, not least in the sense that we cannot start the execution of a new task (whether threadblock or warp) until certain hardware is allocated.
Consider now the allocation of "memory-like" resources, eg Scratchpad storage for a new threadblock, or registers for a new warp.
How might we do this?
The simplest answer is we don't really try. One threadblock gets the entire scratchpad, each warp gets a hardwired set of registers, no attempt to do better. This may sound dumb but it is, for example, right now the way AMX handles registers, with a dedicated register set for each potential client core in a cluster...
If we want to do better, the next step up is to provide each active threadblock with a base and a length register. The first threadblock sets the base to 0 and the length to whatever was allocated. The next threadblock sets base to previous length+1, and so on. We keep going until we run out of Scratchpad storage. Each access to Scratchpad adds the nominal address to the base, while simultaneously testing that it is smaller than length, and we're set. Once an allocation fails, we wait until an earlier threadblock complete and releases its range, then try again. There are a variety of ways we can improve on this basic idea (read any discussion of malloc!) but what's worth doing depends on the details \[Dash] what makes sense if you expect hundreds of thousands of simultaneous allocations is different from if you expect up to maybe eight simultaneous allocations.
This scheme as described is not utterly terrible (and readers may recognize that variants of this were used in early mainframes, and in the very earliest PC operating systems).
If you want to make this work better you might want to add a few elements.
Firstly you might want to aggregate units together. Allocate Scratchpad maybe at the granularity of a "cache line" (64B or 128B?) or even larger like 1KB. Allocate warp registers at a granularity of maybe 16.
Second you will want a bitmap or something similar to track the free vs in-use granules. And hardware that can rapidly find runs of bits of a particular length, so that you can easily find a run of three (or whatever) bits set to 1, to match a three granule request.
At this point you have an allocator adequate for small purposes, and this is essentially the content of (2017)
https://patents.google.com/patent/US10698687B1 Pipelined resource allocation using size aligned allocation.
The specific detail being patented is how, after a few rounds of allocation and deallocation have been performed, do you choose where to perform the next allocation? For example options you might imagine include
- allocate from the smallest free block that meets the requirements
- allocate from the largest free block that meets the requirements
- allocate from the "oldest" (in some sense) or "youngest" free block that meets the requirements
- allocate round-robin style (ie just keep moving left till you find a block that's big enough)
What Apple does is none of these, but should seem familiar from some malloc designs. Allocate blocks by "alignment" granularity.
So if we want to allocate a block that is say 5 units in size, we will look for runs of 5 consecutive free blocks, at locations 0, 5, 10, 15, ...
If we find more than one such location, we will choose the smallest fit.
This behaves somewhat like a malloc that keeps separate free lists for blocks of different sizes; while not requiring the overhead of such a design.
This scheme is implemented in hardware, and performs the allocation (or reports a failure) in one cycle. Not bad!
This is far from the end of the line, however! Suppose you want to do better. The big problem with the design above is that we get fragmentation. We may have three granules free, but not contiguous.
The first step to a more powerful design is you introduce a level of indirection. Add something like a TLB that maps a virtual granuleID to a physical granuleID. Now we no longer need to care about contiguity; a three granule request merely needs to find three free bits anywhere in the bitmap, and set the appropriate (sequential) mappings from the virtualID to the (non-sequential) physicalID. And every granule access goes through the TLB lookup. This is, of course, basic VM and should be very familiar.
The next step is to also allow oversubscription. Now we no longer even require that there be three free physical granules, we simply hand out three virtualIDs, all mapped to "invalid" in the TLB. To make this work, we also need
- a faulting mechanism, so that when access is made to an "invalid" entry, some sort of hardware kicks in to perform a physical allocation
- a means to handle oversubscription, ie what to do if there is no physical allocation available?
For memory-like resources, both of these are fairly easy in the sense that you simply need to copy out some memory to some alternative storage, and reallocate the block that was copied out. Obviously you know how this works for standard virtual memory, and we can do the same thing here. For example if the resource of interest is Scratchpad, we can copy a granule of Scratchpad storage out to the L1D, reallocate that granule as invalid in the mapping for the current user, and set up as valid for the new user, and then keep going.
Just like with VM, you can then use various heuristics to ensure that this swapping does not get out of control.
For a GPU two obvious such heuristics are
- we already know which tasks may be inactive waiting on either RAM or a synchronization event, and obviously it makes more sense to reallocate from them than from an active task
- we also expect that, generally, warps and threadblocks will tend to behave in a similar fashion. So we can do something like
+ for the first round of warps of a threadblock, begin them with just one of their register granule allocations hooked up to a physical allocation and see what happens. If no faults are generated, we are fine! If faults are generated, then for the next round of warps, try to ensure that two register granules (or however many are required) are hooked up before starting the warps.
Of course the details of exactly how aggressively you oversubscribe will depend on the cost of re-allocation faults.
This next technology stage was introduced with the A17/M3 GPU and is given the brand name Dynamic Caching by Apple. It is especially valuable for "unpredictable" shaders. For example a shader may have one common (data dependent) path that only requires a small number of registers or Scratchpad, and a different rare (data dependent) path. The shader has to allocate the full possible set of registers or Scratchpad storage it might require, but usually will not use most of its allocation. By only allocating resources at the point they are truly required, we can frequently pack more threadblocks and warps onto each core, meaning fewer dead cycles when the core simply cannot find a warp to schedule.
However Apple's scheme is even more ambitious!
What I've described above is a scheme that's basically a simple remap table per core, with the ability to fault on certain accesses.
Compare this to the virtual memory system of a modern SoC. Two differences are
- The remapping is common across the entire SoC. There are multiple local remap tables (ie TLBs) but these are all caches of a single system wide collection of remappings that we might collectively call the "page tables". The TLBs are local caches of this entire set of remappings.
- To deal with the size of all the remappings, the "page tables" have a hierarchical structure which lends itself to multiple levels of caching.
This same sort of machinery is duplicated across the GPU. Superficially It looks like the CPU page tables (because of course it's designed using the same ideas) but it's not a remapping of multiple process virtual address spaces into physical DRAM, rather it's a remapping of multiple GPU-specific address spaces (eg the Register address space of each Warp, or the Scratchpad address space of each Threadblock) into a common address space which refers to SRAM storage in the GPU.
We could make this common address space storage just refer to SRAM on the GPU, but Apple go one stage further. They define the target address space not as physical SRAM on the GPU but as the device-wide virtual address space for the relevant process. Thus, in principle, a register lookup translates the (warpID, register number) via a GPU-specific TLB into an (ASID, virtual address) which in turn is translated via a system-wide TLB into a physical address.
That's obviously clumsy if taken literally, and obviously there will be short-circuiting (for example virtual addresses with a particular set of high bits can be routed directly to GPU SRAM), but it allows for expansion into the main memory system as required. For example any oversubscription of GPU SRAM resources can change the mappings to route to the "real" virtual address space and thus be backed by SLC, and DRAM. (If/when Apple start doing crazy things like expanding into high-end systems, you can also use this sort of technology to provide a distributed virtual address space across multiple computers in a way that simultaneously allows for transparent growth of the GPUs.)
This scheme (virtualize resources, and handle oversubscription by faulting) is described in (2020)
https://patents.google.com/patent/US20210271606A1 On-demand Memory Allocation. The patent as described would appear to apply to literal memory spaces (obviously Scratchpad, and also the private address space used by Ray Tracing) but experimental investigations suggest that it is more abstract and may apply to other resources, specifically it does appear to apply to warp registers. In other words you should think of this as a scheme for virtualization, generically considered, and with resources granules sized as appropriate (perhaps 1kB or 4kB for Scratchpad storage? perhaps 16 or 32 registers?) rather than as an extension of the "standard" paging system, and forced to use 16kB pages.
Once you have a scheme that looks like VM, you can in principle use it in ways like VM!
The patent suggests three possible extensions to how this might be used.
First suppose you need to move work from one GPU core to another. In principle you could have each task (at the appropriate granularity, warp, threadblock, kernel, whatever) have the taskID act as an ASID, and simply move the task from one core to another. The first access by the core would miss (ASID+address) in the new cores TLB, so the TLB would be filled in from the central TLB. Then the reference to the address (say a register) would miss in the new core, and the register data would be moved from SRAM in the previous core to SRAM in the new core. Very much like moving a process on a CPU. This seems to be what Apple ultimately envision, and you might want to do this for all the reasons you might want to move a process, from load-balancing across cores to consolidating work on a few cores and powering down the rest.
Second you can share "memory" between tasks. We can already share registers between the lanes of a warp in the form of uniforms, but the uniform has to be reinitialized for each warp. You could imagine something like a threadblock, or even an entire kernel, sets a block of registers to generally useful values, then that block gets "mapped into" every warp when the warp is activated. This both avoids the initialization instructions and allows the storage of these registers to be reused across each warp, rather than per-warp register allocation.
Something similar may be useful across threadblocks. You could even do things like Copy-on-Write, though it's unclear how useful this might be.
Of course it will take changes to MSL and the compiler to make these ideas visible, but we can imagine them coming; the primary constraint on how powerful they might be is the extent to which "faulting" is a GPU-wide notification versus a core-wide notification.
Third you can consolidate all storage in a single physical pool of SRAM, with this "page" indirection determining whether the storage is used by Scratchpad, Ray Tracing, Texture, Registers, or anything else; and you could even have the indirection route to the SRAM pool on a different core.
As we have seen, nVidia has gone some distance down this path: a common address space, somewhat common SRAM pools, and the ability for threadblocks on one core to use Scratchpad storage on another core.
Perhaps Apple ultimately plans to go even further (eg making registers part of the common storage pool?)
The patent as described by Apple is extremely ambitious, and it is likely that elements of this will be rolling out over the next few years, and that the version we have right now in the M3 as of 2023 is the simplest possible version (remapping and oversubscription within a single core, but not yet common cross-core address space, moving address spaces, shared address spaces, etc).