Here's my best analysis of what Dynamic Caching REALLY is.
This is cut from a much larger document describing the Apple GPU in depth, but I think it stands alone enough to explain the relevant points.
resource allocation (2017 and now)
We've now seen that resource allocation is an important part of scheduling, not least in the sense that we cannot start the execution of a new task (whether threadblock or warp) until certain hardware is allocated.
Consider now the allocation of "memory-like" resources, eg Scratchpad storage for a new threadblock, or registers for a new warp.
How might we do this?
[snip]
Apple released a video today that basically confirmed the above analysis.
The primary point is virtualization, with faulting of memory-like resources (registers, Scratchpad and Tile memory).
This allows packing substantially more simultaneous threadgroups and warps onto each GPU core, and thus fewer cycles wasted while no warp is runnable.
BUT Apple also implemented another idea I suggested as possible.
A few years ago nVidia provided a unified pool of SRAM that can have one portion used as GPU L1D, the other portion as Scratchpad, so that after all the Scratchpad storage is allocated, whatever is left over, large or small, gets at least some use as L1D. I suggested Apple might do this, but they have actually gone one better!
They provide a single pool of SRAM storage that is shared by registers, Scratchpad/Tile, and as L1D cache, which each type of use case dynamically "competing" against the other so that this SRAM is always in use by the most deserving clients, whether those clients mostly use registers, Scratchpad/Tile, or even L1D (which, remember, includes things like stack usage). Whatever excess storage is not available in this SRAM pool will spill to L2, but will be tracked (in the same way that an OS tracks paging activity) to ensure that most of the time the core is not thrashing (by sleeping some warps so that not so many are competing for the pool of SRAM).
A second feature nice feature was added.
The shader core, to simplify, has one execution unit that handles FP32, one that handles FP16, one that handles Integer, and one that handles complex computations (sqrt, that sort of thing). It seems a shame to only use one of these units every cycle.
In a sense we want to allow superscalar execution (more than one instruction per cycle) but the details of doing this are tricky, in that we cannot afford the massive machinery used by the CPU to achieve this. So what can we do?
Different vendors solve this in different ways.
- nVidia does it by having each pipeline is only 16-wide. So in the first cycle I send an instruction to execution unit one, and that instruction will dispatch over two cycles (first 16 lanes of the warp, then the second 16 lanes). In the second cycle I can then send an instruction to execution unit two, and so we go. This sort of staggered execution allows a reasonable amount of two units executing per cycle, even if it is one unit executing the first half of a warp, and the second unit executing the second half of a different warp.
- AMD does it by having a few instructions that are SIMD2, so that I send out a single instruction which performs the same operation on a register# and that register#+1. This works well for them given the sets of paired execution units they have, and their use of something called Wavefront64, which, especially for graphics, uses warps that are 64 rather than 32 wide.
- Apple's solution needs to take into account specifically Apple issues which are
+ they have a separate FP32 and FP16 pipeline (rather than doing the "obvious" thing of executing FP16 on the FP32 pipeline, because each of the two pipelines can be energy/area optimized if it doesn't have to do two tasks)
+ which means they can't easily do something like an AMD SIMD2 solution. They also make substantial use of hardware that is optimized for 32-wide lanes (the way they handle matrix multiplies, or shuffles, or reductions, or non-uniform warp size, or ...) and presumably they don't want to have to redesign that for 16-wide hardware, for an nVidia-like solution.
+ BUT Apple has the advantage that (even before Dynamic Caching) their GPU cores were sized so that they could sustain about 1.5x as many active warps as the nVidia or AMD high end, and with Dynamic caching they can sustain even more active warps. This gives them a larger pool of possible warps they can schedule from, meaning that in any cycle there is the potential to find and schedule two instructions, from two different warps, that will execute on different execution units. The advantage of this, compared to nVidia's solutions is that because the two warps are different no dependency checking is required to ensure that register dependencies are satisfied, and relative to AMD's solution it doesn't require compiler changes. The "disadvantage" is that it won't show up in many benchmarks (which tend to do the same thing in every warp); it requires real world code executing a wide variety of warps simultaneously to really shine.
So we see in the new GPUs at the very least
- ray tracing (for the crowd that wants that)
- mesh shading (so more flexible ways of geometry generation/modification)
- dynamic caching
- 2x superscalar execution (when a mix of varied warps is present on the core)
It's notable that, while these are all good and important, to match nVidia (not just in performance, but as the tool of choice for anyone doing leading-edge GPU work) Apple needs to add at least two more large and difficult classes of functionality
- per-lane progress (as added by nV in Volta; allows for a whole range of much more sophisticated synchronization schemes and thus algorithms)
- much larger co-operating groups than just threadgroups. (I think nVidia added this in Ampere? Apple has enough machinery in place for the first part of this, namely a large shared Scratchpad address space extending across multiple cores. But they need all the other parts including cross-core synchronization.)
I am sure these are very much on Apple's radar; but they are not trivial!
People (who understood the implications) were right to be shocked when nVidia introduce them; just like people (who understand the implications) should be shocked at the actual implementation and consequences of Dynamic Caching.