I wonder if this will help with BVH building.
I had prototyped an algorithm for BVH building in my molecular ray tracer. It sorts the world into sparse "voxels" containing ~10000 atoms, then sorts those voxels into smaller ones with ~10 atoms. The second stage is computationally intensive, and keeping the data on-chip provides maximum performance. 10000 x 4 bytes = 40 KB, the same order of magnitude as local memory on a GPU core.
The algorithm has a few intermediate steps that require a large amount of scratch space. At other times, the memory could be freed and lent to other threads. The unevenness of memory access was so severe, I planned to allocate an entire GPU core's resources to a single threadgroup. Then, manually manage the memory and which threads owned which part. This approach leads to very low occupancy, as only 1-2 threadgroups are resident on a GPU core.
Dynamic caching would make this case more feasible. Let's say a section of space is half-filled with atoms, half-filled with void. You'd only need threadgroup memory for 5000 atoms, but have to allocate memory for 10000 up front. Dynamic caching could increase occupancy whenever sparse voxels are encountered. It also packs more threadgroups onto the same core. You just need to modify the algorithm, so threadgroups temporarily relinquish control of a piece of memory, re-computing its contents at a later time.
This may also increase occupancy in FlashAttention, a notoriously SRAM-heavy algorithm for AI. Large volumes of register file are needed, but a little bit of scratch space goes unused during certain steps. While implementing FlashAttention, I learned that Apple actually has the smallest register file of any vendor, not the largest. This is why their ISA has 16-bit registers, while other vendors only have 32-bit registers. Do the same general algorithms with less SRAM. For example, pointers to local memory (<64 KB) could be 16-bit integers. I imagine dynamic caching lets Apple get by with an even smaller SRAM size than before. With TSMC N3, memory density stopped scaling for the first time, while logic density improved. This may parallel the decrease in SLC size from A15 -> A16.
As a side note, there's theoretical literature that suggests logic is inherently more dense than memory, in any substrate. Eric Drexler's Nanosystems has a nanomechanical logic gate using ~2000 atoms, while one bit of RAM has ~6000 atoms. The hardware could also be shaped to perform a different function, which is more effective than making the computer general-purpose. I even contemplated a general-purpose 1-bit computer with all its functionality emulated through onboard instruction tape. The read-only instruction tape wasn't large enough to be generally useful and access latencies were a severe bottleneck. Also quite prone to radiation damage, potentially requiring the data to be duplicated.