Ok how big a deal is this Dynamic Caching?

subjonas · Oct 30, 2023

I don’t have any hopes to understand this at a deep technical level, but just tell me this—is this a game changer? Does it potentially make AS competitive with the dGPUs?

leman · Oct 30, 2023

AS is already competitive with dGPUs. Dynamic Caching will likely marginally improve shader utilization on complex workloads, at it can allow more shaders to run on a GPU core at a given time. Game changer though? Not really.

Analog Kid · Oct 31, 2023

leman said:
AS is already competitive with dGPUs. Dynamic Caching will likely marginally improve shader utilization on complex workloads, at it can allow more shaders to run on a GPU core at a given time. Game changer though? Not really.

I have to admit, based only on yesterday's presentation I don't understand how a change to dynamic memory allocation leads to better performance. As presented, it wasn't clear where the performance bottleneck was, it just looked like there was excess memory being allocated. Is the problem that the GPU had been memory starved in the past leaving available shaders idle?

leman · Oct 31, 2023

Analog Kid said:
I have to admit, based only on yesterday's presentation I don't understand how a change to dynamic memory allocation leads to better performance. As presented, it wasn't clear where the performance bottleneck was, it just looked like there was excess memory being allocated. Is the problem that the GPU had been memory starved in the past leaving available shaders idle?

I think Srouji's explanation was rather easy to misconstrue (and it seems that a lot of reporters already did). This is how I understand it: assume you have a big shader that has two conditional execution paths. A cheap frequent path, that only needs a small amount of core memory (e.g. registers/scratch space) and an expensive but rarely invoked one, which needs more resources. In the usual architectures the resources are partitioned statically when the shader is scheduled, which means you need to reserve the resources for the expensive case, even if it doesn't happen often in practice. This means that you might not have enough resources for other shaders, leading to lower occupancy. If my understanding is correct, Dynamic Caching will only reserve the space if the path is actually taken, so as long as you are on the fast cheap path you can execute more shaders concurrently.

Of course, it's not very clear how exactly and under which circumstances this feature kicks in. It is probably more relevant for ray tracing, where it is common to have large über-shaders and dynamically call functions at runtime (often recursively). Or maybe it actually works for any shader and will lazily allocate registers and shared memory when the shader requires them. That would be pretty neat.

Macintosh IIcx · Oct 31, 2023

leman said:
I think Srouji's explanation was rather easy to misconstrue (and it seems that a lot of reporters already did). This is how I understand it: assume you have a big shader that has two conditional execution paths. A cheap frequent path, that only needs a small amount of core memory (e.g. registers/scratch space) and an expensive but rarely invoked one, which needs more resources. In the usual architectures the resources are partitioned statically when the shader is scheduled, which means you need to reserve the resources for the expensive case, even if it doesn't happen often in practice. This means that you might not have enough resources for other shaders, leading to lower occupancy. If my understanding is correct, Dynamic Caching will only reserve the space if the path is actually taken, so as long as you are on the fast cheap path you can execute more shaders concurrently.

That is how I understood what he said too. But if you are not tight on memory, where does the performance increase then come from? Because you save memory bandwidth access, I assume? 🤔

leman · Oct 31, 2023

Macintosh IIcx said:
That is how I understood what he said too. But if you are not tight on memory, where does the performance increase then come from? Because you save memory bandwidth access, I assume? 🤔

I don't think there is any performance increase with simpler shaders. It's only complex workloads that will see a boost.

diamond.g · Oct 31, 2023

leman said:
I don't think there is any performance increase with simpler shaders. It's only complex workloads that will see a boost.

I wonder if this will help with BVH building.

Analog Kid · Oct 31, 2023

Macintosh IIcx said:
That is how I understood what he said too. But if you are not tight on memory, where does the performance increase then come from? Because you save memory bandwidth access, I assume? 🤔

One thing that's explicit in the naming, but maybe wasn't described in a lot of detail, is that GPUs have a hierarchical memory structure. This isn't the package Unified memory, this is dynamic allocation of the on-chip local, dedicated GPU memory. This isn't something you can order more of, it's set on the chip.

When you launch a GPU workload, data gets brought into this local, faster access memory. I guess I don't generally think of it as a cache, more as a local work space, but I suppose you can think of it as a cache if you avoid ever having a cache miss.

Though this does maybe raise the question of whether it's now more internally structured as a cache-like interface rather than a static work space.

Philip Turner · Oct 31, 2023

diamond.g said:
I wonder if this will help with BVH building.

I had prototyped an algorithm for BVH building in my molecular ray tracer. It sorts the world into sparse "voxels" containing ~10000 atoms, then sorts those voxels into smaller ones with ~10 atoms. The second stage is computationally intensive, and keeping the data on-chip provides maximum performance. 10000 x 4 bytes = 40 KB, the same order of magnitude as local memory on a GPU core.

The algorithm has a few intermediate steps that require a large amount of scratch space. At other times, the memory could be freed and lent to other threads. The unevenness of memory access was so severe, I planned to allocate an entire GPU core's resources to a single threadgroup. Then, manually manage the memory and which threads owned which part. This approach leads to very low occupancy, as only 1-2 threadgroups are resident on a GPU core.

Dynamic caching would make this case more feasible. Let's say a section of space is half-filled with atoms, half-filled with void. You'd only need threadgroup memory for 5000 atoms, but have to allocate memory for 10000 up front. Dynamic caching could increase occupancy whenever sparse voxels are encountered. It also packs more threadgroups onto the same core. You just need to modify the algorithm, so threadgroups temporarily relinquish control of a piece of memory, re-computing its contents at a later time.

This may also increase occupancy in FlashAttention, a notoriously SRAM-heavy algorithm for AI. Large volumes of register file are needed, but a little bit of scratch space goes unused during certain steps. While implementing FlashAttention, I learned that Apple actually has the smallest register file of any vendor, not the largest. This is why their ISA has 16-bit registers, while other vendors only have 32-bit registers. Do the same general algorithms with less SRAM. For example, pointers to local memory (<64 KB) could be 16-bit integers. I imagine dynamic caching lets Apple get by with an even smaller SRAM size than before. With TSMC N3, memory density stopped scaling for the first time, while logic density improved. This may parallel the decrease in SLC size from A15 -> A16.

As a side note, there's theoretical literature that suggests logic is inherently more dense than memory, in any substrate. Eric Drexler's Nanosystems has a nanomechanical logic gate using ~2000 atoms, while one bit of RAM has ~6000 atoms. The hardware could also be shaped to perform a different function, which is more effective than making the computer general-purpose. I even contemplated a general-purpose 1-bit computer with all its functionality emulated through onboard instruction tape. The read-only instruction tape wasn't large enough to be generally useful and access latencies were a severe bottleneck. Also quite prone to radiation damage, potentially requiring the data to be duplicated.

leman · Nov 2, 2023

I did some tests and can confirm that A17 Pro has Dynamic Caching.

I wrote a shader that uses a lot of local variables on a dynamic conditional path that is never taken. On my M1 Max this has reduced the maximal number of threads per GPU core from 1024 to 448, indicating that the system has allocated the register space for the worst case (and as a consequence, fewer threads can fit into the GPU core memory simultaneously). I also observed a 20% reduction in performance vs. an equivalent shader without the expensive conditional path.

On the A17 Pro, there is no difference, it's always 1024 threads and the performance stays the same whether the expensive path is disabled in code or not. So the register memory is allocated lazily, on demand. This is a very impressive work from Apple.

Hope this can put the speculation to rest. A17 and M3 use the same GPU cores.

Search

Search

Ok how big a deal is this Dynamic Caching?

subjonas

macrumors 604

leman

macrumors Core

Analog Kid

macrumors G3

leman

macrumors Core

Macintosh IIcx

macrumors 6502a

leman

macrumors Core

diamond.g

macrumors G5

Analog Kid

macrumors G3

Philip Turner

macrumors regular

leman

macrumors Core

Our Staff