Why is Apple slow to implement Hardware based Ray-Tracing?

leman · Nov 3, 2023

diamond.g said:
For Blender (rendering) it is all about the speed of Path Tracing, and any trick they can do behind the scenes to speed it up (I assume Nvidia uses brute force here).

Based on IMGTech's "Ray Tracing Levels" do we think Apple is at level 4?

If their RT hardware implements what is described in the patents, it will likely have coherency sorting and ray compacting in hardware. Apple doesn’t need extra APIs in the shader to support this, it should be done automatically by the hardware.

diamond.g · Nov 3, 2023

leman said:
If their RT hardware implements what is described in the patents, it will likely have coherency sorting and ray compacting in hardware. Apple doesn’t need extra APIs in the shader to support this, it should be done automatically by the hardware.

I actually had a blurb in my last post about Apple not needing API updates as they can just do stuff behind the scenes with no code changes, but removed it because I wasn't sure if that was true.

I have no clue why nvidia made SER a separate API call and not something the hardware does natively.

leman · Nov 3, 2023

diamond.g said:
I actually had a blurb in my last post about Apple not needing API updates as they can just do stuff behind the scenes with no code changes, but removed it because I wasn't sure if that was true.

I have no clue why nvidia made SER a separate API call and not something the hardware does natively.

I think there might be a significant different difference how Nvidia and Apple do raytracing. For Nvidia, it seems that the raytracer hardware is invoked and will eventually return the intersection results to the calling thread (this works a bit like texturing). Reordering the threads (and how to do it) is then the task of the shader itself. Nvidia's approach also allows one to incorporate additional sorting key, e.g. by intersecting object material type or some dynamic state. Apple patents however describe a very different execution model, where the raytracing hardware literally takes over the shader execution and launches a new shader with the intersection results. The way I understand it is that the shader is broken up into different pieces and these pieces are all invoked separately (just like coroutines are implemented in C++). Explicit reordering API makes no sense here simply because the thread launching an RT request and the thread receiving the RT results are different threads to begin with. I have no idea whether Apple also will support supplying additional coherency hints.

We really need some documentation from Apple's side, then we will hopefully know more.

diamond.g · Nov 3, 2023

Does Apple lose the ability to do other shader stuff while running the RT shader?

leman said:
I think there might be a significant different difference how Nvidia and Apple do raytracing. For Nvidia, it seems that the raytracer hardware is invoked and will eventually return the intersection results to the calling thread (this works a bit like texturing). Reordering the threads (and how to do it) is then the task of the shader itself. Nvidia's approach also allows one to incorporate additional sorting key, e.g. by intersecting object material type or some dynamic state. Apple patents however describe a very different execution model, where the raytracing hardware literally takes over the shader execution and launches a new shader with the intersection results. The way I understand it is that the shader is broken up into different pieces and these pieces are all invoked separately (just like coroutines are implemented in C++). Explicit reordering API makes no sense here simply because the thread launching an RT request and the thread receiving the RT results are different threads to begin with. I have no idea whether Apple also will support supplying additional coherency hints.

We really need some documentation from Apple's side, then we will hopefully know more.

leman · Nov 3, 2023

diamond.g said:
Does Apple lose the ability to do other shader stuff while running the RT shader?

Why would it? The general-purpose shader cores are free to do any other work in the meantime.

diamond.g · Nov 3, 2023

leman said:
Why would it? The general-purpose shader cores are free to do any other work in the meantime.

Good deal. AMD for sure doesn't have that ability.

leman · Nov 3, 2023

diamond.g said:
Good deal. AMD for sure doesn't have that ability.

They do! AMD RT is done on the texturing unit, it's just like any other texture read. The shader is paused until the results are available, in the meantime the hardware can run some other shader.

diamond.g · Nov 3, 2023

leman said:
They do! AMD RT is done on the texturing unit, it's just like any other texture read. The shader is paused until the results are available, in the meantime the hardware can run some other shader.

I could have sworn the whole GPU stalled, but you are right!

Now trying to look at HIP-RT stuff to see why it is still fairly far behind Optix.

leman · Nov 3, 2023

diamond.g said:
Now trying to look at HIP-RT stuff to see why it is still fairly far behind Optix.

Because AMD's RT implementation is crap. They accelerate the interaction computation itself, but not BVH traversal or the core reasons why RT is slow on GPUs (divergence, spatial locality...). Also, from what I understand their BHV layout is kind of crappy (it's a deep tree that requires a lot of hops).

By the look of it, AMD has rushed a very simple implementation just to get the feature done, and so it's not surprising that the performance is bad. In fact, AMD's RT performance seems comparable to Apple pre-G15, simply because Apple has a compare and select instruction that can be used to compute intersections efficiently on the main SIMD ALUs.

diamond.g · Nov 3, 2023

leman said:
Because AMD's RT implementation is crap. They accelerate the interaction computation itself, but not BVH traversal or the core reasons why RT is slow on GPUs (divergence, spatial locality...). Also, from what I understand their BHV layout is kind of crappy (it's a deep tree that requires a lot of hops).

By the look of it, AMD has rushed a very simple implementation just to get the feature done, and so it's not surprising that the performance is bad. In fact, AMD's RT performance seems comparable to Apple pre-G15, simply because Apple has a compare and select instruction that can be used to compute intersections efficiently on the main SIMD ALUs.

But it still sucks with RDNA3 which they claimed they made fixes on.

leman · Nov 3, 2023

diamond.g said:
But it still sucks with RDNA3 which they claimed they made fixes on.

ChipsAndCheese has a nice analysis on this. They say that RNDA3 has increased the cache size and added a dedicated instruction for loading the BVH nodes faster. But this is still treating the symptoms, not the cause. AMD's approach simply doesn't scale.

name99 · Nov 4, 2023

singhs.apps said:
Why do we assume raytracing is just for games ? And since gamers prefer higher fps over visual quality , therefore raytracing should be a lower priority for Apple ?

Considering games that support Ray tracing would generally run slower on macs anyway compared to PCs, even with raytracing switched off ? So that Linus poll is an irrelevant example in the land of Macs (considering, again, that Macs have NEVER been a first choice for gamers, and Apple never pushed it’s Mac offerings as such ?)

Switching off ray tracing … well a far more comprehensive data would be available to the game hardware manufacturers, game engines, game publishers etc. Yet AMD and Intel are trying to push their solutions at the hardware level to catch-up with Nvidia…. Why ? Just marketing?

Apart from whether ray tracing does or does not have multiple uses, there is a separate question, namely whether ray tracing HARDWARE has separate uses.
Think about a GPU core. There's a small amount of scheduling logic, a whole lot of memory logic, and a whole lot of FMA logic.
The configuration is (more or less) balanced if you are engaged in graphics work, where a certain amount of FPU computation is balanced with a constant stream of accesses to memory.
But for other sorts of GPGPU work the configuration is less balanced. For example if you are engaged in massive graph processing, mostly what you care about is the memory side, the ability to launch vast numbers of "in process" memory operations, and continually switch between them as items are returned from memory.

So consider now "ray tracing hardware". For nV this is essentially two very different pieces of hardware: a constantly active memory machine that is walking the BVH tree, some frequently accessed simple logic that tests against the bounds of each tree node then accesses the next level in the tree, and some infrequently accessed special purpose logic that performs a triangle test at the very end.
Decouple these into separately useful hardware...
So imagine that you augment a shader core not with a ray tracing engine but with better memory traversal skills. Right now traversing a tree (without RT hardware) means each access to memory pauses a warp and its entire functionality (including, most importantly, all its registers). Imagine instead a separate core (kinda like a Texture unit) with much smaller state (only a few registers) that can be given a simple program to walk a tree (or list or nested hashtables or whatever). We don't give this core an FP unit, just a simple integer unit and maybe FP comparisons.
This can be programmed and can act like a "ray tracing" core, giving the advantages of nVidia (where ray tracing and shading happen in parallel), but it can also be used for many other high-bandwidth tasks. It's cheaper in area than a full core, so it's not a ridiculous area increase to add it to each existing core, and thus get full-time on-going use of the memory machinery, but WITHOUT forcing each warp to grind to a halt on a memory access. (Or, to put it differently, providing a whole lot of much cheaper "memory-only" warps, say 16x as many, that can be active simultaneously with the "computational" warps.)

nVidia seems to (so far) have dedicated ray tracing hardware which they see as only of interest to games, and which they are specializing ever more in a game-ward direction (eg the mixed-opacity or micro-facet stuff in the 2nd generation).
AMD seems to do some weird low-performing thing I don't understand on the Texture engines.
But Apple has the chance (we shall see...) to do this correctly, building a generic and reusable "complex data structure traversal engine" which just happens to have, as one of its use cases, BVH tree traversal.

name99 · Nov 4, 2023

leman said:
ChipsAndCheese has a nice analysis on this. They say that RNDA3 has increased the cache size and added a dedicated instruction for loading the BVH nodes faster. But this is still treating the symptoms, not the cause. AMD's approach simply doesn't scale.

But I don't trust ChipsAndCheese analyses of GPU's much. They don't seem to appreciate the specific trade-offs required by GPUs.
For example they don't seem to understand how (for now... this may change in future, but not yet) a GPU write-back L1D is SW-coherent, not HW-coherent, meaning it needs to be flushed on a variety of boundaries, but most obviously when certain kernels begin or end.
This *substantially* changes the tradeoff as to how large you want your L1D to be, and the sort of logic you want to add to it (much more tracking of which lines belong to which kernels, much less tracking to optimize reuse or criticality).

Likewise they obsess over unimportant latency details (the sorts of things that determine CPU performance) while ignoring the latency details that matter for GPUs (latency to launch a kernel, threadblock, or warp, latency of synchronization, latency of cache flush, etc). I am no expert in these, but I know enough to know that they are what matters. For example consider a STEM type application running on a GPU, like Gromacs (molecular dynamics). This runs about 40% faster on raw CUDA than on SYCL on the same hardware (eg through OneAPI). There are many small reasons for this, but the single biggest reasons are the high-level latencies (kernel launch, synchronization between kernels, that sort of thing) handled better (hardly optimally! but better) by CUDA than by SYCL.

Or to put it differently, perhaps they are optimized for seeing how a GPU performs on games (and especially the way older games were written) whereas I care (and I think overall Apple cares) a lot more about future GPU cases (obviously AI/ML is part of this, but so is content creator type work, so is STEM, so is up-coming server work). The stuff Apple is adding like Dynamic Caching (and for that matter, the most interesting stuff nVidia has added over the past few years) are mostly not seen by "traditional" games, they are part of the future, not relevant to the past.

name99 · Nov 4, 2023

theorist9 said:
I'm not sure if Apple is interested in creating MacOS AS computing clusters. They worked with a couple of US universities (UCLA and Virginia Tech) to create PowerMac clusters about 20 years ago, but I don't believe anything more was done after that.

They also had a clustering application called XGrid, but it was deprecated around 2012.

Olivier Giroux now works at Apple. If you don't think this is significant look up his name, his interests (he has some very interesting YouTube talks), and what he has done in the past.

There is no way he would leave his very high-level job at nVidia (and his specific interests, which are all about REALLY pushing GPGPU, not graphics and playing games) for Apple if he didn't see some sort of convergence of Apple's interests with his interests...

leman · Nov 4, 2023

name99 said:
Apart from whether ray tracing does or does not have multiple uses, there is a separate question, namely whether ray tracing HARDWARE has separate uses.
Think about a GPU core. There's a small amount of scheduling logic, a whole lot of memory logic, and a whole lot of FMA logic.
The configuration is (more or less) balanced if you are engaged in graphics work, where a certain amount of FPU computation is balanced with a constant stream of accesses to memory.
But for other sorts of GPGPU work the configuration is less balanced. For example if you are engaged in massive graph processing, mostly what you care about is the memory side, the ability to launch vast numbers of "in process" memory operations, and continually switch between them as items are returned from memory.

So consider now "ray tracing hardware". For nV this is essentially two very different pieces of hardware: a constantly active memory machine that is walking the BVH tree, some frequently accessed simple logic that tests against the bounds of each tree node then accesses the next level in the tree, and some infrequently accessed special purpose logic that performs a triangle test at the very end.
Decouple these into separately useful hardware...
So imagine that you augment a shader core not with a ray tracing engine but with better memory traversal skills. Right now traversing a tree (without RT hardware) means each access to memory pauses a warp and its entire functionality (including, most importantly, all its registers). Imagine instead a separate core (kinda like a Texture unit) with much smaller state (only a few registers) that can be given a simple program to walk a tree (or list or nested hashtables or whatever). We don't give this core an FP unit, just a simple integer unit and maybe FP comparisons.
This can be programmed and can act like a "ray tracing" core, giving the advantages of nVidia (where ray tracing and shading happen in parallel), but it can also be used for many other high-bandwidth tasks. It's cheaper in area than a full core, so it's not a ridiculous area increase to add it to each existing core, and thus get full-time on-going use of the memory machinery, but WITHOUT forcing each warp to grind to a halt on a memory access. (Or, to put it differently, providing a whole lot of much cheaper "memory-only" warps, say 16x as many, that can be active simultaneously with the "computational" warps.)

nVidia seems to (so far) have dedicated ray tracing hardware which they see as only of interest to games, and which they are specializing ever more in a game-ward direction (eg the mixed-opacity or micro-facet stuff in the 2nd generation).
AMD seems to do some weird low-performing thing I don't understand on the Texture engines.
But Apple has the chance (we shall see...) to do this correctly, building a generic and reusable "complex data structure traversal engine" which just happens to have, as one of its use cases, BVH tree traversal.

General-purpose graph traversal hardware sounds intriguing, but in the context of raytracing other domain-specific optimisations are possible, and likely preferable. Both Nvidia and Apple describe very wide, shallow trees in order to minimise the number of node hops and allow more parallelism in intersection testing (this structure also likely has better cache reuse on higher levels). Apple patents even mention generating multiple bounding rectangles for a single large triangle, all to reduce the tree depth. With this kind of structure they can probably locate the final node in just a few iteration steps. I would assume that the memory hierarchy optimisation and the tree walker are tightly coupled with the properties of the data structure they use. A general-purpose tree walker engine is unlikely to reach the same level of performance/energy efficiency in this domain.

leman · Nov 4, 2023

name99 said:
But I don't trust ChipsAndCheese analyses of GPU's much. They don't seem to appreciate the specific trade-offs required by GPUs.

That might be the case, their ray-tracing analysis did make some good points however (especially about the structure of the spacial indices used by AMD and Nvidia, and how AMD raytracing works in practice).

Search

Search

Why is Apple slow to implement Hardware based Ray-Tracing?

leman

macrumors Core

diamond.g

macrumors G5

leman

macrumors Core

diamond.g

macrumors G5

leman

macrumors Core

diamond.g

macrumors G5

leman

macrumors Core

diamond.g

macrumors G5

leman

macrumors Core

diamond.g

macrumors G5

leman

macrumors Core

name99

macrumors 68030

name99

macrumors 68030

name99

macrumors 68030

leman

macrumors Core

leman

macrumors Core

Our Staff