A user on the Blender forum did an experiment comparing the speed that ray tracing hardware provides to Intel, AMD and Nvidia GPUs in Blender. You can't draw many conclusions from it, but the charts are very interesting. For instance:
View attachment 2252056
I’ve been interested in comparing the Ray Accelerators in the AMD GPUs to the Ray Tracing Units in Intel GPUs to the Ray Tracing Cores in Nvidia GPUs. And now that we have HIPRT, OptiX, and Embree GPU in Cycles, we can do that. And with Cycles we can extract some extra debug information to help...
devtalk.blender.org
I wonder what those charts would look like when the Apple GPU had ray tracing hardware?
What this clearly shows is that "RT hardware" is a technical misnomer. We need to understand what exactly these GPUs are doing rather than just say "they have hardware RT".
Take AMD for example. They have a dedicated instruction for intersects a ray against a BVH node. This instruction will fetch the BHV data, do a lengthy math sequence that checks for intersections, sorts the results, and gives it back to you shader. At the same time, it's not very clear how this instruction is implemented. There probably is at least some dedicated hardware logic behind it, but I would't be surprised if the base implementation uses some sort of microcode-like sequence to run the math itself. And I very much doubt that this instruction is particularly fast. At any rate, while this instruction does seem to accelerate the compute-heavy part of BHV traversal, it does nothing to address the main problem of RT — memory access patterns. GPUs are designed with data locality in mind, and tree traversal often tends to be more random. When running these types of tasks on a massive SIMD processors, you end up with cache misses and divergent instruction flow, which will kill your performance. Both Nvidia and intel clearly have a technically superior solution that does more than just accelerate the compute sequence. As to how exactly they do it, one would probably need to study their patents (and hope that the answer is in there).
I hope you can see how the notion of "hardware RT" becomes useless here. AMD can fully claim that they have hardware RT, but their solution is just a bandaid that doesn't really do much. While Apple for example doesn't have a dedicated GPU RT instruction, but they have a general purpose four argument compare and select instruction which can be used to compute a ray/AABB intersection in six GPU cycles. In fact, I wouldn't be shocked if Apple's "software RT" ends up as fast or faster than AMD's "hardware RT" in practice.
Apple has published a bunch of patents describing a RT solution, and if that's what they are going to ship, I think there are good reasons to expect good performance. They are describing a dedicated RT coprocessor (integrated into the GPU, but having it's own cache) that performs intersection tests in parallel on a wide BVH. These tests are done using limited precision ALUs, which allows a compact and very energy-efficient hardware implementations. Intersection hits are ordered and compacted in hardware and a new shader program is launched asynchronously to process the intersections results. There are many interesting aspects to this approach, and the main advantage appears to be low power consumption compared to existing solutions. Of course, all this is purely theoretical until we see the actual hardware.