The Apple M1 GPU has hardware acceleration for ray tracing, similar in performance to RDNA 2. The majority of the compute cost in RT is ray-box intersections, which is why that operation happens 4x faster (RDNA 2), 12x faster (Intel Arc) than ray-triangle intersections. Here is the
optimal slab method for ray-box intersections:
C++:
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int d = 0; d < 3; ++d) {
bool sign = signbit(ray->dir_inv[d]);
float bmin = box->corners[sign][d];
float bmax = box->corners[!sign][d];
float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];
tmin = max(dmin, tmin);
tmax = min(dmax, tmax);
}
return tmin < tmax;
}
Here is one iteration of the inner loop in Apple GPU assembly. Notice the magic CMPSEL instruction - performing a comparison and selection in a single cycle. Both GPT-4 and an AMD expert confirmed no other architectures have this. It is also absent from the M1 CPU, which requires 2 cycles for MAX, MIN, etc.
C++:
Apple:
// precomputed ori_dir = origin * dir_inv
float bmin = CMPSEL(dir_inv < 0 ? corner[1] : corner[0]); // 1 cyc
float bmax = CMPSEL(dir_inv < 0 ? corner[0] : corner[1]); // 1 cyc
float dmin = FFMA(bmin, dir_inv, -ori_dir); // 1 cyc
float dmax = FFMA(bmax, dir_inv, -ori_dir); // 1 cyc
tmin = CMPSEL(dmin > tmin ? dmin : tmin); // 1 cyc
tmax = CMPSEL(dmax < tmax ? dmax : tmax); // 1 cyc
// 6 cycles total
On Apple, all 3 iterations take 18 cycles, while AMD or Nvidia take 30 cycles. Granted this ignores the memory and control flow instructions afterward. This makes it similar to AMD's ray accelerators, which only accelerate the intersection part. Pretend CMPSEL isn't a native instruction, compare Apple's throughput:
Code:
Apple: 30 cycles
AMD: 64 ALU / 4 ray-box/CU = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
30 / 16 = 1.88x AMD
30 / 10.7 = 2.80x Intel
Now, Apple adds CMPSEL to the GPU ISA:
Code:
Apple: 18 cycles
AMD: 64 ALU / 4 ray-box/CU = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
18 / 16 = 1.13x AMD
18 / 10.7 = 1.68x Intel
An Apple GPU has ~same throughput as an RDNA2 GPU of the same F32 TFLOPS. Furthermore, NV boasts a lot about ray-triangle performance, but my use case has zero triangle intersections. For ray-box perf, it could be much closer to an Ampere GPU of the same F32 TFLOPS than people think. The ratio of compute:memory is also lopsided: much more memory bandwidth in the M1's case. Meanwhile, NV optimized the memory subsystem for memory divergence in ray tracing. Say NV has 2.5x better memory, 2.5x better ray-box throughput. Memory might not be any more of a bottleneck on M1 than on NV.
The M1 SLC also has similar (occupancy-scaled) latency/clock speed ratio and size range as NV's L2. The SLC is effectively an L2, the L2 is effectively an L1 instruction cache.
This is not for Philip (who surely understand the issues more than me!) but for other readers.
Talking about "native ray tracing" bundles together a large number of issues; it's more useful to see what elements Apple has already picked up and what elements remain to be handled.
First we have the SW side.
Apple's first round of ray tracing support was to provide Metal Performance Shaders that do the job. Using a very rough analogy, this is like providing a command line tool for the CPU, and you, a developer, can perform some task by calling that command line tool (eg via the C system() call). Point is, overhead is high.
The next year Apple provided the equivalent of a function call so that you could interleave calls to ray tracing with the rest of your compute/render pipeline. That's obviously lower overhead, so a nice little boost.
The next year Apple provided the equivalent of changes in the Metal language itself (ie changes to the compiler) so that ray tracing becomes basically as efficient as it can possibly be, compiled directly with no function call overhead, scheduled optimally, blah blah.
Alongside all this, over the same years Apple introduced changes to the rest of the Metal runtime to allow much faster operation of the sorts of operations usually performed while ray tracing (for example when a ray interests a triangle, you need to look up the material properties of that triangle to figure out what to do -- extinguish the ray? reflect the ray? change the color of the reflected ray? etc). This involves things like how data is laid out in memory and how the Metal runtime knows what data is stored in what buffers.
All these SW changes, even if nothing were done to the HW, would be a nice performance improvement. But another way to look at them is that they were essential infrastructure that would need to be in place to take advantage of any future ray tracing HW, and now they are fully in place...
Now for the HW side.
Ray tracing involves two primary operations. The first is building the so-called Acceleration Structure (based on the scene geometry), the second is using that structure to test for ray intersections. You can provide hardware to perform either or both parts of these two jobs.
Apple has Metal API to call into creating the Acceleration Structure and, a year later, provided API to allow for updating an existing Acceleration Structure. (You do this if you are creating animation rather than a single ray traced image. Hopefully, if your geometry has not changed too much relative to the previous frame, you can make small cheap modifications to the Acceleration Structure rather than rebuilding from scratch).
nV (and maybe AMD, I am not sure about them) have hardware to improve building the Acceleration Structure. I don't know if they have a special path for updating it for small geometry changes.
Apple seem to view these issues in terms of (as far as possible) making each lane and core of the GPU more capable, rather than adding side hardware that can only do one task. We have seen this already with matrix multiply - rather than adding the equivalent of a dedicated matrix multiply engine, Apple added permute to move data cheaply between GPU lanes (functionality that is widely useful), and built matrix multiply on top of this permute ability. Similarly they have accelerated one part of ray tracing by adding a new instruction which speeds up the relevant operation but may well have value for other tasks the GPU needs to perform.
So if we ask about "ray tracing hardware" the parts that are left are
- building/updating an acceleration structure (the API/language part of this is done) and
- ray triangle testing
They probably hope to implement both of these via per-lane improvements rather than dedicated hardware. I don't know what's involved in building the acceleration structure, but the ray triangle intersection may be an operation that can be performed on a per-quad level (like dfdx and dfdy)?