Apple Silicon in Sciences

Xiao_Xi · Apr 20, 2023

chris11 said:
When using numpy, with OpenBlas (the default) multiplying a 8192*8192 matrix of random
singles I get ~ 600 GFLOPS on an M2Pro (6+4).

Switching to Accelerate with [...] this number goes to ~ 2 TFLOPS

How much faster does Accelerate make matrix multiplication? In Julia, Accelerate speeds up the multiplication of an 8192x8192 matrix by almost 40% (from 2.9s to 1.8s).

chris11 · Apr 20, 2023

Xiao_Xi said:
How much faster does Accelerate make matrix multiplication? In Julia, Accelerate speeds up the multiplication of an 8192x8192 matrix by almost 40% (from 2.9s to 1.8s).

Oh. Unless I did something wrong, it's much faster than that.

Here is the code:

Python:

import numpy as np
import time

rg = np.random.default_rng(1)
fp = "single"

for size in (2048, 4096, 8192):

for repeat in range(1, 5):
big1 = rg.random((size, size), dtype=fp)
big2 = rg.random((size, size), dtype=fp)

t0 = time.time()
prod = big1 @ big2
t1 = time.time()

print("matrix mul %d x %d took %.3f seconds -> %.1f GFLOPS (%s)" % (
size, size, t1 - t0, (2 * size ** 3 - size ** 2) / (t1 - t0) / 1.0E9, fp))

and here is the output with Accelerate on the M2Pro (6+4):

Code:

matrix mul 2048 x 2048 took 0.013 seconds -> 1340.4 GFLOPS (single)
matrix mul 2048 x 2048 took 0.010 seconds -> 1719.1 GFLOPS (single)
matrix mul 2048 x 2048 took 0.010 seconds -> 1759.8 GFLOPS (single)
matrix mul 2048 x 2048 took 0.010 seconds -> 1756.5 GFLOPS (single)
matrix mul 4096 x 4096 took 0.075 seconds -> 1844.1 GFLOPS (single)
matrix mul 4096 x 4096 took 0.073 seconds -> 1873.3 GFLOPS (single)
matrix mul 4096 x 4096 took 0.080 seconds -> 1708.4 GFLOPS (single)
matrix mul 4096 x 4096 took 0.073 seconds -> 1885.2 GFLOPS (single)
matrix mul 8192 x 8192 took 0.525 seconds -> 2094.3 GFLOPS (single)
matrix mul 8192 x 8192 took 0.539 seconds -> 2038.6 GFLOPS (single)
matrix mul 8192 x 8192 took 0.526 seconds -> 2089.2 GFLOPS (single)
matrix mul 8192 x 8192 took 0.529 seconds -> 2080.0 GFLOPS (single)

With OpenBlas the 8192 x 8192 is about 3.3 times slower.

-- Chris

Xiao_Xi · Apr 20, 2023

chris11 said:
With OpenBlas the 8192 x 8192 is about 3.3 times slower.

Your numbers are consistent with those of other benchmarks. For example, in Julia
- multiplying 8192x8192 arrays in single precision with Accelerate takes 0.5s instead of 1.5s with OpenBlas.
- multiplying 8192x8192 matrices in double precision with Accelerate takes 1.8s instead of 2.9s with OpenBlas.
- multiplying complex 8192x8192 matrices in double precision with Accelerate takes 7.8s instead of 14s with OpenBlas.

BLAS support for M1 ARM64 via Apple's Accelerate · Issue #869 · JuliaLang/LinearAlgebra.jl

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can prov...

github.com

chris11 · Apr 21, 2023

Xiao_Xi said:
Your numbers are consistent with those of other benchmarks.

Oh, I see. That was single vs. double.

Good to know, thanks

-- Chris

Romain_H · Apr 21, 2023

Xiao_Xi said:
Your numbers are consistent with those of other benchmarks. For example, in Julia
- multiplying 8192x8192 arrays in single precision with Accelerate takes 0.5s instead of 1.5s with OpenBlas.
- multiplying 8192x8192 matrices in double precision with Accelerate takes 1.8s instead of 2.9s with OpenBlas.
- multiplying complex 8192x8192 matrices in double precision with Accelerate takes 7.8s instead of 14s with OpenBlas.

BLAS support for M1 ARM64 via Apple's Accelerate · Issue #869 · JuliaLang/LinearAlgebra.jl

The default BLAS Julia uses is OpenBLAS. Apple's M1 has proprietary dedicated matrix hardware that is only accessible via Apple's Accelerate BLAS implementation. That proprietary interface can prov...

github.com

How would this compare with CUDA? Is it even comparable? If so, what GPU would one compare this to?

name99 · Apr 21, 2023

Romain_H said:
How would this compare with CUDA? Is it even comparable? If so, what GPU would one compare this to?

What do you have in mind? FP64? FP32? FP16? FP8 or Int8? Sparse vs dense matrices? etc etc.

The handwaving answer is that the most common GPU (by Steam Survey) is nV GeForce RTX 3060.
This has 28 SMs which are more or less analogous to an Apple Silicon GPU "Core" (so an SM has 128 execution lanes, just like an Apple Silicon GPU core has 128 execution lanes).
An M2 Pro has 16 or 19 GPU cores depending on whether you get the low or high end version; whereas a Max has 30 or 38.
With those numbers, *for the most part* both nV and Apple do a reasonable job of executing one FP op per cycle per lane with well-written code. Apple (just like with their CPU's) do a slightly better job of not wasting time on busywork overhead (eg address calculations), on the other hand (just like with x86) there's a much larger pool of code that's been pre-optimized for nVidia.

When it comes specifically to matrix multiply, the points I raised become even more relevant.
If you want FP64, well AMX covers that, whereas doing it on the GPU (whether Apple or nV) is a lot slower (like ~18 to ~50 ops per FP64 FMAC depending on exactly how you fake FP64-like precision).

For FP32 and FP16 nV do it via a dedicated Tensor Unit, whereas Apple do it via dedicated cross-lane permute/multiply/add instructions. As far as I can tell, performance is essentially the same given the same amount of hardware (appropriately scaled, since exact matches between the two don't exist). Apple take advantage of the fact that, unusually for a GPU, they have very rich permute operations to move data between lanes; and they use these operations in other useful ways. nV don't have those operations (as far as I know) and so use dedicated HW for matrix multiply, but their are limits to how much of that HW they can include on the GPU.

But Apple's hardware does not scale down to FP8 or Int8 multiply, whereas nV's tensor units do scale down to that, so they will look really good for that purpose. Most recent nV also supports sparse matrices in a way that Apple does not yet (either AMX or GPU) but I don't know if RTX3060 is recent enough to have that support.

Of course if you are willing to go up nV scales a lot larger -- and a LOT more expensive. (As does Apple, but not as large, not yet.) I'm trying to compare the kind of range of things that most people actually buy rather than professional or data center equipment.

So I think it's fair to say that the most popular nV GPU (among people who pay specifically for a GPU, but don't have crazy budgets) is for most purposes above an Apple Pro (what people who don't really care about GPU would buy but want a better than basic computer would buy) and below an Apple Max (what people who do care about GPU but have realistic budgets would buy). FWIW I bought a mac mini M2 Pro (low end 6+4 core, 16 GPU core) and would not have bought an M2 Max even if such existed. But then for my purposes I don't play games on the machine so I care more about CPU speed than GPU.

Philip Turner · Apr 24, 2023

name99 said:
But Apple's hardware does not scale down to FP8 or Int8 multiply, whereas nV's tensor units do scale down to that, so they will look really good for that purpose.

Apple does use the neural engine (FP16) and GPU (FP32) simultaneously for MetalFX upscaling. That's why they recommend avoiding Metal resource dependencies; to hide* the massive (~1 ms)* latency to access the ANE, it overlaps GPU work across the next frame's render pass. The main (practical) use case for tensor cores on RTX 3060 is running ConvNets for DLSS upscaling, and Apple achieved the same objective in a different way.

*Take this conclusion and number with a grain of salt though.

Chip	FP16 Matrix	TF32 Matrix	FP32 Vector	Power during AI upscaling
M2 Pro (19-core)	15.8 TFLOPS	5.47 TFLOPS	6.84 TFLOPS	2 W
RTX 3060	50.96 TFLOPS	12.74 TFLOPS	12.74 TFLOPS	maybe 10% of 170 W

I got 2 W at 120 Hz, 768x768 -> 1536x1536. Extrapolate what you will to other resolutions. The ANE part was 80 mW.

Xiao_Xi · Apr 24, 2023

Philip Turner said:
Chip FP16 Matrix TF32 Matrix FP32 Vector Power during AI upscaling
M2 Pro (19-core) 15.8 TFLOPS 5.47 TFLOPS 6.84 TFLOPS 2 W
RTX 3060 50.96 TFLOPS 12.74 TFLOPS 12.74 TFLOPS maybe 10% of 170 W

I got 2 W at 120 Hz, 768x768 -> 1536x1536. Extrapolate what you will to other resolutions. The ANE part was 80 mW.

Rescaling seems like a cheap operation. An RTX 3050 needs less than 3 W to upscale a 720p video to 4K.

To show this, we tested VSR on the same video sequence — the 720p NHL game upscaled to 4K — on two different extremes of the VSR spectrum, with both the VSR Quality 1 and VSR Quality 4 settings.

Nvidia GPU Power Use With / Without VSR

GPU VSR Off VSR On (1) VSR On (4)
RTX 4090 (Watts) 28.9 32.8 36.9
RTX 3050 (Watts) 13.0 15.9 15.9

Nvidia VSR Testing: AI Upscaling and Enhancement for Video

Improve those low quality video streams

www.tomshardware.com

Philip Turner · Apr 24, 2023

Xiao_Xi said:
An RTX 3050 needs less than 3 W to upscale a 720p video to 4K.

"In contrast, the RTX 3050 needed just 3W more power for either VSR setting."

YouTube is typically 60 Hz, but I was ray tracing at 120 Hz. 2 W is the combined cost of both the ray tracing part and the upscaling part. The RTX 3050 was consuming 16 W total.

Resolution	FPS	Power (staring at empty background)	Power (every pixel intersects a ray)
768x768 -> 1536x1536	60 Hz	950 mW	1150 mW
1024x1024 -> 2048x2048	60 Hz	1700 mW	2000 mW
768x768 -> 1536x1536	120 Hz	1600 mW	2200 mW
1024x1024 -> 2048x2048	120 Hz	3500 mW	5300 mW

Extrapolate this to the Nvidia GPUs, which are inputting 921,600 pixels @ 60 Hz. Pretend they are rendering my 1024x1024 input pixel scene. Apple's algorithm does not have quality settings, but you were referencing Nvidia's lowest quality (VSR 1). I will show both lowest -> highest quality. I am also assuming Nvidia would consume 300 mW without upscaling, but in reality the minimum power is >10000 mW. Notice that power consumption is higher on their newer GPU.

Resolution	FPS	Power (staring at empty background)	Power (every pixel intersects a ray)
1024x1024 in, M1 Max	60 Hz	1700 mW	2000 mW
720x1280 in, 3050	<= 60 Hz	2900 -> 2900 mW	3200 -> 3200 mW
720x1280 in, 4090	<= 60 Hz	3900 -> 8000 mW	4200 -> 8300 mW

name99 · Apr 24, 2023

Xiao_Xi said:
Rescaling seems like a cheap operation. An RTX 3050 needs less than 3 W to upscale a 720p video to 4K.

Nvidia VSR Testing: AI Upscaling and Enhancement for Video

Improve those low quality video streams

www.tomshardware.com

That's just spatial?
My experience (not using either nV or Apple but an expensive Apple Silicon 3rd party app [in test/demo mode!]) is that what's REALLY expensive is temporal resampling. That's also where I want this technology to go...

The combination of Apple TV and/or my TV do an acceptable job of spatial upsampling today, but the combination of incompetences at every level of pouring digital content from film to some eventual stream results in absolutely horrendous jerkiness.
(Things like company A uses 3:2 pulldown to generate 60/s interleaved fields from film, then company B streams this as 30/s upscaled fields->frames.
When the 60fps fields are broadcast, a lot of hardware, including mine, can undo that back to 24 fps; but when this bastardized 30fps upscaled fields are broadcast they are helpless!)

I"m hoping for AI that, at some point, does the job of
- tracking the jerkiness
- figuring out the jerk pattern
- retemporalizing back to the correct frame rate
- temporal upsampling from 24fps (which ain't good enough!) to 120fps

We're a long way from that, but basic AI temporal upsampling is the first step.
Unfortunately the usual suspects that report on AI upsampling quality, for whatever reason, seem uninterested in telling us how well nV or Apple or anyone else handle this particular task. I get that it's harder to test and to demonstrate (freezing a frame to see upscaling artifacts is easy). But I do wish there were *some* reporting on this! After all the TV review folks (like rtings.com) have a similarly tough task, but they bite the bullet and buy the hardware or whatever necessary to do the job!

Philip Turner · Apr 24, 2023

"According to the Nvidia documents, a new and different neural network is used than the one in DLSS, and training is done on different data sets. RTX Video Super Resolution works purely with pixels of the image (video frame), unlike DLSS it doesn’t use any “meta” inputs like motion vector and depth information from the game engine. On the one hand, the filter detects and enhances edges, and on the other hand it also tries to remove compression artifacts."

I'm using MetalFX with temporal upscaling. This technique is much more computationally intensive than spatial upscaling. So the comparison with NV was not fair at all. Temporal upscaling needs both motion vectors and a depth buffer, although I think video compression uses motion vectors somehow.

Xiao_Xi · Apr 24, 2023

Philip Turner said:
I'm using MetalFX with temporal upscaling. This technique is much more computationally intensive than spatial upscaling.

The DLSS benchmarks focus on FPS improvement. Igor did a cool benchmark, but it doesn't show how much DLSS consumes.

With NVIDIA and DLSS 2.0 on the eco-trip from Turing to Ampere? Saving power against the trend is more than a nice gimmick! | Page 2 | igor´sLAB

Today’s article is actually more of a random product, because during the preparations for the Ampere launch I not only renewed and significantly expanded the test system, but also all the metrics and…

www.igorslab.de

Philip Turner · Apr 24, 2023

Ideally, TAA will lead to 4x FPS improvement, because you can render to half the resolution. However, now some time in the frame is spent on the upscaler. I would set 3-3.5x as the actual theoretical limit. Another FPS boost might be interpolating frames, which could be more than 6-7x because the second frame doesn't need to be rendered. For my use case, I need the exact geometry data for each frame, not just an AI's guess, to capture high-frequency motions.

Then there's the geometry pass, which will not change in cost. In rasterization or hybrid RT, this is the vertex stage. My use case is entirely RT, but builds a new acceleration structure each frame. Lower resolutions let me have a tradeoff, where the accel takes less time (1/32x) to build, but the intersection functions are more expensive (32x). I can process groups of 32 nearby atoms instead of each atom individually. MetalFX "makes or breaks" whether I can render supermassive molecules, each atom having a different position in each different frame.

Xiao_Xi · Apr 24, 2023

name99 said:
We're a long way from that, but basic AI temporal upsampling is the first step.

Do you want something like RIFE?

Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Real-time video frame interpolation (VFI) is very useful in video processing, media players, and display devices. We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for VFI. To realize a high-quality flow-based VFI method, RIFE uses a neural network named IFNet that can estimate...

arxiv.org

Video size	Generic (any GPU)	TensorRT (NVidia only)
720p @48 fps	>= RTX 2060 or M1 Max	1660 (?)
1080p @48 fps, 720p @60-72 fps	>= RTX 3070	>= RTX 2060
1080 @60-72 fps	>= RTX 3080	>= RTX 3060
4K @48 fps	RTX 4090	>= RTX 4080
4K @60-72 fps	wait for RTX 5xxx...	RTX 4090

RIFE AI interpolation - SmoothVideo Project (SVP) - frame doubling interpolation

www.svp-team.com

Philip Turner · Apr 24, 2023

Philip Turner said:
This technique is much more computationally intensive than spatial upscaling. So the comparison with NV was not fair at all.

I want to reiterate that Apple, under their highest-quality, most computationally intensive algorithm, used a fraction of the power of both Ampere and Ada GPUs, while Nvidia was on their lowest-quality, least computationally intensive algorithm.

name99 · Apr 25, 2023

Xiao_Xi said:
Do you want something like RIFE?

Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Real-time video frame interpolation (VFI) is very useful in video processing, media players, and display devices. We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for VFI. To realize a high-quality flow-based VFI method, RIFE uses a neural network named IFNet that can estimate...

arxiv.org

Video size Generic (any GPU) TensorRT (NVidia only)
720p @48 fps >= RTX 2060 or M1 Max 1660 (?)
1080p @48 fps, 720p @60-72 fps >= RTX 3070 >= RTX 2060
1080 @60-72 fps >= RTX 3080 >= RTX 3060
4K @48 fps RTX 4090 >= RTX 4080
4K @60-72 fps wait for RTX 5xxx... RTX 4090

RIFE AI interpolation - SmoothVideo Project (SVP) - frame doubling interpolation

www.svp-team.com

Optical flow is one input into a decent temporal resampler.

But for the purposes *I* care about, more important (and more difficult) is detecting (and undoing) the jerkiness that comes from the uneven frame rate of the input after the 3:2 pullup followed by alternate field removal...

Philip Turner · May 1, 2023

The Apple M1 GPU has hardware acceleration for ray tracing, similar in performance to RDNA 2. The majority of the compute cost in RT is ray-box intersections, which is why that operation happens 4x faster (RDNA 2), 12x faster (Intel Arc) than ray-triangle intersections. Here is the optimal slab method for ray-box intersections:

C++:

bool intersection(const struct ray *ray, const struct box *box) {
    float tmin = 0.0, tmax = INFINITY;

    for (int d = 0; d < 3; ++d) {
        bool sign = signbit(ray->dir_inv[d]);
        float bmin = box->corners[sign][d];
        float bmax = box->corners[!sign][d];

        float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
        float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];

        tmin = max(dmin, tmin);
        tmax = min(dmax, tmax);
    }

    return tmin < tmax;
}

Here is one iteration of the inner loop in Apple GPU assembly. Notice the magic CMPSEL instruction - performing a comparison and selection in a single cycle. Both GPT-4 and an AMD expert confirmed no other architectures have this. It is also absent from the M1 CPU, which requires 2 cycles for MAX, MIN, etc.

C++:

Apple:
// precomputed ori_dir = origin * dir_inv
float bmin = CMPSEL(dir_inv < 0 ? corner[1] : corner[0]); // 1 cyc
float bmax = CMPSEL(dir_inv < 0 ? corner[0] : corner[1]); // 1 cyc
float dmin = FFMA(bmin, dir_inv, -ori_dir); // 1 cyc
float dmax = FFMA(bmax, dir_inv, -ori_dir); // 1 cyc
tmin = CMPSEL(dmin > tmin ? dmin : tmin); // 1 cyc
tmax = CMPSEL(dmax < tmax ? dmax : tmax); // 1 cyc
// 6 cycles total

On Apple, all 3 iterations take 18 cycles, while AMD or Nvidia take 30 cycles. Granted this ignores the memory and control flow instructions afterward. This makes it similar to AMD's ray accelerators, which only accelerate the intersection part. Pretend CMPSEL isn't a native instruction, compare Apple's throughput:

Code:

Apple:                                30 cycles
AMD:    64 ALU /  4 ray-box/CU      = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
30 / 16   = 1.88x AMD
30 / 10.7 = 2.80x Intel

Now, Apple adds CMPSEL to the GPU ISA:

Code:

Apple:                                18 cycles
AMD:    64 ALU /  4 ray-box/CU      = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
18 / 16   = 1.13x AMD
18 / 10.7 = 1.68x Intel

An Apple GPU has ~same throughput as an RDNA2 GPU of the same F32 TFLOPS. Furthermore, NV boasts a lot about ray-triangle performance, but my use case has zero triangle intersections. For ray-box perf, it could be much closer to an Ampere GPU of the same F32 TFLOPS than people think. The ratio of compute:memory is also lopsided: much more memory bandwidth in the M1's case. Meanwhile, NV optimized the memory subsystem for memory divergence in ray tracing. Say NV has 2.5x better memory, 2.5x better ray-box throughput. Memory might not be any more of a bottleneck on M1 than on NV.

The M1 SLC also has similar (occupancy-scaled) latency/clock speed ratio and size range as NV's L2. The SLC is effectively an L2, the L2 is effectively an L1 instruction cache.

name99 · May 1, 2023

Philip Turner said:
The Apple M1 GPU has hardware acceleration for ray tracing, similar in performance to RDNA 2. The majority of the compute cost in RT is ray-box intersections, which is why that operation happens 4x faster (RDNA 2), 12x faster (Intel Arc) than ray-triangle intersections. Here is the optimal slab method for ray-box intersections:

C++:

bool intersection(const struct ray *ray, const struct box *box) { float tmin = 0.0, tmax = INFINITY; for (int d = 0; d < 3; ++d) { bool sign = signbit(ray->dir_inv[d]); float bmin = box->corners[sign][d]; float bmax = box->corners[!sign][d]; float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d]; float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d]; tmin = max(dmin, tmin); tmax = min(dmax, tmax); } return tmin < tmax; }

Here is one iteration of the inner loop in Apple GPU assembly. Notice the magic CMPSEL instruction - performing a comparison and selection in a single cycle. Both GPT-4 and an AMD expert confirmed no other architectures have this. It is also absent from the M1 CPU, which requires 2 cycles for MAX, MIN, etc.

C++:

Apple: // precomputed ori_dir = origin * dir_inv float bmin = CMPSEL(dir_inv < 0 ? corner[1] : corner[0]); // 1 cyc float bmax = CMPSEL(dir_inv < 0 ? corner[0] : corner[1]); // 1 cyc float dmin = FFMA(bmin, dir_inv, -ori_dir); // 1 cyc float dmax = FFMA(bmax, dir_inv, -ori_dir); // 1 cyc tmin = CMPSEL(dmin > tmin ? dmin : tmin); // 1 cyc tmax = CMPSEL(dmax < tmax ? dmax : tmax); // 1 cyc // 6 cycles total

On Apple, all 3 iterations take 18 cycles, while AMD or Nvidia take 30 cycles. Granted this ignores the memory and control flow instructions afterward. This makes it similar to AMD's ray accelerators, which only accelerate the intersection part. Pretend CMPSEL isn't a native instruction, compare Apple's throughput:

Code:

Apple: 30 cycles AMD: 64 ALU / 4 ray-box/CU = 16 cycles Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles 30 / 16 = 1.88x AMD 30 / 10.7 = 2.80x Intel

Now, Apple adds CMPSEL to the GPU ISA:

Code:

Apple: 18 cycles AMD: 64 ALU / 4 ray-box/CU = 16 cycles Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles 18 / 16 = 1.13x AMD 18 / 10.7 = 1.68x Intel

An Apple GPU has ~same throughput as an RDNA2 GPU of the same F32 TFLOPS. Furthermore, NV boasts a lot about ray-triangle performance, but my use case has zero triangle intersections. For ray-box perf, it could be much closer to an Ampere GPU of the same F32 TFLOPS than people think. The ratio of compute:memory is also lopsided: much more memory bandwidth in the M1's case. Meanwhile, NV optimized the memory subsystem for memory divergence in ray tracing. Say NV has 2.5x better memory, 2.5x better ray-box throughput. Memory might not be any more of a bottleneck on M1 than on NV.

The M1 SLC also has similar (occupancy-scaled) latency/clock speed ratio and size range as NV's L2. The SLC is effectively an L2, the L2 is effectively an L1 instruction cache.

This is not for Philip (who surely understand the issues more than me!) but for other readers.
Talking about "native ray tracing" bundles together a large number of issues; it's more useful to see what elements Apple has already picked up and what elements remain to be handled.

First we have the SW side.
Apple's first round of ray tracing support was to provide Metal Performance Shaders that do the job. Using a very rough analogy, this is like providing a command line tool for the CPU, and you, a developer, can perform some task by calling that command line tool (eg via the C system() call). Point is, overhead is high.
The next year Apple provided the equivalent of a function call so that you could interleave calls to ray tracing with the rest of your compute/render pipeline. That's obviously lower overhead, so a nice little boost.
The next year Apple provided the equivalent of changes in the Metal language itself (ie changes to the compiler) so that ray tracing becomes basically as efficient as it can possibly be, compiled directly with no function call overhead, scheduled optimally, blah blah.

Alongside all this, over the same years Apple introduced changes to the rest of the Metal runtime to allow much faster operation of the sorts of operations usually performed while ray tracing (for example when a ray interests a triangle, you need to look up the material properties of that triangle to figure out what to do -- extinguish the ray? reflect the ray? change the color of the reflected ray? etc). This involves things like how data is laid out in memory and how the Metal runtime knows what data is stored in what buffers.

All these SW changes, even if nothing were done to the HW, would be a nice performance improvement. But another way to look at them is that they were essential infrastructure that would need to be in place to take advantage of any future ray tracing HW, and now they are fully in place...

Now for the HW side.
Ray tracing involves two primary operations. The first is building the so-called Acceleration Structure (based on the scene geometry), the second is using that structure to test for ray intersections. You can provide hardware to perform either or both parts of these two jobs.
Apple has Metal API to call into creating the Acceleration Structure and, a year later, provided API to allow for updating an existing Acceleration Structure. (You do this if you are creating animation rather than a single ray traced image. Hopefully, if your geometry has not changed too much relative to the previous frame, you can make small cheap modifications to the Acceleration Structure rather than rebuilding from scratch).

nV (and maybe AMD, I am not sure about them) have hardware to improve building the Acceleration Structure. I don't know if they have a special path for updating it for small geometry changes.

Apple seem to view these issues in terms of (as far as possible) making each lane and core of the GPU more capable, rather than adding side hardware that can only do one task. We have seen this already with matrix multiply - rather than adding the equivalent of a dedicated matrix multiply engine, Apple added permute to move data cheaply between GPU lanes (functionality that is widely useful), and built matrix multiply on top of this permute ability. Similarly they have accelerated one part of ray tracing by adding a new instruction which speeds up the relevant operation but may well have value for other tasks the GPU needs to perform.
So if we ask about "ray tracing hardware" the parts that are left are
- building/updating an acceleration structure (the API/language part of this is done) and
- ray triangle testing
They probably hope to implement both of these via per-lane improvements rather than dedicated hardware. I don't know what's involved in building the acceleration structure, but the ray triangle intersection may be an operation that can be performed on a per-quad level (like dfdx and dfdy)?

Philip Turner · May 1, 2023

I recommend reading over Imagination's CXT whitepaper. Apple stopped using Imagination's PowerVR architecture when they introduced the A11 GPU, but they probably have still been working for them since at least a few years ago. Imagination created "ray sorting" hardware to improve the power efficiency of ray tracing, which reorders rays to reduce divergence. This is basically a ton of data permutation, swapping rays and state variables between different threads. Remember all of this is for their AR/VR headset, which is extremely power constrained - even with the most efficient GPU in the world, only 2 hours on battery.

Apple has also sacrified performance for the sake of efficiency before - their AMX half the vector throughput of NEON, making Accelerate slower than OpenBLAS. I wouldn't rule out Apple having less theoretical performance ("RT TFLOPS") than Nvidia. They might reuse the FP32 ALUs for intersections; the bottleneck is how to coordinate data movement (BVH decoding, ray sorting) to reach high utilization of existing compute hardware. This would also line up with their equalizing of FP16 and FP32 performance. It is not possible to perform ray intersection tests in FP16, only FP32. AMD, Intel, and Nvidia still have dedicated FP16 ALUs, although on recent generations only accessible through Matmul instructions. Apple removed FP16 from the GPU and put it in the neural engine (ahem, MetalFX ray tracing denoiser).

Philip Turner · May 1, 2023

- building/updating an acceleration structure (the API/language part of this is done) and

Good point - for my use case, the animation is so costly (1 million atoms @ 120 Hz) that I have to consider things like streaming a pre-built accel straight from the disk. This is where Metal fast resource loading and the LZBITMAP codec become extremely helpful. I need the order of 1 GB per second disk bandwidth.

leman · May 1, 2023

Nice catch regarding CMPSEL! It is basically a fusion of a compare and select operations, which is a common pattern for a bunch of data-parallel algorithms. I don’t think that raytracing acceleration is the primary idea behind this instruction (e.g. Apple also has an integer variant), but it does help to make intersection calculation a bit cheaper, which is certainly nice.

But I wouldn’t call it hardware-accelerated RT. To be honest, I wouldn’t call AMDs implementation hardware accelerated RT either. A hardware RT implementation would do the entirety of the traversal using dedicated circuitry.

Philip Turner · May 1, 2023

leman said:
But I wouldn’t call it hardware-accelerated RT. To be honest, I wouldn’t call AMDs implementation hardware accelerated RT either. A hardware RT implementation would do the entirety of the traversal using dedicated circuitry.

Would you consider simdgroup_matrix to be hardware-accelerated matrix multiplication? Does the chip have to dedicate space to separate circuitry that will only be utilized during a small fraction of everyday workloads?

Xiao_Xi · May 1, 2023

leman said:
To be honest, I wouldn’t call AMDs implementation hardware accelerated RT either. A hardware RT implementation would do the entirety of the traversal using dedicated circuitry.

Chips and Cheese wrote a post about ray tracing on AMD and Nvidia GPUs some time ago.

Raytracing on AMD’s RDNA 2/3, and Nvidia’s Turing and Pascal

Note: Jake has commented that Nvidia’s tools may not show the true BVH structure. That’s a distinct possibility, as the structure implied by Nsight is indeed ridiculously wide. The rest…

chipsandcheese.com

Philip Turner · May 1, 2023

Xiao_Xi said:
Chips and Cheese wrote a post about ray tracing on AMD and Nvidia GPUs some time ago.

Raytracing on AMD’s RDNA 2/3, and Nvidia’s Turing and Pascal

Note: Jake has commented that Nvidia’s tools may not show the true BVH structure. That’s a distinct possibility, as the structure implied by Nsight is indeed ridiculously wide. The rest…

chipsandcheese.com

I read this today while checking whether the CMPSEL conclusion was right.

"Each SM tracked 13.23 waves on average out of 64 theoretical,"

Yikes! Apple prevents the register allocation from ever falling below 24 simds/core (max 96), while NV lets you go as low as 8 simds/core. You need a lot of occupant simds to hide the latency of L2 (NV)/L3 (M1) cache.

name99 · May 1, 2023

Philip Turner said:
I recommend reading over Imagination's CXT whitepaper. Apple stopped using Imagination's PowerVR architecture when they introduced the A11 GPU, but they probably have still been working for them since at least a few years ago. Imagination created "ray sorting" hardware to improve the power efficiency of ray tracing, which reorders rays to reduce divergence. This is basically a ton of data permutation, swapping rays and state variables between different threads. Remember all of this is for their AR/VR headset, which is extremely power constrained - even with the most efficient GPU in the world, only 2 hours on battery.

Apple has also sacrified performance for the sake of efficiency before - their AMX half the vector throughput of NEON, making Accelerate slower than OpenBLAS. I wouldn't rule out Apple having less theoretical performance ("RT TFLOPS") than Nvidia. They might reuse the FP32 ALUs for intersections; the bottleneck is how to coordinate data movement (BVH decoding, ray sorting) to reach high utilization of existing compute hardware. This would also line up with their equalizing of FP16 and FP32 performance. It is not possible to perform ray intersection tests in FP16, only FP32. AMD, Intel, and Nvidia still have dedicated FP16 ALUs, although on recent generations only accessible through Matmul instructions. Apple removed FP16 from the GPU and put it in the neural engine (ahem, MetalFX ray tracing denoiser).

Do we know for a fact that Apple has removed FP16 from the GPU? When? With A15/M2?

Apple Silicon in Sciences

macrumors 68000

macrumors newbie

macrumors 68000

macrumors newbie

macrumors 6502a

macrumors 68030

macrumors regular

macrumors 68000

macrumors regular

macrumors 68030

macrumors regular

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors 68030

macrumors regular

macrumors 68030

macrumors regular

macrumors regular

macrumors Core

macrumors regular

macrumors 68000

macrumors regular

macrumors 68030

Our Staff