Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
When using numpy, with OpenBlas (the default) multiplying a 8192*8192 matrix of random
singles I get ~ 600 GFLOPS on an M2Pro (6+4).

Switching to Accelerate with [...] this number goes to ~ 2 TFLOPS :oops:
How much faster does Accelerate make matrix multiplication? In Julia, Accelerate speeds up the multiplication of an 8192x8192 matrix by almost 40% (from 2.9s to 1.8s).
 

chris11

macrumors newbie
Nov 30, 2008
16
12
How much faster does Accelerate make matrix multiplication? In Julia, Accelerate speeds up the multiplication of an 8192x8192 matrix by almost 40% (from 2.9s to 1.8s).

Oh. Unless I did something wrong, it's much faster than that.

Here is the code:

Python:
import numpy as np
import time

rg = np.random.default_rng(1)
fp = "single"

for size in (2048, 4096, 8192):

for repeat in range(1, 5):
big1 = rg.random((size, size), dtype=fp)
big2 = rg.random((size, size), dtype=fp)

t0 = time.time()
prod = big1 @ big2
t1 = time.time()

print("matrix mul %d x %d took %.3f seconds -> %.1f GFLOPS (%s)" % (
size, size, t1 - t0, (2 * size ** 3 - size ** 2) / (t1 - t0) / 1.0E9, fp))

and here is the output with Accelerate on the M2Pro (6+4):

Code:
matrix mul 2048 x 2048 took 0.013 seconds -> 1340.4 GFLOPS (single)
matrix mul 2048 x 2048 took 0.010 seconds -> 1719.1 GFLOPS (single)
matrix mul 2048 x 2048 took 0.010 seconds -> 1759.8 GFLOPS (single)
matrix mul 2048 x 2048 took 0.010 seconds -> 1756.5 GFLOPS (single)
matrix mul 4096 x 4096 took 0.075 seconds -> 1844.1 GFLOPS (single)
matrix mul 4096 x 4096 took 0.073 seconds -> 1873.3 GFLOPS (single)
matrix mul 4096 x 4096 took 0.080 seconds -> 1708.4 GFLOPS (single)
matrix mul 4096 x 4096 took 0.073 seconds -> 1885.2 GFLOPS (single)
matrix mul 8192 x 8192 took 0.525 seconds -> 2094.3 GFLOPS (single)
matrix mul 8192 x 8192 took 0.539 seconds -> 2038.6 GFLOPS (single)
matrix mul 8192 x 8192 took 0.526 seconds -> 2089.2 GFLOPS (single)
matrix mul 8192 x 8192 took 0.529 seconds -> 2080.0 GFLOPS (single)

With OpenBlas the 8192 x 8192 is about 3.3 times slower.

-- Chris
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
With OpenBlas the 8192 x 8192 is about 3.3 times slower.
Your numbers are consistent with those of other benchmarks. For example, in Julia
- multiplying 8192x8192 arrays in single precision with Accelerate takes 0.5s instead of 1.5s with OpenBlas.
- multiplying 8192x8192 matrices in double precision with Accelerate takes 1.8s instead of 2.9s with OpenBlas.
- multiplying complex 8192x8192 matrices in double precision with Accelerate takes 7.8s instead of 14s with OpenBlas.
 
  • Like
Reactions: Romain_H

Romain_H

macrumors 6502a
Sep 20, 2021
520
438
Your numbers are consistent with those of other benchmarks. For example, in Julia
- multiplying 8192x8192 arrays in single precision with Accelerate takes 0.5s instead of 1.5s with OpenBlas.
- multiplying 8192x8192 matrices in double precision with Accelerate takes 1.8s instead of 2.9s with OpenBlas.
- multiplying complex 8192x8192 matrices in double precision with Accelerate takes 7.8s instead of 14s with OpenBlas.
How would this compare with CUDA? Is it even comparable? If so, what GPU would one compare this to?
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
How would this compare with CUDA? Is it even comparable? If so, what GPU would one compare this to?
What do you have in mind? FP64? FP32? FP16? FP8 or Int8? Sparse vs dense matrices? etc etc.

The handwaving answer is that the most common GPU (by Steam Survey) is nV GeForce RTX 3060.
This has 28 SMs which are more or less analogous to an Apple Silicon GPU "Core" (so an SM has 128 execution lanes, just like an Apple Silicon GPU core has 128 execution lanes).
An M2 Pro has 16 or 19 GPU cores depending on whether you get the low or high end version; whereas a Max has 30 or 38.
With those numbers, *for the most part* both nV and Apple do a reasonable job of executing one FP op per cycle per lane with well-written code. Apple (just like with their CPU's) do a slightly better job of not wasting time on busywork overhead (eg address calculations), on the other hand (just like with x86) there's a much larger pool of code that's been pre-optimized for nVidia.

When it comes specifically to matrix multiply, the points I raised become even more relevant.
If you want FP64, well AMX covers that, whereas doing it on the GPU (whether Apple or nV) is a lot slower (like ~18 to ~50 ops per FP64 FMAC depending on exactly how you fake FP64-like precision).

For FP32 and FP16 nV do it via a dedicated Tensor Unit, whereas Apple do it via dedicated cross-lane permute/multiply/add instructions. As far as I can tell, performance is essentially the same given the same amount of hardware (appropriately scaled, since exact matches between the two don't exist). Apple take advantage of the fact that, unusually for a GPU, they have very rich permute operations to move data between lanes; and they use these operations in other useful ways. nV don't have those operations (as far as I know) and so use dedicated HW for matrix multiply, but their are limits to how much of that HW they can include on the GPU.

But Apple's hardware does not scale down to FP8 or Int8 multiply, whereas nV's tensor units do scale down to that, so they will look really good for that purpose. Most recent nV also supports sparse matrices in a way that Apple does not yet (either AMX or GPU) but I don't know if RTX3060 is recent enough to have that support.

Of course if you are willing to go up nV scales a lot larger -- and a LOT more expensive. (As does Apple, but not as large, not yet.) I'm trying to compare the kind of range of things that most people actually buy rather than professional or data center equipment.

So I think it's fair to say that the most popular nV GPU (among people who pay specifically for a GPU, but don't have crazy budgets) is for most purposes above an Apple Pro (what people who don't really care about GPU would buy but want a better than basic computer would buy) and below an Apple Max (what people who do care about GPU but have realistic budgets would buy). FWIW I bought a mac mini M2 Pro (low end 6+4 core, 16 GPU core) and would not have bought an M2 Max even if such existed. But then for my purposes I don't play games on the machine so I care more about CPU speed than GPU.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
But Apple's hardware does not scale down to FP8 or Int8 multiply, whereas nV's tensor units do scale down to that, so they will look really good for that purpose.
Apple does use the neural engine (FP16) and GPU (FP32) simultaneously for MetalFX upscaling. That's why they recommend avoiding Metal resource dependencies; to hide* the massive (~1 ms)* latency to access the ANE, it overlaps GPU work across the next frame's render pass. The main (practical) use case for tensor cores on RTX 3060 is running ConvNets for DLSS upscaling, and Apple achieved the same objective in a different way.

*Take this conclusion and number with a grain of salt though.

ChipFP16 MatrixTF32 MatrixFP32 VectorPower during AI upscaling
M2 Pro (19-core)15.8 TFLOPS5.47 TFLOPS6.84 TFLOPS2 W :oops:
RTX 306050.96 TFLOPS12.74 TFLOPS12.74 TFLOPSmaybe 10% of 170 W

I got 2 W at 120 Hz, 768x768 -> 1536x1536. Extrapolate what you will to other resolutions. The ANE part was 80 mW.
 
  • Like
Reactions: name99

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
ChipFP16 MatrixTF32 MatrixFP32 VectorPower during AI upscaling
M2 Pro (19-core)15.8 TFLOPS5.47 TFLOPS6.84 TFLOPS2 W :oops:
RTX 306050.96 TFLOPS12.74 TFLOPS12.74 TFLOPSmaybe 10% of 170 W

I got 2 W at 120 Hz, 768x768 -> 1536x1536. Extrapolate what you will to other resolutions. The ANE part was 80 mW.

Rescaling seems like a cheap operation. An RTX 3050 needs less than 3 W to upscale a 720p video to 4K.

To show this, we tested VSR on the same video sequence — the 720p NHL game upscaled to 4K — on two different extremes of the VSR spectrum, with both the VSR Quality 1 and VSR Quality 4 settings.

Nvidia GPU Power Use With / Without VSR
GPUVSR OffVSR On (1)VSR On (4)
RTX 4090 (Watts)28.932.836.9
RTX 3050 (Watts)13.015.915.9
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
An RTX 3050 needs less than 3 W to upscale a 720p video to 4K.
"In contrast, the RTX 3050 needed just 3W more power for either VSR setting."

YouTube is typically 60 Hz, but I was ray tracing at 120 Hz. 2 W is the combined cost of both the ray tracing part and the upscaling part. The RTX 3050 was consuming 16 W total.

ResolutionFPSPower (staring at empty background)Power (every pixel intersects a ray)
768x768 -> 1536x153660 Hz950 mW1150 mW
1024x1024 -> 2048x204860 Hz1700 mW2000 mW
768x768 -> 1536x1536120 Hz1600 mW2200 mW
1024x1024 -> 2048x2048120 Hz3500 mW5300 mW

Extrapolate this to the Nvidia GPUs, which are inputting 921,600 pixels @ 60 Hz. Pretend they are rendering my 1024x1024 input pixel scene. Apple's algorithm does not have quality settings, but you were referencing Nvidia's lowest quality (VSR 1). I will show both lowest -> highest quality. I am also assuming Nvidia would consume 300 mW without upscaling, but in reality the minimum power is >10000 mW. Notice that power consumption is higher on their newer GPU.

ResolutionFPSPower (staring at empty background)Power (every pixel intersects a ray)
1024x1024 in, M1 Max60 Hz1700 mW2000 mW
720x1280 in, 3050<= 60 Hz2900 -> 2900 mW3200 -> 3200 mW
720x1280 in, 4090<= 60 Hz3900 -> 8000 mW4200 -> 8300 mW
 
  • Like
Reactions: Xiao_Xi

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
Rescaling seems like a cheap operation. An RTX 3050 needs less than 3 W to upscale a 720p video to 4K.


That's just spatial?
My experience (not using either nV or Apple but an expensive Apple Silicon 3rd party app [in test/demo mode!]) is that what's REALLY expensive is temporal resampling. That's also where I want this technology to go...

The combination of Apple TV and/or my TV do an acceptable job of spatial upsampling today, but the combination of incompetences at every level of pouring digital content from film to some eventual stream results in absolutely horrendous jerkiness.
(Things like company A uses 3:2 pulldown to generate 60/s interleaved fields from film, then company B streams this as 30/s upscaled fields->frames.
When the 60fps fields are broadcast, a lot of hardware, including mine, can undo that back to 24 fps; but when this bastardized 30fps upscaled fields are broadcast they are helpless!)

I"m hoping for AI that, at some point, does the job of
- tracking the jerkiness
- figuring out the jerk pattern
- retemporalizing back to the correct frame rate
- temporal upsampling from 24fps (which ain't good enough!) to 120fps

We're a long way from that, but basic AI temporal upsampling is the first step.
Unfortunately the usual suspects that report on AI upsampling quality, for whatever reason, seem uninterested in telling us how well nV or Apple or anyone else handle this particular task. I get that it's harder to test and to demonstrate (freezing a frame to see upscaling artifacts is easy). But I do wish there were *some* reporting on this! After all the TV review folks (like rtings.com) have a similarly tough task, but they bite the bullet and buy the hardware or whatever necessary to do the job!
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
"According to the Nvidia documents, a new and different neural network is used than the one in DLSS, and training is done on different data sets. RTX Video Super Resolution works purely with pixels of the image (video frame), unlike DLSS it doesn’t use any “meta” inputs like motion vector and depth information from the game engine. On the one hand, the filter detects and enhances edges, and on the other hand it also tries to remove compression artifacts."

I'm using MetalFX with temporal upscaling. This technique is much more computationally intensive than spatial upscaling. So the comparison with NV was not fair at all. Temporal upscaling needs both motion vectors and a depth buffer, although I think video compression uses motion vectors somehow.
 
  • Like
Reactions: name99

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I'm using MetalFX with temporal upscaling. This technique is much more computationally intensive than spatial upscaling.
The DLSS benchmarks focus on FPS improvement. Igor did a cool benchmark, but it doesn't show how much DLSS consumes.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Ideally, TAA will lead to 4x FPS improvement, because you can render to half the resolution. However, now some time in the frame is spent on the upscaler. I would set 3-3.5x as the actual theoretical limit. Another FPS boost might be interpolating frames, which could be more than 6-7x because the second frame doesn't need to be rendered. For my use case, I need the exact geometry data for each frame, not just an AI's guess, to capture high-frequency motions.

Then there's the geometry pass, which will not change in cost. In rasterization or hybrid RT, this is the vertex stage. My use case is entirely RT, but builds a new acceleration structure each frame. Lower resolutions let me have a tradeoff, where the accel takes less time (1/32x) to build, but the intersection functions are more expensive (32x). I can process groups of 32 nearby atoms instead of each atom individually. MetalFX "makes or breaks" whether I can render supermassive molecules, each atom having a different position in each different frame.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
We're a long way from that, but basic AI temporal upsampling is the first step.
Do you want something like RIFE?

Video sizeGeneric (any GPU)TensorRT (NVidia only)
720p @48 fps>= RTX 2060 or M1 Max1660 (?)
1080p @48 fps, 720p @60-72 fps>= RTX 3070>= RTX 2060
1080 @60-72 fps>= RTX 3080>= RTX 3060
4K @48 fpsRTX 4090>= RTX 4080
4K @60-72 fpswait for RTX 5xxx...RTX 4090
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
This technique is much more computationally intensive than spatial upscaling. So the comparison with NV was not fair at all.
I want to reiterate that Apple, under their highest-quality, most computationally intensive algorithm, used a fraction of the power of both Ampere and Ada GPUs, while Nvidia was on their lowest-quality, least computationally intensive algorithm.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
Do you want something like RIFE?

Video sizeGeneric (any GPU)TensorRT (NVidia only)
720p @48 fps>= RTX 2060 or M1 Max1660 (?)
1080p @48 fps, 720p @60-72 fps>= RTX 3070>= RTX 2060
1080 @60-72 fps>= RTX 3080>= RTX 3060
4K @48 fpsRTX 4090>= RTX 4080
4K @60-72 fpswait for RTX 5xxx...RTX 4090

Optical flow is one input into a decent temporal resampler.

But for the purposes *I* care about, more important (and more difficult) is detecting (and undoing) the jerkiness that comes from the uneven frame rate of the input after the 3:2 pullup followed by alternate field removal...
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
The Apple M1 GPU has hardware acceleration for ray tracing, similar in performance to RDNA 2. The majority of the compute cost in RT is ray-box intersections, which is why that operation happens 4x faster (RDNA 2), 12x faster (Intel Arc) than ray-triangle intersections. Here is the optimal slab method for ray-box intersections:

C++:
bool intersection(const struct ray *ray, const struct box *box) {
    float tmin = 0.0, tmax = INFINITY;

    for (int d = 0; d < 3; ++d) {
        bool sign = signbit(ray->dir_inv[d]);
        float bmin = box->corners[sign][d];
        float bmax = box->corners[!sign][d];

        float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
        float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];

        tmin = max(dmin, tmin);
        tmax = min(dmax, tmax);
    }

    return tmin < tmax;
}

Here is one iteration of the inner loop in Apple GPU assembly. Notice the magic CMPSEL instruction - performing a comparison and selection in a single cycle. Both GPT-4 and an AMD expert confirmed no other architectures have this. It is also absent from the M1 CPU, which requires 2 cycles for MAX, MIN, etc.

C++:
Apple:
// precomputed ori_dir = origin * dir_inv
float bmin = CMPSEL(dir_inv < 0 ? corner[1] : corner[0]); // 1 cyc
float bmax = CMPSEL(dir_inv < 0 ? corner[0] : corner[1]); // 1 cyc
float dmin = FFMA(bmin, dir_inv, -ori_dir); // 1 cyc
float dmax = FFMA(bmax, dir_inv, -ori_dir); // 1 cyc
tmin = CMPSEL(dmin > tmin ? dmin : tmin); // 1 cyc
tmax = CMPSEL(dmax < tmax ? dmax : tmax); // 1 cyc
// 6 cycles total

On Apple, all 3 iterations take 18 cycles, while AMD or Nvidia take 30 cycles. Granted this ignores the memory and control flow instructions afterward. This makes it similar to AMD's ray accelerators, which only accelerate the intersection part. Pretend CMPSEL isn't a native instruction, compare Apple's throughput:

Code:
Apple:                                30 cycles
AMD:    64 ALU /  4 ray-box/CU      = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
30 / 16   = 1.88x AMD
30 / 10.7 = 2.80x Intel

Now, Apple adds CMPSEL to the GPU ISA:

Code:
Apple:                                18 cycles
AMD:    64 ALU /  4 ray-box/CU      = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
18 / 16   = 1.13x AMD
18 / 10.7 = 1.68x Intel

An Apple GPU has ~same throughput as an RDNA2 GPU of the same F32 TFLOPS. Furthermore, NV boasts a lot about ray-triangle performance, but my use case has zero triangle intersections. For ray-box perf, it could be much closer to an Ampere GPU of the same F32 TFLOPS than people think. The ratio of compute:memory is also lopsided: much more memory bandwidth in the M1's case. Meanwhile, NV optimized the memory subsystem for memory divergence in ray tracing. Say NV has 2.5x better memory, 2.5x better ray-box throughput. Memory might not be any more of a bottleneck on M1 than on NV.

The M1 SLC also has similar (occupancy-scaled) latency/clock speed ratio and size range as NV's L2. The SLC is effectively an L2, the L2 is effectively an L1 instruction cache.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
The Apple M1 GPU has hardware acceleration for ray tracing, similar in performance to RDNA 2. The majority of the compute cost in RT is ray-box intersections, which is why that operation happens 4x faster (RDNA 2), 12x faster (Intel Arc) than ray-triangle intersections. Here is the optimal slab method for ray-box intersections:

C++:
bool intersection(const struct ray *ray, const struct box *box) {
    float tmin = 0.0, tmax = INFINITY;

    for (int d = 0; d < 3; ++d) {
        bool sign = signbit(ray->dir_inv[d]);
        float bmin = box->corners[sign][d];
        float bmax = box->corners[!sign][d];

        float dmin = (bmin - ray->origin[d]) * ray->dir_inv[d];
        float dmax = (bmax - ray->origin[d]) * ray->dir_inv[d];

        tmin = max(dmin, tmin);
        tmax = min(dmax, tmax);
    }

    return tmin < tmax;
}

Here is one iteration of the inner loop in Apple GPU assembly. Notice the magic CMPSEL instruction - performing a comparison and selection in a single cycle. Both GPT-4 and an AMD expert confirmed no other architectures have this. It is also absent from the M1 CPU, which requires 2 cycles for MAX, MIN, etc.

C++:
Apple:
// precomputed ori_dir = origin * dir_inv
float bmin = CMPSEL(dir_inv < 0 ? corner[1] : corner[0]); // 1 cyc
float bmax = CMPSEL(dir_inv < 0 ? corner[0] : corner[1]); // 1 cyc
float dmin = FFMA(bmin, dir_inv, -ori_dir); // 1 cyc
float dmax = FFMA(bmax, dir_inv, -ori_dir); // 1 cyc
tmin = CMPSEL(dmin > tmin ? dmin : tmin); // 1 cyc
tmax = CMPSEL(dmax < tmax ? dmax : tmax); // 1 cyc
// 6 cycles total

On Apple, all 3 iterations take 18 cycles, while AMD or Nvidia take 30 cycles. Granted this ignores the memory and control flow instructions afterward. This makes it similar to AMD's ray accelerators, which only accelerate the intersection part. Pretend CMPSEL isn't a native instruction, compare Apple's throughput:

Code:
Apple:                                30 cycles
AMD:    64 ALU /  4 ray-box/CU      = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
30 / 16   = 1.88x AMD
30 / 10.7 = 2.80x Intel

Now, Apple adds CMPSEL to the GPU ISA:

Code:
Apple:                                18 cycles
AMD:    64 ALU /  4 ray-box/CU      = 16 cycles
Intel: 128 ALU / 12 ray-box/Xe-core = 10.7 cycles
18 / 16   = 1.13x AMD
18 / 10.7 = 1.68x Intel

An Apple GPU has ~same throughput as an RDNA2 GPU of the same F32 TFLOPS. Furthermore, NV boasts a lot about ray-triangle performance, but my use case has zero triangle intersections. For ray-box perf, it could be much closer to an Ampere GPU of the same F32 TFLOPS than people think. The ratio of compute:memory is also lopsided: much more memory bandwidth in the M1's case. Meanwhile, NV optimized the memory subsystem for memory divergence in ray tracing. Say NV has 2.5x better memory, 2.5x better ray-box throughput. Memory might not be any more of a bottleneck on M1 than on NV.

The M1 SLC also has similar (occupancy-scaled) latency/clock speed ratio and size range as NV's L2. The SLC is effectively an L2, the L2 is effectively an L1 instruction cache.

This is not for Philip (who surely understand the issues more than me!) but for other readers.
Talking about "native ray tracing" bundles together a large number of issues; it's more useful to see what elements Apple has already picked up and what elements remain to be handled.

First we have the SW side.
Apple's first round of ray tracing support was to provide Metal Performance Shaders that do the job. Using a very rough analogy, this is like providing a command line tool for the CPU, and you, a developer, can perform some task by calling that command line tool (eg via the C system() call). Point is, overhead is high.
The next year Apple provided the equivalent of a function call so that you could interleave calls to ray tracing with the rest of your compute/render pipeline. That's obviously lower overhead, so a nice little boost.
The next year Apple provided the equivalent of changes in the Metal language itself (ie changes to the compiler) so that ray tracing becomes basically as efficient as it can possibly be, compiled directly with no function call overhead, scheduled optimally, blah blah.

Alongside all this, over the same years Apple introduced changes to the rest of the Metal runtime to allow much faster operation of the sorts of operations usually performed while ray tracing (for example when a ray interests a triangle, you need to look up the material properties of that triangle to figure out what to do -- extinguish the ray? reflect the ray? change the color of the reflected ray? etc). This involves things like how data is laid out in memory and how the Metal runtime knows what data is stored in what buffers.

All these SW changes, even if nothing were done to the HW, would be a nice performance improvement. But another way to look at them is that they were essential infrastructure that would need to be in place to take advantage of any future ray tracing HW, and now they are fully in place...

Now for the HW side.
Ray tracing involves two primary operations. The first is building the so-called Acceleration Structure (based on the scene geometry), the second is using that structure to test for ray intersections. You can provide hardware to perform either or both parts of these two jobs.
Apple has Metal API to call into creating the Acceleration Structure and, a year later, provided API to allow for updating an existing Acceleration Structure. (You do this if you are creating animation rather than a single ray traced image. Hopefully, if your geometry has not changed too much relative to the previous frame, you can make small cheap modifications to the Acceleration Structure rather than rebuilding from scratch).

nV (and maybe AMD, I am not sure about them) have hardware to improve building the Acceleration Structure. I don't know if they have a special path for updating it for small geometry changes.

Apple seem to view these issues in terms of (as far as possible) making each lane and core of the GPU more capable, rather than adding side hardware that can only do one task. We have seen this already with matrix multiply - rather than adding the equivalent of a dedicated matrix multiply engine, Apple added permute to move data cheaply between GPU lanes (functionality that is widely useful), and built matrix multiply on top of this permute ability. Similarly they have accelerated one part of ray tracing by adding a new instruction which speeds up the relevant operation but may well have value for other tasks the GPU needs to perform.
So if we ask about "ray tracing hardware" the parts that are left are
- building/updating an acceleration structure (the API/language part of this is done) and
- ray triangle testing
They probably hope to implement both of these via per-lane improvements rather than dedicated hardware. I don't know what's involved in building the acceleration structure, but the ray triangle intersection may be an operation that can be performed on a per-quad level (like dfdx and dfdy)?
 
  • Like
Reactions: dgdosen and Xiao_Xi

Philip Turner

macrumors regular
Dec 7, 2021
170
111
I recommend reading over Imagination's CXT whitepaper. Apple stopped using Imagination's PowerVR architecture when they introduced the A11 GPU, but they probably have still been working for them since at least a few years ago. Imagination created "ray sorting" hardware to improve the power efficiency of ray tracing, which reorders rays to reduce divergence. This is basically a ton of data permutation, swapping rays and state variables between different threads. Remember all of this is for their AR/VR headset, which is extremely power constrained - even with the most efficient GPU in the world, only 2 hours on battery.

Apple has also sacrified performance for the sake of efficiency before - their AMX half the vector throughput of NEON, making Accelerate slower than OpenBLAS. I wouldn't rule out Apple having less theoretical performance ("RT TFLOPS") than Nvidia. They might reuse the FP32 ALUs for intersections; the bottleneck is how to coordinate data movement (BVH decoding, ray sorting) to reach high utilization of existing compute hardware. This would also line up with their equalizing of FP16 and FP32 performance. It is not possible to perform ray intersection tests in FP16, only FP32. AMD, Intel, and Nvidia still have dedicated FP16 ALUs, although on recent generations only accessible through Matmul instructions. Apple removed FP16 from the GPU and put it in the neural engine (ahem, MetalFX ray tracing denoiser).
 
  • Like
Reactions: name99

Philip Turner

macrumors regular
Dec 7, 2021
170
111
- building/updating an acceleration structure (the API/language part of this is done) and

Good point - for my use case, the animation is so costly (1 million atoms @ 120 Hz) that I have to consider things like streaming a pre-built accel straight from the disk. This is where Metal fast resource loading and the LZBITMAP codec become extremely helpful. I need the order of 1 GB per second disk bandwidth.
 

leman

macrumors Core
Oct 14, 2008
19,517
19,664
Nice catch regarding CMPSEL! It is basically a fusion of a compare and select operations, which is a common pattern for a bunch of data-parallel algorithms. I don’t think that raytracing acceleration is the primary idea behind this instruction (e.g. Apple also has an integer variant), but it does help to make intersection calculation a bit cheaper, which is certainly nice.

But I wouldn’t call it hardware-accelerated RT. To be honest, I wouldn’t call AMDs implementation hardware accelerated RT either. A hardware RT implementation would do the entirety of the traversal using dedicated circuitry.
 
  • Like
Reactions: Philip Turner

Philip Turner

macrumors regular
Dec 7, 2021
170
111
But I wouldn’t call it hardware-accelerated RT. To be honest, I wouldn’t call AMDs implementation hardware accelerated RT either. A hardware RT implementation would do the entirety of the traversal using dedicated circuitry.
Would you consider simdgroup_matrix to be hardware-accelerated matrix multiplication? Does the chip have to dedicate space to separate circuitry that will only be utilized during a small fraction of everyday workloads?
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
To be honest, I wouldn’t call AMDs implementation hardware accelerated RT either. A hardware RT implementation would do the entirety of the traversal using dedicated circuitry.
Chips and Cheese wrote a post about ray tracing on AMD and Nvidia GPUs some time ago.
 

Philip Turner

macrumors regular
Dec 7, 2021
170
111
Chips and Cheese wrote a post about ray tracing on AMD and Nvidia GPUs some time ago.
I read this today while checking whether the CMPSEL conclusion was right.

"Each SM tracked 13.23 waves on average out of 64 theoretical,"

Yikes! Apple prevents the register allocation from ever falling below 24 simds/core (max 96), while NV lets you go as low as 8 simds/core. You need a lot of occupant simds to hide the latency of L2 (NV)/L3 (M1) cache.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
I recommend reading over Imagination's CXT whitepaper. Apple stopped using Imagination's PowerVR architecture when they introduced the A11 GPU, but they probably have still been working for them since at least a few years ago. Imagination created "ray sorting" hardware to improve the power efficiency of ray tracing, which reorders rays to reduce divergence. This is basically a ton of data permutation, swapping rays and state variables between different threads. Remember all of this is for their AR/VR headset, which is extremely power constrained - even with the most efficient GPU in the world, only 2 hours on battery.

Apple has also sacrified performance for the sake of efficiency before - their AMX half the vector throughput of NEON, making Accelerate slower than OpenBLAS. I wouldn't rule out Apple having less theoretical performance ("RT TFLOPS") than Nvidia. They might reuse the FP32 ALUs for intersections; the bottleneck is how to coordinate data movement (BVH decoding, ray sorting) to reach high utilization of existing compute hardware. This would also line up with their equalizing of FP16 and FP32 performance. It is not possible to perform ray intersection tests in FP16, only FP32. AMD, Intel, and Nvidia still have dedicated FP16 ALUs, although on recent generations only accessible through Matmul instructions. Apple removed FP16 from the GPU and put it in the neural engine (ahem, MetalFX ray tracing denoiser).
Do we know for a fact that Apple has removed FP16 from the GPU? When? With A15/M2?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.