Ray tracing on Apple Silicon?

l0stl0rd · Aug 29, 2023

Xiao_Xi said:
A user on the Blender forum did an experiment comparing the speed that ray tracing hardware provides to Intel, AMD and Nvidia GPUs in Blender. You can't draw many conclusions from it, but the charts are very interesting. For instance:

View attachment 2252056

Cycles AMD HIP device feedback

I’ve been interested in comparing the Ray Accelerators in the AMD GPUs to the Ray Tracing Units in Intel GPUs to the Ray Tracing Cores in Nvidia GPUs. And now that we have HIPRT, OptiX, and Embree GPU in Cycles, we can do that. And with Cycles we can extract some extra debug information to help...

devtalk.blender.org

I wonder what those charts would look like when the Apple GPU had ray tracing hardware?

What I find interesting is.

Nvidia is 3rd generation
Amd 2nd generation
Intel 1st generation

However Intel does way better boost then Amd.

diamond.g · Aug 29, 2023

l0stl0rd said:
What I find interesting is.

Nvidia is 3rd generation
Amd 2nd generation
Intel 1st generation

However Intel does way better boost then Amd.

Yeah Intel followed Nvidia's lead and made dedicated hardware for RT cores. Supposedly AMD may do the same for RDNA4, but that is just a rumor.

leman · Aug 29, 2023

Xiao_Xi said:
A user on the Blender forum did an experiment comparing the speed that ray tracing hardware provides to Intel, AMD and Nvidia GPUs in Blender. You can't draw many conclusions from it, but the charts are very interesting. For instance:

View attachment 2252056

Cycles AMD HIP device feedback

I’ve been interested in comparing the Ray Accelerators in the AMD GPUs to the Ray Tracing Units in Intel GPUs to the Ray Tracing Cores in Nvidia GPUs. And now that we have HIPRT, OptiX, and Embree GPU in Cycles, we can do that. And with Cycles we can extract some extra debug information to help...

devtalk.blender.org

I wonder what those charts would look like when the Apple GPU had ray tracing hardware?

What this clearly shows is that "RT hardware" is a technical misnomer. We need to understand what exactly these GPUs are doing rather than just say "they have hardware RT".

Take AMD for example. They have a dedicated instruction for intersects a ray against a BVH node. This instruction will fetch the BHV data, do a lengthy math sequence that checks for intersections, sorts the results, and gives it back to you shader. At the same time, it's not very clear how this instruction is implemented. There probably is at least some dedicated hardware logic behind it, but I would't be surprised if the base implementation uses some sort of microcode-like sequence to run the math itself. And I very much doubt that this instruction is particularly fast. At any rate, while this instruction does seem to accelerate the compute-heavy part of BHV traversal, it does nothing to address the main problem of RT — memory access patterns. GPUs are designed with data locality in mind, and tree traversal often tends to be more random. When running these types of tasks on a massive SIMD processors, you end up with cache misses and divergent instruction flow, which will kill your performance. Both Nvidia and intel clearly have a technically superior solution that does more than just accelerate the compute sequence. As to how exactly they do it, one would probably need to study their patents (and hope that the answer is in there).

I hope you can see how the notion of "hardware RT" becomes useless here. AMD can fully claim that they have hardware RT, but their solution is just a bandaid that doesn't really do much. While Apple for example doesn't have a dedicated GPU RT instruction, but they have a general purpose four argument compare and select instruction which can be used to compute a ray/AABB intersection in six GPU cycles. In fact, I wouldn't be shocked if Apple's "software RT" ends up as fast or faster than AMD's "hardware RT" in practice.

Apple has published a bunch of patents describing a RT solution, and if that's what they are going to ship, I think there are good reasons to expect good performance. They are describing a dedicated RT coprocessor (integrated into the GPU, but having it's own cache) that performs intersection tests in parallel on a wide BVH. These tests are done using limited precision ALUs, which allows a compact and very energy-efficient hardware implementations. Intersection hits are ordered and compacted in hardware and a new shader program is launched asynchronously to process the intersections results. There are many interesting aspects to this approach, and the main advantage appears to be low power consumption compared to existing solutions. Of course, all this is purely theoretical until we see the actual hardware.

diamond.g · Aug 29, 2023

Does AMD Infinity Cache not help with BVH data locality?

MayaUser · Aug 29, 2023

@leman if Apple is indeed serious about RT, can we expect it from the A17? Or do you expect to be only for the M series? But i guess M3 family will still be based on A17
In fact my question is if we will find out on 12 Sept if the M3 family will have RT right?

leman · Aug 29, 2023

MayaUser said:
@leman if Apple is indeed serious about RT, can we expect it from the A17? Or do you expect to be only for the M series? But i guess M3 family will still be based on A17
In fact my question is if we will find out on 12 Sept if the M3 family will have RT right?

Until now, A- and M-series chips have had exactly the same hardware features, just different performance levels. I mean, they even put a matrix accelerators into their smartphones. Apple's strategy seems to be streamlining and reusing their modular designs, what is the chance that they will fork the GPU cores into mobile/desktop at this point? Then again, who knows, with chip production getting more expensive...

And of course, it is entirely possible that fast RT will be dropped altogether. Personally, I tend to believe last years news story where A16 was supposed to be on N3 and Apple had to pull the break and go with a placeholder N5 design last minute. The RT patents were published just shortly before the iPhone 14 Pro was announced, and it's quite typical for Apple to tie the patent publication to new product releases (it was the case with the new GPU features on A15, and also with the UltraFusion related stuff). Since then they have republished new versions of the RT patents (with more concrete hardware details) once or twice. Last published RT patent actually dates to last week. So yeah, I'm quite optimistic that we will see it this fall.

MayaUser · Aug 29, 2023

leman said:
So yeah, I'm quite optimistic that we will see it this fall.

Probably on 12th of September...even if Apple in general doesnt go to very specifics of the new A series, in this case A17

Boil · Aug 29, 2023

leman said:
...a dedicated RT co-processor (integrated into the GPU, but having it's own cache)...

Maybe this is why the rumors have the M3 series of SoCs showing a jump in CPU cores, but just a slight bump in GPU cores; the space that would have gone to more GPU cores is instead allotted to the RT co-processor...?

innerproduct · Sep 2, 2023

@Boil exactly my thinking. Not that many more cores but revamped and space used for RT or at least some new functionality. One wonders though if this will be useful in any way for mobile? Maybe the silicon will be there but unused and unannounced and until revealed on the mac?

deconstruct60 · Sep 2, 2023

innerproduct said:
@Boil exactly my thinking. Not that many more cores but revamped and space used for RT or at least some new functionality. One wonders though if this will be useful in any way for mobile? Maybe the silicon will be there but unused and unannounced and until revealed on the mac?

If the primary Apple objective for hardware RT is significantly better Perf/Watt ray tracking/tracing then, yes .... it is entirely applicable to mobile.

I think folks are 'twisted' in what Apple's HW RT goal is. It probably isn't a '4090 killer' , Nvidia killer high win no matter what the cost. It probably is to run the existing Apple RT library calls faster and more efficiently. 'faster' RT is going to be secondary effect. It won't necessarily be a small secondary impact, but it isn't 'the point'.

Some of the patents are about 'avoiding work' at least as much as doing the RT faster. ( avoiding unnecessary work often has a net throughput impact. But a different mindset than just 'swinging a bigger, heavier hammer'. )

For example, current mobiles games that may leverage RT run longer on an iPhone. In contrast, to trying to drive much larger than iPhone resolutions at faster frame rates with RT on.

The M-series have bigger GPUs and bigger bandwidth so will cover more 'resolution' ground ( substantively faster refresh at 4K-5K and better reach for higher resolutions). However, penultimate performance over every possible competition at maximum thermal consumption , no . So much bigger RT data cache that consumes more power than the savings the new process node brought? Probably not. The net Perf/Watt trend probably isn't moving lower. Apple will add "more stuff' up until it negatively impacts the power (and space) budget and then stop.

P.S. if HW RT is bringling along its own additional cache to help improve Pref/Watt then TSMC N3 isn't going to be some major miracle increase. Cache doesn't have a huge shrink. Can expand some cache into where the logic retreated from because it could have gotten incrementally smaller. There will be room for "more stuff" , but it will be incremental. Doubtful gets much 'brute force' improvements ( where apps just ignore where Apple has been pointing with respect to RT and just 'power through' to more performance. )

leman · Sep 2, 2023

deconstruct60 said:
if HW RT is bringling along its own additional cache to help improve Pref/Watt then TSMC N3 isn't going to be some major miracle increase. Cache doesn't have a huge shrink. Can expand some cache into where the logic retreated from because it could have gotten incrementally smaller. There will be room for "more stuff" , but it will be incremental. Doubtful gets much 'brute force' improvements ( where apps just ignore where Apple has been pointing with respect to RT and just 'power through' to more performance. )

The cache doesn't have to be large, even couple of KB will go a long way to help with texture/shader data trashing. Apple GPU caches are already super tiny, most of the die area seems to be dedicated to data movement (Apple brings some data shifting capabilities to the table). The main goal of the hardware RT (as described by Apple patents) is to free up the compute clusters from doing pointer chasing and do it on dedicated optimised hardware instead. If this works, it should indeed improve both perf/watt and absolute performance, simply because less time is wasted in shaders.

And of course, if an application does not use the RT APIs, they won't see any improvements. The same is true for any other API however.

innerproduct · Sep 3, 2023

But but, are there any games at all in the mobile space that uses or need RT? After all, it only helps for image quality like more realistic reflections, shadows and lighting. In simple terms, if a game has a “ultra quality setting”’this is where those effects would be enabled. And since mobile really have a lot more simple graphics in general, i don’t see the use case. Rasterization and shading will not benefit in any way. To me it seems benefits will only be within offline rendering like octane, redshift and blender cycles. And yet again, this is not apps you run on mobile.
Maybe the mobile use case would instead be for energy efficiency in areas like spatial audio? Something within computational photography? Otherwise I assume that the HW while maybe present on the a17 will be announced with the m3 pro and max chips.

leman · Sep 3, 2023

innerproduct said:
But but, are there any games at all in the mobile space that uses or need RT? After all, it only helps for image quality like more realistic reflections, shadows and lighting. In simple terms, if a game has a “ultra quality setting”’this is where those effects would be enabled. And since mobile really have a lot more simple graphics in general, i don’t see the use case. Rasterization and shading will not benefit in any way. To me it seems benefits will only be within offline rendering like octane, redshift and blender cycles. And yet again, this is not apps you run on mobile.
Maybe the mobile use case would instead be for energy efficiency in areas like spatial audio? Something within computational photography? Otherwise I assume that the HW while maybe present on the a17 will be announced with the m3 pro and max chips.

If the hardware is there, applications will come in time. Similar arguments were made about programmable shading back in the day, but now it’s a baseline feature. Right now, RT is still associated with high-end, power-hungry graphics. But I don’t think that energy-efficient RT is that far off. It might very well be that doing shadows via RT for example will be cheaper tomorrow than todays hacks…

The main question if adding hardware RT capability will be expensive enough that separate mobile (without RT) and desktop designs make more sense commercially. If RT hardware only takes negligible amount of die space, it might be cheaper to let it on the mobile chip, even if it won’t be used that much. Wouldn’t be first feature of this kind anyway…

komuh · Sep 3, 2023

It kinda looks like near future of RT will be in deep learning (as probably most heavy video/physics effects) as it is just easier to make small neural network to properly "generate" most of the rays and using only small portion from original tracer to train small neural network in flight or pre-train it with limited compute (DLLS 3.5 but also tons of paper from NV for past 3years) vs adding multiple accelerators on die.

Also this solution allow for specialised matmul/tensor units to do more than just ray-tracing so it'll be more useful locating it on die and add very small accelerators to help with native generation of specialised stuff like RT/Flow close to this units for faster memory access or maybe even allowing share same cache so the memory need to be transferred only once.

jujoje · Sep 3, 2023

innerproduct said:
To me it seems benefits will only be within offline rendering like octane, redshift and blender cycles. And yet again, this is not apps you run on mobile.

Octane for iPad is the same as the desktop version and pretty decent. I can see a use for it there perhaps.

I'm guessing RT would help with more realistic rendering of AR objects. Feel like I'm pushing a bit at the boundaries of credible mobile use cases there.

innerproduct · Sep 3, 2023

jujoje said:
Octane for iPad is the same as the desktop version and pretty decent. I can see a use for it there perhaps.

I'm guessing RT would help with more realistic rendering of AR objects. Feel like I'm pushing a bit at the boundaries of credible mobile use cases there.

I know, but ipad uses m series SoC. Anyway , the point is: maybe the rt hw will be there on the a17 but maybe Apple won’t enable it or announce it until the m3 is unveiled? I do hope for a reveal next week oc, since that would make the future more clear. However, Apple has not really surprised me in positive ways lately…

MayaUser · Sep 12, 2023

So its official, Ray tracing

ader42 · Sep 12, 2023

Yeah gonna be interesting

Bodhitree · Sep 12, 2023

It was interesting, the short demo they showed was a hw accelerated scene showing like 30 fps alongside the same scene on cpu at 8 fps. They also said the new gpu had an Apple-designed shader core which was significantly faster than the previous edition.

diamond.g · Sep 12, 2023

Well it does answer why 3DMark came out with a new mobile based RT test app (Solar Ray) and surprise surprise it has an iOS version...

leman · Sep 12, 2023

Bodhitree said:
It was interesting, the short demo they showed was a hw accelerated scene showing like 30 fps alongside the same scene on cpu at 8 fps.

The software rendering was GPU and not CPU. On the CPU you’ll be lucky to get 0.1 FPS. Metal RT on current M-series devices operates via a “software” implementation.

Bodhitree said:
They also said the new gpu had an Apple-designed shader core which was significantly faster than the previous edition.

I don’t think they said that? But I’m sure the new GPU brings a lot of improvements. The way they phrased it makes me hope that they have addressed some of the pain points for complex workloads (e.g. advanced atomics). We will have to wait for technical documentation to know more.

sunny5 · Sep 12, 2023

Now, my thread came true. But A17 Pro's performance looks bad compared to A16. What's up with that?

MayaUser · Sep 12, 2023

i think having more "toys" like AV1 and Ray Tracing are more important to have than 20% instead of 10%
having more to work out is better for developers...but again this is just for mobile...the M3 family will be more important

komuh · Sep 13, 2023

sunny5 said:
Now, my thread came true. But A17 Pro's performance looks bad compared to A16. What's up with that?

Where did you saw that if it's indeed true it'll be pretty big deal, as apple inability to scale GPU is limiting factor for a lot of use cases for newer Macs atm.

name99 · Sep 13, 2023

leman said:
If the hardware is there, applications will come in time. Similar arguments were made about programmable shading back in the day, but now it’s a baseline feature. Right now, RT is still associated with high-end, power-hungry graphics. But I don’t think that energy-efficient RT is that far off. It might very well be that doing shadows via RT for example will be cheaper tomorrow than todays hacks…

The main question if adding hardware RT capability will be expensive enough that separate mobile (without RT) and desktop designs make more sense commercially. If RT hardware only takes negligible amount of die space, it might be cheaper to let it on the mobile chip, even if it won’t be used that much. Wouldn’t be first feature of this kind anyway…

The other question is how "programmable" this RT hardware is. For example, can it ONLY walk a BVH tree and perform intersections? Or can it walk a generic pointer based data structure and apply a function to every end node?

As far as I know, nVidia's hardware can still only do Ray Tracing, which I find strange. You'd think, based on their history, they'd be defining "Node shaders" that operate as I described, to get more value out the hardware.

Ray tracing on Apple Silicon?

macrumors 6502

macrumors G5

macrumors Core

macrumors G5

macrumors 68040

macrumors Core

macrumors 68040

macrumors 68040

macrumors regular

macrumors G5

macrumors Core

macrumors regular

macrumors Core

Suspended

macrumors 6502

macrumors regular

macrumors 68040

macrumors 6502

macrumors 68020

macrumors G5

macrumors Core

Suspended

macrumors 68040

Suspended

macrumors 68030

Our Staff