Why is Apple slow to implement Hardware based Ray-Tracing?

leman · Jul 17, 2022

tmoerel said:
Blender is only used by a small minority and gaming is irrelevant. That is why RT is not implemented.

If that’s the case why did Apple invest substantial amount of resources to implement state of the art RT APIs and debugging tools? It does seem like ray tracing is a particular focus of interest for them.

neinjohn · Jul 17, 2022

Probably Apple wants Ray Tracing hardware to be implemented on all levels of Apple silicon. Meaning all the way from the very low thermal, and power budget, that is an iPhone. We don't have any of the kind shipping yet and it's certainly is a harder issue than the Nvidia implementation. So I would say Apple is not slow as it's a first-of-the-kind hardware and due to its silo nature we won't hear anything about it years before shipping anything (as Imagination).

As AS enthusiast I like the rumours about Apple diverging the SOCs between the vanilla and Pro's iPhones as I see it pointing to faster escalation on their chip efforts that translates even better to the computer side.

theorist9 · Jul 17, 2022

leman said:
FP64 support requires non-trivial precious transistor budget, so Nvidia/AMDs solution has been to implement it as part of the special function unit, with dramatically lower throughput compared to the normal ALUs. In other words, it’s very very slow unless you get a specialized data enter GPU like the A100, but that’s a completely different category.

If you need extended precision and want to use the GPU, you are almost always better off using numerical precision-extending algorithms like the double-float technique. It’s going to be faster than FP64 on GPUs and you can tweak it to suit your numerical needs. And Metal is C++, so creating custom types with overloaded operators and compile-type transformations is easy.

So basically you're saying implement the extended-precision using software rather than hardware. That's what Mathematica does with CPU calculations. You have a choice of machine-precision calculations, which are limited by the precision of the hardware, or arbitrary-precision calculations, which are done in the software, and can be extended to whatever precision you like. [There's also a third option—exact (infinite precision) calculations—which isn't germane here.] However, the arbitrary-precision ("software") calculations are always much slower.

But I see your point—to do them efficiently in hardware requires a costly transitior budget, which wouldn't make sense for Apple's GPU's given their most likely use cases.

Here's a nice (though somewhat outdated) table showing the throughput penalty for FP64 (rel. to FP32) in various NVIDIA GPUs; the're nearly all either 1/2X or 1/32X (though the 1/32X may in part be deliberate nerfing by NVIDIA to force data centers towards their more expensive commercial products):

Hardware for Deep Learning. Part 3: GPU

This is a part on GPUs in a series “Hardware for Deep Learning”. It is the most content-heavy part, mostly because GPUs are the current…

blog.inten.to

leman · Jul 17, 2022

theorist9 said:
So basically you're saying implement the extended-precision using software rather than hardware. That's what Mathematica does with CPU calculations. You have a choice of machine-precision calculations, which are limited by the precision of the hardware, or arbitrary-precision calculations, which are done in the software, and can be extended to whatever precision you like. [There's also a third option—exact (infinite precision) calculations—which isn't germane here.] However, the arbitrary-precision ("software") calculations are always much slower.

But I see your point—to do them efficiently in hardware requires a costly transitior budget, which wouldn't make sense for Apple's GPU's given their most likely use cases.

Here's a nice (though somewhat outdated) table showing the throughput penalty for FP64 (rel. to FP32) in various NVIDIA GPUs; the're nearly all either 1/2X or 1/32X (though the 1/32X may in part be deliberate nerfing by NVIDIA to force data centers towards their more expensive commercial products):

Hardware for Deep Learning. Part 3: GPU

This is a part on GPUs in a series “Hardware for Deep Learning”. It is the most content-heavy part, mostly because GPUs are the current…

blog.inten.to

It’s even worse with current crop of Nvidia cards where FP64 has 1/64 throughput of FP32. Performance-wise, native FP64 is a buster. Vendors keep it around for software compatibility but you’d likely get better performance using extended precision algorithms anyway. Even division - most expensive operation to implement - using the double-float technique needs around 10 instructions on modern hardware. So your get a guaranteed throughput better than 1/16 on any hardware. Why even bother with hardware FP64 at this point?

theorist9 · Jul 17, 2022

leman said:
It’s even worse with current crop of Nvidia cards where FP64 has 1/64 throughput of FP32. Performance-wise, native FP64 is a buster. Vendors keep it around for software compatibility but you’d likely get better performance using extended precision algorithms anyway. Even division - most expensive operation to implement - using the double-float technique needs around 10 instructions on modern hardware. So your get a guaranteed throughput better than 1/16 on any hardware. Why even bother with hardware FP64 at this point?

I think 1/64X is the result of NVIDIA's strategy to nerf FP64 performance on everything but its data center cards, and thus can't be used as an indicator of the kind of FP64 performance you could achieve with even a limited FP64 transistor budget. For instance, AMD's consumer cards are at 1/16X.

I'm not saying that Apple should be expending that budget. As I already acknowledged, given Apple's current market, maybe it shouldn't. Rather, I'm saying you can't use NVIDIA's current non-data-center FP64 performance as an indicator of what a limited FP64 transitior budget would get you. I have no idea here, so I'm purely speculating, but often one sees cases where the first, say, 20% of the effort gets you 80% of the performance, etc., etc. So is it possible that a limited transistor budget could get you to, say, 1/8X, and the big expense is getting you from there to 1/2X? The benefit, even if the speed were the same, would be reduced software complexity if you needed double-precision (no need for those 10 instructions to do division).

Fun discussion, BTW.

singhs.apps · Jul 17, 2022

jujoje said:
Maya unfortunately is still MIA with Autodesk claiming it is looking into a native version but not committing. Autodesk and The Foundry are really the two big holdouts from a VFX perspective.

Me thinks Autodesk has been wanting to go to cloud for a while. Foundry tried something along those lines and it went nowhere.

Also I think the Weta-M initiative with Maya might also be throwing a spanner into the works. Remains to be seen how it pans out.

13.55 onwards. Apple gets a mention.

theorist9 · Jul 17, 2022

singhs.apps said:
13.55 onwards. Apple gets a mention.

To save others the effort:

The only mentions of Apple after 13:55 concerned their role as a direct-to-consumer content producer (like Netflix and Amazon); there was nothing about hardware or software:

There was one mention of Macs, at 7:24, but it was just about how using the cloud would make their programs OS-agnostic:

leman · Jul 18, 2022

theorist9 said:
I think 1/64X is the result of NVIDIA's strategy to nerf FP64 performance on everything but its data center cards, and thus can't be used as an indicator of the kind of FP64 performance you could achieve with even a limited FP64 transistor budget. For instance, AMD's consumer cards are at 1/16X.

I wouldn’t read that much into this. The FP64 throughput is simply the direct side effect of balancing choices. Mainstream GPUs do FP64 in the special function unit. Nvidia says they have two such units per SM. Now, an SM in Turing has 64 FP32 and 64 INT32 units. That gives you the ratio of 1/32 FP64 throughput. For Ampere Nvidia has improved the INT32 units to also do FP32 ops - this doubled the FP32 throughput but effectively halved the relative FP64 throughout. In other words, the FP64 performance didn’t change for Ampere but minor architectural tweaks improved FP32 capability.

theorist9 said:
I'm not saying that Apple should be expending that budget. As I already acknowledged, given Apple's current market, maybe it shouldn't. Rather, I'm saying you can't use NVIDIA's current non-data-center FP64 performance as an indicator of what a limited FP64 transitior budget would get you. I have no idea here, so I'm purely speculating, but often one sees cases where the first, say, 20% of the effort gets you 80% of the performance, etc., etc. So is it possible that a limited transistor budget could get you to, say, 1/8X, and the big expense is getting you from there to 1/2X? The benefit, even if the speed were the same, would be reduced software complexity if you needed double-precision (no need for those 10 instructions to do division).

You still need those FP64 capable ALUs. This is not really the case of 20 for 80 etc, but simply the question of what you want your FP64 to be. In CPU SIMD, two FP32 units are usually “fused” to do a FP64 op - I’m not a hardware engineer and don’t know how much die space overhead such fusion needs. But on all die shots I’ve seen GPU ALU clusters are tiny compared to CPU ones, so probably the overhead is non-trivial. So you have to decide how much of this overhead you have the budget for and what your target performance is. It's a linear tradeoff.

There is no doubt that hardware FP64 results in better performance — that much is entirely obvious, but if providing competent FP64 performance means that you are sacrificing performance potential for the vast majority of your customers, that's probably not a good tradeoff. The advantage of "software" solutions is that they are predictable and scale with the base FP32 performance. And of course, one can go for special-purpose accelerators that have better FP64 throughput (like the Nvidia A100), but given the cost and lack of availability, you are probably better off just using CPU clusters.

Anyway, it's rather unlikely that Apple is interested in building specialised supercomputers for physics or meteorology research, so this discussion is not really relevant to them. If you need extended precision and want your software to run on wide range of hardware with best possible performance, software solutions are the way to go.

jujoje · Jul 18, 2022

singhs.apps said:
Me thinks Autodesk has been wanting to go to cloud for a while. Foundry tried something along those lines and it went nowhere.

Also I think the Weta-M initiative with Maya might also be throwing a spanner into the works. Remains to be seen how it pans out.

13.55 onwards. Apple gets a mention.

The Weta M / Weta H initiatives looked really interesting. Curious to see what actually ends up getting released, how much of a black box solution it is and how much the subscription price will be. Many questions, but is intriguing.

The Foundry's cloud platform was a bit underwhelming, but the main issue was that it was too early; the pandemic forced movie studios to relax security, so now cloud hosted tools are hopefully more viable.

singhs.apps · Jul 20, 2022

jujoje said:
The Weta M / Weta H initiatives looked really interesting. Curious to see what actually ends up getting released, how much of a black box solution it is and how much the subscription price will be. Many questions, but is intriguing.

The Foundry's cloud platform was a bit underwhelming, but the main issue was that it was too early; the pandemic forced movie studios to relax security, so now cloud hosted tools are hopefully more viable.

Weta-H might also be in the pipeline.
And oh, in the karma XPU preview for Houdini 19.5, keating seems confident we will see memory size access for GPUs in line with those traditionally enjoyed by CPUs in a couple of years…

Maxon’ s move to port redshift to support CPUs seems inline with that prediction

Boil · Jul 21, 2022

Weta M, Weta H; where's Weta B...? ;^p

jujoje · Jul 21, 2022

singhs.apps said:
Weta-H might also be in the pipeline.

It totally is

https://weta-h.com/

singhs.apps said:
And oh, in the karma XPU preview for Houdini 19.5, keating seems confident we will see memory size access for GPUs in line with those traditionally enjoyed by CPUs in a couple of years…

I was pretty curious about this too. There was someone on the forums who ran into a hard limit with Karma xPU where it would only recognise 6 GPUs ( they had 8 x 3080) so someone's already living the dream.

The other interesting thing for me was Christin's mention of the new viewport coming in the not to distant future; rebuilt from the ground up around Vulkan. Not Metal I guess, but really curious to see a modern 3D Viewport.

singhs.apps · Jul 21, 2022

jujoje said:
It totally is https://weta-h.com/

I was pretty curious about this too. There was someone on the forums who ran into a hard limit with Karma xPU where it would only recognise 6 GPUs ( they had 8 x 3080) so someone's already living the dream.

The other interesting thing for me was Christin's mention of the new viewport coming in the not to distant future; rebuilt from the ground up around Vulkan. Not Metal I guess, but really curious to see a modern 3D Viewport.

Yesss..Vulkan !
Houdini might be first out of the gate with transitioning to a new viewport tech. OpenGL was great, but long in the tooth now.

I was more curious viz keating’s claim of memory not really being a limit on the XPU = AS SOC… maybe AMD/Intel/Nvidia have plans for unified memory, whatever form it takes.

jujoje · Aug 6, 2022

theorist9 said:
Naive question: Is part of the reason we don't yet see Autodesk on AS that Autodesk incorporates tools like Moonray*, which is designed for Intel Advanced Vector Exensions, making it challenging to create an optimized port?

https://jo.dreggn.org/path-tracing-in-production/2017/MoonrayV3.pdf

"In our shading system, each vector lane represents an intersection point to be shaded. For example, on AVX2 hardware, 8 different intersection points are passed into the shade function by the renderer...."

[*The author says that Maya uses Moonray, so I assume that includes Autodesk, but he wasn't specific.]

Looks like Moonray will be open sourced, so there's yet another xPU renderer on the horizon. Looks pretty nice, although probably not something you'd really use without a farm (might have missed it but they didn't specify what spec the 'hosts' were that it was using, or GPUs for that matter). Scalability wise looks pretty great, particular with the CPU and GPU renders matching perfectly.

Anyways, promo thing for it here:

MrGunny94 · Aug 6, 2022

jujoje said:
Looks like Moonray will be open sourced, so there's yet another xPU renderer on the horizon. Looks pretty nice, although probably not something you'd really use without a farm (might have missed it but they didn't specify what spec the 'hosts' were that it was using, or GPUs for that matter). Scalability wise looks pretty great, particular with the CPU and GPU renders matching perfectly.

Anyways, promo thing for it here:

Was really surprised about this going open-source! What we definitely need is more tools within this space becoming open-source so not only we can help and train the artists of tomorrow but also improve the artists of today.

4nNtt · Jan 9, 2023

Rumors are that Apple is working on realtime raytracing. It might be out in the M3 and A17 series of processors if it doesn't get pushed again. I think it may be taking longer than other vendors because Apple needs a low power version of realtime raytracing. RTX cards are great for rendering, but they burn a lot of power doing so. Snapdragon 8 Gen 2 will also be adding realtime raytracing so it will be interesting to compare the two. Eventually by being lower power, Apple should actually outperform RTX cards because they will be able to sustain the performance better.

4nNtt · Jan 9, 2023

tmoerel said:
Blender is only used by a small minority and gaming is irrelevant. That is why RT is not implemented.

Apple cares enough about Blender to have one of their engineers dedicated to optimizing it.

4nNtt · Jan 9, 2023

jujoje said:
I think Renderman has a whole lot of x86 / AVX2 optimisations as well - I'm not sure how much that affects render speed, but must be a consideration when porting things to Apple Silicon. As a side note, Renderman xPU is heavily Nvidia focused, so I'm pretty sure we're not going to see that any time soon. Given Pixar's history it sucks somewhat.

Pixar and Apple are very close, so if they switch away from Nvidia this might change.

Right now, I think Otoy Octane gives the most love to Apple Silicon. That will probably be better optimized then anything else on the market.

4nNtt · Jan 9, 2023

Xiao_Xi said:
Is a 40% reduction in rendering time a real benefit for 3D artists? Because Nvidia's ray tracing cores reduce rendering time in Blender by 40%.

It is a little more complicated then this since the Blender Cycles Optix backend (the backend for RTX cards) also has other optimizations besides just raytracing acceleration that speed it up. Optix has many optimizations implemented that have not yet been implemented in the other backends. Right now Optix is probably the best backend for most renders, but this is a shifting landscape. Both AMD and Apple are also putting effort in to their Cycles backends. AMD is planning to have all the same optimizations the Optix backend has. Apple is mostly hardware limited right now, but they are very likely to have raytracing acceleration at some point.

Apple has also been putting a lot of effort in to Pixar's Hydra which is the API that is usually used to interface with raytracers by professional tools. Apple is definitely putting in the groundwork to be competitive in raytraced rendering they just are not there quite yet. Apple could be an interesting competitor in this space since they are less tied to legacy GPU pipelines and have an amazing performance per watt.

leman · Jan 9, 2023

4nNtt said:
Rumors are that Apple is working on realtime raytracing. It might be out in the M3 and A17 series of processors if it doesn't get pushed again. I think it may be taking longer than other vendors because Apple needs a low power version of realtime raytracing. RTX cards are great for rendering, but they burn a lot of power doing so. Snapdragon 8 Gen 2 will also be adding realtime raytracing so it will be interesting to compare the two. Eventually by being lower power, Apple should actually outperform RTX cards because they will be able to sustain the performance better.

Precisely!

And since Apple has published a bunch of RT-related patents last year we have a fairly good idea about their approach. They perform ray intersection tests using limited precision computation (idea borrowed from IMG technologies) and retest potential hits using regular compute shaders. This would potentially allow them to minimise the area and power consumption needed to do do hardware RT while doing more tests in parallel. Intersection hits are bundled, ordered and dispatched in coherent thread groups in order to improve hardware utilisation. And there are some other patents that discuss more efficient shader program execution (needed for dispatching multiple small on-demand programs) as well as new BHV partitioning schemes.

Hope we will get this hardware soon and curious to see what it can do!

4nNtt · Jan 9, 2023

theorist9 said:
Then again, while Macs are particularly popular among scientists for their personal/professional use, and many scientists thus use Macs to do development work for programs they send to computer clusters*, I suppose Apple doesn't see scientific computing as the main market for its upper-tier machines. [*This would be more for CPU computing than GPU computing. For the latter, they'll probably use NVIDIA/CUDA.]

I think Apple realizes this is an important group for the Mac. Generally the scientists use Macs, but their clusters/GPUs are on Linux so this isn't a huge deal at the moment. I'm sure Apple would like that to change and Metal compute shaders would be well suited since they are almost the same as C++14. From what I understand (I don't have CUDA experience) some projects actually target both CUDA and Metal using a subset of C++ using the same code with a shim layer. Certainly for CPU bound tasks, the M-series processors must look pretty compelling to scientific workloads.

Xiao_Xi · Jan 9, 2023

4nNtt said:
It is a little more complicated then this since the Blender Cycles Optix backend (the backend for RTX cards) also has other optimizations besides just raytracing acceleration that speed it up. Optix has many optimizations implemented that have not yet been implemented in the other backends.

I compared (apparently badly) the rendering times between OptiX and CUDA. It seems that OptiX can improve rendering time by more than 40%. By the way, shouldn't Optix at 4090/4080 reduce the rendering time in percentage more than at 3090/3080 due to ray sorting?

Blender-3.4-Cycles-GPU-Render-Performance-Scanlands-CUDA-vs-OptiX.jpg

Blender 3.4 Performance Deep-dive: Cycles, Eevee & Viewport

Blender 3.4 released just a few weeks ago, and as usual, we couldn't wait long before jumping in and exploring its performance with current GPUs from AMD, Intel, and NVIDIA. With 22 graphics cards and six Cycles and Eevee projects in-hand for rendering and viewport testing, let's find out which...

techgage.com

leman said:
They perform ray intersection tests using limited precision computation (idea borrowed from IMG technologies) and retest potential hits using regular compute shaders. This would potentially allow them to minimise the area and power consumption needed to do do hardware RT while doing more tests in parallel. Intersection hits are bundled, ordered and dispatched in coherent thread groups in order to improve hardware utilisation.

How is it different from how Nvidia does it?

leman · Jan 9, 2023

Xiao_Xi said:
How is it different from how Nvidia does it?

I don't know how Nvidia does it. Do they use limited-precision approximate tests?

diamond.g · Jan 9, 2023

Xiao_Xi said:
I compared (apparently badly) the rendering times between OptiX and CUDA. It seems that OptiX can improve rendering time by more than 40%. By the way, shouldn't Optix at 4090/4080 reduce the rendering time in percentage more than at 3090/3080 due to ray sorting?
View attachment 2139538

Blender 3.4 Performance Deep-dive: Cycles, Eevee & Viewport

Blender 3.4 released just a few weeks ago, and as usual, we couldn't wait long before jumping in and exploring its performance with current GPUs from AMD, Intel, and NVIDIA. With 22 graphics cards and six Cycles and Eevee projects in-hand for rendering and viewport testing, let's find out which...

techgage.com

How is it different from how Nvidia does it?

It looks like SER isn't used "automatically", you have to code for it.

Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering | NVIDIA Technical Blog

Learn about Shader Execution Reordering (SER), a performance optimization that unlocks the potential for better ray and memory coherency in ray tracing shaders.

developer.nvidia.com

Since only 3 cards support it, it will probably take a while for it to be added to programs.

maflynn · Jan 9, 2023

Unpopular opinion: Ray Tracing is over rated and unless you have a RTX 4090 class GPU, then you probably don't have it enabled in the game.

I have a rather anemic GPU that technically supports it, but is rather ineffective. I used it on a handful of cards and while it did add visual depth and better looking images, its not a night and day difference and playing games the scenes change enough that you won't miss it

I don't see why Apple has to implement it, particularly when gaming is not (at least as of yet) a major priority for them. I'd much rather see them improve other aspects of the GPU and CPU

Why is Apple slow to implement Hardware based Ray-Tracing?

macrumors Core

macrumors regular

macrumors 601

macrumors Core

macrumors 601

macrumors 6502a

macrumors 601

macrumors Core

macrumors 6502

macrumors 6502a

macrumors 68040

macrumors 6502

macrumors 6502a

macrumors 6502

macrumors 65816

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors 68000

macrumors Core

macrumors G5

macrumors Haswell

Our Staff