3D Rendering on Apple Silicon, CPU&GPU

Xiao_Xi · Oct 7, 2022

leman said:
Only Nvidia has proper hardware RT, Intel and AMD just slapped-on some intersection acceleration.

According to Imagination's ray tracing levels system,

Introducing the Ray Tracing Levels System – and what it will mean for gaming - Imagination

Not all ray tracing solutions are created equal. Learn about the Ray Tracing Levels System here.

blog.imaginationtech.com

Intel's ray tracing hardware is more advanced than Nvidia's because it can sort rays.

Intel Gaming Access

game.intel.com

leman · Oct 7, 2022

Xiao_Xi said:
Intel's ray tracing hardware is more advanced than Nvidia's because it can sort rays.

Ah, thanks for pointing out that Intel's hardware is more sophisticated than I thought! Nvidia has that stuff too though, probably since the beginning.

jmho · Oct 7, 2022

Xiao_Xi said:
Intel's ray tracing hardware is more advanced than Nvidia's because it can sort rays.

The 40 series does Shader Execution Reordering, which I believe is the exact same thing.

Xiao_Xi · Oct 7, 2022

jmho said:
The 40 series does Shader Execution Reordering, which I believe is the exact same thing.

You're right! Although I saw the presentation, I completely forgot about it.

leman · Oct 8, 2022

Btw, I was just made aware of this very interesting Apple patent:

https://patentimages.storage.googleapis.com/4f/c5/15/a9366cb4c0fffc/US20220207690A1.pdf

It describes an energy-efficient way to do RT intersection tests - dedicated hardware that runs intersection tests using limited precision and then follows up in a compute shader if the initial test indicates a hit. Since most tests in RT are misses it should massively cut down the time hardware spends doing RT checks.

diamond.g · Oct 8, 2022

leman said:
Btw, I was just made aware of this very interesting Apple patent:

https://patentimages.storage.googleapis.com/4f/c5/15/a9366cb4c0fffc/US20220207690A1.pdf

It describes an energy-efficient way to do RT intersection tests - dedicated hardware that runs intersection tests using limited precision and then follows up in a compute shader if the initial test indicates a hit. Since most tests in RT are misses it should massively cut down the time hardware spends doing RT checks.

Is that different than what AMD is currently doing?

terminator-jq · Oct 8, 2022

leman said:
Btw, I was just made aware of this very interesting Apple patent:

https://patentimages.storage.googleapis.com/4f/c5/15/a9366cb4c0fffc/US20220207690A1.pdf

It describes an energy-efficient way to do RT intersection tests - dedicated hardware that runs intersection tests using limited precision and then follows up in a compute shader if the initial test indicates a hit. Since most tests in RT are misses it should massively cut down the time hardware spends doing RT checks.

Wow that was interesting! Great find!

Based on the file date and publish date, it seems like we could see this ray tracing tech debut in the Mac Pro and then make its way into the M3 series (the M2 Pro/ Max could be possible as well).

It’s good to see Apple making a move like this on the GPU front. On the CPU side, they have definitely demonstrated their ability to beat Intel and AMD with better performance and efficiency but on the GPU side… they need some work. There’s no excuse for a “Pro” level chip to not have at least some sort of hardware raytracing built in. Let’s hope Apple can implement this quickly and effectively.

leman · Oct 8, 2022

diamond.g said:
Is that different than what AMD is currently doing?

No idea. The way I interpret the documentation is that AMD has dedicated instructions that do intersection tests. I do not know whether they have a separate traversal hardware unit or any other optimisations.

diamond.g · Oct 8, 2022

leman said:
No idea. The way I interpret the documentation is that AMD has dedicated instructions that do intersection tests. I do not know whether they have a separate traversal hardware unit or any other optimisations.

From my understanding the texture processing unit does that and intersection traversal. So in a way I guess that part isn't dedicated, but AMD has shaders do the rest (where Nvidia has more hardware dedicated for other stuff that helps the speed along). It seems like Apple is leaning towards the AMD route, per the patent, or at least my understanding of it. Which admittedly could be wrong.

exoticSpice · Oct 8, 2022

I found another RT patent.
Can the persons here explain it a bit more?

Could this patent work together with the other patent posted above?

https://patentimages.storage.googleapis.com/b3/02/76/59b3b1bc715f29/US20220036630A1.pdf

jmho · Oct 8, 2022

exoticSpice said:
I found another RT patent.
Can the persons here explain it a bit more?

Could this patent work together with the other patent posted above?

https://patentimages.storage.googleapis.com/b3/02/76/59b3b1bc715f29/US20220036630A1.pdf

Shaders in metal have essentially three levels:

Threads - this is the number of times you want the shader to run, often in the millions.
Threadgroups - Threads that run together in a batch. The GPU will often take say 512 threads and run them together, these threads can potentially work together and share memory.
Simdgroups: Threadgroups are then broken down even further into simdgroups for actual parallel execution on the GPU. Every thread in a simdgroup executes the exact same code, so it's really important that shaders in a simdgroup do basically the same thing. If shaders in a simdgroup do task A or B or C, then every single shader in the group will have to do A and B and C which is obviously very slow. Ideally you would want to put all task A shaders in one simd group, and all task B in another etc.

This patent seems to be about forming optimal simd groups while raytracing, which is probably a bit like nVidia's Shader Execution Reordering.

leman · Oct 9, 2022

exoticSpice said:
I found another RT patent.
Can the persons here explain it a bit more?

Could this patent work together with the other patent posted above?

https://patentimages.storage.googleapis.com/b3/02/76/59b3b1bc715f29/US20220036630A1.pdf

I find patents notoriously difficult to read, but from a cursory glance it seems to me that this patent simply describes hardware-accelerated RT as such. It’s appears to be about the ability of the ray-tracing hardware to issue work to compute cores and via versa. In a nutshell, the compute engine initiates ray tracing, the dedicated scene traversal hardware performs the ray tracing and invokes up a new compute task that processes the results. Other parts of the patent discuss in detail the interaction between the RT hardware and the compute engine. I can’t really tell if the patent mentions work compacting or ray reordering at all.

To add to @jmho’s reply: Apple GPUs are vector processors that operate on 1024-bit vectors (32 FP32 numbers) per cycle. A single add operation for example operates on inputs of 32 numbers at once. This is what is known as SIMD (single instruction multiple data). So a simdgroup is essentially just a one instance of a shader program. As @jmho mentions, with this kind of hardware you really want every instruction to do as much useful work as possible, so it’s advantageous to design the program (and your data layouts!) in a way that 32 values can be processed at once. Otherwise you are just wasting compute resources. Example: most modern GPUs have massive problems with small or “long” triangles, because they process blocks of 4x8 pixels at once and most of this block will be empty along the edges of such triangles.

And finally, let’s talk about timing. Nvidia RT patents were filed in 2014/2015. If I remember correctly the first RT hardware came out in 2018 - tests three to four years to the market. Apple patents are filed in 2020, so I’d expect the hardware to arrive in 2013 at the earliest. But they might surprise us and give us the M2 Pro with this stuff included already this year. Anyway, it kind of explains why there were no changes to the GPU from the A15 on. Apples Team must be very busy

P.S. These are the Apple RT patents I found, maybe there are more, no idea (we already discussed two of them)

US20220036630A1 - SIMD Group Formation Techniques during Ray Intersection Traversal - Google Patents

Disclosed techniques relate to forming single-instruction multiple-data (SIMD) groups during ray intersection traversal. In particular, ray intersection circuitry may include dedicated circuitry configured to traverse an acceleration data structure, but may dynamically form a SIMD group to...

patents.google.com

US20220207690A1 - Primitive Testing for Ray Intersection at Multiple Precisions - Google Patents

Techniques are disclosed relating to testing whether a ray intersects a graphics primitive, e.g., for ray tracing. In some embodiments, intersection circuitry performs a reduced-precision conservative intersection test and shader circuitry performs an original-precision intersection test if the...

patents.google.com

CN114092614A - Ray Intersection Circuit with Parallel Ray Test - Google Patents

The present disclosure relates to ray intersection circuits with parallel ray testing. The disclosed technology relates to ray intersection processing for ray tracing. In some embodiments, the ray intersection circuitry traverses an accelerated data structure organized in a spatial form and...

patents.google.com

exoticSpice · Oct 9, 2022

leman said:
I find patents notoriously difficult to read, but from a cursory glance it seems to me that this patent simply describes hardware-accelerated RT as such. It’s appears to be about the ability of the ray-tracing hardware to issue work to compute cores and via versa. In a nutshell, the compute engine initiates ray tracing, the dedicated scene traversal hardware performs the ray tracing and invokes up a new compute task that processes the results. Other parts of the patent discuss in detail the interaction between the RT hardware and the compute engine. I can’t really tell if the patent mentions work compacting or ray reordering at all.

To add to @jmho’s reply: Apple GPUs are vector processors that operate on 1024-bit vectors (32 FP32 numbers) per cycle. A single add operation for example operates on inputs of 32 numbers at once. This is what is known as SIMD (single instruction multiple data). So a simdgroup is essentially just a one instance of a shader program. As @jmho mentions, with this kind of hardware you really want every instruction to do as much useful work as possible, so it’s advantageous to design the program (and your data layouts!) in a way that 32 values can be processed at once. Otherwise you are just wasting compute resources. Example: most modern GPUs have massive problems with small or “long” triangles, because they process blocks of 4x8 pixels at once and most of this block will be empty along the edges of such triangles.

And finally, let’s talk about timing. Nvidia RT patents were filed in 2014/2015. If I remember correctly the first RT hardware came out in 2018 - tests three to four years to the market. Apple patents are filed in 2020, so I’d expect the hardware to arrive in 2013 at the earliest. But they might surprise us and give us the M2 Pro with this stuff included already this year. Anyway, it kind of explains why there were no changes to the GPU from the A15 on. Apples Team must be very busy

P.S. These are the Apple RT patents I found, maybe there are more, no idea (we already discussed two of them)

US20220036630A1 - SIMD Group Formation Techniques during Ray Intersection Traversal - Google Patents

Disclosed techniques relate to forming single-instruction multiple-data (SIMD) groups during ray intersection traversal. In particular, ray intersection circuitry may include dedicated circuitry configured to traverse an acceleration data structure, but may dynamically form a SIMD group to...

patents.google.com

US20220207690A1 - Primitive Testing for Ray Intersection at Multiple Precisions - Google Patents

Techniques are disclosed relating to testing whether a ray intersects a graphics primitive, e.g., for ray tracing. In some embodiments, intersection circuitry performs a reduced-precision conservative intersection test and shader circuitry performs an original-precision intersection test if the...

patents.google.com

CN114092614A - Ray Intersection Circuit with Parallel Ray Test - Google Patents

The present disclosure relates to ray intersection circuits with parallel ray testing. The disclosed technology relates to ray intersection processing for ray tracing. In some embodiments, the ray intersection circuitry traverses an accelerated data structure organized in a spatial form and...

patents.google.com

Interesting leman. Thank you for your explanation and you as well @jmho. These next few years are going to be fun!

deconstruct60 · Oct 10, 2022

terminator-jq said:
Based on the file date and publish date, it seems like we could see this ray tracing tech debut in the Mac Pro and then make its way into the M3 series (the M2 Pro/ Max could be possible as well).

debut in Mac Pro? Errr, a tether-less , battery only powered, VR/AR headset likely has much higher pressing need for a hyper low power consumption RT hardware.

Several of these "work way smarter , not harder" patents smell far more driven by being limited by a small battery than on primarily focused on constructing some Nvidia x090/x080 or AMD x900/x800 'killer' large GPU.

It would trickle out the Macs ( and other SoCs) , but it won't be surprising if it doesn't start there.

Boil · Oct 10, 2022

Ray-Tracing debuting in the ASi Mac Pro as part of the toolset for creating content for the forthcoming Apple Mixed Reality glasses/goggles/headset/whatever...?

leman · Oct 10, 2022

Boil said:
Ray-Tracing debuting in the ASi Mac Pro as part of the toolset for creating content for the forthcoming Apple Mixed Reality glasses/goggles/headset/whatever...?

What would be the value of that? Metal has RT for what, two years now? And something like M1 Pro is already fast enough for real-time RT on simpler geometries if you forego denoising.

deconstruct60 · Oct 10, 2022

Boil said:
Ray-Tracing debuting in the ASi Mac Pro as part of the toolset for creating content for the forthcoming Apple Mixed Reality glasses/goggles/headset/whatever...?

If the current API has a call that takes a 'ray' and a point to a function to do the ray computation if necessary then don't really need to change that call at all if there is now a "hardware" to support whether can skip the compute function call. The Apple library code invokes the 'test if a complete miss' call and then skips the user supplied call as appropriate. The application level code wouldn't change at all.

Similar issues if the work scheduling is largely being handled by the Apple foundation library code.

Basically same principle of how Apple dispatches to a ProRes accelerate if there is hardware present or 'falls back' to software ProRes decoding if there isn't one. The applications are suppose to make the same Apple library API call regardless.

Apple has been promoting their AR libraries for more than several years now. The hardware acceleration point should be to make the existing code go faster at far lower power consumption; not that app developers have to rewrite for every iteration of hardware evolution. Few folks are going to use a RT hardware interface that is constantly churning every couple of years.

There is a huge , increasingly legacy preconception that the VR/AR app has to run tether to some gigantic , firebreathing GPU to construct a AR/VR app that runs on tethered headset . That should not be the case at all. For better or worse, Apple killed off that development approach years ago. Their efforts have been to go around that generally across the whole line up so that what folks built toward Apple GPU would move straightforwardly onto the headset. The headset is going to use the baseline Apple GPU tech also.

theorist9 · Oct 10, 2022

Given that NVIDIA's 40-series hardware RT is the product of three generations of refinement (it was introduced with the 20-series), I wonder if Apple's first offering, which I assume will use Imagination's RT tech, will be as performant.

Imagination's marketing materials say it will be, but has there been any independent testing of Imagination's commercial implementation of this, which can be seen in the IMG CXT GPU, to see if the reality lives up to the claims? I did a quick Google search and didn't turn up anything.

leman · Oct 10, 2022

deconstruct60 said:
If the current API has a call that takes a 'ray' and a point to a function to do the ray computation if necessary then don't really need to change that call at all if there is now a "hardware" to support whether can skip the compute function call. The Apple library code invokes the 'test if a complete miss' call and then skips the user supplied call as appropriate. The application level code wouldn't change at all.

Precisely. Metal already has a state of the art raytracing support and it has obviously been designed with hardware acceleration in mind. The software layer won’t change.

theorist9 said:
I wonder if Apple's first offering, which I assume will use Imagination's RT tech, will be as advanced.

Why do you think they will use Imaginations IP? Apples RT patents seem to be an entirely new effort.

theorist9 · Oct 10, 2022

leman said:
Why do you think they will use Imaginations IP? Apples RT patents seem to be an entirely new effort.

I don't know for certain, but I assumed that would be the case because of the deal Apple signed with Imagination in 2020:

Apple’s Imagination Technologies deal is all about ray tracing and AR

Apple and GPU designer Imagination Technologies forged a mysterious new multi-year agreement, almost certainly to bring ray tracing to Apple devices.

venturebeat.com

The patents seem less definitive, since Apple implements only a tiny percentage of what they patent. Often patents are filed either speculatively (for technology a company doesn't intend to use now, but wants the option of using in the future); or defensively (for tech a company has no intention of using now or in the futurre, but wants to prevent competitiors from using).

Do you have anything definitive showing what Apple will use?

And regardless of what Apple uses, my essential question remains: How likely is it that Apple, with their first hardware RT implementation, will be able to offer something as performant as that in NVIDIA's 40-series which, again, is the product of three generations of refinement?

Remember that while Apple's M-series chips were a game-changer, they were an extension of something with which Apple had years of practical experience: The A-series chips. Hardware RT is qualitatively different, since it's something they've never done before, in any form (right?). Hence my question.

Xiao_Xi · Oct 10, 2022

theorist9 said:
How likely is it that Apple, with their first hardware RT implementation, will be able to offer something as performant as that in NVIDIA's 40-series?

For reference, ARC A770 can render almost as fast as an RTX 3050.

Intel-Arc-A770-and-A750-Performance-Blender-BMW-and-Classroom-680x383.jpg

A Creative Arc By Intel: A750 & A770 GPU Workstation Performance...

We recently took an in-depth look at Intel's low-end Arc A380 graphics card in creator workloads, and now, we're able to add the top-end A750 and A770 models to the results. We found some notable highlights of Arc with our A380 look, and as it happens, Arc 7 has created even more.

techgage.com

And it should render faster than the M1 Ultra with 64 cores.

deconstruct60 · Oct 10, 2022

theorist9 said:
I don't know for certain, but I assumed that would be the case because of the deal Apple signed with Imagination in 2020:

Apple’s Imagination Technologies deal is all about ray tracing and AR

Apple and GPU designer Imagination Technologies forged a mysterious new multi-year agreement, almost certainly to bring ray tracing to Apple devices.

venturebeat.com

The patents seem less definitive, since Apple implements only a tiny percentage of what they patent.

Do you have anything definitive showing what Apple will use?

Pretty good chance Apple implementation will not be exactly the same. The bigger issue is how much do the approaches overlap. Patents usually have a section where the reference prior art and related patents. I haven't done an exhaustive search of these new two but there is a pretty good chance they reference some 'concepts' in some other Imagination Tech patents.

For example this first new one listed.

"...

Patent Citations (3)

Publication numberPriority datePublication dateAssigneeTitle
Family To Family Citations
US9928640B2 *2015-12-182018-03-27 Intel Corporation Decompression and traversal of a bounding volume hierarchy
US10825230B2 *2018-08-102020-11-03 Nvidia Corporation Watertight ray triangle intersection
US10970914B1 *2019-11-152021-04-06 Imagination Technologies Limited Multiple precision level intersection testing in a ray tracing system
* Cited by examiner, † Cited by third party
..."

US20220207690A1 - Primitive Testing for Ray Intersection at Multiple Precisions - Google Patents

Techniques are disclosed relating to testing whether a ray intersects a graphics primitive, e.g., for ray tracing. In some embodiments, intersection circuitry performs a reduced-precision conservative intersection test and shader circuitry performs an original-precision intersection test if the...

patents.google.com

Surprise , surprise , surprise. There is an Imagination Tech patent with almost the same title. Completely zero overlap in how those two approached leverage the TBDR tile memory cache or other foundationally common infrastructure. Apple has a different GPU implementation but the stack of patents it is built off of shares lots of infrastructure patents with ImgTech. Apple is still licensing stuff.

Rather than get into an exhaustive litgative match it makes sense just to license the stuff from Imgination tech. It also a defensive move because can see Nvidia is still there also with some similar techniques.

Similar with other one dug up
https://patents.google.com/patent/US20220036630A1/en?oq=US20220036630A1

Go to citations list and look for absence of Imagination Tech. Doesn't happen.

theorist9 said:
And regardless of what Apple uses, my essential question remains: How likely is it that Apple, with their first hardware RT implementation, will be able to offer something as performant as that in NVIDIA's 40-series which, again, is the product of three generations of refinement?

Your presumption is that they are solely after 'preformance' as opposed to 'perf/watt'. Every year from 2020 through 2022 at every new Apple Silicon SoC introduction, Apple gets up and preaches another sermon on "Pref/Watt". When Apple rolls out hardware RT there is a very good chance going to get another "Pref/Watt" sermon.

Apple isn't out to built a x090 'killer' GPU. Nothing that Nvidia makes is even remotely suitable for a AR/VR headset. Qualcomm is the vendor who has real, shipping volume headsets with a SoC. Not Nvidia.

If the point was to make something that was plugged into a wall for power and completely tethered, then Nvidia might be highly relevant.

The other issue is the software. The Nvidia 40 series is benefitting from a multiple years of laying foundation for the software RT calls that calls the hardware. That first 1-1.5 years after Nvidia rollout out their first generation RT hardware pragmatically didn't buy a whole lot.

theorist9 said:
Hardware RT is qualitatively different, since it's something they've never done before, in any form (right?). Hence my question.

Pragmatically, It is not purely a hardware issue. Nvidia GPU hardware is how useful to macOS 13 how with no GPU drivers?

P.S. Just like it is in Apple's best interest to keep getting new Arch licenses for new ARM architectures ( to keep the ecosystem healthy), it is also in Apple's general interest to keep ImagTech somewhat afloat with new licensing also.
If ImaginTech patents fell into the hands of an entity that was hyper hostile to Apple ... may not be saving any money long term by feeding ImaginTech to the wolves.

leman · Oct 10, 2022

theorist9 said:
And regardless of what Apple uses, my essential question remains: How likely is it that Apple, with their first hardware RT implementation, will be able to offer something as performant as that in NVIDIA's 40-series which, again, is the product of three generations of refinement?

Remember that while Apple's M-series chips were a game-changer, they were an extension of something with which Apple had years of practical experience: The A-series chips. Hardware RT is qualitatively different, since it's something they've never done before, in any form (right?). Hence my question.

I think that Apple has enough resources and talent to deliver a good implementation. Hardware RT is a new domain in general, and while Nvidia is an absolute pioneer in this area, it doesn't mean that there is only one viable approach. As @deconstruct60 writes, Apple will probably pursue efficiency (which is not Nvidia's strongest point).

theorist9 · Oct 10, 2022

deconstruct60 said:
Your presumption is that they are solely after 'preformance' as opposed to 'perf/watt'.

I think you misunderstand what I'm asking. I'm wondering about how advanced/performant Apple's RT implementation will be compared to NVIDIA's. I.e., NVIDIA's implementation gives tasks that benefit from RT a certain percent increase in performance over what the performance would be without RT. So when I ask how performant Apple's application of RT is compared to NVIDIA's, I'm asking how Apple's percentage performance gain from RT will compare. I'm not asking about absolute performance.

Or are are you saying that you can get a higher percentage increase in performance from RT by having a bigger area of the die devoted to RT, would would cost more watts?

leman said:
I think that Apple has enough resources and talent to deliver a good implementation. Hardware RT is a new domain in general, and while Nvidia is an absolute pioneer in this area, it doesn't mean that there is only one viable approach. As @deconstruct60 writes, Apple will probably pursue efficiency (which is not Nvidia's strongest point).

See my reply to deconstruct60, above.

deconstruct60 said:
Pragmatically, It is not purely a hardware issue. Nvidia GPU hardware is how useful to macOS 13 how with no GPU drivers?

This is just word salad. When you write this way, you're asking your readers to do the work to translate your words into something comprehensible. Don't make me work so hard to understand you!

leman · Oct 10, 2022

theorist9 said:
I think you misunderstand what I'm asking. I'm wondering about how advanced/performant Apple's RT implementation will be compared to NVIDIA's. I.e., NVIDIA's implementation gives tasks that benefit from RT a certain percent increase in performance over what the performance would be without RT. So when I ask how performant Apple's application of RT is compared to NVIDIA's, I'm asking how Apple's percentage performance gain from RT will compare. I'm not asking about absolute performance.

Since you also directed this question at me, the simple truth is that we don’t know. As Apple is clearly pursuing a different approach than other companies, we’ll have to wait and see. But judging by the patents we’ve seen I think there are reasons to be cautiously optimistic. Using reduced precision for conservative preliminary tests suggests an area-efficient solution, so there is a possibility that Apples hardware might end up with a higher ray testing throughput than the competitors, at lower power.

3D Rendering on Apple Silicon, CPU&GPU

macrumors 68000

macrumors Core

macrumors 6502a

macrumors 68000

macrumors Core

macrumors G5

macrumors 6502a

macrumors Core

macrumors G5

Suspended

macrumors 6502a

macrumors Core

Suspended

macrumors G5

macrumors 68040

macrumors Core

macrumors G5

macrumors 601

macrumors Core

macrumors 601

macrumors 68000

macrumors G5

Patent Citations (3)​

macrumors Core

macrumors 601

macrumors Core

Our Staff

Patent Citations (3)