Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

richinaus

macrumors 68020
Oct 26, 2014
2,430
2,186
If the claims of 4x RT performance translate to production RT, M3 Ultra could be as fast as a 4090 RTX in Blender. Of course, I am very skeptical. But 2-2.5x improvements are probably not unrealistic. That would already be phenomenal and make Mac laptops best choice for CAD and 3D (at least platform-wise).

I wonder what other GPU improvements they made. They mentioned a full redesign. Could be something to do with closing the feature parity gap to Nvidia chips?
Leman - you are making music in your tech talk to my ears :)
 

NT1440

macrumors Pentium
May 18, 2008
15,092
22,158
I can see the Asahi team thinking “here we go again” 😁
I think you are actually the reason why this was a main point for me.

Are you the user here who has pointed out that Apple’s performance and low power cores are miles ahead of the competition in perf/watt while being on a 3+ year old micro architecture?

This could be BIG news for the Macs when scaled up…
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Are you the user here who has pointed out that Apple’s performance and low power cores are miles ahead of the competition in perf/watt while being on a 3+ year old micro architecture?

A bunch of folks have pointed this out over a period of time, myself included
 
  • Like
Reactions: NT1440

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
IMG_0474.png
First get Nvidia RT was about 4 x too so yeah possibe. Will it match the 40 series who knows.
 

innerproduct

macrumors regular
Jun 21, 2021
222
353
if you take a look on RT performance for offline rendering in Octane and Redshift, a “normal” scene usually only gains 20-30% since RT cores only solves bvh/intersections while shading uses the regular cores. But, as f you really want to showcase RT perf, make a scene with large amounts of triangles and simple materials and then performance will be “insane” as they say on youtube.
 
  • Like
Reactions: sirio76

leman

macrumors Core
Oct 14, 2008
19,521
19,674
if you take a look on RT performance for offline rendering in Octane and Redshift, a “normal” scene usually only gains 20-30% since RT cores only solves bvh/intersections while shading uses the regular cores. But, as f you really want to showcase RT perf, make a scene with large amounts of triangles and simple materials and then performance will be “insane” as they say on youtube.

Nvidia OPTIX results in much higher improvements than 20-30%. Computing intersections is not much work in itself, but there are a lot of them, and graph traversal is inherently GPU-unfriendly. The main problem with naive GPU RT is that it results in massive stalls, with shaders waiting for final intersection results before they can do, well, shading. Dedicated accelerators help minimize the stalls, dramatically improving shader utilization. It will be even more the case on Apple hardware, where BVH walk and shading is executed asynchronously.
 

richinaus

macrumors 68020
Oct 26, 2014
2,430
2,186
Nvidia OPTIX results in much higher improvements than 20-30%. Computing intersections is not much work in itself, but there are a lot of them, and graph traversal is inherently GPU-unfriendly. The main problem with naive GPU RT is that it results in massive stalls, with shaders waiting for final intersection results before they can do, well, shading. Dedicated accelerators help minimize the stalls, dramatically improving shader utilization. It will be even more the case on Apple hardware, where BVH walk and shading is executed asynchronously.
Can we have a Leman Layman translation :) Apple will be better or worse at RT?
 

Boil

macrumors 68040
Oct 23, 2018
3,477
3,173
Stargate Command
Maybe Otoy will be one of the first with A17 Pro ray-tracing benchmarks, since in the past Jules has touted how well the An-series of SoCs work with the Octane renderer...?
 

MRMSFC

macrumors 6502
Jul 6, 2023
371
381
I do wish our old friend CMaeir was here to help unpack all this. I don’t care about the iPhone (I’ve got a 13 mini), but man this seems like the M3’s are going to be a huge upgrade…
I always enjoyed lurking and reading CMaier’s posts.

Certainly better than before :p As to the rest, let's wait for benchmarks. Personally, I expect at least a 2x improvement in Blender, hopefully more.
I feel like strategically, Apples focused on improving Blender performance right now.

It kinda fits with their “prosumer” branding.
 

Homy

macrumors 68030
Jan 14, 2006
2,506
2,459
Sweden
I do wish our old friend CMaeir was here to help unpack all this. I don’t care about the iPhone (I’ve got a 13 mini), but man this seems like the M3’s are going to be a huge upgrade…

I always enjoyed lurking and reading CMaier’s posts.


I feel like strategically, Apples focused on improving Blender performance right now.

It kinda fits with their “prosumer” branding.

You find him here.
 

dmccloud

macrumors 68040
Sep 7, 2009
3,142
1,899
Anchorage, AK
I made a comparison video between NVidia cards and Mseries Macs using Cinebench. Unfortunately Macs are nowhere near close to a 4090 or even a 3090.
It's not unexpected but it's still discouraging if you're a 3D user who wants to use a Mac.

For the same price as a maxed out M2 Mac Studio you can get a PC with 2 4090s. The PC setup will be 5 times faster than the Mac, when rendering.

Cinebench stresses the CPU, not the GPU.
 
  • Like
Reactions: MRMSFC

leman

macrumors Core
Oct 14, 2008
19,521
19,674
One might wonder if the GPU benchmark is skewed towards Nvidia GPUs like the CPU benchmark is skewed towards Intel CPUs...?

The new version of the CPU benchmark appears to be skewed in Apple's favour, actually (which makes total sense if they have used a larger scene as they claim, Apple CPUs will have an advantage due to their humongous caches).

The GPU version instead runs really well on AMD for some reason. Nvidia is still on top, of course, but AMD is performing much better than expected based on Blender.
 
  • Like
Reactions: Boil and MRMSFC

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
Yes, but there is a but: Apple‘s RT approach is inherently much more power-efficient than Nvidia‘s. At least based on the patents I’ve read. It could be possible for Apple to beat a much larger and power-hungry GPU on the RT metric. In pure shader power, not really (here die area and power consumption constraints reign supreme). But with smarter tech Apple could punch way above its weight in certain applications. Like we already see M2 performing surprisingly well for rasterization thanks to TBDR.

I think the more interesting question is the extent to which dedicated HW vs clever reuse of existing hardware is involved.
If you look at nVidia they have dedicated RT hardware (which appears to have no value when you're not ray tracing) and dedicated tensor hardware. They have done a reasonable job of adding value to the tensor hardware (can now perform FP64 multiplications at a useful speed, and can even perform FP16 non-matrix calculations). But as far as I know, there is zero value to the RT hardware apart from ray tracing.

Apple have a different set of solutions for matrix multiply (NPU for one set, AMX for a different set, a cute reuse of the GPU FP hardware for a third set) so they've chosen a very different path from nV in that respect.
Have they also chosen a different path for ray tracing? A way to fit RT (and in particular autonomous execution by RT code) onto the standard shader hardware (maybe a core is switched to "RT mode, and runs some local millicode"?) Certainly it seems sub-optimal to devote area to hardware that gets little other use?

(The other interesting question is whether Apple have dropped dedicated Geometry hardware, ie specific Vertex and Tesselation IP blocks, for a generic Mesh shader path that operates on the generic shader hardware? Like nVidia did with, what, Turing?
This is kinda the same as the RT situation - drop hardware that's only useful when you're doing one particular thing in favor of generic hardware...)
 
  • Like
Reactions: Chuckeee

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
Any particular reason they cannot backport any scheduler changes to M1 Ultra? That kind of thing shouldn't be "hardcoded" right?
Making optimal scheduling decisions depends on knowing what's going on in each core, which means that the core has to count the appropriate events and report them upward.
It's possible that the M1 Max chips don't count some events that are relevant to distributing on chiplet A vs chiplet B. Or it's possible that even though these are counted, there isn't a mechanism in place to report these events from one chiplet to the next.

It's not easy to know, in advance, everything that you will need to monitor and report cross-chiplet for optimal performance!
 

leman

macrumors Core
Oct 14, 2008
19,521
19,674
Have they also chosen a different path for ray tracing? A way to fit RT (and in particular autonomous execution by RT code) onto the standard shader hardware (maybe a core is switched to "RT mode, and runs some local millicode"?) Certainly it seems sub-optimal to devote area to hardware that gets little other use?

If one goes by the patents, Apple’s RT is indeed a separate module, running its own programs. But the way they describe it makes me think that it won’t need too much area. They run computations at low precision (allowing for false positives), so those circuits probably won’t need too much space. The rest is sorting and compacting circuitry, which should be fairly small as well. The biggest problem is likely cache, but if I understand it correctly the RT cache gets sub-allocated from the shader cache.

What I find particularly interesting though is the execution model. From what I understand, RT functionality is associated with texture fetch units in AMDs and Nvidias implementations (certainly in AMDs). The RT assist hardware is a classical coprocessor - the shader program fires a graph traversal request and stalls until the result is available. Apple describes a continuation-based approach: the shader program prepares ray data and quits, passing control to the RT unit. The latter does its job and launches a new shader program to validate and process the results. This style has far reaching consequences. For example, there are fewer stalled programs to track. The RT unit is free to reorganize results to improve shader occupancy (reduce divergence). With this models it’s ok to wait hundreds or thousands of GPU cycles before the intersection results are back, giving the hardware more opportunities to build larger SIMD bundles. What’s more, this approach requires hardware that’s optimized for running multiple small shaders and launching shaders from shaders. I wouldn’t be surprised if it turns out that A17 introduces these capabilities to Metal. Hope the documentation will be out soon.


(The other interesting question is whether Apple have dropped dedicated Geometry hardware, ie specific Vertex and Tesselation IP blocks, for a generic Mesh shader path that operates on the generic shader hardware? Like nVidia did with, what, Turing?
This is kinda the same as the RT situation - drop hardware that's only useful when you're doing one particular thing in favor of generic hardware...)

I don’t think Apple ever had this kind of hardware, not for some years at least. Vertex shading and tessellation are just compute kernels over buffers of memory. So are mesh shaders for what it’s worth. Fragment shaders are also compute kernels operating on tiles (e.g. 32x32 dispatch size). It’s just that you have to insert some fixed function hardware between these kernels (e.g. binner and rasterizer).
 
  • Like
Reactions: name99

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
No, I don't. Because this means competing on a very different market with very different margins. It's not Apple's business and I doubt they could make it profitable with their technology. They are simply not positioned for this kind of push.

This is a fair point. Very possible that you are right. But it's also possible that by offering a compelling product for a reasonable price a certain niche can be carved out. Apple's technology would work well in this scenario. Whether they are interested is a whole different question.
This is a business decision, not a technology decision. 20 years ago Apple's assumption was you store music locally, now it's that rent music. Things change.
Already with XCode Cloud Apple has started moving down this path (admittedly for a very limited sort of workload).

The real, technical, issue is network performance. SOME tasks (like compiling) can work well in the cloud, others (like video editing) would be ludicrous. Everything is changing so fast that it's not clear quite what the best sort of solution might be, and for what use cases. Is it better if I "log in" to what looks like a remote mac, and operate it as such (my own account on that mac, my own local file system, etc?) Or is it better than Apple provide APIs that allow apps who so desire (XCode. Mathematica? Blender?) to ship work to an Apple cloud?
How about an innovative business strategy, like sharing revenue with a company when this functionality is used, to encourage Adobe or Autodesk or whatever to add the functionality?

Apple always plays the long game. Being a me-too cloud provider like AMZ or Azure is not a game they will win; even many of their developers, let alone customers, are not interested in dealing with those sorts of UI's. But generalizing the way XCode Cloud works (where the user feels like they are just using standard XCode, only way faster) could be a real game changer.
My guess is that, with XCode Cloud, there's already *some sort* of Apple Silicon based "server" hardware setup being tested. But going further requires
- learning from the (internal) API's used by XCode, to design public APIs
- learning from the hardware limitations hit the XCode "server hardware"
Maybe this is far enough along that we will see something by WWDC2024? But that seems optimistic to me. Maybe at least one more generation of hardware learning is required.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
Exactly. I don't think there is anything.
As I said in my earlier reply: the functionality Apple can add that others can not is
- APIs that are extensions of current APIs but perform large units of work remotely
- these APIs could make use of CPU, GPU, and/or NPU functionalitu
- and even a business model that encourages developers to make use of these APIs

MS can probably copy this once they see Apple do it, but their developers are notoriously uninterested in adopting new APIs. And going beyond CPU is hard (GPU means all sorts on endless negotiations to get the next round of D3D out the door).
Google can maybe do it within Android?
AMZ has no obvious way to get on this track.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
Sure, I have been used C4D Team Render with couple of MacPro's and iMac's several times but never used VRay for this. Also there is another problem shiny magnificent superior magical ''Subscription Model'' :). For Maxon/Redshift you had to buy another license for each machine no matter what. I have no idea how it is work with VRay DR, if it is reasonable price it could work in logical budgets.

So you're saying that existing solutions have value but are not aggressively used because people
- find them clumsy to set up
- irritating in terms of extra billing
?

Mmm, Sounds EXACTLY like a business opportunity...

Hell, the Apple scheme I have in mind could also (with appropriately written APIs) direct requests to local hardware (ie one set of APIs and one setup to get distributed work across my random macs and various apps, rather than the current disaster where every large app has its own local farm creation software, and they all suck!)
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,317
If one goes by the patents, Apple’s RT is indeed a separate module, running its own programs. But the way they describe it makes me think that it won’t need too much area. They run computations at low precision (allowing for false positives), so those circuits probably won’t need too much space. The rest is sorting and compacting circuitry, which should be fairly small as well. The biggest problem is likely cache, but if I understand it correctly the RT cache gets sub-allocated from the shader cache.

What I find particularly interesting though is the execution model. From what I understand, RT functionality is associated with texture fetch units in AMDs and Nvidias implementations (certainly in AMDs). The RT assist hardware is a classical coprocessor - the shader program fires a graph traversal request and stalls until the result is available. Apple describes a continuation-based approach: the shader program prepares ray data and quits, passing control to the RT unit. The latter does its job and launches a new shader program to validate and process the results. This style has far reaching consequences. For example, there are fewer stalled programs to track. The RT unit is free to reorganize results to improve shader occupancy (reduce divergence). With this models it’s ok to wait hundreds or thousands of GPU cycles before the intersection results are back, giving the hardware more opportunities to build larger SIMD bundles. What’s more, this approach requires hardware that’s optimized for running multiple small shaders and launching shaders from shaders. I wouldn’t be surprised if it turns out that A17 introduces these capabilities to Metal. Hope the documentation will be out soon.


I don’t think Apple ever had this kind of hardware, not for some years at least. Vertex shading and tessellation are just compute kernels over buffers of memory. So are mesh shaders for what it’s worth. Fragment shaders are also compute kernels operating on tiles (e.g. 32x32 dispatch size). It’s just that you have to insert some fixed function hardware between these kernels (e.g. binner and rasterizer).
(a) Can you give a reference to the Ray Tracing patents you've been looking at?

(b) The reason I mentioned Geometry hardware is even many of the recent Apple patents talk about a Vertex Master (along with Pixel Master and Data Master). It's hard to know what these refer to, but Imagination documentation from say around 2018 says that these ultimately refer to dedicated hardware as opposed to the generic Shader hardware.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.