Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

altaic

macrumors 6502a
Jan 26, 2004
711
484
View attachment 2308022

So now add Hardware RT to that 🤔

Ah sorry Blender Opendata if anyone wonders.

I do not think RT is used as 3.6 should not be ready but I could be wrong.

Blender 4 should be.

@leman mentioned this strong showing could be due to the dynamic cache.

It is very impressive.

Yes, MetalRT in older Blender version can already uses HW Rt. See geekerwarn test posted above.
Performance might be a bit different on more recent versions.

I just came across a very interesting post on Blender devtalk by Brecht Van Lommel: "The benchmark is currently not using MetalRT. When Blender 4.0 is out, the benchmark will use MetalRT for the M3, but not the M1 and M2." If the name sounds familiar, it's because he’s a Principal Engineer at Blender HQ, the lead Cycles dev, etc.

Those M3 Blender benchmark results should get a big boost with 4.0.
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Those M3 Blender benchmark results should get a big boost with 4.0.
While we wait until next week for the official release of Blender 4.0, those with an M3 MBP could test the Blender 4.0 beta.

Out of curiosity, which demo file on the Blender website could most clearly show the capabilities of Apple's new GPU ray tracing hardware IP?
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
While we wait until next week for the official release of Blender 4.0, those with an M3 MBP could test the Blender 4.0 beta.

Out of curiosity, which demo file on the Blender website could most clearly show the capabilities of Apple's new GPU ray tracing hardware IP?
Not sure which one is best.

I will just say that the Lone Monk one is over 1 min faster one the M3 Pro compared M2 30 core ;)

Perhaps even more impressive viewport animation playback is faster too by 2,5 fps.

Just dont use BMW that is worthless.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
Dave2D tested M3 Max in Blender 4.0

View attachment 2308122

I tried to look up the Gooseberry test scene and it seems it uses the CPU only. I am still surprised by the very small improvement over the M2 Max. In the official Blender benchmark results M3 Mac CPU renders are over 60% faster than M2 Max CPU renders. Something went wrong here. Maybe it's limiting the number of cores in use.
 

galad

macrumors 6502a
Apr 22, 2022
610
492
I just came across a very interesting post on Blender devtalk by Brecht Van Lommel: "The benchmark is currently not using MetalRT. When Blender 4.0 is out, the benchmark will use MetalRT for the M3, but not the M1 and M2." If the name sounds familiar, it's because he’s a Principal Engineer at Blender HQ, the lead Cycles dev, etc.

Those M3 Blender benchmark results should get a big boost with 4.0.
That's about Blender Benchmark app, you can already benchmark it yourself with MetalRT enabled like those people above did.
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
That's about Blender Benchmark app, you can already benchmark it yourself with MetalRT enabled like those people above did.
Exactly however the difference in Blender 4 beta with and without is less big then I expected, or at least for now.
 

aytan

macrumors regular
Dec 20, 2022
161
110
While we wait until next week for the official release of Blender 4.0, those with an M3 MBP could test the Blender 4.0 beta.

Out of curiosity, which demo file on the Blender website could most clearly show the capabilities of Apple's new GPU ray tracing hardware IP?
I mostly use Scanlands demo file if I need to look for improvements over Blender versions.
With 4.0 Blender is ready for M3 Hardware RT,

The new Apple M3 processor supports hardware ray-tracing. Cycles takes advantage of this by default, with the MetalRT in the GPU device preferences set to "Auto".

For the M1 and M2 processors MetalRT is no longer an experimental feature, and fully supported now. However it is off by default, as Cycles' own intersection code still has better performance when there is no hardware ray-tracing.

There is improvements also for M1 and M2 generations with Blender 4.0 which is Release Candidate now. I do not rely on benchmarks generally. Your workflow determines any software is fine for you or not.
The point is what you are doing and how you will.
An example to this subject is Substance Painter, it is great software however it adds extra steps, consumes time and adds more and more branches to your file structure. If it is not on my budget limits for any project I tend to not use it.
I suggest measure new versions with your own old scenes if it is possible. There will be always - and + results with every Hardware/Software generations.
 

altaic

macrumors 6502a
Jan 26, 2004
711
484
And that my friends is why you discuss big redesign plans with the project owners before you submit a huge patch :)
It seems to have started a worthwhile discussion with significant benefits, however, so it’s not wasted effort 🙂
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
Exactly however the difference in Blender 4 beta with and without is less big then I expected, or at least for now.

Don't look at Dave2D results, he seems to be using a CPU-only test which behaves weirdly anyway. Geekerwan and others who have actually trier MetalRT Cycles renderers have observed a 2x speedup.
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
Don't look at Dave2D results, he seems to be using a CPU-only test which behaves weirdly anyway. Geekerwan and others who have actually trier MetalRT Cycles renderers have observed a 2x speedup.
I am not I am looking at my own results.

One example
Lone Monk
No RT 8:54
RT 6:31

Blender 4 beta.

Never trust “common” youtuber and blender anyway.
 
Last edited:

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I mostly use Scanlands demo file if I need to look for improvements over Blender versions.
Geekerwan and others who have actually trier MetalRT Cycles renderers have observed a 2x speedup.
If the new ray tracing IP really does deliver 2x improvements, Scanlands seems like a good benchmark, as other GPUs get a similar speed increase.
Blender-Cycles-GPU-Rendering-Performance-Scanlands.jpg


I am not I am looking at my own results.

One example
Lone Monk
No RT 8:54
RT 6:31

Blender 4 beta.
For reference, this video shows that an RTX 3070 takes about 8 minutes with CUDA and 3.5 minutes with OptiX to render Lone Monk.
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,520
19,671
I am not I am looking at my own results.

One example
Lone Monk
No RT 8:54
RT 6:31

Blender 4 beta.

Never trust “common” youtuber and blender anyway.

Which config are you using? Do you also have an earlier Apple Silicon machine? If so, what are you kind of results are you getting there?

The improvements will vary from scene to scene, but your scores seem a bit low to me.
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
Which config are you using? Do you also have an earlier Apple Silicon machine? If so, what are you kind of results are you getting there?

The improvements will vary from scene to scene, but your scores seem a bit low to me.
M3 Pro with 36 GB

Yes of course monster under the bed, the difference is way less.

Sure give me a sec I will render it on the M1 Max with Blender 4.

There you go 10:22 for the full M1 Max.
 
Last edited:

Pressure

macrumors 603
May 30, 2006
5,179
1,544
Denmark
What scene are you rendering?

Just did a few scenes on the newest beta but I honestly don't know what settings I am supposed to be using.

Scanlands: 05:05.77
Lone Monk: 07:28.44
Monster under the bed: 01:28.28

This is on the base M2 Max (30-core GPU).

Update: Redid the demos with only GPU (had wrongly enabled CPU as well before).
 
Last edited:
  • Like
Reactions: aytan

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
What scene are you rendering?

Just did Scanlands on the newest beta and got a time of 7:44.85 but I honestly don't know what settings I am supposed to be using.

This is on the base M2 Max (30-core GPU).
I used this with the settings the scene comes with as it is already set to GPU.


I did not try scanlands but perhaps I will as it has quite a bit of geometry.

This one might be interesting too

I did try it one some machines in the past and in the past and they ran out of vram, especially when not using tiling.
 
Last edited:

aytan

macrumors regular
Dec 20, 2022
161
110
What scene are you rendering?

Just did Scanlands on the newest beta and got a time of 7:44.85 but I honestly don't know what settings I am supposed to be using.

This is on the base M2 Max (30-core GPU).
If you use 4.0 Alpha or Beta do not to do mass up any render setting, it is ok with demo file. Only check render only GPU section, otherwise CPU will involved to the render and this will ends up with longer render times. But you time looks ok, M1 Ultra with 48 GPU cores render out the scene around 4:26:00 by Blender 4.0 Beta.
 

Attachments

  • Screenshot 2023-11-09 at 15.10.24.png
    Screenshot 2023-11-09 at 15.10.24.png
    2.7 MB · Views: 74
  • Like
Reactions: Pressure

innerproduct

macrumors regular
Jun 21, 2021
222
353
If you use 4.0 Alpha or Beta do not to do mass up any render setting, it is ok with demo file. Only check render only GPU section, otherwise CPU will involved to the render and this will ends up with longer render times. But you time looks ok, M1 Ultra with 48 GPU cores render out the scene around 4:26:00 by Blender 4.0 Beta.
If you check this german review, you get the impression that the scanlands scene renders in 1:09 on m3max. (And more that 7 mins on m1max). Sadly it is not explicitly stated. Watch at 4:40
 

aytan

macrumors regular
Dec 20, 2022
161
110
If you check this german review, you get the impression that the scanlands scene renders in 1:09 on m3max. (And more that 7 mins on m1max). Sadly it is not explicitly stated. Watch at 4:40
yes actually it is :)
I have watched this video yesterday and link it at Redshift forums.
I think a 40 GPU core full M3 Max could achieve a render time result under 2 minutes, my 48 GPU core M1 Ultra get 4:26:09 and 40 Core M3 Max -/+ 2.5 times faster than my Ultra.
I get a RTX 4070 result from my PC with Blender 4.0.0 Release Candidate version, Optix 00:40:80 / Cuda 1:59:10
This results points out maybe upcoming M3 Ultra could catch 4080 or could reach only 4070 level.
Anyway 40 GPU M3 Max better than 4070 cuda score right now.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,311
And that my friends is why you discuss big redesign plans with the project owners before you submit a huge patch :)
And that, my friends, is why you should understand something about what you are commenting on before simply operating on autopilot (ie "whatever just happened proves my theories of the world are completely correct").

To quote some items from the back-and-forth regarding the patch:

"Hey @brecht, here is the PR that I was mentioning. There's quite a bit to take in, and some details that need finessing on the pipeline caching side, but when you get a moment it would be great to get some initial feedback."
Doesn't sound exactly like "here's a huge set of changes that you haven't been informed about"...

"
I was imagining this to go even further, eliminating the SVM stack and using stack variables instead, so that the compiler can put them registers when possible. That would requires more significant changes of course, refactoring all SVM nodes to have a function that directly takes the inputs and outputs as function arguments. And then for the non-specialized case, much of the ugly SVM pack/unpacking/offsetting code can be auto generated as well.

Of course that's a bigger change, and the performance impact of those code changes would need to be tested. But it would make the code much cleaner and easier to maintain I think, including the existing SVM code."
Doesn't sound like a complaint that the big redesign is too big, or was not discussed by all stakeholders....


Looks to me very much like the way these things usually happen. An initial discussion that "this sucks, we should do better" "OK, how about A and B and C", "Sounds good let me work on it",
"Here's a patch that implements A and B", "What about C?", "I agree C is a good idea, but I wanted to get A and B out there and working, then we'll add C",
"To my eyes, A and B without C creates a complicated intermediate state that we should try to avoid committing. How about we do things in a different order to get to the same ultimate goal, but avoiding the intermediate complexity"

And that's life, when multiple competent engineers are working towards a common goal.
 
  • Like
Reactions: Xiao_Xi

name99

macrumors 68020
Jun 21, 2004
2,410
2,311
Here's my best analysis of what Dynamic Caching REALLY is.
This is cut from a much larger document describing the Apple GPU in depth, but I think it stands alone enough to explain the relevant points.



resource allocation (2017 and now)

We've now seen that resource allocation is an important part of scheduling, not least in the sense that we cannot start the execution of a new task (whether threadblock or warp) until certain hardware is allocated.
Consider now the allocation of "memory-like" resources, eg Scratchpad storage for a new threadblock, or registers for a new warp.
How might we do this?
[snip]

Apple released a video today that basically confirmed the above analysis.
The primary point is virtualization, with faulting of memory-like resources (registers, Scratchpad and Tile memory).
This allows packing substantially more simultaneous threadgroups and warps onto each GPU core, and thus fewer cycles wasted while no warp is runnable.

BUT Apple also implemented another idea I suggested as possible.
A few years ago nVidia provided a unified pool of SRAM that can have one portion used as GPU L1D, the other portion as Scratchpad, so that after all the Scratchpad storage is allocated, whatever is left over, large or small, gets at least some use as L1D. I suggested Apple might do this, but they have actually gone one better!
They provide a single pool of SRAM storage that is shared by registers, Scratchpad/Tile, and as L1D cache, which each type of use case dynamically "competing" against the other so that this SRAM is always in use by the most deserving clients, whether those clients mostly use registers, Scratchpad/Tile, or even L1D (which, remember, includes things like stack usage). Whatever excess storage is not available in this SRAM pool will spill to L2, but will be tracked (in the same way that an OS tracks paging activity) to ensure that most of the time the core is not thrashing (by sleeping some warps so that not so many are competing for the pool of SRAM).

A second feature nice feature was added.
The shader core, to simplify, has one execution unit that handles FP32, one that handles FP16, one that handles Integer, and one that handles complex computations (sqrt, that sort of thing). It seems a shame to only use one of these units every cycle.
In a sense we want to allow superscalar execution (more than one instruction per cycle) but the details of doing this are tricky, in that we cannot afford the massive machinery used by the CPU to achieve this. So what can we do?
Different vendors solve this in different ways.
- nVidia does it by having each pipeline is only 16-wide. So in the first cycle I send an instruction to execution unit one, and that instruction will dispatch over two cycles (first 16 lanes of the warp, then the second 16 lanes). In the second cycle I can then send an instruction to execution unit two, and so we go. This sort of staggered execution allows a reasonable amount of two units executing per cycle, even if it is one unit executing the first half of a warp, and the second unit executing the second half of a different warp.
- AMD does it by having a few instructions that are SIMD2, so that I send out a single instruction which performs the same operation on a register# and that register#+1. This works well for them given the sets of paired execution units they have, and their use of something called Wavefront64, which, especially for graphics, uses warps that are 64 rather than 32 wide.

- Apple's solution needs to take into account specifically Apple issues which are
+ they have a separate FP32 and FP16 pipeline (rather than doing the "obvious" thing of executing FP16 on the FP32 pipeline, because each of the two pipelines can be energy/area optimized if it doesn't have to do two tasks)
+ which means they can't easily do something like an AMD SIMD2 solution. They also make substantial use of hardware that is optimized for 32-wide lanes (the way they handle matrix multiplies, or shuffles, or reductions, or non-uniform warp size, or ...) and presumably they don't want to have to redesign that for 16-wide hardware, for an nVidia-like solution.
+ BUT Apple has the advantage that (even before Dynamic Caching) their GPU cores were sized so that they could sustain about 1.5x as many active warps as the nVidia or AMD high end, and with Dynamic caching they can sustain even more active warps. This gives them a larger pool of possible warps they can schedule from, meaning that in any cycle there is the potential to find and schedule two instructions, from two different warps, that will execute on different execution units. The advantage of this, compared to nVidia's solutions is that because the two warps are different no dependency checking is required to ensure that register dependencies are satisfied, and relative to AMD's solution it doesn't require compiler changes. The "disadvantage" is that it won't show up in many benchmarks (which tend to do the same thing in every warp); it requires real world code executing a wide variety of warps simultaneously to really shine.

So we see in the new GPUs at the very least
- ray tracing (for the crowd that wants that)
- mesh shading (so more flexible ways of geometry generation/modification)
- dynamic caching
- 2x superscalar execution (when a mix of varied warps is present on the core)

It's notable that, while these are all good and important, to match nVidia (not just in performance, but as the tool of choice for anyone doing leading-edge GPU work) Apple needs to add at least two more large and difficult classes of functionality
- per-lane progress (as added by nV in Volta; allows for a whole range of much more sophisticated synchronization schemes and thus algorithms)
- much larger co-operating groups than just threadgroups. (I think nVidia added this in Ampere? Apple has enough machinery in place for the first part of this, namely a large shared Scratchpad address space extending across multiple cores. But they need all the other parts including cross-core synchronization.)

I am sure these are very much on Apple's radar; but they are not trivial!
People (who understood the implications) were right to be shocked when nVidia introduce them; just like people (who understand the implications) should be shocked at the actual implementation and consequences of Dynamic Caching.
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
416
yes actually it is :)
I have watched this video yesterday and link it at Redshift forums.
I think a 40 GPU core full M3 Max could achieve a render time result under 2 minutes, my 48 GPU core M1 Ultra get 4:26:09 and 40 Core M3 Max -/+ 2.5 times faster than my Ultra.
I get a RTX 4070 result from my PC with Blender 4.0.0 Release Candidate version, Optix 00:40:80 / Cuda 1:59:10
This results points out maybe upcoming M3 Ultra could catch 4080 or could reach only 4070 level.
Anyway 40 GPU M3 Max better than 4070 cuda score right now.
Ooh you can bet it will and by a lot.

M3 Pro can render it in 2:15

So M3 Max should be about 1:07 ish.
Perhaps even 1 min in a 16” with perfect scaling.

Depending on how the Ultra scales 35-40 sec?
 
  • Like
Reactions: aytan
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.