Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

aytan

macrumors regular
Dec 20, 2022
161
110
Ooh you can bet it will and by a lot.

M3 Pro can render it in 2:15

So M3 Max should be about 1:07 ish.
Perhaps even 1 min in a 16” with perfect scaling.

Depending on how the Ultra scales 35-40 sec?
We did not saw perfect scaling on M1 / M2 Ultra. So I hope 40 for M3 Ultra :) even with that M3 Studio Max and M3 Studio Ultra will be quite impressive line up for Apple and yes Apple can be compete against others with that line up.
I am happy now I did not upgrade my M1 Ultra to M2 Ultra and wait for M3.
I am excited for 32 Core CPU on Ultra it could be a perfect alternative for Threadripper pro 7000 series and will be cheaper compare to a workstation. Actually this is what I am waiting for an old fashion CPU rendering beast.
Maybe we will get some unexpected possibilities with M3 Mac Pro. I fell like this products will arrive earlier than usual upgrade cycle. Who knows ?
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
Apple released a video today that basically confirmed the above analysis.

I think is very cool how following the topic closely and reading the patents can offer a deeper understanding of a topic before official documentation is released. Doesn't work always, but worked very well in this case. My recent hobby (inspired by your own example I must add) has been paying off! :)

They provide a single pool of SRAM storage that is shared by registers, Scratchpad/Tile, and as L1D cache, which each type of use case dynamically "competing" against the other so that this SRAM is always in use by the most deserving clients, whether those clients mostly use registers, Scratchpad/Tile, or even L1D (which, remember, includes things like stack usage). Whatever excess storage is not available in this SRAM pool will spill to L2, but will be tracked (in the same way that an OS tracks paging activity) to ensure that most of the time the core is not thrashing (by sleeping some warps so that not so many are competing for the pool of SRAM).

This is very impressive and I must admit has caught me a bit by surprise. In retrospect, it makes a lot os sense. Much respect for Apple team for pulling it off. This is really advanced GPU tech that will enable new heights for their GPUs going forward. They must have been working on it for very long time, at least for the last 4-5 years.

The shader core, to simplify, has one execution unit that handles FP32, one that handles FP16, one that handles Integer, and one that handles complex computations (sqrt, that sort of thing). It seems a shame to only use one of these units every cycle. In a sense we want to allow superscalar execution (more than one instruction per cycle) but the details of doing this are tricky, in that we cannot afford the massive machinery used by the CPU to achieve this.

Watching the tech note video immediately made me thing of this patent:


I was asking myself what this is for. I suppose now we know. By the way, is this the first time that a SIMT GPU is doing really superscalar execution with two instruction from two programs decoded and dispatched per cycle? To my knowledge, until now only Nvidia did something like that, but they still dispatch one SIMD per cycle (and alternate between two SIMDs). I do wonder what this means for the operand bandwidth and data path contention, one needs a fairly wide bus to sustain two 32-wide SIMD pipelines simultaneously...

The beauty of this is that it opens up Apple for Nvidia-style expansion of pipeline functionality. Right now they seem to have separate FP32, FP16, and INT pipelines. But say they make the FP pipelines more flexible so that both can do FP32/16 — then they can double the peak FP throughput for only a minor increase in area (all the other machinery is still in place). And maybe three-way superscalar with three symmetrical pipelines will also be possible in the future. Looking forward to what they will cook up! At any rate, there is a lot of potential here and they seem to be designing their things in a way that will allow them bigger advancements later.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
I was asking myself what this is for. I suppose now we know. By the way, is this the first time that a SIMT GPU is doing really superscalar execution with two instruction from two programs decoded and dispatched per cycle? To my knowledge, until now only Nvidia did something like that, but they still dispatch one SIMD per cycle (and alternate between two SIMDs). I do wonder what this means for the operand bandwidth and data path contention, one needs a fairly wide bus to sustain two 32-wide SIMD pipelines simultaneously...

The beauty of this is that it opens up Apple for Nvidia-style expansion of pipeline functionality. Right now they seem to have separate FP32, FP16, and INT pipelines. But say they make the FP pipelines more flexible so that both can do FP32/16 — then they can double the peak FP throughput for only a minor increase in area (all the other machinery is still in place). And maybe three-way superscalar with three symmetrical pipelines will also be possible in the future. Looking forward to what they will cook up! At any rate, there is a lot of potential here and they seem to be designing their things in a way that will allow them bigger advancements later.
Yes, I didn't want to get into this, but one reason going superscalar in this way is not completely trivial, is reading registers.
Apple, like everyone, uses a register cache to hold "active" registers and avoid the delay and energy cost of reading from SRAM each time, but Apple's scheme seems to be more aggressive than anyone else's (can hold destination registers, not just source registers; more sophisticated compiler controls for suggesting that certain registers are high priority for retention, vs no longer of interest; and the ability to indicate in an instruction the prefetching into the register cache of a register to be used a few cycles later.
So going superscalar means substantially increasing the bandwidth of that cache, which is surely not trivial (at least not if you care about limiting energy). And I think that's the most immediate constraint on not going wider.

You could imagine other designs, like a kinda Bulldozer design, where we have say two register caches (one associated with even numbered warps, one with odd numbered warps) above and below a common pool of execution pipelines, so that, between the two register caches, say three instructions per cycle are dispatched into the pool of execution units...

Apple seem to believe that, at least for now (maybe this will split with future M vs A designs?) it's worth having separate (and separately optimized) FP16 and FP32 pipelines.
What they could do at the cost of very little area is add an additional uniform pipeline to handle values (think eg a loop counter) that are common across the entire warp. Metal as a language has the concept of uniform registers, but it is unclear to me if they are actually implemented that way in hardware (the way AMD has done for many years, and nV added a few years ago). Do you know? I've found zero reference to uniforms in hardware in all my explorations.

The other obvious next improvement could be to add a full FP64 pipeline (perhaps at low density so each warp has to loop through it four time or so), which would allow them to start being interesting to various STEM use cases. Another way to handle this could be on the M designs to add an additional FP32 pipeline AND to add some helper instructions. Right now you can fake various FP64-like formats (not exactly IEEE FP64, but close enough) in FP32, but it costs a lot of instructions, like say 18 or so per operation. Perhaps, with just minor tweaks and new instructions for the FP32 unit, you could define something like a GF48 or GF64 format (compare with say BF16) that's lightweight to implement on GPU, but good enough for most cases of interest for using doubles on a GPU?
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
Yes, I didn't want to get into this, but one reason going superscalar in this way is not completely trivial, is reading registers.
Apple, like everyone, uses a register cache to hold "active" registers and avoid the delay and energy cost of reading from SRAM each time, but Apple's scheme seems to be more aggressive than anyone else's (can hold destination registers, not just source registers; more sophisticated compiler controls for suggesting that certain registers are high priority for retention, vs no longer of interest; and the ability to indicate in an instruction the prefetching into the register cache of a register to be used a few cycles later.
So going superscalar means substantially increasing the bandwidth of that cache, which is surely not trivial (at least not if you care about limiting energy). And I think that's the most immediate constraint on not going wider.

That is very true. But it does seem like they are moving in that direction. If I remember correctly, Philip Turner did remark how Apple Silicon seems to have more register file bandwidth than other designs (which is odd for a GPU with mobile roots). In retrospect, it all makes sense if they were preparing for dual issue. I think there is a lot of meticulous high level planning behind all this, and a really impressive execution over many years. What’s the next step? I don’t know, but I would be surprised if Apple didn’t have a plan.

Apple seem to believe that, at least for now (maybe this will split with future M vs A designs?) it's worth having separate (and separately optimized) FP16 and FP32 pipelines.
What they could do at the cost of very little area is add an additional uniform pipeline to handle values (think eg a loop counter) that are common across the entire warp. Metal as a language has the concept of uniform registers, but it is unclear to me if they are actually implemented that way in hardware (the way AMD has done for many years, and nV added a few years ago). Do you know? I've found zero reference to uniforms in hardware in all my explorations.

Dougall Johnson and the Asahi team did quite an impressive work reverse engineering G13, so we do know quite a bit about these things. There is a set of registers called uniform registers shared by all threads in the simd group - these are often used to store pointers or constants. Maybe it is used for loops already, I’m not sure. G13 uses a very interesting method to handle conditional execution - one register per thread is dedicated to tracking how many inactive braches a thread is deep (it’s all described in the docs linked below).


The other obvious next improvement could be to add a full FP64 pipeline (perhaps at low density so each warp has to loop through it four time or so), which would allow them to start being interesting to various STEM use cases. Another way to handle this could be on the M designs to add an additional FP32 pipeline AND to add some helper instructions. Right now you can fake various FP64-like formats (not exactly IEEE FP64, but close enough) in FP32, but it costs a lot of instructions, like say 18 or so per operation. Perhaps, with just minor tweaks and new instructions for the FP32 unit, you could define something like a GF48 or GF64 format (compare with say BF16) that's lightweight to implement on GPU, but good enough for most cases of interest for using doubles on a GPU?

Personally, I’d prefer if they added some instructions that could be used to accelerate FP64 rather than adding a dedicated pipeline. Maybe some fancy 64-bit shift for aligning mantissas. If they could bring down FP64 emulation to 8 instructions it would already be very good.
 
  • Like
Reactions: altaic

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
That is very true. But it does seem like they are moving in that direction. If I remember correctly, Philip Turner did remark how Apple Silicon seems to have more register file bandwidth than other designs (which is odd for a GPU with mobile roots). In retrospect, it all makes sense if they were preparing for dual issue. I think there is a lot of meticulous high level planning behind all this, and a really impressive execution over many years. What’s the next step? I don’t know, but I would be surprised if Apple didn’t have a plan.



Dougall Johnson and the Asahi team did quite an impressive work reverse engineering G13, so we do know quite a bit about these things. There is a set of registers called uniform registers shared by all threads in the simd group - these are often used to store pointers or constants. Maybe it is used for loops already, I’m not sure. G13 uses a very interesting method to handle conditional execution - one register per thread is dedicated to tracking how many inactive braches a thread is deep (it’s all described in the docs linked below).




Personally, I’d prefer if they added some instructions that could be used to accelerate FP64 rather than adding a dedicated pipeline. Maybe some fancy 64-bit shift for aligning mantissas. If they could bring down FP64 emulation to 8 instructions it would already be very good.
Thanks for the G13 reference. I read that a while ago but forgot about it as a reference.
But you will notice that it only refers to uniform registers and instructions to load/store them, with NO ALU instructions to modify them. I'm not sure what to make of that. Does that mean they don't exist? Or just that Dougall didn't bother looking for those when he was exploring assembly codes?

Obviously one solution is simply to provide for faster emulation of FP64 using FP32 hardware. But is there also value in making the emulation easier by defining a "simpler" version of FP64, or at least one that better matches FP32 hardware (like was done with bf16), or nVidia's TF19 (kinda sorta, but not really, 32/16 bit tensor accumulator format)?

For example (just throwing this out there) you could imagine a W-process relaxation solver that moves from FP16 to FP32 to GF64, doing as much work as possible in lower precision, then for the very last IEE64 pass, move the work to AMX. Get as much "FP64-like" performance "for free" on the GPU, but resort to AMX if you absolutely have to have 100% IEE64 details?
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
@leman You may be interested to know that there is a new game engine in its infancy that focuses on unified memory architectures.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
@leman You may be interested to know that there is a new game engine in its infancy that focuses on unified memory architectures.

Thanks, looks like an interesting little project!
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
Thanks for the G13 reference. I read that a while ago but forgot about it as a reference.
But you will notice that it only refers to uniform registers and instructions to load/store them, with NO ALU instructions to modify them. I'm not sure what to make of that. Does that mean they don't exist? Or just that Dougall didn't bother looking for those when he was exploring assembly codes?

I just spent some time playing around with the Apple GPU assembler and it was using SIMD integer operations for loops (but a loop counter limit will may be stored in a uniform register).

Obviously one solution is simply to provide for faster emulation of FP64 using FP32 hardware. But is there also value in making the emulation easier by defining a "simpler" version of FP64, or at least one that better matches FP32 hardware (like was done with bf16), or nVidia's TF19 (kinda sorta, but not really, 32/16 bit tensor accumulator format)?

For example (just throwing this out there) you could imagine a W-process relaxation solver that moves from FP16 to FP32 to GF64, doing as much work as possible in lower precision, then for the very last IEE64 pass, move the work to AMX. Get as much "FP64-like" performance "for free" on the GPU, but resort to AMX if you absolutely have to have 100% IEE64 details?

Could be interesting. But probably quite a low priority for Apple.
 

altaic

macrumors 6502a
Jan 26, 2004
711
484
@leman You may be interested to know that there is a new game engine in its infancy that focuses on unified memory architectures.
Ooh that’s cool!
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
I just spent some time playing around with the Apple GPU assembler and it was using SIMD integer operations for loops (but a loop counter limit will may be stored in a uniform register).
Thanks, this confirms what it looked like to me. No uniform EXECUTION...
One hopes that at some point this will be added; like I said you could do it with like a 3% execution area addition (effectively add a 33rd lane), and now that we have a form of superscalar, there's a performance win as well as just an energy win.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Blender 4.0 is now available and the Blender benchmark database now includes Blender 4.0 data.

Update: I'm not sure what went wrong, but Blender benchmark scores seem strange.

Device NameBlender VersionMedian ScoreNumber of Benchmarks
Apple M2 Max (GPU - 30 cores)3.6.01550.7339
Apple M2 Max (GPU - 30 cores)4.0.014472
Apple M1 Max (GPU - 32 cores)3.6.01006.6368
Apple M1 Max (GPU - 32 cores)4.0.0932.142

Device NameBlender VersionMedian ScoreNumber of Benchmarks
NVIDIA GeForce RTX 40903.6.013095.211195
NVIDIA GeForce RTX 40904.0.011587.333
NVIDIA GeForce RTX 30903.6.06292.49487
NVIDIA GeForce RTX 30904.0.05734.463

At least M3 Pro scores better in Blender 4.0 than in Blender 3.6, but the score jump is meaningless until Blender developers realise what's going on.

Device NameBlender VersionMedian ScoreNumber of Benchmarks
Apple M3 Pro (GPU - 18 cores)4.0.01510.375
Apple M3 Pro (GPU - 14 cores)4.0.01432.081
Apple M3 Pro (GPU - 18 cores)3.6.01314.465
Apple M3 Pro (GPU - 14 cores)3.6.01185.064


Can anyone confirm that Blender 4.0 renders faster than Blender 3.6 on M1/M2 hardware?
 
Last edited:

l0stl0rd

macrumors 6502
Jul 25, 2009
483
415
Blender 4.0 is now available and the Blender benchmark database now includes Blender 4.0 data.

Update: I'm not sure what went wrong, but Blender benchmark scores seem strange.

Device NameBlender VersionMedian ScoreNumber of Benchmarks
Apple M2 Max (GPU - 30 cores)3.6.01550.7339
Apple M2 Max (GPU - 30 cores)4.0.014472
Apple M1 Max (GPU - 32 cores)3.6.01006.6368
Apple M1 Max (GPU - 32 cores)4.0.0932.142

Device NameBlender VersionMedian ScoreNumber of Benchmarks
NVIDIA GeForce RTX 40903.6.013095.211195
NVIDIA GeForce RTX 40904.0.011587.333
NVIDIA GeForce RTX 30903.6.06292.49487
NVIDIA GeForce RTX 30904.0.05734.463

At least M3 Pro scores better in Blender 4.0 than in Blender 3.6, but the score jump is meaningless until Blender developers realise what's going on.

Device NameBlender VersionMedian ScoreNumber of Benchmarks
Apple M3 Pro (GPU - 18 cores)4.0.01514.013
Apple M3 Pro (GPU - 18 cores)3.6.01314.465
Apple M3 Pro (GPU - 14 cores)3.6.01185.064

Can anyone confirm that Blender 4.0 renders faster than Blender 3.6 on M1/M2 hardware?
Normal even Nvidia cards are slower in 4.0, just have a look, nothing to do with M1 or M2.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Normal even Nvidia cards are slower in 4.0, just have a look, nothing to do with M1 or M2.
I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.
 

altaic

macrumors 6502a
Jan 26, 2004
711
484
I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.
It looks like a pretty bad performance regression to me. Hopefully we’ll see a fix in 4.0.1.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,670
I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.

Looking at M3 Pro scores and trying to estimate the regression in scores gives around 30% improvement.

But it’s important to note that these things do not exist in isolation. Dynamic Caching is a crucial component in enabling good RT performance.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
  • Like
Reactions: aytan

thunng8

macrumors 65816
Feb 8, 2006
1,032
417
It looks like a pretty bad performance regression to me. Hopefully we’ll see a fix in 4.0.1.
There won’t be a fix in the near term. The performance regression was intentional as they wanted to increase the rendering quality. Therefore cannot compare 3.6 vs 4.0 scores.
 
  • Like
Reactions: jido

leman

macrumors Core
Oct 14, 2008
19,520
19,670
There won’t be a fix in the near term. The performance regression was intentional as they wanted to increase the rendering quality. Therefore cannot compare 3.6 vs 4.0 scores.

Do you happen to have a link to a relevant discussion or explanation from the Blender team? I'd be curious to learn more.
 
  • Like
Reactions: jido and altaic

altaic

macrumors 6502a
Jan 26, 2004
711
484
Do you happen to have a link to a relevant discussion or explanation from the Blender team? I'd be curious to learn more.
I spent a few hours earlier looking for exactly that, and I didn’t see anything in bug reports, devtalk, or benchmark commits. I’m not super familiar with their code base, but nothing jumped out at me WRT “rebalancing” or “increased rendering quality.” Looks like a regression to me; this release touched a lot of things, so that’s my current opinion.

@thunng8 Inquiring minds would like to know the basis of your claim. TIA 🙂
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Early benchmark scores put the 40-core M3 Max at similar performance to the RTX 4070 laptop.

Device NameBlender VersionMedian ScoreNumber of Benchmarks
NVIDIA GeForce RTX 4070 Laptop GPU4.0.03449.215
Apple M3 Max (GPU - 40 cores)4.0.03417.291
NVIDIA GeForce RTX 4060 Laptop GPU4.0.03245.133
NVIDIA GeForce RTX 3070 Laptop GPU4.0.03102.842
Apple M3 Max (GPU - 30 cores)4.0.02942.111
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
415
I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.
True there is some trange stuff going on with Blender on Mac anyway, why does the same scene on Mac use more memory then Windows Ram and Vram combined.

8GB on Mac is like 16 GB on Windows, don’t make me laugh Apple ;)
 

l0stl0rd

macrumors 6502
Jul 25, 2009
483
415
Some say Blender 4 might be slower because of the new Material nodes but not sure.
Like the principled BSDF shader which has been updated with a new version.

Might be the conversion that is happening as the benchmark scenes still use the old BSDF node 🤷‍♂️
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.