3D Rendering on Apple Silicon, CPU&GPU

aytan · Nov 9, 2023

l0stl0rd said:
Ooh you can bet it will and by a lot.

M3 Pro can render it in 2:15

So M3 Max should be about 1:07 ish.
Perhaps even 1 min in a 16” with perfect scaling.

Depending on how the Ultra scales 35-40 sec?

We did not saw perfect scaling on M1 / M2 Ultra. So I hope 40 for M3 Ultra

even with that M3 Studio Max and M3 Studio Ultra will be quite impressive line up for Apple and yes Apple can be compete against others with that line up.
I am happy now I did not upgrade my M1 Ultra to M2 Ultra and wait for M3.
I am excited for 32 Core CPU on Ultra it could be a perfect alternative for Threadripper pro 7000 series and will be cheaper compare to a workstation. Actually this is what I am waiting for an old fashion CPU rendering beast.
Maybe we will get some unexpected possibilities with M3 Mac Pro. I fell like this products will arrive earlier than usual upgrade cycle. Who knows ?

treehuggerpro · Nov 9, 2023

aytan said:
I fell like this products will arrive earlier than usual upgrade cycle. Who knows ?

Yes, hope so. I doubt Apple will be selling many M2 Ultras between now and then.

leman · Nov 9, 2023

name99 said:
Apple released a video today that basically confirmed the above analysis.

I think is very cool how following the topic closely and reading the patents can offer a deeper understanding of a topic before official documentation is released. Doesn't work always, but worked very well in this case. My recent hobby (inspired by your own example I must add) has been paying off!

name99 said:
They provide a single pool of SRAM storage that is shared by registers, Scratchpad/Tile, and as L1D cache, which each type of use case dynamically "competing" against the other so that this SRAM is always in use by the most deserving clients, whether those clients mostly use registers, Scratchpad/Tile, or even L1D (which, remember, includes things like stack usage). Whatever excess storage is not available in this SRAM pool will spill to L2, but will be tracked (in the same way that an OS tracks paging activity) to ensure that most of the time the core is not thrashing (by sleeping some warps so that not so many are competing for the pool of SRAM).

This is very impressive and I must admit has caught me a bit by surprise. In retrospect, it makes a lot os sense. Much respect for Apple team for pulling it off. This is really advanced GPU tech that will enable new heights for their GPUs going forward. They must have been working on it for very long time, at least for the last 4-5 years.

name99 said:
The shader core, to simplify, has one execution unit that handles FP32, one that handles FP16, one that handles Integer, and one that handles complex computations (sqrt, that sort of thing). It seems a shame to only use one of these units every cycle. In a sense we want to allow superscalar execution (more than one instruction per cycle) but the details of doing this are tricky, in that we cannot afford the massive machinery used by the CPU to achieve this.

Watching the tech note video immediately made me thing of this patent:

https://patentimages.storage.googleapis.com/cd/e8/29/b8ccccd654cfa3/US20210349725A1.pdf

I was asking myself what this is for. I suppose now we know. By the way, is this the first time that a SIMT GPU is doing really superscalar execution with two instruction from two programs decoded and dispatched per cycle? To my knowledge, until now only Nvidia did something like that, but they still dispatch one SIMD per cycle (and alternate between two SIMDs). I do wonder what this means for the operand bandwidth and data path contention, one needs a fairly wide bus to sustain two 32-wide SIMD pipelines simultaneously...

The beauty of this is that it opens up Apple for Nvidia-style expansion of pipeline functionality. Right now they seem to have separate FP32, FP16, and INT pipelines. But say they make the FP pipelines more flexible so that both can do FP32/16 — then they can double the peak FP throughput for only a minor increase in area (all the other machinery is still in place). And maybe three-way superscalar with three symmetrical pipelines will also be possible in the future. Looking forward to what they will cook up! At any rate, there is a lot of potential here and they seem to be designing their things in a way that will allow them bigger advancements later.

jido · Nov 10, 2023

Very cool! Nice to see applied research like that.

name99 · Nov 10, 2023

leman said:
I was asking myself what this is for. I suppose now we know. By the way, is this the first time that a SIMT GPU is doing really superscalar execution with two instruction from two programs decoded and dispatched per cycle? To my knowledge, until now only Nvidia did something like that, but they still dispatch one SIMD per cycle (and alternate between two SIMDs). I do wonder what this means for the operand bandwidth and data path contention, one needs a fairly wide bus to sustain two 32-wide SIMD pipelines simultaneously...

The beauty of this is that it opens up Apple for Nvidia-style expansion of pipeline functionality. Right now they seem to have separate FP32, FP16, and INT pipelines. But say they make the FP pipelines more flexible so that both can do FP32/16 — then they can double the peak FP throughput for only a minor increase in area (all the other machinery is still in place). And maybe three-way superscalar with three symmetrical pipelines will also be possible in the future. Looking forward to what they will cook up! At any rate, there is a lot of potential here and they seem to be designing their things in a way that will allow them bigger advancements later.

Yes, I didn't want to get into this, but one reason going superscalar in this way is not completely trivial, is reading registers.
Apple, like everyone, uses a register cache to hold "active" registers and avoid the delay and energy cost of reading from SRAM each time, but Apple's scheme seems to be more aggressive than anyone else's (can hold destination registers, not just source registers; more sophisticated compiler controls for suggesting that certain registers are high priority for retention, vs no longer of interest; and the ability to indicate in an instruction the prefetching into the register cache of a register to be used a few cycles later.
So going superscalar means substantially increasing the bandwidth of that cache, which is surely not trivial (at least not if you care about limiting energy). And I think that's the most immediate constraint on not going wider.

You could imagine other designs, like a kinda Bulldozer design, where we have say two register caches (one associated with even numbered warps, one with odd numbered warps) above and below a common pool of execution pipelines, so that, between the two register caches, say three instructions per cycle are dispatched into the pool of execution units...

Apple seem to believe that, at least for now (maybe this will split with future M vs A designs?) it's worth having separate (and separately optimized) FP16 and FP32 pipelines.
What they could do at the cost of very little area is add an additional uniform pipeline to handle values (think eg a loop counter) that are common across the entire warp. Metal as a language has the concept of uniform registers, but it is unclear to me if they are actually implemented that way in hardware (the way AMD has done for many years, and nV added a few years ago). Do you know? I've found zero reference to uniforms in hardware in all my explorations.

The other obvious next improvement could be to add a full FP64 pipeline (perhaps at low density so each warp has to loop through it four time or so), which would allow them to start being interesting to various STEM use cases. Another way to handle this could be on the M designs to add an additional FP32 pipeline AND to add some helper instructions. Right now you can fake various FP64-like formats (not exactly IEEE FP64, but close enough) in FP32, but it costs a lot of instructions, like say 18 or so per operation. Perhaps, with just minor tweaks and new instructions for the FP32 unit, you could define something like a GF48 or GF64 format (compare with say BF16) that's lightweight to implement on GPU, but good enough for most cases of interest for using doubles on a GPU?

leman · Nov 10, 2023

name99 said:
Yes, I didn't want to get into this, but one reason going superscalar in this way is not completely trivial, is reading registers.
Apple, like everyone, uses a register cache to hold "active" registers and avoid the delay and energy cost of reading from SRAM each time, but Apple's scheme seems to be more aggressive than anyone else's (can hold destination registers, not just source registers; more sophisticated compiler controls for suggesting that certain registers are high priority for retention, vs no longer of interest; and the ability to indicate in an instruction the prefetching into the register cache of a register to be used a few cycles later.
So going superscalar means substantially increasing the bandwidth of that cache, which is surely not trivial (at least not if you care about limiting energy). And I think that's the most immediate constraint on not going wider.

That is very true. But it does seem like they are moving in that direction. If I remember correctly, Philip Turner did remark how Apple Silicon seems to have more register file bandwidth than other designs (which is odd for a GPU with mobile roots). In retrospect, it all makes sense if they were preparing for dual issue. I think there is a lot of meticulous high level planning behind all this, and a really impressive execution over many years. What’s the next step? I don’t know, but I would be surprised if Apple didn’t have a plan.

name99 said:
Apple seem to believe that, at least for now (maybe this will split with future M vs A designs?) it's worth having separate (and separately optimized) FP16 and FP32 pipelines.
What they could do at the cost of very little area is add an additional uniform pipeline to handle values (think eg a loop counter) that are common across the entire warp. Metal as a language has the concept of uniform registers, but it is unclear to me if they are actually implemented that way in hardware (the way AMD has done for many years, and nV added a few years ago). Do you know? I've found zero reference to uniforms in hardware in all my explorations.

Dougall Johnson and the Asahi team did quite an impressive work reverse engineering G13, so we do know quite a bit about these things. There is a set of registers called uniform registers shared by all threads in the simd group - these are often used to store pointers or constants. Maybe it is used for loops already, I’m not sure. G13 uses a very interesting method to handle conditional execution - one register per thread is dedicated to tracking how many inactive braches a thread is deep (it’s all described in the docs linked below).

Apple G13 GPU Architecture Reference

name99 said:
The other obvious next improvement could be to add a full FP64 pipeline (perhaps at low density so each warp has to loop through it four time or so), which would allow them to start being interesting to various STEM use cases. Another way to handle this could be on the M designs to add an additional FP32 pipeline AND to add some helper instructions. Right now you can fake various FP64-like formats (not exactly IEEE FP64, but close enough) in FP32, but it costs a lot of instructions, like say 18 or so per operation. Perhaps, with just minor tweaks and new instructions for the FP32 unit, you could define something like a GF48 or GF64 format (compare with say BF16) that's lightweight to implement on GPU, but good enough for most cases of interest for using doubles on a GPU?

Personally, I’d prefer if they added some instructions that could be used to accelerate FP64 rather than adding a dedicated pipeline. Maybe some fancy 64-bit shift for aligning mantissas. If they could bring down FP64 emulation to 8 instructions it would already be very good.

name99 · Nov 10, 2023

leman said:
That is very true. But it does seem like they are moving in that direction. If I remember correctly, Philip Turner did remark how Apple Silicon seems to have more register file bandwidth than other designs (which is odd for a GPU with mobile roots). In retrospect, it all makes sense if they were preparing for dual issue. I think there is a lot of meticulous high level planning behind all this, and a really impressive execution over many years. What’s the next step? I don’t know, but I would be surprised if Apple didn’t have a plan.

Dougall Johnson and the Asahi team did quite an impressive work reverse engineering G13, so we do know quite a bit about these things. There is a set of registers called uniform registers shared by all threads in the simd group - these are often used to store pointers or constants. Maybe it is used for loops already, I’m not sure. G13 uses a very interesting method to handle conditional execution - one register per thread is dedicated to tracking how many inactive braches a thread is deep (it’s all described in the docs linked below).

Apple G13 GPU Architecture Reference

Personally, I’d prefer if they added some instructions that could be used to accelerate FP64 rather than adding a dedicated pipeline. Maybe some fancy 64-bit shift for aligning mantissas. If they could bring down FP64 emulation to 8 instructions it would already be very good.

Thanks for the G13 reference. I read that a while ago but forgot about it as a reference.
But you will notice that it only refers to uniform registers and instructions to load/store them, with NO ALU instructions to modify them. I'm not sure what to make of that. Does that mean they don't exist? Or just that Dougall didn't bother looking for those when he was exploring assembly codes?

Obviously one solution is simply to provide for faster emulation of FP64 using FP32 hardware. But is there also value in making the emulation easier by defining a "simpler" version of FP64, or at least one that better matches FP32 hardware (like was done with bf16), or nVidia's TF19 (kinda sorta, but not really, 32/16 bit tensor accumulator format)?

For example (just throwing this out there) you could imagine a W-process relaxation solver that moves from FP16 to FP32 to GF64, doing as much work as possible in lower precision, then for the very last IEE64 pass, move the work to AMX. Get as much "FP64-like" performance "for free" on the GPU, but resort to AMX if you absolutely have to have 100% IEE64 details?

Xiao_Xi · Nov 11, 2023

@leman You may be interested to know that there is a new game engine in its infancy that focuses on unified memory architectures.

Arete Engine v0.1

After many, many months of hard work, we are super excited to announce the release of Arete Engine v0.1!

www.areteengine.com

Unified Memory

Contribute to aretegames/arete-engine development by creating an account on GitHub.

github.com

leman · Nov 11, 2023

Xiao_Xi said:
@leman You may be interested to know that there is a new game engine in its infancy that focuses on unified memory architectures.

Arete Engine v0.1

After many, many months of hard work, we are super excited to announce the release of Arete Engine v0.1!

www.areteengine.com

Unified Memory

Contribute to aretegames/arete-engine development by creating an account on GitHub.

github.com

Thanks, looks like an interesting little project!

leman · Nov 11, 2023

name99 said:
Thanks for the G13 reference. I read that a while ago but forgot about it as a reference.
But you will notice that it only refers to uniform registers and instructions to load/store them, with NO ALU instructions to modify them. I'm not sure what to make of that. Does that mean they don't exist? Or just that Dougall didn't bother looking for those when he was exploring assembly codes?

I just spent some time playing around with the Apple GPU assembler and it was using SIMD integer operations for loops (but a loop counter limit will may be stored in a uniform register).

name99 said:
Obviously one solution is simply to provide for faster emulation of FP64 using FP32 hardware. But is there also value in making the emulation easier by defining a "simpler" version of FP64, or at least one that better matches FP32 hardware (like was done with bf16), or nVidia's TF19 (kinda sorta, but not really, 32/16 bit tensor accumulator format)?

For example (just throwing this out there) you could imagine a W-process relaxation solver that moves from FP16 to FP32 to GF64, doing as much work as possible in lower precision, then for the very last IEE64 pass, move the work to AMX. Get as much "FP64-like" performance "for free" on the GPU, but resort to AMX if you absolutely have to have 100% IEE64 details?

Could be interesting. But probably quite a low priority for Apple.

altaic · Nov 11, 2023

Xiao_Xi said:
@leman You may be interested to know that there is a new game engine in its infancy that focuses on unified memory architectures.

Arete Engine v0.1

After many, many months of hard work, we are super excited to announce the release of Arete Engine v0.1!

www.areteengine.com

Unified Memory

Contribute to aretegames/arete-engine development by creating an account on GitHub.

github.com

Ooh that’s cool!

name99 · Nov 11, 2023

leman said:
I just spent some time playing around with the Apple GPU assembler and it was using SIMD integer operations for loops (but a loop counter limit will may be stored in a uniform register).

Thanks, this confirms what it looked like to me. No uniform EXECUTION...
One hopes that at some point this will be added; like I said you could do it with like a 3% execution area addition (effectively add a 33rd lane), and now that we have a form of superscalar, there's a performance win as well as just an energy win.

Xiao_Xi · Nov 14, 2023

Blender 4.0 is now available and the Blender benchmark database now includes Blender 4.0 data.

Update: I'm not sure what went wrong, but Blender benchmark scores seem strange.

Device Name	Blender Version	Median Score	Number of Benchmarks
Apple M2 Max (GPU - 30 cores)	3.6.0	1550.73	39
Apple M2 Max (GPU - 30 cores)	4.0.0	1447	2
Apple M1 Max (GPU - 32 cores)	3.6.0	1006.63	68
Apple M1 Max (GPU - 32 cores)	4.0.0	932.14	2

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

Device Name	Blender Version	Median Score	Number of Benchmarks
NVIDIA GeForce RTX 4090	3.6.0	13095.21	1195
NVIDIA GeForce RTX 4090	4.0.0	11587.33	3
NVIDIA GeForce RTX 3090	3.6.0	6292.49	487
NVIDIA GeForce RTX 3090	4.0.0	5734.46	3

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

At least M3 Pro scores better in Blender 4.0 than in Blender 3.6, but the score jump is meaningless until Blender developers realise what's going on.

Device Name	Blender Version	Median Score	Number of Benchmarks
Apple M3 Pro (GPU - 18 cores)	4.0.0	1510.37	5
Apple M3 Pro (GPU - 14 cores)	4.0.0	1432.08	1
Apple M3 Pro (GPU - 18 cores)	3.6.0	1314.46	5
Apple M3 Pro (GPU - 14 cores)	3.6.0	1185.06	4

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

Can anyone confirm that Blender 4.0 renders faster than Blender 3.6 on M1/M2 hardware?

l0stl0rd · Nov 14, 2023

Xiao_Xi said:
Blender 4.0 is now available and the Blender benchmark database now includes Blender 4.0 data.

Update: I'm not sure what went wrong, but Blender benchmark scores seem strange.

Device Name Blender Version Median Score Number of Benchmarks
Apple M2 Max (GPU - 30 cores) 3.6.0 1550.73 39
Apple M2 Max (GPU - 30 cores) 4.0.0 1447 2
Apple M1 Max (GPU - 32 cores) 3.6.0 1006.63 68
Apple M1 Max (GPU - 32 cores) 4.0.0 932.14 2

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

Device Name Blender Version Median Score Number of Benchmarks
NVIDIA GeForce RTX 4090 3.6.0 13095.21 1195
NVIDIA GeForce RTX 4090 4.0.0 11587.33 3
NVIDIA GeForce RTX 3090 3.6.0 6292.49 487
NVIDIA GeForce RTX 3090 4.0.0 5734.46 3

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

At least M3 Pro scores better in Blender 4.0 than in Blender 3.6, but the score jump is meaningless until Blender developers realise what's going on.

Device Name Blender Version Median Score Number of Benchmarks
Apple M3 Pro (GPU - 18 cores) 4.0.0 1514.01 3
Apple M3 Pro (GPU - 18 cores) 3.6.0 1314.46 5
Apple M3 Pro (GPU - 14 cores) 3.6.0 1185.06 4

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

Can anyone confirm that Blender 4.0 renders faster than Blender 3.6 on M1/M2 hardware?

Normal even Nvidia cards are slower in 4.0, just have a look, nothing to do with M1 or M2.

Xiao_Xi · Nov 15, 2023

l0stl0rd said:
Normal even Nvidia cards are slower in 4.0, just have a look, nothing to do with M1 or M2.

I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.

altaic · Nov 15, 2023

Xiao_Xi said:
I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.

It looks like a pretty bad performance regression to me. Hopefully we’ll see a fix in 4.0.1.

leman · Nov 15, 2023

Xiao_Xi said:
I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.

Looking at M3 Pro scores and trying to estimate the regression in scores gives around 30% improvement.

But it’s important to note that these things do not exist in isolation. Dynamic Caching is a crucial component in enabling good RT performance.

Basic75 · Nov 15, 2023

name99 said:
Here's my best analysis of what Dynamic Caching REALLY is.

This video has good details on Dynamic Caching and other advancements of Apple's newest GPUs.

Explore GPU advancements in M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Learn how Dynamic Caching, the next-generation shader core, hardware-accelerated ray tracing, and hardware-accelerated mesh shading of...

developer.apple.com

thunng8 · Nov 15, 2023

altaic said:
It looks like a pretty bad performance regression to me. Hopefully we’ll see a fix in 4.0.1.

There won’t be a fix in the near term. The performance regression was intentional as they wanted to increase the rendering quality. Therefore cannot compare 3.6 vs 4.0 scores.

leman · Nov 15, 2023

thunng8 said:
There won’t be a fix in the near term. The performance regression was intentional as they wanted to increase the rendering quality. Therefore cannot compare 3.6 vs 4.0 scores.

Do you happen to have a link to a relevant discussion or explanation from the Blender team? I'd be curious to learn more.

altaic · Nov 15, 2023

leman said:
Do you happen to have a link to a relevant discussion or explanation from the Blender team? I'd be curious to learn more.

I spent a few hours earlier looking for exactly that, and I didn’t see anything in bug reports, devtalk, or benchmark commits. I’m not super familiar with their code base, but nothing jumped out at me WRT “rebalancing” or “increased rendering quality.” Looks like a regression to me; this release touched a lot of things, so that’s my current opinion.

@thunng8 Inquiring minds would like to know the basis of your claim. TIA 🙂

Xiao_Xi · Nov 15, 2023

Early benchmark scores put the 40-core M3 Max at similar performance to the RTX 4070 laptop.

Device Name	Blender Version	Median Score	Number of Benchmarks
NVIDIA GeForce RTX 4070 Laptop GPU	4.0.0	3449.21	5
Apple M3 Max (GPU - 40 cores)	4.0.0	3417.29	1
NVIDIA GeForce RTX 4060 Laptop GPU	4.0.0	3245.13	3
NVIDIA GeForce RTX 3070 Laptop GPU	4.0.0	3102.84	2
Apple M3 Max (GPU - 30 cores)	4.0.0	2942.11	1

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

l0stl0rd · Nov 15, 2023

Xiao_Xi said:
I find it strange that neither Nvidia, Apple or the Blender developers check whether Cycles is faster or not.

If Blender's Cycles got slower in 4.0, as the benchmark scores suggest, Blender is not a good measure for inferring how much Apple's ray tracing hardware helps rendering.

True there is some trange stuff going on with Blender on Mac anyway, why does the same scene on Mac use more memory then Windows Ram and Vram combined.

8GB on Mac is like 16 GB on Windows, don’t make me laugh Apple

jeanlain · Nov 15, 2023

(ignore)

l0stl0rd · Nov 15, 2023

Some say Blender 4 might be slower because of the new Material nodes but not sure.
Like the principled BSDF shader which has been updated with a new version.

Might be the conversion that is happening as the benchmark scenes still use the old BSDF node 🤷‍♂️

3D Rendering on Apple Silicon, CPU&GPU

macrumors regular

macrumors regular

macrumors Core

macrumors 6502

macrumors 68030

macrumors Core

macrumors 68030

macrumors 68000

macrumors Core

macrumors Core

macrumors 6502a

macrumors 68030

macrumors 68000

macrumors 6502

macrumors 68000

macrumors 6502a

macrumors Core

macrumors 68020

macrumors 65816

macrumors Core

macrumors 6502a

macrumors 68000

macrumors 6502

macrumors 68020

macrumors 6502

Our Staff