3D Rendering on Apple Silicon, CPU&GPU

crazy dave · Feb 5, 2025

OptimusGrime said:
So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?

I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?

On consumer Nvidia hardware only a few of the subcores have hardware FP64 enabled - throughput is about 1:64. To be technical: each SM thus would have 1 hardware FP64 unit for every 64 FP32 and 64 FP32/INT32 units which are split amongst two "Streaming Multiprocessor Sub-Partition" SMSPs, so only one of the two half units would have an FP64 unit. These days, for the last few generations at least*, professional hardware that has lots of FP64 throughput uses a different die with tweaked core designs. If Nvidia does anything beyond that is unknown to me.

*basically since Pascal/Volta then Ampere consumer vs Ampere professional, Ada vs Hopper, and now Blackwell 2.0 vs Blackwell (the professional variant. Professional chips with FP64 (there are more pro-chips, but specifically the ones with lots of FP64) use dies xx100 while all other professional and consumer chips use dies xx102 or higher. Prior I'm not sure Nvidia had any designs that were better than 1:32 FP64 throughput at all, which Turing was the last time a consumer card had that much (beyond the occasional Titan like the Titan V which had 1:2).

For Macs, because of the UMA using the AMX in addition to the GPU may indeed be possible. It would depends on the application how separate the FP64 calculations are from the rest of the calculations - in other words the latency cost of communicating through main memory vs local cache vs FP64 throughput on the AMX vs GPU.

OptimusGrime · Feb 5, 2025

crazy dave said:
On consumer Nvidia hardware only a few of the subcores have hardware FP64 enabled - throughput is about 1:64. To be technical: each SM thus would have 1 hardware FP64 unit for every 64 FP32 and 64 FP32/INT32 units which are split amongst two "Streaming Multiprocessor Sub-Partition" SMSPs, so only one of the two half units would have an FP64 unit. These days, for the last few generations at least*, professional hardware that has lots of FP64 throughput uses a different die with tweaked core designs. If Nvidia does anything beyond that is unknown to me.

*basically since Pascal/Volta then Ampere consumer vs Ampere professional, Ada vs Hopper, and now Blackwell 2.0 vs Blackwell (the professional variant. Professional chips with FP64 (there are more pro-chips, but specifically the ones with lots of FP64) use dies xx100 while all other professional and consumer chips use dies xx102 or higher. Prior I'm not sure Nvidia had any designs that were better than 1:32 FP64 throughput at all, which Turing was the last time a consumer card had that much (beyond the occasional Titan like the Titan V which had 1:2).

For Macs, because of the UMA using the AMX in addition to the GPU may indeed be possible. It would depends on the application how separate the FP64 calculations are from the rest of the calculations - in other words the latency cost of communicating through main memory vs local cache vs FP64 throughput on the AMX vs GPU.

I suppose my next question would be, if fp64 can be emulated without hardware, to a similar performance level as the hardware Nvidia currently sells, why don’t Nvidia just emulate it?

crazy dave · Feb 5, 2025

OptimusGrime said:
I suppose my next question would be, if fp64 can be emulated without hardware, to a similar performance level as the hardware Nvidia currently sells, why don’t Nvidia just emulate it?

Consistency in software platforms between consumer and professional hardware (a lot of people develop on consumer and deploy on professional) and as I said there is the caveat that the emulation is a lot more code intensive - throughput might the same but calling function pointers or having huge inline code has its own performance implications on the GPU.

leman · Feb 5, 2025

OptimusGrime said:
So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?

Nvidia has FP64 hardware, but the performance is only 1/64 of the FP32 code. Basically, it’s there just for the compatibility with legacy code.

OptimusGrime said:
I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?

Sure, you can, if your workload maps well to AMX/SME.

Regulus67 · Feb 6, 2025

leman said:
Nvidia has FP64 hardware, but the performance is only 1/64 of the FP32 code. Basically, it’s there just for the compatibility with legacy code.

AMD Radeon Pro Vega II has 1:2 FP64, at 7.045 TFLOPS, the Duo card double that. If it is important?

leman · Feb 6, 2025

Regulus67 said:
AMD Radeon Pro Vega II has 1:2 FP64, at 7.045 TFLOPS, the Duo card double that. If it is important?

View attachment 2479719

AMD still offers a strong FP64 implementation in the professional Radeon Instinct series. Judging by the financial performance, there is not much interest.

Regulus67 · Feb 6, 2025

leman said:
AMD still offers a strong FP64 implementation in the professional Radeon Instinct series. Judging by the financial performance, there is not much interest.

I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research

leman · Feb 6, 2025

Regulus67 said:
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research

People like to quite academic research as a user of FP64 precision, but as a researcher myself I am confused about it. I am sure that there are fields that need extended precision for calculations. I am also quite sure that most of these people don't run their algorithms on your standard GPU.

crazy dave · Feb 6, 2025

Regulus67 said:
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research

leman said:
People like to quite academic research as a user of FP64 precision, but as a researcher myself I am confused about it. I am sure that there are fields that need extended precision for calculations. I am also quite sure that most of these people don't run their algorithms on your standard GPU.

For myself, I use a FP64 on occasion where most of my calculations can get by on FP32, but there are some times when I am doing accumulations or calculations where at the very least the extended range of FP64 is necessary and possibly so is the precision. I tried using FP32 everywhere, but it greatly constrained my ability to run certain parameters. One of the selling points of my research while I was still doing was indeed to "democratize" the research computations needed to do the intense simulations, move them out of the big iron and onto people's desktops. So I try to make my algorithm as performant as possible on standard GPUs. Having said that, my FP64 usage is low enough that as far as I can tell it doesn't really drag down performance. I might get a little better and other model/algorithms may be different if I were to ever pursue them might be different, but I admit that even in my case a lack of FP64 compute isn't hamstringing me so far.

EDIT: I love that I talk about this as though I still actually still get to work on this ... ah well ... maybe one day

jujoje · Mar 5, 2025

Bit disappointed that the Mac Studio went to M3, but at least there's a pretty nice performance increase for redshift (assumably down to hw raytracing):

Also having nearly 512GB memory for rendering would be nice

Wonder how much faster the GPU is generally; don't render that much, but do use OpenCL a fair bit, so any speed up there would be nice.

jujoje · Mar 5, 2025

"Redshift 2025.0.2 tested using a 29.2MB scene"

Really pushing the system there.

Xiao_Xi · Mar 30, 2025

This is a big offtopic, but may be interesting to a few.

Analyzing Modern NVIDIA GPU cores

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse...

arxiv.org

jujoje · Jun 6, 2025

Out of curiosity does anyone use Marmoset Toolbag?

Just discovered that the latest version has a native app (and UDIM support!). Looks like it's getting close to Substance Painter in terms of features and has the benefit of not being Adobe and having a perpetual license.

Being in film I'm vaguely familiar with Mari, Substance and to a certain extent 3D Coat, but never seen Marmoset used, so was wondering how it compares these days.

singhs.apps · Jun 7, 2025

It’s good and getting more and more features overtime.
Not a painter killer yet but it’s getting there.

jujoje · Jun 8, 2025

Nice; will def give it a shot once I have something I need to texture (didn't want to waste the trial). TBH my texturing needs are pretty simple most of the time, so suspect it's current functionality will more than cover my use case.

With a bit of luck between Toolbag for manual texturing and Houdini's COPs for procedural textures I can continue avoiding anything Adobe based.

singhs.apps · Jun 9, 2025

jujoje said:
Nice; will def give it a shot once I have something I need to texture (didn't want to waste the trial). TBH my texturing needs are pretty simple most of the time, so suspect it's current functionality will more than cover my use case.

With a bit of luck between Toolbag for manual texturing and Houdini's COPs for procedural textures I can continue avoiding anything Adobe based.

Yeah. V5 release has some neat texturing features.
It also comes with tons of material libraries analogous to substance materials. So it will be easier to mockup some textures.

Besides its renderer is also a useful feature.

jujoje · Jul 25, 2025

Blender for iPad:

Beyond Mouse & Keyboard — Blender Developers Blog

Get ready for blending on-the-go.

code.blender.org

hovscorpion12 · Jul 25, 2025

Oh were definitely getting a super charged iPad Pro with M5. Blender on iPad is a GAMECHABGER.

Blender starts at 8GB RAM but recommends 32GB of RAM. Just to put that into perspective.

While the ultimate goal is a complete Blender experience, the initial focus will be on basic object manipulation and sculpting. These will be followed by Grease Pencil and storyboarding, which are slightly more advanced as they require animation tools.

Xiao_Xi · Jul 25, 2025

jujoje said:
Blender for iPad:

Blender developers haven't clarified how they will distribute Blender for iPad.

Blender and Tablets

(Sorry in advance if this is the wrong place for this thread) How would porting to iOS work? Blender is GPL-2 and I thought GPL is incompatible with iOS. See: https://opensource.stackexchange.com/questions/9500/is-apple-allowed-to-distribute-gplv3-licensed-software-through-its-ios-app-store

devtalk.blender.org

3D Rendering on Apple Silicon, CPU&GPU

macrumors 68000

macrumors regular

macrumors 68000

macrumors Core

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors Core

macrumors 68000

macrumors 6502

macrumors 6502

macrumors 68000

macrumors 6502

macrumors 6502a

macrumors 6502

macrumors 6502a

macrumors 6502

macrumors 68040

macrumors 68000

Our Staff