Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?

I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?
On consumer Nvidia hardware only a few of the subcores have hardware FP64 enabled - throughput is about 1:64. To be technical: each SM thus would have 1 hardware FP64 unit for every 64 FP32 and 64 FP32/INT32 units which are split amongst two "Streaming Multiprocessor Sub-Partition" SMSPs, so only one of the two half units would have an FP64 unit. These days, for the last few generations at least*, professional hardware that has lots of FP64 throughput uses a different die with tweaked core designs. If Nvidia does anything beyond that is unknown to me.

*basically since Pascal/Volta then Ampere consumer vs Ampere professional, Ada vs Hopper, and now Blackwell 2.0 vs Blackwell (the professional variant. Professional chips with FP64 (there are more pro-chips, but specifically the ones with lots of FP64) use dies xx100 while all other professional and consumer chips use dies xx102 or higher. Prior I'm not sure Nvidia had any designs that were better than 1:32 FP64 throughput at all, which Turing was the last time a consumer card had that much (beyond the occasional Titan like the Titan V which had 1:2).

For Macs, because of the UMA using the AMX in addition to the GPU may indeed be possible. It would depends on the application how separate the FP64 calculations are from the rest of the calculations - in other words the latency cost of communicating through main memory vs local cache vs FP64 throughput on the AMX vs GPU.
 
  • Like
Reactions: OptimusGrime
On consumer Nvidia hardware only a few of the subcores have hardware FP64 enabled - throughput is about 1:64. To be technical: each SM thus would have 1 hardware FP64 unit for every 64 FP32 and 64 FP32/INT32 units which are split amongst two "Streaming Multiprocessor Sub-Partition" SMSPs, so only one of the two half units would have an FP64 unit. These days, for the last few generations at least*, professional hardware that has lots of FP64 throughput uses a different die with tweaked core designs. If Nvidia does anything beyond that is unknown to me.

*basically since Pascal/Volta then Ampere consumer vs Ampere professional, Ada vs Hopper, and now Blackwell 2.0 vs Blackwell (the professional variant. Professional chips with FP64 (there are more pro-chips, but specifically the ones with lots of FP64) use dies xx100 while all other professional and consumer chips use dies xx102 or higher. Prior I'm not sure Nvidia had any designs that were better than 1:32 FP64 throughput at all, which Turing was the last time a consumer card had that much (beyond the occasional Titan like the Titan V which had 1:2).

For Macs, because of the UMA using the AMX in addition to the GPU may indeed be possible. It would depends on the application how separate the FP64 calculations are from the rest of the calculations - in other words the latency cost of communicating through main memory vs local cache vs FP64 throughput on the AMX vs GPU.
I suppose my next question would be, if fp64 can be emulated without hardware, to a similar performance level as the hardware Nvidia currently sells, why don’t Nvidia just emulate it?
 
I suppose my next question would be, if fp64 can be emulated without hardware, to a similar performance level as the hardware Nvidia currently sells, why don’t Nvidia just emulate it?
Consistency in software platforms between consumer and professional hardware (a lot of people develop on consumer and deploy on professional) and as I said there is the caveat that the emulation is a lot more code intensive - throughput might the same but calling function pointers or having huge inline code has its own performance implications on the GPU.
 
Last edited:
  • Like
Reactions: OptimusGrime
So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?

Nvidia has FP64 hardware, but the performance is only 1/64 of the FP32 code. Basically, it’s there just for the compatibility with legacy code.


I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?

Sure, you can, if your workload maps well to AMX/SME.
 
  • Like
Reactions: OptimusGrime
Nvidia has FP64 hardware, but the performance is only 1/64 of the FP32 code. Basically, it’s there just for the compatibility with legacy code.
AMD Radeon Pro Vega II has 1:2 FP64, at 7.045 TFLOPS, the Duo card double that. If it is important?

Screenshot 2025-02-06 at 13.08.40.png
 
AMD still offers a strong FP64 implementation in the professional Radeon Instinct series. Judging by the financial performance, there is not much interest.
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research
 
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research

People like to quite academic research as a user of FP64 precision, but as a researcher myself I am confused about it. I am sure that there are fields that need extended precision for calculations. I am also quite sure that most of these people don't run their algorithms on your standard GPU.
 
  • Like
Reactions: Bungaree.Chubbins
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research

People like to quite academic research as a user of FP64 precision, but as a researcher myself I am confused about it. I am sure that there are fields that need extended precision for calculations. I am also quite sure that most of these people don't run their algorithms on your standard GPU.

For myself, I use a FP64 on occasion where most of my calculations can get by on FP32, but there are some times when I am doing accumulations or calculations where at the very least the extended range of FP64 is necessary and possibly so is the precision. I tried using FP32 everywhere, but it greatly constrained my ability to run certain parameters. One of the selling points of my research while I was still doing was indeed to "democratize" the research computations needed to do the intense simulations, move them out of the big iron and onto people's desktops. So I try to make my algorithm as performant as possible on standard GPUs. Having said that, my FP64 usage is low enough that as far as I can tell it doesn't really drag down performance. I might get a little better and other model/algorithms may be different if I were to ever pursue them might be different, but I admit that even in my case a lack of FP64 compute isn't hamstringing me so far.

EDIT: I love that I talk about this as though I still actually still get to work on this ... ah well ... maybe one day
 
Last edited:
Bit disappointed that the Mac Studio went to M3, but at least there's a pretty nice performance increase for redshift (assumably down to hw raytracing):

1741206694071.png


Also having nearly 512GB memory for rendering would be nice :)

Wonder how much faster the GPU is generally; don't render that much, but do use OpenCL a fair bit, so any speed up there would be nice.
 
This is a big offtopic, but may be interesting to a few.
 
Out of curiosity does anyone use Marmoset Toolbag?

Just discovered that the latest version has a native app (and UDIM support!). Looks like it's getting close to Substance Painter in terms of features and has the benefit of not being Adobe and having a perpetual license.

Being in film I'm vaguely familiar with Mari, Substance and to a certain extent 3D Coat, but never seen Marmoset used, so was wondering how it compares these days.
 
Nice; will def give it a shot once I have something I need to texture (didn't want to waste the trial). TBH my texturing needs are pretty simple most of the time, so suspect it's current functionality will more than cover my use case.

With a bit of luck between Toolbag for manual texturing and Houdini's COPs for procedural textures I can continue avoiding anything Adobe based.
 
Nice; will def give it a shot once I have something I need to texture (didn't want to waste the trial). TBH my texturing needs are pretty simple most of the time, so suspect it's current functionality will more than cover my use case.

With a bit of luck between Toolbag for manual texturing and Houdini's COPs for procedural textures I can continue avoiding anything Adobe based.
Yeah. V5 release has some neat texturing features.
It also comes with tons of material libraries analogous to substance materials. So it will be easier to mockup some textures.

Besides its renderer is also a useful feature.
 
  • Like
Reactions: jujoje
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.