Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?

I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?
On consumer Nvidia hardware only a few of the subcores have hardware FP64 enabled - throughput is about 1:64. To be technical: each SM thus would have 1 hardware FP64 unit for every 64 FP32 and 64 FP32/INT32 units which are split amongst two "Streaming Multiprocessor Sub-Partition" SMSPs, so only one of the two half units would have an FP64 unit. These days, for the last few generations at least*, professional hardware that has lots of FP64 throughput uses a different die with tweaked core designs. If Nvidia does anything beyond that is unknown to me.

*basically since Pascal/Volta then Ampere consumer vs Ampere professional, Ada vs Hopper, and now Blackwell 2.0 vs Blackwell (the professional variant. Professional chips with FP64 (there are more pro-chips, but specifically the ones with lots of FP64) use dies xx100 while all other professional and consumer chips use dies xx102 or higher. Prior I'm not sure Nvidia had any designs that were better than 1:32 FP64 throughput at all, which Turing was the last time a consumer card had that much (beyond the occasional Titan like the Titan V which had 1:2).

For Macs, because of the UMA using the AMX in addition to the GPU may indeed be possible. It would depends on the application how separate the FP64 calculations are from the rest of the calculations - in other words the latency cost of communicating through main memory vs local cache vs FP64 throughput on the AMX vs GPU.
 
  • Like
Reactions: OptimusGrime
On consumer Nvidia hardware only a few of the subcores have hardware FP64 enabled - throughput is about 1:64. To be technical: each SM thus would have 1 hardware FP64 unit for every 64 FP32 and 64 FP32/INT32 units which are split amongst two "Streaming Multiprocessor Sub-Partition" SMSPs, so only one of the two half units would have an FP64 unit. These days, for the last few generations at least*, professional hardware that has lots of FP64 throughput uses a different die with tweaked core designs. If Nvidia does anything beyond that is unknown to me.

*basically since Pascal/Volta then Ampere consumer vs Ampere professional, Ada vs Hopper, and now Blackwell 2.0 vs Blackwell (the professional variant. Professional chips with FP64 (there are more pro-chips, but specifically the ones with lots of FP64) use dies xx100 while all other professional and consumer chips use dies xx102 or higher. Prior I'm not sure Nvidia had any designs that were better than 1:32 FP64 throughput at all, which Turing was the last time a consumer card had that much (beyond the occasional Titan like the Titan V which had 1:2).

For Macs, because of the UMA using the AMX in addition to the GPU may indeed be possible. It would depends on the application how separate the FP64 calculations are from the rest of the calculations - in other words the latency cost of communicating through main memory vs local cache vs FP64 throughput on the AMX vs GPU.
I suppose my next question would be, if fp64 can be emulated without hardware, to a similar performance level as the hardware Nvidia currently sells, why don’t Nvidia just emulate it?
 
I suppose my next question would be, if fp64 can be emulated without hardware, to a similar performance level as the hardware Nvidia currently sells, why don’t Nvidia just emulate it?
Consistency in software platforms between consumer and professional hardware (a lot of people develop on consumer and deploy on professional) and as I said there is the caveat that the emulation is a lot more code intensive - throughput might the same but calling function pointers or having huge inline code has its own performance implications on the GPU.
 
Last edited:
  • Like
Reactions: OptimusGrime
So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?

Nvidia has FP64 hardware, but the performance is only 1/64 of the FP32 code. Basically, it’s there just for the compatibility with legacy code.


I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?

Sure, you can, if your workload maps well to AMX/SME.
 
  • Like
Reactions: OptimusGrime
Nvidia has FP64 hardware, but the performance is only 1/64 of the FP32 code. Basically, it’s there just for the compatibility with legacy code.
AMD Radeon Pro Vega II has 1:2 FP64, at 7.045 TFLOPS, the Duo card double that. If it is important?

Screenshot 2025-02-06 at 13.08.40.png
 
AMD still offers a strong FP64 implementation in the professional Radeon Instinct series. Judging by the financial performance, there is not much interest.
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research
 
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research

People like to quite academic research as a user of FP64 precision, but as a researcher myself I am confused about it. I am sure that there are fields that need extended precision for calculations. I am also quite sure that most of these people don't run their algorithms on your standard GPU.
 
  • Like
Reactions: Bungaree.Chubbins
I thought so. From time to time I search for information on how it could be utilised. But have not seen any use cases that I could take advantage of, as I have one of these cards. So I assume it's used mainly in academic research

People like to quite academic research as a user of FP64 precision, but as a researcher myself I am confused about it. I am sure that there are fields that need extended precision for calculations. I am also quite sure that most of these people don't run their algorithms on your standard GPU.

For myself, I use a FP64 on occasion where most of my calculations can get by on FP32, but there are some times when I am doing accumulations or calculations where at the very least the extended range of FP64 is necessary and possibly so is the precision. I tried using FP32 everywhere, but it greatly constrained my ability to run certain parameters. One of the selling points of my research while I was still doing was indeed to "democratize" the research computations needed to do the intense simulations, move them out of the big iron and onto people's desktops. So I try to make my algorithm as performant as possible on standard GPUs. Having said that, my FP64 usage is low enough that as far as I can tell it doesn't really drag down performance. I might get a little better and other model/algorithms may be different if I were to ever pursue them might be different, but I admit that even in my case a lack of FP64 compute isn't hamstringing me so far.

EDIT: I love that I talk about this as though I still actually still get to work on this ... ah well ... maybe one day
 
Last edited:
Bit disappointed that the Mac Studio went to M3, but at least there's a pretty nice performance increase for redshift (assumably down to hw raytracing):

1741206694071.png


Also having nearly 512GB memory for rendering would be nice :)

Wonder how much faster the GPU is generally; don't render that much, but do use OpenCL a fair bit, so any speed up there would be nice.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.