On consumer Nvidia hardware only a few of the subcores have hardware FP64 enabled - throughput is about 1:64. To be technical: each SM thus would have 1 hardware FP64 unit for every 64 FP32 and 64 FP32/INT32 units which are split amongst two "Streaming Multiprocessor Sub-Partition" SMSPs, so only one of the two half units would have an FP64 unit. These days, for the last few generations at least*, professional hardware that has lots of FP64 throughput uses a different die with tweaked core designs. If Nvidia does anything beyond that is unknown to me.So…I’m confused that Apple gets the same performance ratio for fp64 as Nvidia consumer cards without fp64 circuitry. Is Nvidia slowing down it’s fp64, is Nvidia lacking fp64 hardware and I misunderstood or…?
I’d be interested to know why we wouldn’t just use the AMX fp64 facility which is faster than most Nvidia cards, on the M4 Max at least?
*basically since Pascal/Volta then Ampere consumer vs Ampere professional, Ada vs Hopper, and now Blackwell 2.0 vs Blackwell (the professional variant. Professional chips with FP64 (there are more pro-chips, but specifically the ones with lots of FP64) use dies xx100 while all other professional and consumer chips use dies xx102 or higher. Prior I'm not sure Nvidia had any designs that were better than 1:32 FP64 throughput at all, which Turing was the last time a consumer card had that much (beyond the occasional Titan like the Titan V which had 1:2).
For Macs, because of the UMA using the AMX in addition to the GPU may indeed be possible. It would depends on the application how separate the FP64 calculations are from the rest of the calculations - in other words the latency cost of communicating through main memory vs local cache vs FP64 throughput on the AMX vs GPU.