When I was doing research on Apple/Nvidia performance in rendering I came across an interesting blog post from ChipsandCheese here:
While chronic GPU shortages dominate the current news cycle (exacerbated by a myriad of individual factors set to continue for the foreseeable future), the massive increase in power requirements for high-end graphics cards is just as newsworthy.
chipsandcheese.com
In it they wondered if Nvidia's new architecture was partially responsible for Ampere's higher power to performance cost - one of those being its
"dual-issue" design, saying:
Another potential source of its power efficiency issues would be the large number of CUDA cores intrinsic to the design. With Ampere, Nvidia converted their separate integer pipeline added in Turing to a combined floating point/integer pipeline, essentially doubling the number of FP units per streaming multiprocessor. While this did result in a potential performance uplift of around 30% per SM, it also presents challenges for keeping the large number of cores occupied.
21 Ampere has been shown to exhibit lower shader utilization and more uneven workload distribution when compared to the earlier Pascal architecture.
22 These issues with occupancy resulting from a large core count are ones Ampere shares with other high core-count and power-hungry architectures such as Vega; like Vega, Ampere appears to be a better fit for compute-heavy workloads than many gaming workloads.
Now there wasn't a figure a supporting figure for this up-to-30% claim (though
@name99 also repeated this number in his description of Nvidia's pseudo-dual-issue) so I decided to test it myself! I used the
3DMark search since there are multiple tests and you can search specific GPU and CPUs even down to GPU core/memory clocks for different memory bandwidths and compute ... which I take advantage of below. Some caveats: for this particular part I avoid ray tracing tests because the ray tracing cores evolved significantly between Nvidia generations (I will use them later) and 2K tests like Steel Nomad Light will degrade in performance per TFLOPS as the GPUs get bigger far more significantly than 4K tests like Steel Nomad (the latter of which will also be more sensitive to bandwidth). I tested this myself and found that moving from a 4060 to 4090 Steel Nomad Light will lose about 12% more perf/TFLOPS than Steel Nomad and the difference would probably have been greater if the 4090's bandwidth had been bigger (I have 5080/5090 results for Steel Nomad but not Steel Nomad Light which will confirm). So to alleviate these issues I try to match GPUs as best I can by bandwidth and compute power (though comparing pre-Ampere and post-Ampere GPUs I try to match the pre-Ampere GPU with a post-Ampere GPU that is double the TFLOPS). To calculate the performance uplift from Nvidia's "dual-issue" I use 2xperf_TFLOPS_40XX/per_TFLOPS_20XX. If Nvidia's scheme were to give full uplift, you'd get a result of 2, no uplift 1, 30% (the expected based on earlier research) 1.3. The Steel Nomad tests were both DX12 for the Nvidia GPUs. I would select the median result for the GPU.
| Wild Life Extreme | Steel Nomad Light | Steel Nomad | Bandwidth | TFLOPS |
4090 mobile | 38365 | 18194 | 4439 | 576.4 | 33.3 |
4060 | 17980 | 9408 | 2088 | 272 | 15.11 |
2080 Ti (overclocked) | 28810 | 13498 | 3493 | 616 | 15.5 |
2060 (overclocked) | 15909 | 7742 | 1788 | 336 | 7.78 |
The 4090 mobile was a slight overlock as the stock numbers either didn't exist or were poor quality. The CPU can also come into play, although these are pretty intense tests (especially the two Steel Nomads - although Wild Life Extreme is still 4K). The 4090 mobile will be compared with the overclocked 2080 Ti while the 2060 will be compared with the 4060.
Here we see achieved increases of about 16-25% across these tests. Thus expecting an uplift of around 30% was not far off the mark and will likely be achieved depending on the compared GPUs and the test (a high degree of variance in these tests). Interestingly Steel Nomad Light achieved the best uplift of the three tests and is a 2K test while the two 4K tests had worse increases even though Steel Nomad Light is a much harder hitting test than Wild Life Extreme is. Since two of the three tests are also on the Mac, sadly no native full Steel Nomad, I decided to repeat the above analysis but matching M4 GPUs with the above 4000-series GPUs. Now the M4 has dual-issue, but not double FP32 pipes - amongst many other differences of course as well (TBDR design, dynamic cache, etc ...). So I wanted to see how they compared. Since 3D Mark does search Macs, I used NotebookCheck results.
| Wild Life Extreme | Steel Nomad Light | Bandwidth | TFLOPS |
M4 Max (40-core) | 37434 | 13989 | 546 | 16.1 |
M4 Pro (20-core) | 19314.5 | 7795 | 273 | 8.1 |
M4 Pro (16-core) | 16066.5 | 6681 | 273 | 6.46 |
The M4 Max will be compared with the 4090 mobile while the two M4 Pros will be matched with the 4060.
Fascinatingly, the M4 GPU behaves almost exactly like a 2000-series GPU in Steel Nomad Light, but very, very differently in Wild Life Extreme. In the former, the two M4 Pro GPUs straddle the perf per TFLOPS of the overclocked RTX 2060 while the M4 Max gets almost identical per per TFLOPS as the overclocked 2080 Ti. Both 2000-series GPUs have better bandwidth than their respective M4 GPUs. In Wild Life Extreme though, the ratio is below one! That means the M4 GPU, per TFLOPS, is outperforming double the performance per TFLOPS of the comparative 4000-series GPU. I also checked a similar benchmark GFXBench 4K Aztec Ruins High Tier Offscreen and got similar results to Wild Life Extreme ratios of 0.84 (overclocked 4090 mobile to M4 Max) to 1.05 (4060* to M4 Pro 20-core *result from GFXBench website, may not be at stock - the 4090 mobile definitely wasn't but I knew its configuration from the Notebookcheck website). Now I'm not an expert in these engines, but something different is happening with these older, but not as intense engines even though they are at 4K that isn't happening with the 2K Steel Nomad Light. If anyone has any ideas, I'd love to hear them. It might also be fun to test non-native Steel Nomad, but I don't have the relevant Macs. Even though Wild Life Extreme and Aztec High 4K are both 4K tests, they are older, lighter tests ... it's possible the CPU has more influence, it's possible they made more optimizations for mobile, but I'm not sure what those would be and why the 4000-series wouldn't benefit.
I also took a look at ray tracing. Again I already knew that the 2000-series GPU would fare much more poorly against the 4000-series in ray tracing tests so I didn't compare them, but I wanted to see how the results differed from above for the Macs as a way to gauge ray tracing effectiveness of the ray tracing cores in the Apple GPUs. Unfortunately for the life of me I couldn't find M4 Max Solar Bay results! Extremely annoying. However, I got results for the M4 Pro to compare against the 4060 and I got ratios of 1.2 and 1.3 for the 16-core and 20-core M4 Pro vs 4060 respectively. While I couldn't control Blender results as well as I could 3D Mark, I looked at it as well. Assuming the median Blender results were close to stock and my results were reasonably close to median I got increases ranging from 0.98 (Junkshop, 4060 vs 16-core M4 Pro) to 1.39 (Monster, 4090 mobile vs 40-core M4 Max) with most being in the 1.2-1.3 range. Combined with the Steel Nomad results above I would say the ray tracing cores certainly don't hamper the M4's performance wrt to the RTX 4000-series GPUs. Apple's ray tracing cores appear at least comparable to Nvidia's.
I did play around with Geekbench 5 and 6 a little with CUDA/OpenCL and Metal/OpenCL and found similar ratios in the top line numbers but didn't look at subtests. I probably won't get to that but at least on average there weren't any glaring differences with Geekbench compute tasks interestingly.
So there are two interesting corollary topics from this analysis:
1) how to compare Nvidia and Apple GPUs?
2) should Apple adopt double-piped FP32 graphics cores?
With regards to the first point, I think there is sometimes confusion over how an Apple GPU can compete with an Nvidia GPU that is sometimes twice its rated TFLOPS and here we see that in fact, depending on the test and the two GPUs in question, the expected performance behavior of the Nvidia GPU is sometimes 0.5-0.7 its TFLOPS wrt the Apple GPU. So a 40-core M4 Max rated at ~16 TFLOPS could be comparable with an Nvidia GPU anywhere from 23 TFLOPS to 32 TFLOPS Nvidia GPU and that's not actually unreasonable. With the exception of Wildlife Extreme and Aztec ruins where Apple does incredible well (0.5 and lower), it's actually similar to how older Nvidia GPUs compare to the newer ones (in non-raytracing workloads).
With respect to the second point,
@leman and I were discussing this possibility and I was quite in favor of it but I have to admit, I've cooled somewhat based on this analysis. So from that perspective it makes sense for Apple to adopt dual FP32 pipes per core. Dual-issue FP32 per core increases performance and doesn't cost much if any silicon die area and Apple already has a dual-issue design, it just doesn't have two FP32 pipes. Intel interestingly as a triple issue design but also doesn't double the FP32 pipes, maybe they will later but it is noteworthy that neither Apple nor Intel doubled FP32 pipes despite having the ability to issue multiple instructions per clock.
As ChipsandCheese says, it's possible occupancy and other architectural issues become more complicated. AMD certainly suffers here as well. I'm not going to provide full analysis here but the 890M has a dual issue design,
without dual issue it's roughly a 6 TFLOPS design that can when dual issue can be 12 TFLOPS but is very temperamental.
More here on the 890M in particular. But in performance metrics the 890M can struggle versus the base M4, even in non-native games like CP2077 (coming soon to the Mac) and often has the power profile of the M4 Pro. Similar the Strix Halo looks to be roughly the M4 Pro level (maybe a bit better) despite having 40 cores and presumably large amounts of theoretical TFLOPS. Lastly, d
espite having the same TFLOPS rating, the 890M is also seemingly not as performant in many of these benchmarks and games as the older 6500XT, which was RDNA 2.
Now as aforementioned, Apple has a different dual issue design. If they adopt double FP32-pipes per core, they may not suffer from whatever problems is holding Nvidia/AMD back, they might get better performance increases per additional TFLOPS. They may not get the full double, but they might be better. However, one of the key drivers of this dual-issue design is to increase the TFLOPS without impacting die size. But that doesn't necessarily stop power increases and given the above, the power increases may be bigger than the performance increases! Apple, not having to sell its GPUs as a separate part to OEMs and users, may decide that if they need to increase TFLOPS, they'll just increase the number of cores and eat the silicon costs.