3D Rendering on Apple Silicon, CPU&GPU

leman · Jan 21, 2025

crazy dave said:
It's about 36-40% improvement and while there are 33% more cores, core clocks were reduced so TFLOPS is actually about 27% higher. Meanwhile memory bandwidth is about 80% higher. So it's fair to say that some of the improvement in Blender score would indeed appear to be from memory improvements.

True, true, it's just I am a bit surprised that the memory bandwidth effect is so small. I think it also poses some fascinating questions about the compute efficiency of modern GPUs and the relationship between compute and data transfer. An interesting point about Nvidia consumer GPUs was the relatively low bandwidth-to-compute ratio, so I'd naively expect the behemoth like 4090 to be bandwidth-limited on these types of workloads. The 5090 should alleviate this bottleneck, but the effect is tiny — meaning it wasn't really a bottleneck at all. So the question I have now — is it some property of modern GDDR that hampers effective data transfers on these type of applications, does the Blender code fail to scale up to such large parallel processor, or is Nvidia's compute inefficient at scale for running complex asymmetric workloads? Despite massive bandwidth and cache advantages the 5090 achieves only ~150 points/TFLOP in Blender benchmarks — the much smaller 4060 RTX achieves ~ 200 pt/TFLOP and the M4 Max achieves whopping ~ 270 pt/TFLOP. I'd really like to understand this discrepancy.

crazy dave · Jan 21, 2025

leman said:
True, true, it's just I am a bit surprised that the memory bandwidth effect is so small. I think it also poses some fascinating questions about the compute efficiency of modern GPUs and the relationship between compute and data transfer. An interesting point about Nvidia consumer GPUs was the relatively low bandwidth-to-compute ratio, so I'd naively expect the behemoth like 4090 to be bandwidth-limited on these types of workloads. The 5090 should alleviate this bottleneck, but the effect is tiny — meaning it wasn't really a bottleneck at all. So the question I have now — is it some property of modern GDDR that hampers effective data transfers on these type of applications, does the Blender code fail to scale up to such large parallel processor, or is Nvidia's compute inefficient at scale for running complex asymmetric workloads? Despite massive bandwidth and cache advantages the 5090 achieves only ~150 points/TFLOP in Blender benchmarks — the much smaller 4060 RTX achieves ~ 200 pt/TFLOP and the M4 Max achieves whopping ~ 270 pt/TFLOP. I'd really like to understand this discrepancy.

My money is that this is a scaling issue of the test - Blender benchmarks have to run on tiny GPUs with no bandwidth and limited VRAM so by the time you get to these behemoths, there has to be a limit to how much compute and bandwidth such a test can really push. And things can get complicated like the mobile 4080 beating the desktop 4070 in Blender 4.2, but not in CB R24 or even in Blender 4.3. And in Blender 4.3 the M4 Max beats both. Though you have to be careful, as user submitted data can be problematic with no way to know what power settings or other applications were on during the run (with enough data you hope the median would be fine, but ... that depends on the extent on the contamination). Also, as the Blender score is a composite of scores from multiple scenes, you'd have to see the sub-scores to see if it is outliers causing any issues. If one test stops scaling, it can drag the whole score down.

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

leman · Jan 21, 2025

I agree that the high variability in results is odd. Would be interesting to look at the score distributions and the sub-tests. The raw data is accessible from what I understand, yes?

diamond.g · Jan 21, 2025

crazy dave said:
My money is that this is a scaling issue of the test - Blender benchmarks have to run on tiny GPUs with no bandwidth and limited VRAM so by the time you get to these behemoths, there has to be a limit to how much compute and bandwidth such a test can really push. And things can get complicated like the mobile 4080 beating the desktop 4070 in Blender 4.2, but not in CB R24 or even in Blender 4.3. And in Blender 4.3 the M4 Max beats both. Though you have to be careful, as user submitted data can be problematic with no way to know what power settings or other applications were on during the run (with enough data you hope the median would be fine, but ... that depends on the extent on the contamination). Also, as the Blender score is a composite of scores from multiple scenes, you'd have to see the sub-scores to see if it is outliers causing any issues. If one test stops scaling, it can drag the whole score down.

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

It is interesting that the Nvidia RTX 6000 Ada Generation is closer to the 4090 than the 5090 considering it is halfway between the two with respect to SM count. I wonder how much the GPU clocks matter for Blender testing. As the Nvidia RTX 6000 Ada Generation doesn't clock as high as the 4090. Interestingly, the 5090 has lower clocks (base/boost) than the 4090 as well.

crazy dave · Jan 21, 2025

diamond.g said:
It is interesting that the Nvidia RTX 6000 Ada Generation is closer to the 4090 than the 5090 considering it is halfway between the two with respect to SM count. I wonder how much the GPU clocks matter for Blender testing. As the Nvidia RTX 6000 Ada Generation doesn't clock as high as the 4090. Interestingly, the 5090 has lower clocks (base/boost) than the 4090 as well.

In general, GPU compute is function of clocks and SM count (2 x shading units x clock speed) so you have to take both into account simultaneously. So the Ada 6000 has an expected TFLOPS of 91 vs 83 (4090) vs 105 (5090). It also has worse bandwidth than the 4090, so that Blender results being in between the 4090 and 5090 shows that at its level of compute and bandwidth, it's possible that compute matters more.

leman said:
I agree that the high variability in results is odd. Would be interesting to look at the score distributions and the sub-tests. The raw data is accessible from what I understand, yes?

Yeah unfortunately as far as I can tell the only way to do so is download the daily snapshot of the full Blender data set. Downloading the JSON/CSV of just what you search for just recapitulates the main table with no additional information. So you'd have to read the full JSON file and parse it to get what you want. Not impossible of course, but I haven't done it.

Blender - Open Data

Blender Open Data is a platform to collect, display and query the results of hardware and software performance tests - provided by the public.

opendata.blender.org

OptimusGrime · Jan 23, 2025

Going through the latest snapshot of Blender data, we can see the following:

	Monster	Junkshop	Classroom	Total Score
5090	7367.90547830103	3890.35490074279	3645.547999789463	14903.8083788333
4090	5513.46414977437	2719.98915071775	2761.0141095246	10994.4674100167
3090	2385.67801958354	1508.31362266432	1333.10109865781	5227.09274090567
4090 -> 5090	1.33	1.43	1.32	1.35
3090 -> 4090	2.31	1.80	2.07	2.07

Each of the chosen samples were either the median score, or the next highest score to the median.

EDIT: Corrected the Classroom score from “3890.35490074279” to “3645.547999789463”. Thanks to @crazy dave for pointing it out.

crazy dave · Jan 23, 2025

OptimusGrime said:
Going through the latest snapshot of Blender data, we can see the following:

Monster Junkshop Classroom Total Score
5090 7367.90547830103 3890.35490074279 3890.35490074279 14903.8083788333
4090 5513.46414977437 2719.98915071775 2761.0141095246 10994.4674100167
3090 2385.67801958354 1508.31362266432 1333.10109865781 5227.09274090567
4090 -> 5090 1.33 1.43 1.32 1.35
3090 -> 4090 2.31 1.80 2.07 2.07

Each of the chosen samples were either the median score, or the next highest score to the median.

Fascinating! Thanks so much for digging around to get that. I think for the 5090 Classroom score you accidentally copied and pasted the Junkshop score. Still judging by the uplifts and the tech specs, I'd say Junkshop is the most sensitive to bandwidth. It has the least uplift between 3090 and 4090 where compute but not bandwidth changed a lot and the most uplift between the 4090 and 5090 where bandwidth changed much more than compute. That was also my conclusion based on the scores @Homy posted earlier. Classroom may be equally (or more) taxing on the compute, but Junkshop seems to be more sensitive to memory bandwidth. Again my guess is even Junkshop has limits in terms of how much it can stress memory bandwidth and we've probably more than hit that limit for the other two.

OptimusGrime · Jan 23, 2025

crazy dave said:
Fascinating! Thanks so much for digging around to get that. I think for the 5090 Classroom score you accidentally copied and pasted the Junkshop score. Still judging by the uplifts and the tech specs, I'd say Junkshop is the most sensitive to bandwidth. It has the least uplift between 3090 and 4090 where compute but not bandwidth changed a lot and the most uplift between the 4090 and 5090 where bandwidth changed much more than compute. That was also my conclusion based on the scores @Homy posted earlier. Classroom may be equally (or more) taxing on the compute, but Junkshop seems to be more sensitive to memory bandwidth. Again my guess is even Junkshop has limits in terms of how much it can stress memory bandwidth and we've probably more than hit that limit for the other two.

Damn yeah you’re correct. Thanks for spotting it!

Here is the same presentation of M3 Max (40 Core) to M4 Max (40 Core)

	Monster	Junkshop	Classroom	Total Score
M4 Max (40 Core)	2462.07638375756	1322.10820108297	1302.27569870296	5086.46028354349
M3 Max (40 Core)	2006.5650469189	1048.99339927041	1064.5654517731	4120.12389796241
M3 Max -> M4 Max	1.23	1.26	1.22	1.23

crazy dave · Jan 23, 2025

OptimusGrime said:
Damn yeah you’re correct. Thanks for spotting it!

Here is the same presentation of M3 Max (40 Core) to M4 Max (40 Core)

Monster Junkshop Classroom Total Score
M4 Max (40 Core) 2462.07638375756 1322.10820108297 1302.27569870296 5086.46028354349
M3 Max (40 Core) 2006.5650469189 1048.99339927041 1064.5654517731 4120.12389796241
M3 Max -> M4 Max 1.23 1.26 1.22 1.23

Did you write a parser for the JSON file or do you have a favored program for reading JSON?

OptimusGrime · Jan 23, 2025

crazy dave said:
Did you write a parser for the JSON file or do you have a favored program for reading JSON?

I wrote it in Python. The jsonl file is a json “file” per line. Nothing too complex.

crazy dave · Jan 23, 2025

OptimusGrime said:
I wrote it in Python. The jsonl file is a json “file” per line. Nothing too complex.

Would you mind sharing on Pastebin?

OptimusGrime · Jan 23, 2025

crazy dave said:
Would you mind sharing on Pastebin?

I’m learning Python tbh and my skills are pretty limited. I’m not sure I’d feel comfortable right now. Hopefully if I learn more, I’ll feel more confident to share it. The script isn’t very robust right now.

M4pro · Jan 24, 2025

More 5090 performance data…

From Puget Systems:

In V-Ray 6.0 RTX rendering the RTX 5090 is … 38% faster than the 4090… (15,062 vs 10,927).

In Unreal Engine 5.5 the RTX 5090 is 19% (Raytraced) and 16% (Rasterized) faster than the 4090.

Puget found the RTX 5090 is not yet supported in either the Redshift (Cinebench GPU) or Octanebench benchmarks - so for some additional context, below are summary points from Gamers Nexus testing:

Generally, we saw 27-35% uplift in 4K gaming over the RTX 4090.
In one instance, we saw 50%
In one RT instance, we saw 44%
Most of the time, it hung around 30%
At 1440p, we saw those drop down into the 20s typically, even when not CPU bound
The card is one of the most power hungry idle power consumers we’ve tested
The card is the most power hungry for maximum power consumption and we worry that NVIDIA has elected to continue using 12VHPWR while pushing so close to the limit of the cable spec.
Power Efficiency (FPS/W) is about the same as the 4090.

novagamer · Jan 25, 2025

The 5090 idle power is hopefully a bug, I can’t fathom why it uses so much especially on the FE where the PCB design is so tight. It would be great if someone got a comment from nvidia about it but as far as I know there hasn’t been one, please correct me if I’m wrong.

It seems to handle power scaling fairly well considering it’s on the same node as the 4090, at ~70-80% it’s around 20-25% faster than the 4090 which is pretty good. Stock settings are nuts though, I don’t think I’d run it that way if I had one and the idle power at 50w+ is atrocious but hopefully a software not hardware issue.

It’s a shame they weren’t able to shrink the node, Apple has a real chance of catching it in some productivity benchmarks if the M6 has an Ultra-esque variant which I was not expecting, there’s no way nvidia can ship a TI 5090 unless there’s a node shrink and as far as I’m aware they have never done that outside of a generational leap.

INT4 performance is 300-400% faster vs. the 4090 but they kneecapped the card of course at higher precision to push you to the pro offerings.

thenewperson · Jan 25, 2025

novagamer said:
Apple has a real chance of catching it in some productivity benchmarks if the M6 has an Ultra-esque variant

Could even be M5, depending on what that gen looks like

diamond.g · Jan 25, 2025

novagamer said:
The 5090 idle power is hopefully a bug, I can’t fathom why it uses so much especially on the FE where the PCB design is so tight. It would be great if someone got a comment from nvidia about it but as far as I know there hasn’t been one, please correct me if I’m wrong.

It seems to handle power scaling fairly well considering it’s on the same node as the 4090, at ~70-80% it’s around 20-25% faster than the 4090 which is pretty good. Stock settings are nuts though, I don’t think I’d run it that way if I had one and the idle power at 50w+ is atrocious but hopefully a software not hardware issue.

It’s a shame they weren’t able to shrink the node, Apple has a real chance of catching it in some productivity benchmarks if the M6 has an Ultra-esque variant which I was not expecting, there’s no way nvidia can ship a TI 5090 unless there’s a node shrink and as far as I’m aware they have never done that outside of a generational leap.

INT4 performance is 300-400% faster vs. the 4090 but they kneecapped the card of course at higher precision to push you to the pro offerings.

I'd be curious to see if the GDDR7 is being kept at high speeds, thus drawing more power needlessly.

The 5090 isn't even the full die, they are leaving ~3000 coda cores on the table, but based on how much power the 5090 is pulling, a dual 12v2x6 card may not be well received in the marketplace.

komuh · Jan 27, 2025

NV seems to be in the spot that Intel was ~10 years ago they are on the top, no one is close to the top and revolution in node shrinking is stopping (which is also case for Apple and all other parties outside of Intel atm.) If there won't be any revolution in TSMC next few years probably AMD, Intel and NV will be in the same performance/cost position, right now it seems NV >> AMD = Intel for dGPU's.

I would love to see Apple trying to get into dGPU's but i can't see it in next ~4/5 years.

crazy dave · Jan 28, 2025

Inspired by conversations with @leman and @OptimusGrime, I took a deeper look into Blender benchmarks for Apple Silicon:

	Monster	Junkshop	Classroom	Total Score	Bandwidth GB/s	FP32 TFLOPS
M4 Max (40 Core)	2462.07638375756	1322.10820108297	1302.27569870296	5086.46028354349	546	15.5
M4 Max (32 Core)	2069.45050595834	1207.0921655042	1067.13432988062	4343.67700134316	410	12.44
M4 Pro (20 Core)	1212.1372188498	622.664482836412	655.990101194567	2490.79180288078	273	7.78
M4 Pro (16 Core)	1110.11827782035	655.284051736463	579.166581942269	2344.56891149908	273	6.22
M4 (10 Core)	524.36837536322	236.747660119091	296.818109862276	1057.93414534459	120	3.89
M3 Max (40 Core)	2006.5650469189	1048.99339927041	1064.5654517731	4120.12389796241	409.6	14.1
M3 Max (30 Core)	1609.33848662039	951.054299518103	829.451646599097	3389.84443273759	307.2	10.6
M3 Pro (18 Core)	873.526014019736	422.836002956528	438.089215474811	1734.45123245108	153.6	6.4
M3 Pro (14 Core)	781.586407609112	399.948252134939	413.590422200261	1595.12508194431	153.6	4.98
M3 (10 Core)	443.621508494821	212.386703520502	241.873586551229	897.881798566552	102.4	3.5

These are relatively close to Blender's median values. I also took a look at Nvidia cards as well - however, it is not shown because after playing with the data set and similar user-generated data sets, I'm fairly convinced that exact numerical analysis using stock Bandwidth/TFLOPS would be erroneous as so many of the Nvidia cards in the data will be overclocked variants. A lot of people who submit benchmarks are going to be people who built their own systems or bought premium ones and AIB for desktop dGPUs and OEMs for laptop GPUs love to overclock their offerings to differentiate themselves both from each other and Nvidia's FE models as well as to give a reason why their charge more than MSRP. Since Blender doesn't record GPU core/memory clocks, it is impossible to weed those out and the median likely reflects their presence. Side note: this means Apple likely does even better than one might think in Blender benchmarks versus Nvidia comparing their relative stock numbers against their observed performance.

Let's take a look at score per TFLOPS (click to expand):

We can see visually that Classroom shows the most stable performance behavior across Apple Silicon (Monster is almost as stable, but all the numbers are bigger, making it look more variable), it and Junkshop vie for the most demanding while Monster is clearly the least demanding. Junkshop would appear to be the more sensitive to bandwidth (backed up by the Nvidia data but see above for caveats on that), but there are a lot of oddities with Junkshop. Especially with performance actually going down, not just normalized performance, for Junkshop moving between M4 Pro 16 cores and 20 cores. I checked this against multiple data entries and this was relatively consistent - the full data might tell a different story but at the very least it's no better. Overall the biggest jump in performance across all scenarios is from the base to binned Pro for both the M3 and M4. Further, you can see that, especially for Monster and Junkshop (and especially Junkshop), the binned models of the Max and Pro do much better per TFLOP than the full ones. To some extent this is expected for the Pro models as the full chips don't get any extra bandwidth to go with their extra compute, but it also holds true for the Max chips too which absolutely do get extra bandwidth, quite a percentage increase too! So I'm not sure what to make of that. Further I was expecting a bigger uplift in M4 relative to M3 given the new ray tracers, but that doesn't really show up in this data. Most of the improvement in perf/TFLOPS to my eye looks explainable by increases in bandwidth rather than newfangled ray tracers.

Anyway what do you all make of it? Why is the binned Max better per FLOPS than the full Max? What is going on with Junkshop? and why might we not have seen a bigger M3 to M4 uplift per FLOPS given the new ray tracers?

MAJOR EDIT: Screwed up TFLOPS for the M4's, plugged in wrong clockspeed. The M4s do a little better now relative to the M3s in terms of performance/TFLOPS. I still contend that performance improvements appear largely bandwidth driven. Take the base M4 vs M3, it's Bandwidth to TFLOPS ratio improved by about 5% and the score/TFLOPS improvements in Monster, Junkshop, and Classroom are 7%, 0%, and 10% respectively . Meanwhile the M4 vs M4 Max 40 core the BW/TFLOPS shows 21% improvement and the uplift in the score/TFLOPS for Monster, Junkshop, and Classroom are 11%, 14%, and 11%. While I don't expect performance to improve linearly with bandwidth, there isn't a lot of room here for saying new ray tracing cores are having a large effect on the final score.

crazy dave · Jan 29, 2025

crazy dave said:
Screwed up TFLOPS for the M4's, plugged in wrong clockspeed.

Sigh ... I might've been right the first time ... or at least closer. It's also possible that the M4 Max has a different clock speed from the base M4 and I'm not sure about the Pro. Geekerwan reported that the base M4 had a just less than 10% increase in clocks (roughly 1.5GHz). But that was in the iPad Pro I think, so it's also possible it's higher in the mini and the Pro. I notice in their M4 Max video, asitop reporting 1.577GHz, which would put the Max at what I had originally at 16 TFLOPS. Not sure about the others.

Bodhitree · Jan 30, 2025

RTX 5080 Disappoints in First Tests: Barely 10% Faster Than RTX 4080 Super

It feels more like an overclocked RTX 4080 Super than a next-generation design.

www.extremetech.com

Seems the RTX 5080 isn’t that much of an improvement.

crazy dave · Feb 3, 2025

When I was doing research on Apple/Nvidia performance in rendering I came across an interesting blog post from ChipsandCheese here:

Nvidia’s Ampere & Process Technology: Sunk by Samsung?

While chronic GPU shortages dominate the current news cycle (exacerbated by a myriad of individual factors set to continue for the foreseeable future), the massive increase in power requirements for high-end graphics cards is just as newsworthy.

chipsandcheese.com

In it they wondered if Nvidia's new architecture was partially responsible for Ampere's higher power to performance cost - one of those being its "dual-issue" design, saying:

Another potential source of its power efficiency issues would be the large number of CUDA cores intrinsic to the design. With Ampere, Nvidia converted their separate integer pipeline added in Turing to a combined floating point/integer pipeline, essentially doubling the number of FP units per streaming multiprocessor. While this did result in a potential performance uplift of around 30% per SM, it also presents challenges for keeping the large number of cores occupied.21 Ampere has been shown to exhibit lower shader utilization and more uneven workload distribution when compared to the earlier Pascal architecture.22 These issues with occupancy resulting from a large core count are ones Ampere shares with other high core-count and power-hungry architectures such as Vega; like Vega, Ampere appears to be a better fit for compute-heavy workloads than many gaming workloads.

Now there wasn't a figure a supporting figure for this up-to-30% claim (though @name99 also repeated this number in his description of Nvidia's pseudo-dual-issue) so I decided to test it myself! I used the 3DMark search since there are multiple tests and you can search specific GPU and CPUs even down to GPU core/memory clocks for different memory bandwidths and compute ... which I take advantage of below. Some caveats: for this particular part I avoid ray tracing tests because the ray tracing cores evolved significantly between Nvidia generations (I will use them later) and 2K tests like Steel Nomad Light will degrade in performance per TFLOPS as the GPUs get bigger far more significantly than 4K tests like Steel Nomad (the latter of which will also be more sensitive to bandwidth). I tested this myself and found that moving from a 4060 to 4090 Steel Nomad Light will lose about 12% more perf/TFLOPS than Steel Nomad and the difference would probably have been greater if the 4090's bandwidth had been bigger (I have 5080/5090 results for Steel Nomad but not Steel Nomad Light which will confirm). So to alleviate these issues I try to match GPUs as best I can by bandwidth and compute power (though comparing pre-Ampere and post-Ampere GPUs I try to match the pre-Ampere GPU with a post-Ampere GPU that is double the TFLOPS). To calculate the performance uplift from Nvidia's "dual-issue" I use 2xperf_TFLOPS_40XX/per_TFLOPS_20XX. If Nvidia's scheme were to give full uplift, you'd get a result of 2, no uplift 1, 30% (the expected based on earlier research) 1.3. The Steel Nomad tests were both DX12 for the Nvidia GPUs. I would select the median result for the GPU.

	Wild Life Extreme	Steel Nomad Light	Steel Nomad	Bandwidth	TFLOPS
4090 mobile	38365	18194	4439	576.4	33.3
4060	17980	9408	2088	272	15.11
2080 Ti (overclocked)	28810	13498	3493	616	15.5
2060 (overclocked)	15909	7742	1788	336	7.78

The 4090 mobile was a slight overlock as the stock numbers either didn't exist or were poor quality. The CPU can also come into play, although these are pretty intense tests (especially the two Steel Nomads - although Wild Life Extreme is still 4K). The 4090 mobile will be compared with the overclocked 2080 Ti while the 2060 will be compared with the 4060.

Screenshot 2025-02-03 at 10.47.19 AM.png

Here we see achieved increases of about 16-25% across these tests. Thus expecting an uplift of around 30% was not far off the mark and will likely be achieved depending on the compared GPUs and the test (a high degree of variance in these tests). Interestingly Steel Nomad Light achieved the best uplift of the three tests and is a 2K test while the two 4K tests had worse increases even though Steel Nomad Light is a much harder hitting test than Wild Life Extreme is. Since two of the three tests are also on the Mac, sadly no native full Steel Nomad, I decided to repeat the above analysis but matching M4 GPUs with the above 4000-series GPUs. Now the M4 has dual-issue, but not double FP32 pipes - amongst many other differences of course as well (TBDR design, dynamic cache, etc ...). So I wanted to see how they compared. Since 3D Mark does search Macs, I used NotebookCheck results.

	Wild Life Extreme	Steel Nomad Light	Bandwidth	TFLOPS
M4 Max (40-core)	37434	13989	546	16.1
M4 Pro (20-core)	19314.5	7795	273	8.1
M4 Pro (16-core)	16066.5	6681	273	6.46

The M4 Max will be compared with the 4090 mobile while the two M4 Pros will be matched with the 4060.

Screenshot 2025-02-03 at 10.53.33 AM.png

Fascinatingly, the M4 GPU behaves almost exactly like a 2000-series GPU in Steel Nomad Light, but very, very differently in Wild Life Extreme. In the former, the two M4 Pro GPUs straddle the perf per TFLOPS of the overclocked RTX 2060 while the M4 Max gets almost identical per per TFLOPS as the overclocked 2080 Ti. Both 2000-series GPUs have better bandwidth than their respective M4 GPUs. In Wild Life Extreme though, the ratio is below one! That means the M4 GPU, per TFLOPS, is outperforming double the performance per TFLOPS of the comparative 4000-series GPU. I also checked a similar benchmark GFXBench 4K Aztec Ruins High Tier Offscreen and got similar results to Wild Life Extreme ratios of 0.84 (overclocked 4090 mobile to M4 Max) to 1.05 (4060* to M4 Pro 20-core *result from GFXBench website, may not be at stock - the 4090 mobile definitely wasn't but I knew its configuration from the Notebookcheck website). Now I'm not an expert in these engines, but something different is happening with these older, but not as intense engines even though they are at 4K that isn't happening with the 2K Steel Nomad Light. If anyone has any ideas, I'd love to hear them. It might also be fun to test non-native Steel Nomad, but I don't have the relevant Macs. Even though Wild Life Extreme and Aztec High 4K are both 4K tests, they are older, lighter tests ... it's possible the CPU has more influence, it's possible they made more optimizations for mobile, but I'm not sure what those would be and why the 4000-series wouldn't benefit.

I also took a look at ray tracing. Again I already knew that the 2000-series GPU would fare much more poorly against the 4000-series in ray tracing tests so I didn't compare them, but I wanted to see how the results differed from above for the Macs as a way to gauge ray tracing effectiveness of the ray tracing cores in the Apple GPUs. Unfortunately for the life of me I couldn't find M4 Max Solar Bay results! Extremely annoying. However, I got results for the M4 Pro to compare against the 4060 and I got ratios of 1.2 and 1.3 for the 16-core and 20-core M4 Pro vs 4060 respectively. While I couldn't control Blender results as well as I could 3D Mark, I looked at it as well. Assuming the median Blender results were close to stock and my results were reasonably close to median I got increases ranging from 0.98 (Junkshop, 4060 vs 16-core M4 Pro) to 1.39 (Monster, 4090 mobile vs 40-core M4 Max) with most being in the 1.2-1.3 range. Combined with the Steel Nomad results above I would say the ray tracing cores certainly don't hamper the M4's performance wrt to the RTX 4000-series GPUs. Apple's ray tracing cores appear at least comparable to Nvidia's.

I did play around with Geekbench 5 and 6 a little with CUDA/OpenCL and Metal/OpenCL and found similar ratios in the top line numbers but didn't look at subtests. I probably won't get to that but at least on average there weren't any glaring differences with Geekbench compute tasks interestingly.

So there are two interesting corollary topics from this analysis:

1) how to compare Nvidia and Apple GPUs?

2) should Apple adopt double-piped FP32 graphics cores?

With regards to the first point, I think there is sometimes confusion over how an Apple GPU can compete with an Nvidia GPU that is sometimes twice its rated TFLOPS and here we see that in fact, depending on the test and the two GPUs in question, the expected performance behavior of the Nvidia GPU is sometimes 0.5-0.7 its TFLOPS wrt the Apple GPU. So a 40-core M4 Max rated at ~16 TFLOPS could be comparable with an Nvidia GPU anywhere from 23 TFLOPS to 32 TFLOPS Nvidia GPU and that's not actually unreasonable. With the exception of Wildlife Extreme and Aztec ruins where Apple does incredible well (0.5 and lower), it's actually similar to how older Nvidia GPUs compare to the newer ones (in non-raytracing workloads).

With respect to the second point, @leman and I were discussing this possibility and I was quite in favor of it but I have to admit, I've cooled somewhat based on this analysis. So from that perspective it makes sense for Apple to adopt dual FP32 pipes per core. Dual-issue FP32 per core increases performance and doesn't cost much if any silicon die area and Apple already has a dual-issue design, it just doesn't have two FP32 pipes. Intel interestingly as a triple issue design but also doesn't double the FP32 pipes, maybe they will later but it is noteworthy that neither Apple nor Intel doubled FP32 pipes despite having the ability to issue multiple instructions per clock. As ChipsandCheese says, it's possible occupancy and other architectural issues become more complicated. AMD certainly suffers here as well. I'm not going to provide full analysis here but the 890M has a dual issue design, without dual issue it's roughly a 6 TFLOPS design that can when dual issue can be 12 TFLOPS but is very temperamental. More here on the 890M in particular. But in performance metrics the 890M can struggle versus the base M4, even in non-native games like CP2077 (coming soon to the Mac) and often has the power profile of the M4 Pro. Similar the Strix Halo looks to be roughly the M4 Pro level (maybe a bit better) despite having 40 cores and presumably large amounts of theoretical TFLOPS. Lastly, despite having the same TFLOPS rating, the 890M is also seemingly not as performant in many of these benchmarks and games as the older 6500XT, which was RDNA 2.

Now as aforementioned, Apple has a different dual issue design. If they adopt double FP32-pipes per core, they may not suffer from whatever problems is holding Nvidia/AMD back, they might get better performance increases per additional TFLOPS. They may not get the full double, but they might be better. However, one of the key drivers of this dual-issue design is to increase the TFLOPS without impacting die size. But that doesn't necessarily stop power increases and given the above, the power increases may be bigger than the performance increases! Apple, not having to sell its GPUs as a separate part to OEMs and users, may decide that if they need to increase TFLOPS, they'll just increase the number of cores and eat the silicon costs.

diamond.g · Feb 3, 2025

I could have sworn Chips and Cheese had an article about AMD pseudo dual issue where the compiler only did it in certain circumstances. It made me wonder if both them and Nvidia are having compiler issues.

crazy dave · Feb 3, 2025

diamond.g said:
I could have sworn Chips and Cheese had an article about AMD pseudo dual issue where the compiler only did it in certain circumstances. It made me wonder if both them and Nvidia are having compiler issues.

They do. It's the link where I talk about the dual issue being tempermental:

Microbenchmarking AMD’s RDNA 3 Graphics Architecture

Editor’s Note (6/14/2023): We have a new article that reevaluates the cache latency of Navi 31, so please refer to that article for some new latency data.

chipsandcheese.com

They might have another one. With Nvidia, it seems to be less about compiler problems and more multifactorial (first link in my post above and second reference in my quote from them and I repeat the second link near the end about architecture issues becoming more complicated partly because of the dual FP32 pipes):

Nvidia’s Ampere & Process Technology: Sunk by Samsung?

While chronic GPU shortages dominate the current news cycle (exacerbated by a myriad of individual factors set to continue for the foreseeable future), the massive increase in power requirements for high-end graphics cards is just as newsworthy.

chipsandcheese.com

GPU Memory Latency’s Impact, and Updated Test

In a previous article, we measured cache and memory latency on different GPUs.

chipsandcheese.com

Like AMD, they may also have problems properly extracting instruction-level-parallelism to reliably get dual FP32 instructions. I believe this is not the first generation AMD/Nvidia have tried this and both stopped and have now gone back to try again (I'm more sure with AMD that they did this once before, I'm not as sure with Nvidia). So true dual issue seems to be an even harder problem for GPUs than CPUs. AMD in particular with RDNA 3/3.5 ... I haven't detailed analysis like I have above, but I am so far even less impressed based on what I have seen with the 890M. Apple and Intel can issue multiple instructions per cycle but neither tries to do so with multiple FP32 ops.

As for what to expect if Apple does go down this route themselves, multiple FP32 instructions per cycle, even if all they get is the same uplift as Nvidia, I suppose a 20-30% performance increase for the same die area is nothing to sneeze at (caveat: unsure about power increase require though!). However, even if I wasn't expecting a full doubling, I guess I was expecting more from Nvidia's architecture. But again, Apple has a different architecture so who knows?

diamond.g · Feb 3, 2025

crazy dave said:
They do. It's the link where I talk about the dual issue being tempermental:

Microbenchmarking AMD’s RDNA 3 Graphics Architecture

Editor’s Note (6/14/2023): We have a new article that reevaluates the cache latency of Navi 31, so please refer to that article for some new latency data.

chipsandcheese.com

They might have another one. With Nvidia, it seems to be less about compiler problems and more multifactorial (first link in my post above and second reference in my quote from them and I repeat the second link near the end about architecture issues becoming more complicated partly because of the dual FP32 pipes):

Nvidia’s Ampere & Process Technology: Sunk by Samsung?

While chronic GPU shortages dominate the current news cycle (exacerbated by a myriad of individual factors set to continue for the foreseeable future), the massive increase in power requirements for high-end graphics cards is just as newsworthy.

chipsandcheese.com

GPU Memory Latency’s Impact, and Updated Test

In a previous article, we measured cache and memory latency on different GPUs.

chipsandcheese.com

Like AMD, they may also have problems properly extracting instruction-level-parallelism to reliably get dual FP32 instructions. I believe this is not the first generation AMD/Nvidia have tried this and both stopped and have now gone back to try again (I'm more sure with AMD that they did this once before, I'm not as sure with Nvidia). So true dual issue seems to be an even harder problem for GPUs than CPUs. AMD in particular with RDNA 3/3.5 ... I haven't detailed analysis like I have above, but I am so far even less impressed based on what I have seen with the 890M. Apple and Intel can issue multiple instructions per cycle but neither tries to do so with multiple FP32 ops.

As for what to expect if Apple does go down this route themselves, multiple FP32 instructions per cycle, even if all they get is the same uplift as Nvidia, I suppose a 20-30% performance increase for the same die area is nothing to sneeze at (caveat: unsure about power increase require though!). However, even if I wasn't expecting a full doubling, I guess I was expecting more from Nvidia's architecture. But again, Apple has a different architecture so who knows?

It does make me excited to see C&C look into Blackwell as that has changed the supported ops split again. Just to get clarification on why the architecture change hasn't shown the expected performance improvements. I am also sad that C&C doesn't do a deep dive into how AS works (especially in comparison).

crazy dave · Feb 3, 2025

diamond.g said:
It does make me excited to see C&C look into Blackwell as that has changed the supported ops split again. Just to get clarification on why the architecture change hasn't shown the expected performance improvements. I am also sad that C&C doesn't do a deep dive into how AS works (especially in comparison).

They did a short one for the M2 but obviously the M3 and M4 had substantial changes and it would be great to get something similar, if not more in depth for them:

A Brief Look at Apple’s M2 Pro iGPU

Integrated GPUs are often low-end affairs.

chipsandcheese.com

3D Rendering on Apple Silicon, CPU&GPU

macrumors Core

macrumors 68000

macrumors Core

macrumors G5

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors regular

macrumors 6502

macrumors 65816

macrumors G5

Suspended

macrumors 68000

macrumors 68000

macrumors 68020

macrumors 68000

macrumors G5

macrumors 68000

macrumors G5

macrumors 68000

Our Staff