Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

theorist9 · Oct 16, 2020

leman said:
....Since compute tasks are usually large (you don't start a GPU compute on just 100 work items, it's usually multiple or even hundreds of thousands), making your architecture as wide as possible works well — you can get many more ALUs in without wasting the die space on controlling logic. So you get better performance at lower power investment....

...Now, for graphic, the situation is a bit different....With smaller triangles, you won't be able to collect enough pixels to keep reasonable GPU utilization and all this ALU power goes to waste...

Because of the above considerations, both AMD and Nvidia settled down on an ALU width of 32 values (32-wide SIMD). This appears to be a sweet spot between managing ALU occupation and conserving die space.

If I'm summarizing your argument correctly, wider ALU's are better for GPU computing but, for graphics (at least without TBDR), they can go to waste because of the small-triangle problem. Hence NVIDIA and AMD have found a 32-bit ALU width is the best compromise for GPUs that are used for both graphical and computing tasks.

But if graphical inefficiencies are in fact the principal reason NVIDIA and AMD don't use wider ALU's in their graphical GPU's, then we would find wider ALU's in their non-graphical datacenter GPU's. Is this case?

jeanlain · Oct 17, 2020

leman said:
Those are different things, you can have more memory channels without increasing RAM... anyway, GB4 (which has memory benchmarks) now has some new iPhone and iPad benchmarks and RAM bandwidth is identical. So must be something else.

I'm probably saying something stupid, but could the memory bandwidth differ between the CPU and GPU? I know they share the same memory pool, but could the "connections" to that pool differ?

leman · Oct 17, 2020

theorist9 said:
If I'm summarizing your argument correctly, wider ALU's are better for GPU computing but, for graphics (at least without TBDR), they can go to waste because of the small-triangle problem. Hence NVIDIA and AMD have found a 32-bit ALU width is the best compromise for GPUs that are used for both graphical and computing tasks.

But if graphical inefficiencies are in fact the principal reason NVIDIA and AMD don't use wider ALU's in their graphical GPU's, then we would find wider ALU's in their non-graphical datacenter GPU's. Is this case?

I don't think that "better" is a correct term. It's more a question of "what can one get away with"? For raw performance, independent scalar units are always the best, because they allow 100% hardware utilization in every case. But the transistor overhead makes them not feasible. Wider ALUs allow chip manufacturers to deliver more performance by saving on transistors required for dispatch, logic, synchronization etc. — but wider ALUs also suffer from control flow divergence.

Getting back to your question, I don't think that wider ALUs are better for compute — I do think however that compute tasks can still perform reasonably well on wider ALUs. Forward-shaded graphics, not so much (because of small triangles), unless you use something like TBDR to coalesce your shader invocations. And wider ALUs potentially means more performance without having to significantly increase power consumption or chip size.

Anyway, given the discrepancy in new iPhone and iPad scores, I am getting more and more skeptical about this whole idea. I have a strong suspicion that we simply have a bug in Geekbench and it messes up compute shader setup somehow..

jeanlain said:
I'm probably saying something stupid, but could the memory bandwidth differ between the CPU and GPU? I know they share the same memory pool, but could the "connections" to that pool differ?

No idea, but wouldn't that suggest that the iPad Air and iPhone 12 are using different chips? This kind of change has to be implemented inside the SoC. I can totally imagine that the GPU has a "wider" access to the system level cache (no idea whether it's actually technically feasible), but we should see the same results for all A14-equipped devices.

Zackmd1 · Oct 17, 2020

@leman

Could Apple be using LPDDR4 for the iPhone and LPDDR5 for the iPad air? Would that help explain some of the differences? Maybe the iPhone uses LPDDR4 as another cost savings measure that the iPad Air was not constrained to?

leman · Oct 17, 2020

Zackmd1 said:
@leman

Could Apple be using LPDDR4 for the iPhone and LPDDR5 for the iPad air? Would that help explain some of the differences? Maybe the iPhone uses LPDDR4 as another cost savings measure that the iPad Air was not constrained to?

Memory bandwidth as reported by Geekbench 4 so far is identical and consistent with LPDDR5 speeds.

theorist9 · Oct 17, 2020

leman said:
I don't think that "better" is a correct term. It's more a question of "what can one get away with"? For raw performance, independent scalar units are always the best, because they allow 100% hardware utilization in every case. But the transistor overhead makes them not feasible. Wider ALUs allow chip manufacturers to deliver more performance by saving on transistors required for dispatch, logic, synchronization etc. — but wider ALUs also suffer from control flow divergence.

Getting back to your question, I don't think that wider ALUs are better for compute ...

But "better" was your word: You said that with wider ALU's, "you get better performance at lower power investment". I didn't see anywhere in your original argument where you provided a non-graphical downside to wider ALU's:

"Since compute tasks are usually large (you don't start a GPU compute on just 100 work items, it's usually multiple or even hundreds of thousands), making your architecture as wide as possible works well — you can get many more ALUs in without wasting the die space on controlling logic. So you get better performance at lower power investment."

I.e., this was the logical structure of your argument:
1) For compute, wider ALU's are better. You didn't provide any downsides, in your original argument, for wide ALU's for compute.
2) For graphics, however, wider ALU's can lead to inefficiencies (unless you have TBDR).
3) Thus, because of graphics considerations, AMD and NVIDIA (which don't have TBDR) have chosen to compromise with an intermediate ALU width.
4) Consequently, if (because of TBDR)Apple doesn't have this graphical constraint, there's no reason for them not to use wide ALU's.

So I think my analysis of your argument (as you stated it) was correct--that, if there are no significant downsides to wider ALU's for compute then, logically, one would expect to see wider ALU's in non-graphical GPU's from NVIDIA and AMD. If that's not the case (i.e., if even non-graphical GPU's don't have wider ALU's), then your argument (as stated) can't be correct: there must be some other (non-graphical) reason wider ALU's aren't used by NVIDIA and AMD. And that non-graphical reason could apply to Apple as well. A possible non-graphical reason is given in your follow-up post: control flow divergence.

jeanlain · Oct 17, 2020

Don't AMD/Nvidia reuse some design paradigms for compute and pro/gaming GPUs? Using the same ALU width could just represents the tradeoff between optimal performance and ressources spend in GPU development.

diamond.g · Oct 17, 2020

jeanlain said:
Don't AMD/Nvidia reuse some design paradigms for compute and pro/gaming GPUs? Using the same ALU width could just represents the tradeoff between optimal performance and ressources spend in GPU development.

We haven’t seen it yet, but CDNA is supposed to be a departure from RDNA2, which is why AMD hadn’t formally launched replacement FirePro cards with RDNA architecture.

Edit I mean desktop Radeon Pro cards. The Mobile units are one offs for Apple as far as I can tell.

theorist9 · Oct 17, 2020

jeanlain said:
Don't AMD/Nvidia reuse some design paradigms for compute and pro/gaming GPUs? Using the same ALU width could just represents the tradeoff between optimal performance and ressources spend in GPU development.

That's a good point, but given how optimized their datacenter GPUs are for compute, it seems unlikely they would compromise their design (if that were a compromise) just to re-use the architechture from their graphical GPUs.

leman · Oct 18, 2020

theorist9 said:
But "better" was your word: You said that with wider ALU's, "you get better performance at lower power investment". I didn't see anywhere in your original argument where you provided a non-graphical downside to wider ALU's:

My point was simply that wider ALUs are a straightforward way to offer higher theoretical peak performance while keeping power consumption and chip size in check. My (certainly challengeable) speculation is that divergence is less of a problem for compute shaders than it is for graphics where tasks have certain predictable granularity to them. Unfortunately, as we don't have wide GPUs on the market (I don't think that Imagination's A-series are in any customer product right now), this is not something we can test empirically.

As to why AMD or Nvidia don't use wider designs for their professional GPUs, I think it's primarily because of a) they use the same architecture (with minor tweaks) for both pro and gaming hardware and b) their pro hardware is still graphically oriented. Besides, recent investments were about improving ML performance which requires different kind of hardware. I am curious about AMD's upcoming CNDA because it is expected to be compute-oriented architecture. What kind of changes will we see?

And finally, I want to be perfectly clear that all I wrote is just speculation. In fact, after iPhone A14 benchmarks appeared, I am very doubtful that my speculation is accurate. The disparity in compute scores between the iPad and the iPhone don't make any sense. There has to be a different factor.

leman · Oct 23, 2020

Anyway, Apple has today released a tech note detailing the capabilities of the A14 GPUs and there seem to be little surprises. Just some internal optimizations and a bunch of new features that make them more compatible to the desktop GPUs (most likely in preparation for Apple Silicon Macs). Based on their examples I doubt there are any fundamental micro-architectural changes. SIMD width still seems to be 32.

diamond.g · Oct 23, 2020

Still no hardware RT then? Do we think they are pulling an Intel Xe here? Not all of the Xe units support hardware RT, but the hardware block is there to be added in higher end configs. Can Apple do the same here? Have a desktop "A14" that includes hardware RT?

leman · Oct 23, 2020

diamond.g said:
Still no hardware RT then? Do we think they are pulling an Intel Xe here? Not all of the Xe units support hardware RT, but the hardware block is there to be added in higher end configs. Can Apple do the same here? Have a desktop "A14" that includes hardware RT?

No idea, someone would need to check the metal capabilities for whether RT is supported. I doubt it though — they would have most certainly mentioned it in the tech note.

RT on Macs is probably coming with AMD GPUs first and then with A14+ or whatever it's going to be next year. Metal support is already there.

diamond.g · Oct 23, 2020

leman said:
No idea, someone would need to check the metal capabilities for whether RT is supported. I doubt it though — they would have most certainly mentioned it in the tech note.

RT on Macs is probably coming with AMD GPUs first and then with A14+ or whatever it's going to be next year. Metal support is already there.

So metal supports software fallback for RT? Or do they have you do a “wisdom check” before you can use it?

leman · Oct 23, 2020

diamond.g said:
So metal supports software fallback for RT? Or do they have you do a “wisdom check” before you can use it?

Haven't played around with it yet, so can't say for sure

If I understand it correctly, RT is currently implemented on all compatible models using compute shaders.

Homy · Oct 23, 2020

Discover Metal enhancements for A14 Bionic - Tech Talks - Videos - Apple Developer

Explore how Metal is bringing sophisticated rendering and powerful compute features to A14 Bionic. We'll take you through the Metal...

developer.apple.com

leman · Oct 25, 2020

Now that there are a lot of results on the Geekbench website, I am getting more and more convinced that it's a bug in GB itself. Compare the score differences in SFFT test for example — which has identical description for GB4 and GB5:

iPad Air (4th generation) vs iPhone 12 Pro - Geekbench

browser.geekbench.com

iPad13,1 vs iPhone13,3 - Geekbench

browser.geekbench.com

They probably either messed up the timing or the setup.

jeanlain · Oct 25, 2020

What kind of bug would make iPads appear faster than iPhones equipped with the same SoCs, and on both geekbench versions?

EntropyQ3 · Oct 25, 2020

jeanlain said:
What kind of bug would make iPads appear faster than iPhones equipped with the same SoCs, and on both geekbench versions?

That’s hard to imagine. It’s more likely that there are hardware related wrinkles that we are as yet unaware of. Thermal throttling conditions, memory specs, .... something.

leman · Oct 25, 2020

EntropyQ3 said:
That’s hard to imagine. It’s more likely that there are hardware related wrinkles that we are as yet unaware of. Thermal throttling conditions, memory specs, .... something.

That would result in as much as 100% difference in test scores on same hardware, same microbenchmark between GB4 and GB5? I find that very unlikely... even if they changed the benchmark code it shouldn’t show such high disparity.

EntropyQ3 · Oct 25, 2020

leman said:
That would result in as much as 100% difference in test scores on same hardware, same microbenchmark between GB4 and GB5? I find that very unlikely... even if they changed the benchmark code it shouldn’t show such high disparity.

Yeah, there is a story there, but I wouldn’t dare speculate as to what it is. Another theory could be local hot-spotting on the die that doesn’t kick in on the iPad due to better heat dissipation, as that could hit specific subtests. It actually isn’t the same hardware, after all, just the same SoC.
The anomaly has been noted, so I guess we’ll get an explanation at some point, but I don’t think we get much further by speculating. If someone notifies John Poole, he will probably check the code for correctness, but if no such inexplicable results have turned up before, it’s difficult to see why the same binary running on the A14 should behave differently.

deconstruct60 · Oct 25, 2020

jeanlain said:
What kind of bug would make iPads appear faster than iPhones equipped with the same SoCs, and on both geekbench versions?

It could be a 'feature'. iPhones and iPads may not be using the same function units for acceleration/computation in the two systems. The binary code that is produced from Geekbench's code is not dynamically the whole code that is executed. If the calls to Apple's libraries branch into two different function units on iPhone and iPad then results don't have to be the same at all. The iPhone branch probably conserves power dissipation . The iPad may not ( there is no 5G radio subsystem on the iPad. ) . The iPhone has a logic board 'sandwich' and perhaps sometimes chooses more conservative options. Apple likes to leverage "race to sleep" but sometimes really do need to use more resources to go faster.

For example if dispatching small computational fragements of the benchmark that are essentially matrix multiple off to the ML subsystem on one system and on the other handling it with a different part of the SoC. Bascially one issue is that there are multiple sets of cores here. It is not just a legacy CPU with only one kind of core present. If dispatch to different core types then shouldn't expect the same results.

PortoMavericks · Oct 25, 2020

The world is moving towards ML upscaling.

The dev can simply render scenes at much, much lower resolutions while maintaining the same scene complexity. Also, they don’t have to deal with engine optimizations to be TBDR compatible.

It means rendering internally a scene at 720p targeting a 1080p output.

To put it simply, Apple is going this route because they lack IPs and Imagination Technologies is a company willing to share their GPU IPs And Apple wants to keep control on everything. AMD has a custom design division but they don’t work that way, they want to design the chip.

I’m sure if they could they’d license all the AMD stuff.

JMacHack · Oct 25, 2020

diamond.g said:
Still no hardware RT then? Do we think they are pulling an Intel Xe here? Not all of the Xe units support hardware RT, but the hardware block is there to be added in higher end configs. Can Apple do the same here? Have a desktop "A14" that includes hardware RT?

"When it's ready™" would be Apple's reply. I don't think hardware RT will come with the first round of ASi. Maybe when they ship their dGPUs. HWRT is still in the wild west phase.

PortoMavericks said:
I’m sure if they could they’d license all the AMD stuff.

From what I've read and (obligatory NOT AN EXPERT) heard, both AMD and NVidia's RT and upscaling tech is based on Vulkan and DirectX libraries. From what I understand that means it's off-limits to Apple.

Also, if my experience with AI upscaling is any indication, there's going to be a market for diehards who want non-upsacled imagery. Upscaling just doesn't work well with some things.

leman · Oct 26, 2020

PortoMavericks said:
The world is moving towards ML upscaling.

The dev can simply render scenes at much, much lower resolutions while maintaining the same scene complexity. Also, they don’t have to deal with engine optimizations to be TBDR compatible.

It means rendering internally a scene at 720p targeting a 1080p output.

To put it simply, Apple is going this route because they lack IPs and Imagination Technologies is a company willing to share their GPU IPs And Apple wants to keep control on everything. AMD has a custom design division but they don’t work that way, they want to design the chip.

ML upscaling is not a magical solution. It requires explicit game support (as the game needs to provide additional data to the NN), it requires the training infrastructure, and it requires local ML performance. Apples ML accelerators are fast, but they are optimized for mobile use — tensor cores in something like a RTX 2060 are still around 5 times faster (and a RTX 3080 is a whopping 25 times faster for ML). And of course, you need a lot of training data, which Nvidia has been collecting for years. Apple — not so much.

Generally I'd wager that tweaking your engine to take explicit advantage of TBDR is not more complicated than supporting ML upscaling. And you don't need to deal with the time-consuming and potentially expensive training step.

Not to mention that ML upscaling has little sense for the resolutions you are talking about. A14 is already offers 2x performance per watt compared to Nvidia's best. Apple GPUs won't have any problems dealing with 1080 resolutions. Where ML upscaling becomes really interesting is getting to resolution that GPU can't realistically do on it's own. But for that you need ML performance and bandwidth... which means needing a beefy GPU in the first place. I am not sure how this tech is suitable for entry level, I have a suspicion that it will stay reserved for high-end only for some time.

JMacHack said:
From what I've read and (obligatory NOT AN EXPERT) heard, both AMD and NVidia's RT and upscaling tech is based on Vulkan and DirectX libraries. From what I understand that means it's off-limits to Apple.

API doesn't matter. Nvidia's upscaling technology is based on a) custom API extensions and b) data for training the NN model and c) excellent ML performance of their GPUs. They use supercomputers to gather high-resolution game image data and build the upscaling network. They also seem to have collected plenty of data to quickly support new games.

But there is nothing special about any of this. If you want, you can build an autoencoder network for your iOS game, train it with high-res output and roll your own ML upscaling. Not sure whether it will be worth the effort though.

By the way, this kind of stuff is obligatory for RT. Ray-tracing produces "grainy" images (since you are casting individual rays), you need to use a filter that will produce a smooth image.

Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

macrumors 601

macrumors 68020

macrumors Core

macrumors 6502a

macrumors Core

macrumors 601

macrumors 68020

macrumors G5

macrumors 601

macrumors Core

macrumors Core

macrumors G5

macrumors Core

macrumors G5

macrumors Core

macrumors 68030

macrumors Core

macrumors 68020

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors G5

macrumors 6502

Suspended

macrumors Core

Our Staff