Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

leman · Oct 4, 2020

Update: after A14 benchmarks for iPhone and iPad became more widely available, it is fairly clear that A14 did not have any deep changes to it's ALU configuration. This pretty much throws my speculation out of the window.

In light of A14's much improved compute benchmark scores, I was thinking a bit, and I had a curious though. If I am even remotely correct, there might just be away to utilize their TBDR architecture to get an unexpected advantage. Of course, this all is just academic speculation. But since a lot of folks here are interested in GPU performance and as we all love speculation, I thought it would be interesting to share.

Here is a TL&DR: Apple could be moving to very wide GPU ALUs, something they can only do because of the TBDR architecture, in order to secure performance-per-watt superiority in compute and raytracing over Nvidia and AMD (see also Imagination A-series). More detailed resining follows below.

First, let's talk compute. GPUs are fast because they are parallel wide processors. "Wide" means that they process multiple data items at the same time in a single program. This allows them to save the valuable silicon space, because you only need one instruction dispatcher/scheduler to feed multiple computation units. Since compute tasks are usually large (you don't start a GPU compute on just 100 work items, it's usually multiple or even hundreds of thousands), making your architecture as wide as possible works well — you can get many more ALUs in without wasting the die space on controlling logic. So you get better performance at lower power investment.

Now, for graphic, the situation is a bit different. Graphics is also about processing multiple values at the same time — after all you are drawing triangles which consist of multiple pixels and you want to execute the same shader program for each of these pixels. But if your triangles are smaller, and your GPU is very wide, you won't be able to utilize the hardware to 100%. You see, the problem is that a GPU ALU has to execute the same code. If you have an ALU that can do 64 items at the same time, but you only need to process 6 items, the remaining 58 ALU slots won't do anything useful. With smaller triangles, you won't be able to collect enough pixels to keep reasonable GPU utilization and all this ALU power goes to waste.

Because of the above considerations, both AMD and Nvidia settled down on an ALU width of 32 values (32-wide SIMD). This appears to be a sweet spot between managing ALU occupation and conserving die space. This configuration means that a single ALU can process something like a 8x4 pixel block in one go (which should work reasonably well for small triangles) while being ok for compute. Apple also has been using 32-wide SIMD in their GPUs.

So basically, 32 wide SIMD is the optimal SIMD width for a general-purpose GPU, right? Well, maybe not... here is the funny thing. Since Apple uses TBDR, they don't really have the "small triangle" problem. The do not shade triangles directly like AMD or Nvidia. They first collect the triangle visibility information in tiles (usually 32*32 pixels) and then run the shader over the entire tile. So they could use much wider hardware without having to worry about the ALU occupation that much. This is the idea behind the Imagination A-series IP announced last year (for those who don't know, Imagination is the company behind PowerVR and ultimately behind most of Apple's GPU IP) — it's the only GPU IP currently on the market that uses 128-wide SIMD. Since Apple and Imagination has signed a new licensing contract, Apple designs could be using some of these ideas.

In fact, we see A14 Geekbench compute scores having a commanding lead over the the A13 with the same number of GPU cores, at the same power consumptions. The only reasonable explanation for this is that A14 can do more work per core - which means either more ALUs, wider ALUs or higher clocks — and given the massive compute performance increase at the same power consumption, I'd bet on wider ALUs. Metal has an API to query the ALU width, so I would be curious to see what it reports on the A14. I'd bet is larger than 32.

Another thing is ray tracing, which is even more interesting. Ray tracing poses some unique challenges for the GPUs, one of which is memory access. When you are shading a triangle using the "classical" rasterization approach, you can kind of assume that neighboring pixels will access memory at similar locations (think about textures, if pixel A reads the texture at location T, then pixel A+1 will probably read the texture at the location T+1 or something similar). This allows GPUs to "bundle" their memory reads, which is the only way how they can get reasonable performance. But in raytracing, rays can bounce pretty much unpredictably — you end up with fragmented memory reads that can kill GPU efficiency. This is where TBDR can bring unique advantages over immediate renderers as well. Imagination already has IP to coalesce rays in hardware — collecting rays that will access memory at similar locations. And one can use the on-chip tile-buffer to reorder pixels in the processing pipeline to make sure that the ALUs are utilized as efficiently as possible — in fact, Apple mentioned that they already do this for some applications (don't remember where it was, but I saw it in one of the WWDC videos). What I am trying to say is that because of their tiled deferred architecture AND in combination with Imagination IP, Apple could be in a unique position in regards to high-performance ray tracing. And while I doubt that A14 already has hardware raytracing, we might see some surprises next year. Like a mobile GPU that outperforms an Nvidia RTX in RT? Who knows.

Some relevant links:

- https://browser.geekbench.com/v5/compute/compare/1587656?baseline=1581541
- https://www.anandtech.com/show/15156/imagination-announces-a-series-gpu-architecture
- https://www.imgtec.com/blog/coheren...racing-the-benefits-of-hardware-ray-tracking/

Kung gu · Oct 4, 2020

Not just the A13, the A14 has a faster metal score than the A12Z.
https://browser.geekbench.com/ios_devices/ipad-pro-11-inch-2nd-generation - iPad Pro 2020

A14 has a compute score of 12571.
Even with 4 GPU, its beats the A12Z with half the GPU cores.

leman · Oct 4, 2020

Kung gu said:
Not just the A13, the A14 has a faster metal score than the A12Z.
https://browser.geekbench.com/ios_devices/ipad-pro-11-inch-2nd-generation - iPad Pro 2020

A14 has a compute score of 12571.
Even with 4 GPU, its beats the A12Z with half the GPU cores.

Thanks for pointing this out, this is really interesting! Let's also not forget that the iPad Pro likely has much more memory bandwidth (quad channel LPDDR4), and interestingly enough the benchmarks where the A12Z does better seem to be ones where memory bandwidth is more critical than compute performance

iPad Pro 12.9-inch (4th generation) vs iPad13,2 - Geekbench

I don't know whether my speculation is correct, but all this data suggests that A14 GPU has been extensively re-engineered.

Kung gu · Oct 4, 2020

leman said:
Thanks for pointing this out, this is really interesting! Let's also not forget that the iPad Pro likely has much more memory bandwidth (quad channel LPDDR4), and interestingly enough the benchmarks where the A12Z does better seem to be ones where memory bandwidth is more critical than compute performance

iPad Pro 12.9-inch (4th generation) vs iPad13,2 - Geekbench

I don't know whether my speculation is correct, but all this data suggests that A14 GPU has been extensively re-engineered.

I don't know if u follow, l0vetodream but he said this was the "year of the A14".

I truly think Apple has something planned, the A14 is going to be in the Air 4, the 4 iPhones 12 and the foundation for the first gen Apple Sillicon for Macs.

teagls · Oct 4, 2020

I thought nvidia already used a form of tbdr.

Tile-based Rasterization in Nvidia GPUs - Real World Tech

Starting with the Maxwell GM20x architecture, Nvidia high-performance GPUs have borrowed techniques from low-power mobile graphics architectures. Specifically, Maxwell and Pascal use tile-based immediate-mode rasterizers that buffer pixel output, instead of conventional full-screen...

www.realworldtech.com

leman · Oct 4, 2020

teagls said:
I thought nvidia already used a form of tbdr.

Tile-based Rasterization in Nvidia GPUs - Real World Tech

Starting with the Maxwell GM20x architecture, Nvidia high-performance GPUs have borrowed techniques from low-power mobile graphics architectures. Specifically, Maxwell and Pascal use tile-based immediate-mode rasterizers that buffer pixel output, instead of conventional full-screen...

www.realworldtech.com

Nvidia and AMD use a limited form of tiling as a caching optimization. This is not TBDR as it lacks the DR part

The optimizations I am talking about are because of the deferred rendering aspect, not tiling in itself.

Homy · Oct 4, 2020

The Metal Score is impressive. It's 137% higher than A12 and 72% higher than A13 according to Geekbench results. It's lot more than 30% higher GPU performance that Apple stated at WWDC.

A12 5307, A13 7308, A14 12571

It can mean that A14X and A14Z can also be much faster?

A12X 10860, A14X 25725
A12Z 11665, A14Z 27632

A12 with 4 GPU cores scores 5307. A12Z with 8 GPU cores scores 11665. 4 extra cores means 120% performance increase. An A14Z Mac SoC with 24 GPU cores could score 87876 in Metal. That's between Radeon Pro W5700XT and Radeon Pro Vega II.

diamond.g · Oct 12, 2020

You guys think Apple Metal could render this at greater than 41 FPS at 4K?

Fomalhaut · Oct 12, 2020

Homy said:
The Metal Score is impressive. It's 137% higher than A12 and 72% higher than A13 according to Geekbench results. It's lot more than 30% higher GPU performance that Apple stated at WWDC.

A12 5307, A13 7308, A14 12571

It can mean that A14X and A14Z can also be much faster?

A12X 10860, A14X 25725
A12Z 11665, A14Z 27632

A12 with 4 GPU cores scores 5307. A12Z with 8 GPU cores scores 11665. 4 extra cores means 120% performance increase. An A14Z Mac SoC with 24 GPU cores could score 87876 in Metal. That's between Radeon Pro W5700XT and Radeon Pro Vega II.

Where are you getting these Metal scores from? The Geekbench results at https://browser.geekbench.com/metal-benchmarks don't show all of these, and show A12X as having a score of 9105 not 10860.

jdb8167 · Oct 12, 2020

Fomalhaut said:
Where are you getting these Metal scores from? The Geekbench results at https://browser.geekbench.com/metal-benchmarks don't show all of these, and show A12X as having a score of 9105 not 10860.

My 2020 11” iPad Pro gets 10119. That is the A12Z with an extra GPU core from the A12X. My retired 2018 12.9” iPad Pro with the A12X got 9058. Maybe he is inflating the numbers for a theoretical boost in thermal capacity in a notebook?

thenewperson · Oct 12, 2020

diamond.g said:
You guys think Apple Metal could render this at greater than 41 FPS at 4K?

Metal or any future GPU they put inside a Mac? Possibly, it'll just depend on the computer I think.

fokmik · Oct 12, 2020

diamond.g said:
You guys think Apple Metal could render this at greater than 41 FPS at 4K?

Are we talking the inside gpu from the A series or from the next years custom Apple made gpu, bec if it is the latter of course and no doubt. Regarding an a14x, i bet it can also

leman · Oct 12, 2020

diamond.g said:
You guys think Apple Metal could render this at greater than 41 FPS at 4K?

Metal can render anything at any FPS, but you need a GPU that can do this

Right now no Apple device ships with hardware raytracing support, so no, not at the moment. Once Apple releases such GPUs, I don't see why not.

diamond.g · Oct 13, 2020

leman said:
Metal can render anything at any FPS, but you need a GPU that can do this Right now no Apple device ships with hardware raytracing support, so no, not at the moment. Once Apple releases such GPUs, I don't see why not.

Just for clarity the 3090 can only do it with DLSS enabled. So it will be pretty interesting to see Apple get the same or higher score (I am expecting this in the next year or two).

leman · Oct 13, 2020

diamond.g said:
Just for clarity the 3090 can only do it with DLSS enabled. So it will be pretty interesting to see Apple get the same or higher score (I am expecting this in the next year or two).

I'd be honestly surprised to see any GPU from Apple getting even close to that behemoth

diamond.g · Oct 13, 2020

leman said:
I'd be honestly surprised to see any GPU from Apple getting even close to that behemoth

Yeah I am not 100% certain why RT performance is so crappy even with a 3090, but it is. Assuming someone could get them to convert the game/benchmark to Apple's platform, maybe we will see what it can do.

Homy · Oct 13, 2020

Fomalhaut said:
Where are you getting these Metal scores from? The Geekbench results at https://browser.geekbench.com/metal-benchmarks don't show all of these, and show A12X as having a score of 9105 not 10860.

Well, obviously I have extrapolated the scores for A14X and A14Z since they don't exist. The Metal score for A14 was reported here at Macrumors on several occasions. Here is the link:https://browser.geekbench.com/v5/compute/1581541

As for the A12, A12X, A12Z and A13 all the scores can be found at https://browser.geekbench.com/ios-benchmarks, so you're looking at "wrong" chart. You should look at iOS Benchmarks. The A12X with 10860 is a 3rd gen iPad Pro on that chart.

jeanlain · Oct 13, 2020

All this is very exciting, but still based on a single geekbench result.
Hopefully, this is legit.

It somewhat contradicts Apple claims. So, assuming the performance increase is much higher in compute tasks than graphical task, would this hypothetical new ALU design explain this discrepancy.

diamond.g · Oct 13, 2020

jeanlain said:
All this is very exciting, but still based on a single geekbench result.
Hopefully, this is legit.

It somewhat contradicts Apple claims. So, assuming the performance increase is much higher in compute tasks than graphical task, would this hypothetical new ALU design explain this discrepancy.

I think that you can increase ALU without increasing ROPS. Since ROPS actually push pixels to screen space, you would get a compute increase but not necessarily a rendering performance increase.

leman · Oct 13, 2020

jeanlain said:
It somewhat contradicts Apple claims. So, assuming the performance increase is much higher in compute tasks than graphical task, would this hypothetical new ALU design explain this discrepancy.

I was hoping that Apple would give us more details today, but they still kept things very vague. They even forgo comparing the A14 to A13, instead comparing to some unspecified “competitor”. Either they don’t want to spoil to much until the Mac event, or the A14 is optimized for battery, or it’s just slower then they expected. Only testing will show.

As to the second part of your question, I think so, because graphics is constrained by other factors as well. There is the ROP, texture units, geometry processing etc.

Or maybe the geekbench entry is just wrong and there is nothing radically new about the A14 GPU...

Fomalhaut · Oct 13, 2020

Homy said:
Well, obviously I have extrapolated the scores for A14X and A14Z since they don't exist. The Metal score for A14 was reported here at Macrumors on several occasions. Here is the link:https://browser.geekbench.com/v5/compute/1581541

As for the A12, A12X, A12Z and A13 all the scores can be found at https://browser.geekbench.com/ios-benchmarks, so you're looking at "wrong" chart. You should look at iOS Benchmarks. The A12X with 10860 is a 3rd gen iPad Pro on that chart.

Thanks for that!

BigSplash · Oct 13, 2020

leman said:
I was hoping that Apple would give us more details today, but they still kept things very vague. They even forgo comparing the A14 to A13, instead comparing to some unspecified “competitor”. Either they don’t want to spoil to much until the Mac event, or the A14 is optimized for battery, or it’s just slower then they expected. Only testing will show.

As to the second part of your question, I think so, because graphics is constrained by other factors as well. There is the ROP, texture units, geometry processing etc.

Or maybe the geekbench entry is just wrong and there is nothing radically new about the A14 GPU...

Apple did emphasize the importance of the neural engine in "computational photography". It may be that a lot of the device budget went into low-precision floating point operations for the convolutional nets. Performance gains there may not be adequately captured by current benchmarks.

leman · Oct 13, 2020

BigSplash said:
Apple did emphasize the importance of the neural engine in "computational photography". It may be that a lot of the device budget went into low-precision floating point operations for the convolutional nets. Performance gains there may not be adequately captured by current benchmarks.

But improvements to ML accelerators won’t show up on the GPGPU benchmark - unless Geekbench uses CoreML in the background. My entire speculation is based on the fragile assumptions that the compute score is “real”.

jeanlain · Oct 13, 2020

Could Apple have rerouted some Metal APIs towards ML accelerators?
(performance shaders for instance)

leman · Oct 14, 2020

jeanlain said:
Could Apple have rerouted some Metal APIs towards ML accelerators?
(performance shaders for instance)

Who knows how Apple does stuff internally... I guess that it’s a remote possibility. I hope though Geekbench devs have more sense than using Performance Shaders for the compute benchmark.

Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

macrumors Core

Suspended

macrumors Core

Suspended

macrumors regular

macrumors Core

macrumors 68030

macrumors G5

macrumors 68020

macrumors 601

macrumors 65816

Suspended

macrumors Core

macrumors G5

macrumors Core

macrumors G5

macrumors 68030

macrumors 68020

macrumors G5

macrumors Core

macrumors 68020

macrumors member

macrumors Core

macrumors 68020

macrumors Core

Our Staff