Update: after A14 benchmarks for iPhone and iPad became more widely available, it is fairly clear that A14 did not have any deep changes to it's ALU configuration. This pretty much throws my speculation out of the window.
In light of A14's much improved compute benchmark scores, I was thinking a bit, and I had a curious though. If I am even remotely correct, there might just be away to utilize their TBDR architecture to get an unexpected advantage. Of course, this all is just academic speculation. But since a lot of folks here are interested in GPU performance and as we all love speculation, I thought it would be interesting to share.
Here is a TL&DR: Apple could be moving to very wide GPU ALUs, something they can only do because of the TBDR architecture, in order to secure performance-per-watt superiority in compute and raytracing over Nvidia and AMD (see also Imagination A-series). More detailed resining follows below.
First, let's talk compute. GPUs are fast because they are parallel wide processors. "Wide" means that they process multiple data items at the same time in a single program. This allows them to save the valuable silicon space, because you only need one instruction dispatcher/scheduler to feed multiple computation units. Since compute tasks are usually large (you don't start a GPU compute on just 100 work items, it's usually multiple or even hundreds of thousands), making your architecture as wide as possible works well — you can get many more ALUs in without wasting the die space on controlling logic. So you get better performance at lower power investment.
Now, for graphic, the situation is a bit different. Graphics is also about processing multiple values at the same time — after all you are drawing triangles which consist of multiple pixels and you want to execute the same shader program for each of these pixels. But if your triangles are smaller, and your GPU is very wide, you won't be able to utilize the hardware to 100%. You see, the problem is that a GPU ALU has to execute the same code. If you have an ALU that can do 64 items at the same time, but you only need to process 6 items, the remaining 58 ALU slots won't do anything useful. With smaller triangles, you won't be able to collect enough pixels to keep reasonable GPU utilization and all this ALU power goes to waste.
Because of the above considerations, both AMD and Nvidia settled down on an ALU width of 32 values (32-wide SIMD). This appears to be a sweet spot between managing ALU occupation and conserving die space. This configuration means that a single ALU can process something like a 8x4 pixel block in one go (which should work reasonably well for small triangles) while being ok for compute. Apple also has been using 32-wide SIMD in their GPUs.
So basically, 32 wide SIMD is the optimal SIMD width for a general-purpose GPU, right? Well, maybe not... here is the funny thing. Since Apple uses TBDR, they don't really have the "small triangle" problem. The do not shade triangles directly like AMD or Nvidia. They first collect the triangle visibility information in tiles (usually 32*32 pixels) and then run the shader over the entire tile. So they could use much wider hardware without having to worry about the ALU occupation that much. This is the idea behind the Imagination A-series IP announced last year (for those who don't know, Imagination is the company behind PowerVR and ultimately behind most of Apple's GPU IP) — it's the only GPU IP currently on the market that uses 128-wide SIMD. Since Apple and Imagination has signed a new licensing contract, Apple designs could be using some of these ideas.
In fact, we see A14 Geekbench compute scores having a commanding lead over the the A13 with the same number of GPU cores, at the same power consumptions. The only reasonable explanation for this is that A14 can do more work per core - which means either more ALUs, wider ALUs or higher clocks — and given the massive compute performance increase at the same power consumption, I'd bet on wider ALUs. Metal has an API to query the ALU width, so I would be curious to see what it reports on the A14. I'd bet is larger than 32.
Another thing is ray tracing, which is even more interesting. Ray tracing poses some unique challenges for the GPUs, one of which is memory access. When you are shading a triangle using the "classical" rasterization approach, you can kind of assume that neighboring pixels will access memory at similar locations (think about textures, if pixel A reads the texture at location T, then pixel A+1 will probably read the texture at the location T+1 or something similar). This allows GPUs to "bundle" their memory reads, which is the only way how they can get reasonable performance. But in raytracing, rays can bounce pretty much unpredictably — you end up with fragmented memory reads that can kill GPU efficiency. This is where TBDR can bring unique advantages over immediate renderers as well. Imagination already has IP to coalesce rays in hardware — collecting rays that will access memory at similar locations. And one can use the on-chip tile-buffer to reorder pixels in the processing pipeline to make sure that the ALUs are utilized as efficiently as possible — in fact, Apple mentioned that they already do this for some applications (don't remember where it was, but I saw it in one of the WWDC videos). What I am trying to say is that because of their tiled deferred architecture AND in combination with Imagination IP, Apple could be in a unique position in regards to high-performance ray tracing. And while I doubt that A14 already has hardware raytracing, we might see some surprises next year. Like a mobile GPU that outperforms an Nvidia RTX in RT? Who knows.
Some relevant links:
- https://browser.geekbench.com/v5/compute/compare/1587656?baseline=1581541
- https://www.anandtech.com/show/15156/imagination-announces-a-series-gpu-architecture
- https://www.imgtec.com/blog/coheren...racing-the-benefits-of-hardware-ray-tracking/
In light of A14's much improved compute benchmark scores, I was thinking a bit, and I had a curious though. If I am even remotely correct, there might just be away to utilize their TBDR architecture to get an unexpected advantage. Of course, this all is just academic speculation. But since a lot of folks here are interested in GPU performance and as we all love speculation, I thought it would be interesting to share.
Here is a TL&DR: Apple could be moving to very wide GPU ALUs, something they can only do because of the TBDR architecture, in order to secure performance-per-watt superiority in compute and raytracing over Nvidia and AMD (see also Imagination A-series). More detailed resining follows below.
First, let's talk compute. GPUs are fast because they are parallel wide processors. "Wide" means that they process multiple data items at the same time in a single program. This allows them to save the valuable silicon space, because you only need one instruction dispatcher/scheduler to feed multiple computation units. Since compute tasks are usually large (you don't start a GPU compute on just 100 work items, it's usually multiple or even hundreds of thousands), making your architecture as wide as possible works well — you can get many more ALUs in without wasting the die space on controlling logic. So you get better performance at lower power investment.
Now, for graphic, the situation is a bit different. Graphics is also about processing multiple values at the same time — after all you are drawing triangles which consist of multiple pixels and you want to execute the same shader program for each of these pixels. But if your triangles are smaller, and your GPU is very wide, you won't be able to utilize the hardware to 100%. You see, the problem is that a GPU ALU has to execute the same code. If you have an ALU that can do 64 items at the same time, but you only need to process 6 items, the remaining 58 ALU slots won't do anything useful. With smaller triangles, you won't be able to collect enough pixels to keep reasonable GPU utilization and all this ALU power goes to waste.
Because of the above considerations, both AMD and Nvidia settled down on an ALU width of 32 values (32-wide SIMD). This appears to be a sweet spot between managing ALU occupation and conserving die space. This configuration means that a single ALU can process something like a 8x4 pixel block in one go (which should work reasonably well for small triangles) while being ok for compute. Apple also has been using 32-wide SIMD in their GPUs.
So basically, 32 wide SIMD is the optimal SIMD width for a general-purpose GPU, right? Well, maybe not... here is the funny thing. Since Apple uses TBDR, they don't really have the "small triangle" problem. The do not shade triangles directly like AMD or Nvidia. They first collect the triangle visibility information in tiles (usually 32*32 pixels) and then run the shader over the entire tile. So they could use much wider hardware without having to worry about the ALU occupation that much. This is the idea behind the Imagination A-series IP announced last year (for those who don't know, Imagination is the company behind PowerVR and ultimately behind most of Apple's GPU IP) — it's the only GPU IP currently on the market that uses 128-wide SIMD. Since Apple and Imagination has signed a new licensing contract, Apple designs could be using some of these ideas.
In fact, we see A14 Geekbench compute scores having a commanding lead over the the A13 with the same number of GPU cores, at the same power consumptions. The only reasonable explanation for this is that A14 can do more work per core - which means either more ALUs, wider ALUs or higher clocks — and given the massive compute performance increase at the same power consumption, I'd bet on wider ALUs. Metal has an API to query the ALU width, so I would be curious to see what it reports on the A14. I'd bet is larger than 32.
Another thing is ray tracing, which is even more interesting. Ray tracing poses some unique challenges for the GPUs, one of which is memory access. When you are shading a triangle using the "classical" rasterization approach, you can kind of assume that neighboring pixels will access memory at similar locations (think about textures, if pixel A reads the texture at location T, then pixel A+1 will probably read the texture at the location T+1 or something similar). This allows GPUs to "bundle" their memory reads, which is the only way how they can get reasonable performance. But in raytracing, rays can bounce pretty much unpredictably — you end up with fragmented memory reads that can kill GPU efficiency. This is where TBDR can bring unique advantages over immediate renderers as well. Imagination already has IP to coalesce rays in hardware — collecting rays that will access memory at similar locations. And one can use the on-chip tile-buffer to reorder pixels in the processing pipeline to make sure that the ALUs are utilized as efficiently as possible — in fact, Apple mentioned that they already do this for some applications (don't remember where it was, but I saw it in one of the WWDC videos). What I am trying to say is that because of their tiled deferred architecture AND in combination with Imagination IP, Apple could be in a unique position in regards to high-performance ray tracing. And while I doubt that A14 already has hardware raytracing, we might see some surprises next year. Like a mobile GPU that outperforms an Nvidia RTX in RT? Who knows.
Some relevant links:
- https://browser.geekbench.com/v5/compute/compare/1587656?baseline=1581541
- https://www.anandtech.com/show/15156/imagination-announces-a-series-gpu-architecture
- https://www.imgtec.com/blog/coheren...racing-the-benefits-of-hardware-ray-tracking/
Last edited: