@crazy dave Excellent write-up with a lot of attention to detail! The GPU implementation details are unfortunately closely guarded secrets, so it can be very difficult to get a clear picture of what is going on.
The idea I personally find most compelling is that it boils down to data movement. I've seen it suggested that Nvidia does not provide enough parallel data path bandwidth to actually feed all the processing units they have. Instead, they rely on a caching circuitry called the operand collector to optimize register storage access. This schema (if true) would mean that some of their ALUs are idle at least some of the time, which results in poorer efficiency. At the same time, it seems that Apple uses a different approach, trying to maximize the computational efficiency.
There is some evidence lending credibility to this world. First, Nvidia GPU cores are smaller than Apple GPU cores despite including more compute (e.g. tensor processing units) and larger storage banks. A narrower data path would explain this discrepancy. Second, Apple appears to be the only GPU on the market with four-input instructions (compare and select), which are difficult to feed.
Assuming that this theory is correct, it is hard to say which approach makes more sense. One might say that Nvidia is being wasteful, but they do achieve comparable or better compute per die area, and certainly better compute per $. By using smaller processing units, they can cram more of them into their device, and idling some of the units can be a good strategy for both energy movement and work balancing. It appears that both strategies excel at what they are doing, and both would have some difficulties with scaling.
Apple could — quite easily — implement concurrent FP32 pipes within their GPU cores. However, this would also require them widening the data path by ~ 33% if they want to keep the same efficiency. Right now, Apple cannot sustain parallel fetches of two 32-bit values, which becomes clear when running FP32 and Int code concurrently — it gets a penalty (no such penalty is observed when combining 32-bit and 16-bit pipes). In principle, it could be well worth it, as they are likely to improve the performance by way more than 30%.
The idea I personally find most compelling is that it boils down to data movement. I've seen it suggested that Nvidia does not provide enough parallel data path bandwidth to actually feed all the processing units they have. Instead, they rely on a caching circuitry called the operand collector to optimize register storage access. This schema (if true) would mean that some of their ALUs are idle at least some of the time, which results in poorer efficiency. At the same time, it seems that Apple uses a different approach, trying to maximize the computational efficiency.
There is some evidence lending credibility to this world. First, Nvidia GPU cores are smaller than Apple GPU cores despite including more compute (e.g. tensor processing units) and larger storage banks. A narrower data path would explain this discrepancy. Second, Apple appears to be the only GPU on the market with four-input instructions (compare and select), which are difficult to feed.
Assuming that this theory is correct, it is hard to say which approach makes more sense. One might say that Nvidia is being wasteful, but they do achieve comparable or better compute per die area, and certainly better compute per $. By using smaller processing units, they can cram more of them into their device, and idling some of the units can be a good strategy for both energy movement and work balancing. It appears that both strategies excel at what they are doing, and both would have some difficulties with scaling.
Apple could — quite easily — implement concurrent FP32 pipes within their GPU cores. However, this would also require them widening the data path by ~ 33% if they want to keep the same efficiency. Right now, Apple cannot sustain parallel fetches of two 32-bit values, which becomes clear when running FP32 and Int code concurrently — it gets a penalty (no such penalty is observed when combining 32-bit and 16-bit pipes). In principle, it could be well worth it, as they are likely to improve the performance by way more than 30%.