This isn't just about primitives crossing tiles (overlap) ... each tile needs to accumulate, sort, and maintain its own list of data (primitives list). That's the entire reason a tile exists ... to be an independent work unit. Decreasing the tile size increases that overhead.
The big difference in terms of cost between TBDR and IMR is that a TBDR needs to store the output of the vertex processing stage until all primitives are rasterized, where an IMR can immediately rasterize a primitive and discard it's data. This is the primary cause of scaling concerns. In a sense, TBDR's bet is that processing the geometry is less costly than processing the pixels. If your geometry list is long and the geometry itself is complex, with many very small triangles (we are talking few pixels in size) and little to no overdraw, this assumption can be violated. In comparison, the overhead of managing bins is negligible. Clipping and binning is done on very fast fixed-function hardware and the bins only need to store a list of primitive ids intersecting the tile (in practice, these are 16-bit indices, so the bandwidth overhead is very small compared to the size of transformed vertex data). From what I've read, GPUs try to manage the geometry recording phase by aggressively compressing data and by putting a limit on the buffer size. Once you get too many primitives per tile, that tile is flushed, which keeps the things from getting stuck. Probably one of the biggest challenges of TBDR (and only TBDR) is transparency. If you encounter a transparent pixel you have to flush the tile as well, since that pixel will needs access to the underlaying image. Keeping track of all this while maintaining good performance is an engineering nightmare — which is probably why only one company up do date has managed to figured out TBDR in practice.
So far, it seems like IMR should always do better with complex geometry and small triangles. But it's not that simple. First of all, modern IMR also use binning, so they have to deal with some of the above issues as well (the common approach here seems to use large tiles and flush early). Second, modern IMR have an even bigger problem with small triangles since they break the SIMD coherency. A modern desktop GPU such as Navi or Ampere needs 32 data items to be processed simultaneously in order to get good shading unit utilization. This means that pixel shaders are executed on blocks of pixels (Navi for example uses 8x8 blocks). For optimal performance, all pixels in the block have to belong to the same triangle. Edges are always a problem since it means that parts of the SIMD units are not doing anything useful. Small triangles are basically all edges, so if most of your triangles are like this, your performance will tank big time. A TBDR in contrast has no problems here because it always shades an entire tile at once — it can shade pixels belonging to multiple different triangles in one go. An IMR simply can't do that. So you know, you win some, you lose some.
But the question is if moving the work to the middle of the pipeline is worth it (which is where the scaling issue is). And saving memory bandwidth is an issue on mobile devices, but it's generally not the bottleneck on desktop GPU's.
The scaling issue is entirely in the frontend. Bandwidth is
always a problem. Why would desktop GPUs use faster and faster RAM, with wider and wider memory buses otherwise? Latency is an even bigger issue.
TBDR is simply about more efficient utilization of resources. There is no reason why a TBDR GPU could not be paired with high-bandwidth RAM for example. It's just that it doesn't necessarily need that RAM at a given performance level. But if Apple is serious about higher performance tiers, they will have to use fast RAM. LPDDR5 will only get you that far, even with an TBDR.
Well that's a whole other can of worms to deal with. The differences between IMR and TBDR appear to be difficult to abstract over via any sort of API, which means ports are going to be suboptimal without a lot of extra work by developers.
Depends how much of the underlaying architecture details you want to expose. On a fundamental level, you don't need to abstract much. Any standard API uses the same abstraction: triangles in, rasterized in order, pixels out. Both IMR and TBDR are very happy to work with this model. It only starts getting tricky if you want to get out more out of the hardware.
Not to mention that modern APIs already have some optimization for TBDR GPUs. Apple for example introduced render passes — sort of contracts that describe how frame buffer resources are being used. These can useful for tilers (any tiles, not just TBDR) in a low-bandwidth environment, since they can use the information to optimize memory transfers. For IMR GPUs, they don't do much (but this could change in the near future as tiling becomes more important). Render pass APIs were subsequently adopted in Vulkan, DX12 and WebGPU — so it's now an industry standard abstraction. So if your Vulkan game uses render passes, it will automatically run "better" on Apple Silicon.
Beyond that Metal on Apple GPUs of course has a lot of additional features that wouldn't make sense for an IMR at all. Which is not different to any other GPU out there. You can use DX12 to make a game that runs good on both AMD or Nvidia GPUs, or you can use Nvidia's mesh shader extensions to make your game run better on Nvidia GPUs... etc. I remember the old good OpenGL days, where you ended up using non-standard vendor extensions because core OpenGL was outdated crap.
it isn’t clear, can macOS users participate in Early Access?
Yes, they have a Mac version. Which came as a bit of a surprise, since Sven Winkle was not sure whether Mac will make it to early access just a short while ago.
Where is the information saying they will do it? I didn't see anything regarding this
There was some mention of ARM strings in data-mined patches on WoW forums, unfortunately I can't find a link to the source...