This is not my deep area of expertise, so you'll have to bear with me if I'm getting the concepts/terminology wrong. But it's a different level of parallelism, isn't it? Work begins "immediately" in an immediate mode renderer, whereas in order to receive the benefits of tile based rendering, you must wait until the geometry in a tile is processed. Waiting to reduce the usage of other resources has its own kind of cost to performance. This makes sense to me as a drawback ... I see it all the time in high performance parallel systems. People try to introduce "smarts" and it ends up bottlenecking scalability.
Ah, I think I understand what you mean. You are saying that in a TBDR model the shader invocation has to wait until all the geometry in a tile has been processed. I wouldn't use the term parallelism here, it's more about latency. Anyway, this would be a cause for concern if you had shaders waiting to do pixel work... but this is a unified shader architecture. There is always work to do. You have compute shaders, vertex shaders, other tiles to process... one of the strong points of modern GPUs is automatic work balancing and latency hiding.
What is really relevant her tis the time it takes to draw a tile. A TBDR will invoke the pixel shader later, but it will only need to one "large" pixel shader dispatch. An IR will dispatch "small" pixel shader tasks continuously (see below for more details), but it will end up processing many more pixels. If you have a decent amount of overdraw, TBDR will end up doing much less work overall. And even more importantly, it will significantly reduce cache trashing and trips to the off-chip memory.
Mobile class GPU's are sharing memory bandwidth with the system to begin with, and on top of that desktop GPU's are using dedicated memory with over 10x the bandwidth. The bottlenecks simply aren't the same ... deferring work to save memory bandwidth isn't going to take the same priority in a higher performance system.
Exactly! We see TBDR in mobile space because it's the only way to get reasonable performance out of bandwidth and energy-starved environment. We don't see it in the desktop space because IR are simpler devices and one can afford to be a bit wasteful in the desktop space anyway.
OK, maybe I didn't mean "drawn" multiple times, but much of the processing work is done for a large triangle redundantly for each tile, on top of having to calculate the tiles each triangle falls in. I'm quite certain this drawback is correct. Having to process the same triangle multiple times when it is large enough to cross multiple tiles is cited in multiple places as a drawback of tile based rendering.
One citation:
https://graphics.stanford.edu/papers/bucket_render/paper_all.pdf
One problem with theoretical computer science is that it can be 100% correct and yet quite useless in real world
Also, the paper you are quoting is from 1998... a lot of things changed since then. The authors are obviously correct that there is a per-triangle-per-tile setup cost — you need to find the tiles that the triangle intersects and store that information — but in practical terms this cost is zero or at least very very small. I have no idea how tiling is done in hardware, but a naive algorithm that comes to my mind is simply using a rasterizer to "draw" the triangle to a grid where a tile is represented by exactly one pixel. Let us consider the worst case scenario. A full-screen triangle at 5K (5120x2880) resolution will touch every single tile. At 32x32 tile size, this is a 160x90 tile grid or 14400 tiles to be marked. Sounds like a lot, right? But when we actually start rasterizing the triangle properly, which we need to do anyway, it will generate whopping 14745600 pixels — that is the amount of work our rasterizer needs to run through anyway! It's an overhead of 0.1%, on an operation that is already really really cheap. For smaller triangles, the overhead will be much lower, as they will touch fewer tiles. Not to mention that you can do other stuff with tiling, like running multiple rasterizers in parallel for different tiles. And that this is the overhead using the most naive approach to tiling — I am sure that experienced engineers who work on it for years were able to come up with a better idea than an amateur like me could do in a minute
But even disregarding all that — modern desktop GPUs already use tiling! So it must be worth it for them. Here is a screenshot from a tool I wrote to study rasterization patterns on Macs:
Here, I am drawing two large overlapping triangles: yellow over red, while limiting the amount of pixels shaded, on a MacBook Pro with an AMD Navi GPU. As you can see, the GPU tiles the screen into large rectangular areas which are rasterized in tile order (left to right, top to bottom). Inside the large tiles, you get smaller square block artifacts. These blocks correspond to the invoked pixel shader groups: since Navi SIMD units are 32-wide units, it needs to receive work in batches of 32 items to achieve good hardware utilization. The blocks itself are 8x8 pixels, so once the rasterizer has output a 8x8 (64 pixels in total) pixel block, the pixel shader is executed. The tiling capability of Navi is fairly limited, once I go over a certain number of triangles (I think it's about 512, but I need to verify it), it resets. I am sure that details are much more complex, but this gives us a general idea how this stuff works. Nvidia is similar (can't test it since I don't have one, but I saw results from other people), just with different tile and block sizes. Intel (at least in the Skylake gen) does not use tiling (see the other screenshot I have attached), and it also uses different SIMD dispatch blocks (4x4 pixels — which makes sense given that fact that Intel Gen uses 4-wide SIMD).
As you can see, tiling must be cheap and beneficial enough so that big houses like Nvidia and AMD use it as well. But while they use tiling rendering, this is not deferred rendering: you can see the bottom red triangle drawn before the yellow triangle paints it over. Each pixel gets shaded twice. On a TBDR GPU (iPad, iPhone etc.) you won't ever see any red pixels, because the pixel shader is never called for the red triangle.
Well, I assume TBDR has a cost in silicon compared to a simpler approach, so it's not really "free", is it?
Definitely not free
TBDR is a much more complex design, it needs more meticulously designed fixed function hardware, and there are a lot of details that I can only assume are extremely difficult to get right. That's why there is only one company that really gets TBDR done right, and it's Imagination Technologies, who has been researching and designing these GPUs for the last 20+ years. Apple GPUs are based on top of Imagination IP, with Apple possibly refining the architecture further.