As a side note about this, Apple puts a lot on emphasis on its talks about a couple of things that minimize this problems: render passes that use tile memory and function specialization for pipeline variants.But Nvidia will often have an advantage if the shaders themselves are large/complex, as they have more registers and more compute throughput.
For IMR GPUs that need previous data I guess the way to go if you require that a render pass uses a resource made in a previous step and don't want to go through system or VRAM memory is to just merge those two passes into a single one. This puts more pressure into registers. But on TBDR GPUs you can run two separate render passes, where the first pass writes to tile memory and the second pass reads from there. Register pressure is minimized as you don't need to keep around registers from the previous pass around.
And same thing goes for pipeline variants. Branching instructions need registers to keep branch directions and counters, so whenever you use a pipeline variant, you know whether the branch is taken (and when) at compile time, so the compiler optimizes those registers and flow control instructions out.