I was asking myself what this is for. I suppose now we know. By the way, is this the first time that a SIMT GPU is doing really superscalar execution with two instruction from two programs decoded and dispatched per cycle? To my knowledge, until now only Nvidia did something like that, but they still dispatch one SIMD per cycle (and alternate between two SIMDs). I do wonder what this means for the operand bandwidth and data path contention, one needs a fairly wide bus to sustain two 32-wide SIMD pipelines simultaneously...
The beauty of this is that it opens up Apple for Nvidia-style expansion of pipeline functionality. Right now they seem to have separate FP32, FP16, and INT pipelines. But say they make the FP pipelines more flexible so that both can do FP32/16 — then they can double the peak FP throughput for only a minor increase in area (all the other machinery is still in place). And maybe three-way superscalar with three symmetrical pipelines will also be possible in the future. Looking forward to what they will cook up! At any rate, there is a lot of potential here and they seem to be designing their things in a way that will allow them bigger advancements later.
Yes, I didn't want to get into this, but one reason going superscalar in this way is not completely trivial, is reading registers.
Apple, like everyone, uses a register cache to hold "active" registers and avoid the delay and energy cost of reading from SRAM each time, but Apple's scheme seems to be more aggressive than anyone else's (can hold destination registers, not just source registers; more sophisticated compiler controls for suggesting that certain registers are high priority for retention, vs no longer of interest; and the ability to indicate in an instruction the prefetching into the register cache of a register to be used a few cycles later.
So going superscalar means substantially increasing the bandwidth of that cache, which is surely not trivial (at least not if you care about limiting energy). And I think that's the most immediate constraint on not going wider.
You could imagine other designs, like a kinda Bulldozer design, where we have say two register caches (one associated with even numbered warps, one with odd numbered warps) above and below a common pool of execution pipelines, so that, between the two register caches, say three instructions per cycle are dispatched into the pool of execution units...
Apple seem to believe that, at least for now (maybe this will split with future M vs A designs?) it's worth having separate (and separately optimized) FP16 and FP32 pipelines.
What they could do at the cost of very little area is add an additional uniform pipeline to handle values (think eg a loop counter) that are common across the entire warp. Metal as a language has the concept of uniform registers, but it is unclear to me if they are actually implemented that way in hardware (the way AMD has done for many years, and nV added a few years ago). Do you know? I've found zero reference to uniforms in hardware in all my explorations.
The other obvious next improvement could be to add a full FP64 pipeline (perhaps at low density so each warp has to loop through it four time or so), which would allow them to start being interesting to various STEM use cases. Another way to handle this could be on the M designs to add an additional FP32 pipeline AND to add some helper instructions. Right now you can fake various FP64-like formats (not exactly IEEE FP64, but close enough) in FP32, but it costs a lot of instructions, like say 18 or so per operation. Perhaps, with just minor tweaks and new instructions for the FP32 unit, you could define something like a GF48 or GF64 format (compare with say BF16) that's lightweight to implement on GPU, but good enough for most cases of interest for using doubles on a GPU?