What would be the mechanism by which IPC is traded for higher clocks? Dougall Johnson reported latency on some operations actually reducing rather than increasing.
I would rather speculate that there is some other bottleneck in the core. Maybe cache bandwidth or the amount of load/store units. Should this be the case we might see some IPC improvements in the future if Apple redesigns these parts. Another explanation is that the IPC is already approaching the limits imposed by code and that higher core utilization will only be noticeable in some corner cases.
To be honest I’m getting less and less optimistic about this. But let’s see. Maybe the larger red tops will feature overclocked “Pro/Max Plus” chips 😅
It's hard to say anything informed because no-one is providing the sorts of information that would be required.
Why might a CPU not run at full width every cycle?
Obvious issues are
- I cache misses
- D cache misses
- similarly with TLB
- branch mispredictions.
Those are all (in principle) easy to track, even so I'm unaware of any serious investigation of them for Apple's design. You can get some info on the Intel side (for example the paper I have referred to occasionally on how a few hard to predict branches dominate branch misprediction, and should be handled via other means).
But there are successor issues, in particular:
- there are runs of dependent instructions BUT
- these are interleaved with other independent runs of dependent instructions
In other words, the extent to which you can get these independent runs executing together (and thus utilizing your full width) depends on how far you look ahead in the instruction stream relative to current execution. This is basically a question of how deep your dispatch/issue scheduling queues are (as opposed to everything else like ROB depth).
We know that, IN THEORY, there is a lot of such parallelism, but
- the last serious limit studies were done 30 years ago on designs that you wouldn't want to put into a lightbulb today
- those limit studies did a poor job of investigating THIS particular issue.
A different way to tackle the problem is by tracking instruction criticality and the consequences that flow from this. If you DO track instruction criticality, you can then start to do things like ensure that only the critical instructions sit in expensive high performance scheduling queues, while the bulk of non-critical instructions can sit in cheaper buffers. As far as I can tell (which is limited, as I say, because there do not seem to be any good studies tracking the issues that really matter) this is the next frontier in wide OoO, given that Apple has implemented everything that's lower lying fruit.
In other words, IMHO (and of course I could be wrong, but this is informed speculation, not wishful thinking)
- there remains a substantial degree of IPC improvement available
- some of this is achievable in the obvious ways (more execution units, wider decode, etc) but only with very low returns to that hardware. What's needed is not more hardware but better algorithms to utilize that hardware.
- the most important of those algorithms will track critical instructions, and segregate them from the rest of the instruction stream
- some academic work on this has been done, but far too little
Do Apple know about this? As always, who knows? There IS a very recent (like past month) patent for tracking critical D-CACHE lines (ie lines whose non-presence resulted in a substantial build-up of instructions that could not progress). Once these lines are detected, they are held onto more tightly in L1, and the criticality bit is preserved if they are moved out to L2 and SLC so that they are also preferentially retained there.
This suggests that Apple in fact has a first round of criticality tracking hardware implemented (or at least being designed), and the obvious next step after that is as I suggested above.
I don't think they would blindly engage in the widening they have engaged in without being aware that, by itself, it's not enough.
Is GW3 aware of this? Who knows...?
Apart from criticality, what else limits going wider?
Decode is easy, and scheduling is as easy as it currently is IF you can segregate many more pending instructions into less high-power queues. What is hard is single-cycle rename (ie resource allocation). My suggestion for this is a decoupling queue between Decode and Rename. This allows for say 10-wide decode with 8-wide allocation, given that a substantial number of decode slots fuse two operations into one.
I was surprised that Apple don't seem to have already done this with the A17/M3, but perhaps they have and the tentative A17 explorations did not pick this up because they did not know what they were looking for?
So my view is that IPC is not yet mined out. But the EASY stuff is mined out – there were two decades worth of good ideas just sitting out there ignored by everyone (totally by Intel and AMD, and mostly by ARM) until Apple lit a fire under everyone. Going forward will require years to design, simulate, and perfect each no big idea (like criticality), and while that is going on the best you can hope for in the intermediate designs is small tweaks, things that don't look valuable until the big picture all comes together.