Here's thing.It’s certainly not coming from Apple. One of the earlier such diagrams was done by Dougall Johnson who wrote a series of tests to measure these details. These newer diagrams are likely done by Geekerwan team using tools written by Johnson.
And one can get quite a lot of info by manipulating the data dependencies and measuring how long it takes an instruction chain to execute. Apple also provides a bunch of performance counters that lets you measure all kinds of code statistics.
It took Dougall and I a fairly long time to put together this information, and most of it only makes sense IF you are willing to understand, really understand, the underlying micro-architecture.
What I am seeing since then is a bunch of people who are capable of compiling and executing microbenchmark code, but without the skill and/or the interest in UNDERSTANDING it. So we get numbers that are, in some technical sense, correct, but not especially important or interesting.
For example look at dougall's discussion of measuring the M1 load/store queues:
Apple M1: Load and Store Queue Measurements
Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and meas…
dougallj.wordpress.com
Even after correcting his various errors and misunderstandings, dougall and I still disagree on a bunch of things. Relevant to the LSQ measurements, I think Apple is using, even with M1, virtual LSQ slots, which is a more important detail than the exact size of the LSQ (and they may well have graduated to virtual register allocation for the M4, again more important than any other detail).
[An OoO CPU allocates resources early in the pipeline at the stage usually called Rename, but which I call Allocate because that's more descriptive.
BUT "allocating" a resource often performs multiple roles...
One role is to ensure that a physical piece of hardware (a slot in a queue, a physical register, whatever) will be available when needed. A DIFFERENT role is tracking dependencies (consider the job of register rename and how the rename table works).
Traditionally a single allocation has been performed that executes both roles, but this is inefficient. dependency tracking generally needs to persist from Allocation until at least Execution, and often past Execution if you want to engage in some sort of Replay or Flush. But physical hardware allocation only needs to persist from Issue to Execution.
Part of what Apple has been doing, since the A7, is one by one splitting tasks and hardware that have always been bundled together into separately optimized pieces. Separating register dependency tracking ("virtual registers") from physical register allocation would be just one more step in this process.
It is details like this that really matter in a new design. But you won't capture this by mindlessly running microbenchmarks, especially if you don't ever tweak the benchmarks to see what behaves unexpectedly, what doesn't match your mental model.
Another issue (super-important if true) is whether 10-wide Decode (if real...) goes along with 10-wide Allocate or not. 10-wide Decode is easy. 10-wide Allocate is hard (much harder than 9-wide, which was much harder than 8-wide). But you often do NOT NEED 10-wide Allocate – usually at least one of the instructions of the 10 decoded is fused with a subsequent instruction...
What you do need, instead, is a decoupling queue between Decode and Allocate to absorb bursts in Decode activity smoothed down to continuous Allocate activity. But again, to discover such a thing you need some careful new microbenchmarks, not just mindless execution of a pre-existing suite.
The basic problem (and I see this even with well-meaning sites like ChipsAndCheese) is that CPUs are not all the same!!! There was a time when perhaps they were all similar, but now there is so much specialized detail that you can't usefully transfer your in-depth knowledge of Intel to just assume you are doing something useful when you microbenchmark Apple (and probably soon Qualcomm/Nuvia). You have to actually understand how Apple is different to know what to even test for. And if there is one thing people hate even more than admitting that they were ever wrong, it's having to learn something new.
So expect five years or so of basically garbage microbenchmarks for both Apple and Qualcomm, until a new generation of teens gets into the game, one that hasn't grown up thinking the world is an x86, :-(