Let me guess, you have never worked on CPU architectures?
Certainly didn't and I have no idea what the constraints are
But I am talking about algorithmic complexity — if you can do it in constant time in "software" you can probably do it in constant time in hardware, no?
In any case, the decoding properties depend on code properties in particular if it has certain prefix properties. In addition, an algorithm being O(n) depth, does not mean, that there are no local parallelization opportunities - we are talking about asymptotic complexities after all.*
Same can be said for UTF-8 validation or JSON parsing or sorting and yet there are constant time algorithms for those tasks which rely on local parallelization. I don't see why x86 parsing is a fundamentally different problem: there are signaling bit patterns and one can design detectors that work over multiple bytes in parallel, detect those patterns, and then reduce them to find instruction boundaries.
Yes typically i refer to the 64bit instruction set, when referring to x64. I fully believe that a single instruction can be decoded with a upper bound constant O(1) or O(constant). The question i am raising is about the complexity of decoding n instructions (while n going towards infinity).
Ah, well, maybe that's where the misunderstanding lies... I am not talking about parsing n instructions in constant time, I am talking about parsing instructions within a given fixed-size window in constant time. If I remember correctly, all modern x86 CPUs parse instructions in windows of 16 bytes.
*SIMD is a good example. Not all SIMD operation are really O(1) with respect to vector length. In particular reduction operations (which include DOT product) are inherently O(logn). So the observation that SIMD operations can be used does not imply that the underlying algorithm algorithm has O(1) complexity.
O(log n) with a constant n is constant time. The rest is just an implementation detail. Reduction operations on modern ARM CPUs (and GPUs) are O(1) — check out
M1 instruction timings for example. They are obviously slower than the parallel versions (reducing ADDV needs 3 cycles for example, while parallel ADDP needs 2 cycles), but it's still constant time. And any combination of those will be constant time as well. And of course, as you increase the vector size, the latency of these operations will go up as well. But constant factors don't change the algorithmic complexity.
And sure, reduction operations in x86 perform terribly, but that's just x86... they have a history of adding seemingly useful instructions that are completely useless in practice... both DPPS and HADDPS are good examples of those.
Yup, the reason is, that 5 decoders are sufficient to feed the backends. Adding more decoders would not have helped with performance, because you would be backend limited most of the time.
I don's see that Apple has cracked something here. They just made a very wide design - there is no principal problem Apple has solved. As i pointed out earlier, it is for instance trivial for AArch64 architectures to increase the number of decoders - unlike in x64 land.
I don't think it's that straightforward. Cortex X1 for example is almost as wide, and yet it can't even approach Apple's IPC.