I will give a few pointers below.
Thats just wrong - of course it depends what you understand under "scanning". In a general computation model you calculate your output from the input - there is no notion of scanning. First of all, a parallel machine model or uniform circuit has essentially parallel access to the input - there is no notion of going sequentially through your input. Of course the algorithm (or function) itself might have dependencies, which might force you to sequentialize some computations - and that is precisely what we are talking about here.
Coming back to what you trying to say above, there are some theorems about lower limits if we remove the notion of scanning. One of them states that a function, which needs all inputs to calculate the output cannot be in NC^0 and must be at least in NC^1. (informally NC^x = O(log^x n) depth on a circuit with bounded fan-in using standard gates). As lecture i suggest to read up on the classes NC, AC, TC - but i guess Nicks Class (NC) is the most relevant for the discussion.
You should convince yourself, that when talking about NC, we are already talking about the best case parallel implementation - indeed i am not subscribing to such mundane arguments
As you noted correctly the O(n) nature is something you cannot escape by any clever implementation - and i am sure the current 4-wide decode in x86 is already extremely clever and it gets increasingly harder to "hide" the O(n) nature of the problem as you go wider. This is opposed to ARM (or any other fixed length ISA), where the nature of the problem is O(1).
Right, ILP is limited but on the other hand it is very application dependent. But we should separate concerns here. So even if you predict branches 100% correctly you still have the same ILP limit.
Outside of the theoretical ILP limit, you have just limits on how the ISA allows you to present the parallelism. All the parallelism a backend based on dynamic scheduling (aka OOE) architecture is able to extract is based on observing register dependencies. Therefore register-renaming plays a huge role, as it removes anti-dependencies between instructions. Anyway, size of your architectural register file also limits how parallelism can be presented. If the compiler ever comes into a situation, where it needs to emit spilling code because running out of architectural registers - register renaming does not help you anymore - the potential parallelism is gone. The conclusion here would be, that with ARM you can potentially extract more ILP due to the larger GP register file compared to x64. And while you can never pass the inherent ILP limit - there still is an ISA dependency here - namely architectures with larger GP register files do have an advantage.
That having said, i truly believe, if ARM decides to make the cores the same size as Firestorm, they would achieve very similar IPC - contrary to the x64 competition. In some cases ARM is very explicit, they say, when we would have increased feature xyz by this amount, they would have gained uvw amount of performance - so "we are not doing this". The X1 is the first ARM core, where ARM is somewhat deviating from these considerations.