Fair enough. But in my defense, most of the data is already in the thread or pretty easy to check or concepts I thought were explainable with just logic alone.
Anyway, that the LTT video is from a year and a half ago is in the timestamp of the video and the NotebookCheck
article covers much of what I was talking about both in general and in particular for the single core CB R24 versus the M2 as well as power measures versus Intel and AMD. For multicore GB 6.2 ray tracing (and single core):
Benchmark results for a SAMSUNG ELECTRONICS CO., LTD. Galaxy Book4 Edge with a Snapdragon X Elite - X1E80100 - Qualcomm Oryon processor.
browser.geekbench.com
Benchmark results for a Mac Studio (2023) with an Apple M2 Max processor.
browser.geekbench.com
That covers my first paragraph on Cinebench R24. Onto the second. Is it the concept that if you downclock a multicore processor then it will be more efficient than a smaller processor doing an embarrassingly parallel task, like Cinebench, that you don't understand? I think the clearest, simplest way to show this in action is with GPUs.
NVIDIA AD106, 1230 MHz, 4608 Cores, 144 TMUs, 48 ROPs, 8192 MB GDDR6, 2000 MHz, 128 bit
www.techpowerup.com
NVIDIA AD107, 1890 MHz, 3072 Cores, 96 TMUs, 48 ROPs, 8192 MB GDDR6, 2000 MHz, 128 bit
www.techpowerup.com
Here are two GPUs, identical architectures, identical performance, but the top GPU uses 70% lower power (unlike for CPUs, GPU TDP is still meaningful and we can compare performance with just TFLOPs since these are the same architecture). Why? Is it because the Max-Q's core design is "better"? Nope identical cores, rather it is because the 4070 MaxQ has 50% more cores, clocked 35% lower. Basically what the reviewer you link to has done is very similar albeit on the CPU and with AMD vs Qualcomm processors. He's power limited the AMD processor with 33% more threads and, yes, that allows it to catch up to the Qualcomm processor in a test that heavily favors multiple threads (there's a reason why Cinebench R24 is now as much a GPU benchmark as CPU with the spread of ray tracing hardware on GPUs). I haven't watched the whole thing but it doesn't seem like he then compared single core scores when he power limited the AMD processor. Had he done so, that would also have demonstrated what I'm talking about. Even just from what he has presented though, think about it: if at the same wattage design A requires more threads than design B to compete, thread to thread B is more efficient and if B is cheaper to make as a result (less silicon)
as is apparently the case, A is at a disadvantage even if they try to substitute lower performance with more cores and threads in multithreaded workloads. Online you can also find power curves for multicore processors with different numbers of cores and this will also demonstrate what I'm talking about although the next section will as well.
That two-way SMT allows x86 to catch up to ARM in performance-per-watt (and that having more threads helps in performance/watt on multicore tests) can be found here:
https://www.anandtech.com/show/17024/apple-m1-max-performance-review/3 and in the Notebookcheck article referenced above. Compare the single threaded and multithreaded performance/watt. Sure the M1 Max always "wins", but the 16-thread Intel machine claws back a huge amount of the performance/watt on the multicore tests (okay except 503.bwaves where the M1 Max CPU is untouchable regardless and I think actually extends its lead because of its massive RAM bandwidth, but that's about the SOC's memory bandwidth not it's CPU cores).
On the third paragraph from my original post: again you can compare single core performance and performance per watt on the NotebookCheck article from this thread and
here's Lion Cove abandoning SMT. The rumors are that this will be the new standard for Intel processors moving forwards.
Here's an article relaying the rumor that AMD is developing its own ARM processor, possible custom core, possibly off-the shelf Cortex. So in relation to your below statement, it may not just be "new guys" without access to x86 developing ARM SOCs going forwards. I doubt AMD will be abandoning x86 anytime soon, but the rumors are that they are certainly hedging their bets just in case.
As I hopefully have demonstrated to you sufficiently, right now, AMD and Intel x86 core designs are generally at a significant disadvantage in terms of performance/watt, especially in single threaded tasks. If you need more, you can check the Notebookcheck articles, old Anandtech articles (
this is a good one, Andrei explains in the comments why Intel/AMD aren't even on the bubble power chart but listed to side), and multiple chipsandcheese articles where they discuss performance and performance/watt - Geekerwan videos if you don't mind auto translations of Chinese also does great work on measuring performance per watt. For heavily multithreaded scenarios, SMT allows x86 to claw back some of that perf/W deficit. No arguments. If embarrassingly parallel multithreaded tasks are more "real world" to you, then I would disagree in general for desktop, laptop and mobile users (
as would the developer of Geekbench John Poole), but everyone has their own use cases and I'm a great believer in following the benchmarks that most closely resemble your own workflows.
So why do ARM designs have this perf/Watt advantage (currently)? The most common set of culprits comes down to multiple reasons listed in order of importance: 1) decode, 2) "long pipes", 2) vectors, 3) old-ISAs. Now you'll note that the last two have nothing to with x86-64 per-se (and even the second is debatable but we'll get into that). Indeed Intel has
proposals for new, better vector extensions,
finally clearing out the old cruft, and increasing general purpose registers - basically making x86-64 more ARM v8/9-like. However, the first two, decode and pipe length, are interesting and kinda interlinked (again, I'll get to pipes in a bit). You'll see varying theories about how crucial decode is - I've seen Ian and Andrei quoting AMD and ARM engineers that they reckon x86 decode is about a 10% performance/PPW penalty and Ive seen people say it doesn't matter at all. Unfortunately this was on Twitter and I no longer have access. But even more importantly the reason I don't link to it is that I don't actually believe any of them: truthfully we don't know and this gets into where I agree with you that comparing architectures in the complete abstract is not just difficult, but almost impossible without comparing concrete designs on actual benchmarks.
So what do we actually know about the ISAs and how they affect these concrete microarchitecture designs? This is a great blog post to start with:
https://queue.acm.org/detail.cfm?id=3639445. Basically microarchitecture does indeed matter more, much more than architecture (ISA). But this gets at the heart as to why Apple able to design the microarchitecture of a Firestorm core and beat x86 designs so handily in both performance and performance per watt when it came out 4 years ago. The particular reason I'm going to focus on (but it isn't the only one) is that ARM is easier to decode in parallel than x86. Prior to the introduction of Firestorm, x86 was pretty much stuck on 4-wide decode, I even remember an AMD engineer quoting back n the day that they didn't think it was possible to go past that (it is). With an 8-wide decode, Apple could make everything else wider too, caches, number of parallel pipes (the number of different parallel instructions a CPU can perform at once), Reorder Buffer, etc ... similar to what we were just discussing in the GPU section a wider core processing more instruction in parallel, operating at lower frequency and with shorter pipes, that is still smaller in silicon, uses much less power for the same or better performance especially on a single thread. To get more out of longer pipes, x86 designs have to speed up the CPU, in doing so though much of the time parts of an individual pipeline is empty - wasting watts. That's why SMT works so well in current x86 designs, you can interleave instructions from another thread into the same pipe and suddenly the core is now filled with work. But it's also why SMT (currently) doesn't add much for ARM. The cores are designed around getting as much out of shorter pipes as possible but more of them. The architecture aided in the natural development of the microarchitecture and indeed on Twitter, deleted now I think, Shac Ron, basically said ARM and Apple collaborated quite closely on ARM v8's design with future Firestorm-like cores in mind. In other words, the ISA was designed around getting to where we are today.
Now x86 isn't staying still. 4-wide decode is a thing of the past on their newest, best cores. Even beyond the larger scale changes mentioned above for the future, Intel and AMD are charging ahead on wider core architectures using either larger instruction caches (Zen 4/5, Golden/Lion Cove) or more exotics decode strategies (Gracemont/Skymont). Intel is also abandoning SMT-based designs and trying to make more efficient use of its cores on a single thread of work. Will the 10% estimation prove prophetic? Will x86 catch ARM entirely? Will it be more than a 10% deficit in the end? I don't know. And I don't think anyone really does. Like you I'm not confident enough to say that ARM will always beat x86 forever. But what I can say with more confidence is that, right now, ARM designs have a substantial performance per watt advantage in many workloads overall and most single threaded workloads and the ISA helped with the design with those cores.
Looking at my own work above, I know this was a giant info dump and I apologize for the length, but beyond that this is an expansive topic barely touched on even here, there is also a lot of nuance to convey - again barely touched on even here. And we're starting to approach my particular limits of CPU understanding and there are others here with more depth and breadth of knowledge if you want to delve further into what is a really interesting topic.