Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

komuh

macrumors regular
May 13, 2023
126
113
As @Xiao_Xi linked to there were problems with Windows and "Modern Standby" but I'm not sure the state of play now. The issue was a confluence between OEMs not instituting all possible sleep states in their products, Windows making bad assumptions, and power hungry x86 chips. That was a couple of years ago and MS claimed to be working on it. To be fair, Macs could occasionally experience similar issues, just far less often and more easily circumventable (turning off "Wake for Network access" basically means it will never happened on the Mac).

As for Linux ... I think Hector is working hard to make sure sleep states work as intended on Asahi Macs, but a lot of OEMs don't do that and my memory is that sleep states, and power management in general, on Linux laptops has a really bad reputation overall. Here's one recent-ish Reddit thread:

https://www.reddit.com/r/AsahiLinux/comments/1907jp4
Power hungry "x86" is such a myth, AMD and even Intel set to same power limits as current X elite devices have almost same performance per Watt, it is like 10-15% extra score vs Intel and it seems to lose vs power limited AMD U series.

 
  • Like
Reactions: Xiao_Xi

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,221
Power hungry "x86" is such a myth,

It's not really though. Especially not when that LTT video was taken a year and a half ago which is what I was referring to when I listed it amongst the toxic confluence of factors that resulted in LTT's Windows experience relative to the Mac.
AMD and even Intel set to same power limits as current X elite devices have almost same performance per Watt, it is like 10-15% extra score vs Intel and it seems to lose vs power limited AMD U series.

That's an interesting data point and I don't deny that x86 has made significant gains in recent generations but there are a few things to note: for some reason yet again CB R24 might not be behaving itself - not as bad as R23 but there is some weirdness. Take a look at CB R24 SC and MC scores for the Elite X - they look the same or worse than the M2 Pro/Max when they should be beating it. 12 effectively upclocked M2 P-cores (Oryon) should beat the M2 Pro/Max's 8+4 in multicore and the SC for the Oryon cores should likewise be higher, but they aren't ... in CB. In contrast, GB 6.2 ray tracing, similar in concept to CB R24 and also using Intel Embree for its vector math but shorter in execution, has much higher relative scores for the Elite compared to the M2 Pro/Max for both SC and MC. I suspect that is closer to the "ground truth" of the silicon. Again, very unclear what is going on here but something is odd since the Oryon and the M2 should be similar here in the broad strokes and the higher end Oryons are running at higher clocks.

Secondly just like comparing the Elite to the M2 Pro/Max or even worse the base M2, comparing multicore performance is very tricky and depends on what you want to emphasize. For instance, the AMD Ryzen 7 8840U is an 8 core, 16 thread CPU. So it's 12 versus 16 threads and CB loves threads, the more the merrier. In other words, to make the comparison ridiculous, if I were to run a Threadripper at 20-something watts (provided it was stable) I could get even more performance at that wattage, vastly more. Again it depends on what we want to know, if it's what kind of device can I buy with $x whose default performance/watt states gives me y, then this doesn't matter as much. Compare away (although again, supposedly Qualcomm is cheaper than Intel/AMD here for OEMs). But if it's whose cores are "better" in some abstract sense, then normalizing by thread count really begins to matter. And to be clear, workloads like CB is where 2-way SMT really makes up for the deficit in SC performance per watt that x86 typically suffers from with respect to ARM designs. That's always been true even back when Apple first released the M1. That's one of the strengths of current x86 designs. They can claw back some, or in this case all, of that deficit in perf/W in high thread count multicore workloads with 2-way SMT.

But it's telling that even Intel is moving away from SMT (in their consumer processors) moving forwards. For a lot of users, single core performance (and SC performance per watt) is the most important day-to-day driver of how well it feels to use a device. And, right now, ARM designs are tough to beat here, especially in laptops which predominate the market*. Improving the x86 cores so they don't need SMT is where, at least Intel, is moving towards. We'll see if they pull that off and, yes, currently Intel are behind AMD in perf/watt.

That said, this is why I've repeatedly admitted that while the Snapdragon Elite looks very good relative to x86, it may have been later than it may be needed to be to have the impact that Qualcomm probably wanted. x86, while still largely behind in this area, isn't staying still and x86 SOCs just have to be "good enough" here for Qualcomm to potentially struggle unless they also excel in other areas that users care about, which is debatable.

*It should be noted that AMD are rumored to be hedging their bets and developing custom ARM-based cores too.
 
Last edited:
  • Like
Reactions: MiniApple

Icelus

macrumors 6502
Nov 3, 2018
421
574
Arm64EC - Build and port apps for native performance on ARM
Arm64EC (“Emulation Compatible”) enables you to build new native apps or incrementally transition existing x64 apps to take advantage of the native speed and performance possible with Arm-powered devices, including better power consumption, battery life, and accelerated AI & ML workloads.

Arm64EC is a new application binary interface (ABI) for apps running on Arm devices with Windows 11. It is a Windows 11 feature that requires the use of the Windows 11 SDK and is not available on Windows 10 on Arm.
 

komuh

macrumors regular
May 13, 2023
126
113
It's not really though. Especially not when that LTT video was taken a year and a half ago which is what I was referring to when I listed it amongst the toxic confluence of factors that resulted in LTT's Windows experience relative to the Mac.

That's an interesting data point and I don't deny that x86 has made significant gains in recent generations but there are a few things to note: for some reason yet again CB R24 might not be behaving itself - not as bad as R23 but there is some weirdness. Take a look at CB R24 SC and MC scores for the Elite X - they look the same or worse than the M2 Pro/Max when they should be beating it. 12 effectively upclocked M2 P-cores (Oryon) should beat the M2 Pro/Max's 8+4 in multicore and the SC for the Oryon cores should likewise be higher, but they aren't ... in CB. In contrast, GB 6.2 ray tracing, similar in concept to CB R24 and also using Intel Embree for its vector math but shorter in execution, has much higher relative scores for the Elite compared to the M2 Pro/Max for both SC and MC. I suspect that is closer to the "ground truth" of the silicon. Again, very unclear what is going on here but something is odd since the Oryon and the M2 should be similar here in the broad strokes and the higher end Oryons are running at higher clocks.

Secondly just like comparing the Elite to the M2 Pro/Max or even worse the base M2, comparing multicore performance is very tricky and depends on what you want to emphasize. For instance, the AMD Ryzen 7 8840U is an 8 core, 16 thread CPU. So it's 12 versus 16 threads and CB loves threads, the more the merrier. In other words, to make the comparison ridiculous, if I were to run a Threadripper at 20-something watts (provided it was stable) I could get even more performance at that wattage, vastly more. Again it depends on what we want to know, if it's what kind of device can I buy with $x whose default performance/watt states gives me y, then this doesn't matter as much. Compare away (although again, supposedly Qualcomm is cheaper than Intel/AMD here for OEMs). But if it's whose cores are "better" in some abstract sense, then normalizing by thread count really begins to matter. And to be clear, workloads like CB is where 2-way SMT really makes up for the deficit in SC performance per watt that x86 typically suffers from with respect to ARM designs. That's always been true even back when Apple first released the M1. That's one of the strengths of current x86 designs. They can claw back some, or in this case all, of that deficit in perf/W in high thread count multicore workloads with 2-way SMT.

But it's telling that even Intel is moving away from SMT (in their consumer processors) moving forwards. For a lot of users, single core performance (and SC performance per watt) is the most important day-to-day driver of how well it feels to use a device. And, right now, ARM designs are tough to beat here, especially in laptops which predominate the market*. Improving the x86 cores so they don't need SMT is where, at least Intel, is moving towards. We'll see if they pull that off and, yes, currently Intel are behind AMD in perf/watt.

That said, this is why I've repeatedly admitted that while the Snapdragon Elite looks very good relative to x86, it may have been later than it may be needed to be to have the impact that Qualcomm probably wanted. x86, while still largely behind in this area, isn't staying still and x86 SOCs just have to be "good enough" here for Qualcomm to potentially struggle unless they also excel in other areas that users care about, which is debatable.

*It should be noted that AMD are rumored to be hedging their bets and developing custom ARM-based cores too.
Without data you can write even more but I won’t hear your argument. X86 for sure is just not viable if you are not intel or amd that’s why new guys in this space will always use ARM and some accelerators even try RISC-V but power usage isn’t much different in real world scenarios.

I’m not an expert in designing microarchitectures but i never heard the phrase that CISC is worse for perf per watt compared to RISC it is just 2complex of a case to be sure we can compare single architecture to another and right now Qualcomm isn’t better compared to AMD with power limits.
 

salamanderjuice

macrumors 6502a
Feb 28, 2020
580
613
Microsoft seems to have messed up with Modern Standby, and AMD/Intel CPUs did not help to hide the problem.

Not just Microsoft. I've had the same issues with Apple too where for whatever reason my Intel MBP tried to cook itself by waking up in my bag. Qualcomm has a lot more experience with connected standby from their time with mobile devices and phones and it shows. My Surface Pro X has been very good about not draining too much on standby so I'm not surprised the newer CPUs seem just as capable.
 
  • Like
Reactions: bousozoku

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,221
Not just Microsoft. I've had the same issues with Apple too where for whatever reason my Intel MBP tried to cook itself by waking up in my bag. Qualcomm has a lot more experience with connected standby from their time with mobile devices and phones and it shows. My Surface Pro X has been very good about not draining too much on standby so I'm not surprised the newer CPUs seem just as capable.
That's covered in the LTT video. They admit that it happens rarely (compared to MS) on the Mac too but it's an easy fix to make sure it doesn't happen: turn off "wake for network access"
 

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,221
Without data you can write even more but I won’t hear your argument.
Fair enough. But in my defense, most of the data is already in the thread or pretty easy to check or concepts I thought were explainable with just logic alone.

Anyway, that the LTT video is from a year and a half ago is in the timestamp of the video and the NotebookCheck article covers much of what I was talking about both in general and in particular for the single core CB R24 versus the M2 as well as power measures versus Intel and AMD. For multicore GB 6.2 ray tracing (and single core):



That covers my first paragraph on Cinebench R24. Onto the second. Is it the concept that if you downclock a multicore processor then it will be more efficient than a smaller processor doing an embarrassingly parallel task, like Cinebench, that you don't understand? I think the clearest, simplest way to show this in action is with GPUs.



Here are two GPUs, identical architectures, identical performance, but the top GPU uses 70% lower power (unlike for CPUs, GPU TDP is still meaningful and we can compare performance with just TFLOPs since these are the same architecture). Why? Is it because the Max-Q's core design is "better"? Nope identical cores, rather it is because the 4070 MaxQ has 50% more cores, clocked 35% lower. Basically what the reviewer you link to has done is very similar albeit on the CPU and with AMD vs Qualcomm processors. He's power limited the AMD processor with 33% more threads and, yes, that allows it to catch up to the Qualcomm processor in a test that heavily favors multiple threads (there's a reason why Cinebench R24 is now as much a GPU benchmark as CPU with the spread of ray tracing hardware on GPUs). I haven't watched the whole thing but it doesn't seem like he then compared single core scores when he power limited the AMD processor. Had he done so, that would also have demonstrated what I'm talking about. Even just from what he has presented though, think about it: if at the same wattage design A requires more threads than design B to compete, thread to thread B is more efficient and if B is cheaper to make as a result (less silicon) as is apparently the case, A is at a disadvantage even if they try to substitute lower performance with more cores and threads in multithreaded workloads. Online you can also find power curves for multicore processors with different numbers of cores and this will also demonstrate what I'm talking about although the next section will as well.

That two-way SMT allows x86 to catch up to ARM in performance-per-watt (and that having more threads helps in performance/watt on multicore tests) can be found here: https://www.anandtech.com/show/17024/apple-m1-max-performance-review/3 and in the Notebookcheck article referenced above. Compare the single threaded and multithreaded performance/watt. Sure the M1 Max always "wins", but the 16-thread Intel machine claws back a huge amount of the performance/watt on the multicore tests (okay except 503.bwaves where the M1 Max CPU is untouchable regardless and I think actually extends its lead because of its massive RAM bandwidth, but that's about the SOC's memory bandwidth not it's CPU cores).

On the third paragraph from my original post: again you can compare single core performance and performance per watt on the NotebookCheck article from this thread and here's Lion Cove abandoning SMT. The rumors are that this will be the new standard for Intel processors moving forwards. Here's an article relaying the rumor that AMD is developing its own ARM processor, possible custom core, possibly off-the shelf Cortex. So in relation to your below statement, it may not just be "new guys" without access to x86 developing ARM SOCs going forwards. I doubt AMD will be abandoning x86 anytime soon, but the rumors are that they are certainly hedging their bets just in case.

X86 for sure is just not viable if you are not intel or amd that’s why new guys in this space will always use ARM and some accelerators even try RISC-V but power usage isn’t much different in real world scenarios.

I’m not an expert in designing microarchitectures but i never heard the phrase that CISC is worse for perf per watt compared to RISC it is just 2complex of a case to be sure we can compare single architecture to another and right now Qualcomm isn’t better compared to AMD with power limits.
As I hopefully have demonstrated to you sufficiently, right now, AMD and Intel x86 core designs are generally at a significant disadvantage in terms of performance/watt, especially in single threaded tasks. If you need more, you can check the Notebookcheck articles, old Anandtech articles (this is a good one, Andrei explains in the comments why Intel/AMD aren't even on the bubble power chart but listed to side), and multiple chipsandcheese articles where they discuss performance and performance/watt - Geekerwan videos if you don't mind auto translations of Chinese also does great work on measuring performance per watt. For heavily multithreaded scenarios, SMT allows x86 to claw back some of that perf/W deficit. No arguments. If embarrassingly parallel multithreaded tasks are more "real world" to you, then I would disagree in general for desktop, laptop and mobile users (as would the developer of Geekbench John Poole), but everyone has their own use cases and I'm a great believer in following the benchmarks that most closely resemble your own workflows.

So why do ARM designs have this perf/Watt advantage (currently)? The most common set of culprits comes down to multiple reasons listed in order of importance: 1) decode, 2) "long pipes", 2) vectors, 3) old-ISAs. Now you'll note that the last two have nothing to with x86-64 per-se (and even the second is debatable but we'll get into that). Indeed Intel has proposals for new, better vector extensions, finally clearing out the old cruft, and increasing general purpose registers - basically making x86-64 more ARM v8/9-like. However, the first two, decode and pipe length, are interesting and kinda interlinked (again, I'll get to pipes in a bit). You'll see varying theories about how crucial decode is - I've seen Ian and Andrei quoting AMD and ARM engineers that they reckon x86 decode is about a 10% performance/PPW penalty and Ive seen people say it doesn't matter at all. Unfortunately this was on Twitter and I no longer have access. But even more importantly the reason I don't link to it is that I don't actually believe any of them: truthfully we don't know and this gets into where I agree with you that comparing architectures in the complete abstract is not just difficult, but almost impossible without comparing concrete designs on actual benchmarks.

So what do we actually know about the ISAs and how they affect these concrete microarchitecture designs? This is a great blog post to start with: https://queue.acm.org/detail.cfm?id=3639445. Basically microarchitecture does indeed matter more, much more than architecture (ISA). But this gets at the heart as to why Apple able to design the microarchitecture of a Firestorm core and beat x86 designs so handily in both performance and performance per watt when it came out 4 years ago. The particular reason I'm going to focus on (but it isn't the only one) is that ARM is easier to decode in parallel than x86. Prior to the introduction of Firestorm, x86 was pretty much stuck on 4-wide decode, I even remember an AMD engineer quoting back n the day that they didn't think it was possible to go past that (it is). With an 8-wide decode, Apple could make everything else wider too, caches, number of parallel pipes (the number of different parallel instructions a CPU can perform at once), Reorder Buffer, etc ... similar to what we were just discussing in the GPU section a wider core processing more instruction in parallel, operating at lower frequency and with shorter pipes, that is still smaller in silicon, uses much less power for the same or better performance especially on a single thread. To get more out of longer pipes, x86 designs have to speed up the CPU, in doing so though much of the time parts of an individual pipeline is empty - wasting watts. That's why SMT works so well in current x86 designs, you can interleave instructions from another thread into the same pipe and suddenly the core is now filled with work. But it's also why SMT (currently) doesn't add much for ARM. The cores are designed around getting as much out of shorter pipes as possible but more of them. The architecture aided in the natural development of the microarchitecture and indeed on Twitter, deleted now I think, Shac Ron, basically said ARM and Apple collaborated quite closely on ARM v8's design with future Firestorm-like cores in mind. In other words, the ISA was designed around getting to where we are today.

Now x86 isn't staying still. 4-wide decode is a thing of the past on their newest, best cores. Even beyond the larger scale changes mentioned above for the future, Intel and AMD are charging ahead on wider core architectures using either larger instruction caches (Zen 4/5, Golden/Lion Cove) or more exotics decode strategies (Gracemont/Skymont). Intel is also abandoning SMT-based designs and trying to make more efficient use of its cores on a single thread of work. Will the 10% estimation prove prophetic? Will x86 catch ARM entirely? Will it be more than a 10% deficit in the end? I don't know. And I don't think anyone really does. Like you I'm not confident enough to say that ARM will always beat x86 forever. But what I can say with more confidence is that, right now, ARM designs have a substantial performance per watt advantage in many workloads overall and most single threaded workloads and the ISA helped with the design with those cores.

Looking at my own work above, I know this was a giant info dump and I apologize for the length, but beyond that this is an expansive topic barely touched on even here, there is also a lot of nuance to convey - again barely touched on even here. And we're starting to approach my particular limits of CPU understanding and there are others here with more depth and breadth of knowledge if you want to delve further into what is a really interesting topic.
 
Last edited:

bousozoku

Moderator emeritus
Jun 25, 2002
16,120
2,397
Lard
Not just Microsoft. I've had the same issues with Apple too where for whatever reason my Intel MBP tried to cook itself by waking up in my bag. Qualcomm has a lot more experience with connected standby from their time with mobile devices and phones and it shows. My Surface Pro X has been very good about not draining too much on standby so I'm not surprised the newer CPUs seem just as capable.
My mid-2012 quad-core i7 MacBook Pro would wake partway through my drive and I'd have almost no battery life remaining to process photos. I'm glad my backpack didn't melt.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Does using a different memory consistency model give Arm implementations any advantage over x64 implementations?
 
  • Like
Reactions: crazy dave

komuh

macrumors regular
May 13, 2023
126
113
Fair enough. But in my defense, most of the data is already in the thread or pretty easy to check or concepts I thought were explainable with just logic alone.

Anyway, that the LTT video is from a year and a half ago is in the timestamp of the video and the NotebookCheck article covers much of what I was talking about both in general and in particular for the single core CB R24 versus the M2 as well as power measures versus Intel and AMD. For multicore GB 6.2 ray tracing (and single core):



That covers my first paragraph on Cinebench R24. Onto the second. Is it the concept that if you downclock a multicore processor then it will be more efficient than a smaller processor doing an embarrassingly parallel task, like Cinebench, that you don't understand? I think the clearest, simplest way to show this in action is with GPUs.



Here are two GPUs, identical architectures, identical performance, but the top GPU uses 70% lower power (unlike for CPUs, GPU TDP is still meaningful and we can compare performance with just TFLOPs since these are the same architecture). Why? Is it because the Max-Q's core design is "better"? Nope identical cores, rather it is because the 4070 MaxQ has 50% more cores, clocked 35% lower. Basically what the reviewer you link to has done is very similar albeit on the CPU and with AMD vs Qualcomm processors. He's power limited the AMD processor with 33% more threads and, yes, that allows it to catch up to the Qualcomm processor in a test that heavily favors multiple threads (there's a reason why Cinebench R24 is now as much a GPU benchmark as CPU with the spread of ray tracing hardware on GPUs). I haven't watched the whole thing but it doesn't seem like he then compared single core scores when he power limited the AMD processor. Had he done so, that would also have demonstrated what I'm talking about. Even just from what he has presented though, think about it: if at the same wattage design A requires more threads than design B to compete, thread to thread B is more efficient and if B is cheaper to make as a result (less silicon) as is apparently the case, A is at a disadvantage even if they try to substitute lower performance with more cores and threads in multithreaded workloads. Online you can also find power curves for multicore processors with different numbers of cores and this will also demonstrate what I'm talking about although the next section will as well.

That two-way SMT allows x86 to catch up to ARM in performance-per-watt (and that having more threads helps in performance/watt on multicore tests) can be found here: https://www.anandtech.com/show/17024/apple-m1-max-performance-review/3 and in the Notebookcheck article referenced above. Compare the single threaded and multithreaded performance/watt. Sure the M1 Max always "wins", but the 16-thread Intel machine claws back a huge amount of the performance/watt on the multicore tests (okay except 503.bwaves where the M1 Max CPU is untouchable regardless and I think actually extends its lead because of its massive RAM bandwidth, but that's about the SOC's memory bandwidth not it's CPU cores).

On the third paragraph from my original post: again you can compare single core performance and performance per watt on the NotebookCheck article from this thread and here's Lion Cove abandoning SMT. The rumors are that this will be the new standard for Intel processors moving forwards. Here's an article relaying the rumor that AMD is developing its own ARM processor, possible custom core, possibly off-the shelf Cortex. So in relation to your below statement, it may not just be "new guys" without access to x86 developing ARM SOCs going forwards. I doubt AMD will be abandoning x86 anytime soon, but the rumors are that they are certainly hedging their bets just in case.


As I hopefully have demonstrated to you sufficiently, right now, AMD and Intel x86 core designs are generally at a significant disadvantage in terms of performance/watt, especially in single threaded tasks. If you need more, you can check the Notebookcheck articles, old Anandtech articles (this is a good one, Andrei explains in the comments why Intel/AMD aren't even on the bubble power chart but listed to side), and multiple chipsandcheese articles where they discuss performance and performance/watt - Geekerwan videos if you don't mind auto translations of Chinese also does great work on measuring performance per watt. For heavily multithreaded scenarios, SMT allows x86 to claw back some of that perf/W deficit. No arguments. If embarrassingly parallel multithreaded tasks are more "real world" to you, then I would disagree in general for desktop, laptop and mobile users (as would the developer of Geekbench John Poole), but everyone has their own use cases and I'm a great believer in following the benchmarks that most closely resemble your own workflows.

So why do ARM designs have this perf/Watt advantage (currently)? The most common set of culprits comes down to multiple reasons listed in order of importance: 1) decode, 2) "long pipes", 2) vectors, 3) old-ISAs. Now you'll note that the last two have nothing to with x86-64 per-se (and even the second is debatable but we'll get into that). Indeed Intel has proposals for new, better vector extensions, finally clearing out the old cruft, and increasing general purpose registers - basically making x86-64 more ARM v8/9-like. However, the first two, decode and pipe length, are interesting and kinda interlinked (again, I'll get to pipes in a bit). You'll see varying theories about how crucial decode is - I've seen Ian and Andrei quoting AMD and ARM engineers that they reckon x86 decode is about a 10% performance/PPW penalty and Ive seen people say it doesn't matter at all. Unfortunately this was on Twitter and I no longer have access. But even more importantly the reason I don't link to it is that I don't actually believe any of them: truthfully we don't know and this gets into where I agree with you that comparing architectures in the complete abstract is not just difficult, but almost impossible without comparing concrete designs on actual benchmarks.

So what do we actually know about the ISAs and how they affect these concrete microarchitecture designs? This is a great blog post to start with: https://queue.acm.org/detail.cfm?id=3639445. Basically microarchitecture does indeed matter more, much more than architecture (ISA). But this gets at the heart as to why Apple able to design the microarchitecture of a Firestorm core and beat x86 designs so handily in both performance and performance per watt when it came out 4 years ago. The particular reason I'm going to focus on (but it isn't the only one) is that ARM is easier to decode in parallel than x86. Prior to the introduction of Firestorm, x86 was pretty much stuck on 4-wide decode, I even remember an AMD engineer quoting back n the day that they didn't think it was possible to go past that (it is). With an 8-wide decode, Apple could make everything else wider too, caches, number of parallel pipes (the number of different parallel instructions a CPU can perform at once), Reorder Buffer, etc ... similar to what we were just discussing in the GPU section a wider core processing more instruction in parallel, operating at lower frequency and with shorter pipes, that is still smaller in silicon, uses much less power for the same or better performance especially on a single thread. To get more out of longer pipes, x86 designs have to speed up the CPU, in doing so though much of the time parts of an individual pipeline is empty - wasting watts. That's why SMT works so well in current x86 designs, you can interleave instructions from another thread into the same pipe and suddenly the core is now filled with work. But it's also why SMT (currently) doesn't add much for ARM. The cores are designed around getting as much out of shorter pipes as possible but more of them. The architecture aided in the natural development of the microarchitecture and indeed on Twitter, deleted now I think, Shac Ron, basically said ARM and Apple collaborated quite closely on ARM v8's design with future Firestorm-like cores in mind. In other words, the ISA was designed around getting to where we are today.

Now x86 isn't staying still. 4-wide decode is a thing of the past on their newest, best cores. Even beyond the larger scale changes mentioned above for the future, Intel and AMD are charging ahead on wider core architectures using either larger instruction caches (Zen 4/5, Golden/Lion Cove) or more exotics decode strategies (Gracemont/Skymont). Intel is also abandoning SMT-based designs and trying to make more efficient use of its cores on a single thread of work. Will the 10% estimation prove prophetic? Will x86 catch ARM entirely? Will it be more than a 10% deficit in the end? I don't know. And I don't think anyone really does. Like you I'm not confident enough to say that ARM will always beat x86 forever. But what I can say with more confidence is that, right now, ARM designs have a substantial performance per watt advantage in many workloads overall and most single threaded workloads and the ISA helped with the design with those cores.

Looking at my own work above, I know this was a giant info dump and I apologize for the length, but beyond that this is an expansive topic barely touched on even here, there is also a lot of nuance to convey - again barely touched on even here. And we're starting to approach my particular limits of CPU understanding and there are others here with more depth and breadth of knowledge if you want to delve further into what is a really interesting topic.
Ok thank you for such a big update i really apprised that, but I still don't get the whole RISC is better for perf/power use case.

I know about decoder and pipes difference etc. RISC is easier to implement and is more power efficient atm especially if you don't use any complex instructions (otherwise it is just a lot slower, so worse in performance/watt metric) and you just "spam" ton of simple instructions without any data dependencies on each other. But where is this limit coming from if this is just decoding is slower (as it is only natural you have a lot more instructions) you can mix-match your designs for just that (different cpu clusters with different instructions set?) and it is still let say even 10% more power/latency needed to decode single instruction, but with potential to be A LOT faster for every time you can just use complex instruction over multiple simple instructions (which v8/v9 ARM is already kinda doing with for example NEON instructions which are complex).

I feel like we agree on most things it is just x86-64 like you say it is currently worse on performance, because both of architectures come from different spaces and are coming into single middle ground i feel like we just see x86 -> simpler and wider pipes, thinking about removing older instructions and arm -> more and more complex instructions with each new version right now focused on cryptography and SIMD stuff.

And i think we can see it mostly on the server side new Intel and AMD is beating ARM (not much choice there atm we got only nvidia servers) designs in that space and hyperscalers aren't pushing ARM as much as before (for example price/performance at amazon is close to arm (graviton) and in most classic cases a lot cheaper on x86 based servers (mostly Xeons) compared to few years ago when it was almost always cheaper to just use graviton).
 
  • Like
Reactions: crazy dave

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,221
Does using a different memory consistency model give Arm implementations any advantage over x64 implementations?
My impression is that it definitely can for multithreaded scenarios, but in terms of how much it helps ... well we're getting past my point of usefulness here. I've seen it discussed in the context of the M1 (example here) and they found a ~9% performance loss in TSO compared to weak-ordering model - in some ways that's best most controlled way to test, the same microarchitecture with a switch that flips between the two. However, again my expertise being stretched here, maybe one could also argue that the M1 isn't really designed with TSO in mind and a processor that was designed knowing TSO was the primary memory model might behave differently.
Ok thank you for such a big update i really apprised that, but I still don't get the whole RISC is better for perf/power use case.

I know about decoder and pipes difference etc. RISC is easier to implement and is more power efficient atm especially if you don't use any complex instructions (otherwise it is just a lot slower, so worse in performance/watt metric) and you just "spam" ton of simple instructions without any data dependencies on each other. But where is this limit coming from if this is just decoding is slower (as it is only natural you have a lot more instructions) you can mix-match your designs for just that (different cpu clusters with different instructions set?) and it is still let say even 10% more power/latency needed to decode single instruction, but with potential to be A LOT faster for every time you can just use complex instruction over multiple simple instructions (which v8/v9 ARM is already kinda doing with for example NEON instructions which are complex).
I'm afraid I'm not quite following your paragraph here, apologies it is late here and I'm responding because I'm having trouble sleeping. So I'm not sure if I'm addressing your post right but I'll try: the decoding thing isn't about the number of instructions and in fact ARM and x86-64 have similar instruction density, but it's more that x86-64 is simply harder to decode. With ARM you know every instruction length is 32, but for x86 the length of instruction could be anywhere from x to y (a big range, but I forget what it is). So you've got an instruction stream and you've got to figure out where each instruction actually is. Most instructions are well behaved, but the fact that it *could* be any length means the decoders have to do a lot of manual scanning and doing more of that scanning in parallel naively simply wastes more work whereas for ARM you can build as many decoders as you can build a back end for, in other words the decoders themselves are no longer a bottleneck to building a core wider. Because of this decode bottleneck, x86 by contrast historically built narrower cores with longer pipelines run much faster. But if you think of a pipe as a series of stages in an assembly line the faster you run them, the more gaps appear where a stage isn't doing anything, power is on, but nothing is happening until the next instruction comes through. That's why, while not all even multithreaded workloads benefit equally, SMT can work quite well and increase perf/watt for a core designed with it in mind. The first page of the linked article also gives a really good explanation of what I'm trying to say here.
I feel like we agree on most things it is just x86-64 like you say it is currently worse on performance, because both of architectures come from different spaces and are coming into single middle ground i feel like we just see x86 -> simpler and wider pipes, thinking about removing older instructions and arm -> more and more complex instructions with each new version right now focused on cryptography and SIMD stuff.

And i think we can see it mostly on the server side new Intel and AMD is beating ARM (not much choice there atm we got only nvidia servers) designs in that space and hyperscalers aren't pushing ARM as much as before (for example price/performance at amazon is close to arm (graviton) and in most classic cases a lot cheaper on x86 based servers (mostly Xeons) compared to few years ago when it was almost always cheaper to just use graviton).
Servers and server workloads are emblematic of what as I getting at where the tradeoffs x86 made payoff - huge core counts processing massively threaded workloads really allows SMT to shine* (and why Intel will continue to push SMT on Xeons, at least for now). In the Noteboocheck article, you can see Nuvia and Apple's single core performance/watt in CB R24 is double or even triple what AMD/Intel currently offer (next generation AMD/Intel chips obviously untested). But in multicore, AMD and Intel close that gap substantially - those 16 threads compared to 8-12 help a lot. That's why we see in the server space larger and larger ARM-based server chips to compete, they're trying to match or even exceed x86 designs on thread count. x86 have SMT to pack more threads into the same silicon area, while ARM has smaller cores (Intel and AMD are trying this as well with all E-core or in AMD's case "c"-core server chips). Also it has to be said that ARM's cores aren't as good as the designs from Nuvia and Apple (again rumored future PCs with Cortex x925 not being tested yet). That was in fact the original conceit of Nuvia: building server chips out of an M2-like core. Supposedly Qualcomm will indeed be coming out with Oryon-based servers later this year so we'll get to see how those compete.

*Heck in some circles it's why POWER10 with SMT8 remains a popular choice!
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,518
19,669
I know about decoder and pipes difference etc. RISC is easier to implement and is more power efficient atm

As @crazy dave writes, ARM is easier to decode. I am not so sure that ARM is easier to implement. Once you get past the decode stage, there is not much difference between contemporary x86 and ARM CPUs.


especially if you don't use any complex instructions (otherwise it is just a lot slower, so worse in performance/watt metric) and you just "spam" ton of simple instructions without any data dependencies on each other. But where is this limit coming from if this is just decoding is slower (as it is only natural you have a lot more instructions) you can mix-match your designs for just that (different cpu clusters with different instructions set?) and it is still let say even 10% more power/latency needed to decode single instruction, but with potential to be A LOT faster for every time you can just use complex instruction over multiple simple instructions (which v8/v9 ARM is already kinda doing with for example NEON instructions which are complex).

What makes you think that ARM instructions are simpler or that “complex” instructions are more difficult to decode? I would say that average ARM instructions are more “complex” than average x86 instructions. ARM generally packs more information (operations) per instruction and has more instruction variants. In addition, performance of modern CPUs is not achieved by using many simple instructions, but by tracking data dependencies of hundreds instructions simultaneously and progressing concurrently across multiple operations at once. ARM designs just happen to have some features that make this kind of progress a bit easier than x86.

Maybe you are thinking about early RISC vs. CiSC? Your description applies quite well to those historical designs. CPUs were much simpler back then though.

I feel like we agree on most things it is just x86-64 like you say it is currently worse on performance, because both of architectures come from different spaces and are coming into single middle ground i feel like we just see x86 -> simpler and wider pipes, thinking about removing older instructions and arm -> more and more complex instructions with each new version right now focused on cryptography and SIMD stuff.

I would say that x86 is trying to become more like ARM in order to catch up on power-efficient performance. Intels proposed APX instructions are exactly that - they introduce ARM-style instructions that help increasing operation density and facilitate super scalar execution.
 

komuh

macrumors regular
May 13, 2023
126
113
What makes you think that ARM instructions are simpler or that “complex” instructions are more difficult to decode? I would say that average ARM instructions are more “complex” than average x86 instructions. ARM generally packs more information (operations) per instruction and has more instruction variants. In addition, performance of modern CPUs is not achieved by using many simple instructions, but by tracking data dependencies of hundreds instructions simultaneously and progressing concurrently across multiple operations at once. ARM designs just happen to have some features that make this kind of progress a bit easier than x86.
It is not about single instruction it is that whole instruction set for x86-64 is just bigger, there is more instructions to implement right now i'm agreeing that ARM is getting bigger and moving closer to x86 if i remember correctly modern AMD instruction set are around 1.5k considering ARM v8.6 is closer to 500.

And ofc. I agree with that whole "complexity" I only mean Complex instruction in a sense of CISC (for example REP SCASB etc.)
 

leman

macrumors Core
Oct 14, 2008
19,518
19,669
It is not about single instruction it is that whole instruction set for x86-64 is just bigger, there is more instructions to implement right now i'm agreeing that ARM is getting bigger and moving closer to x86 if i remember correctly modern AMD instruction set are around 1.5k considering ARM v8.6 is closer to 500.

Counting instructions can be tricky. For example, ARM lacks a register move instruction - those are often implemented as OR with a zero second operand. It’s a nice trick to manage the limited opcode space. But of course, it would be extremely wasteful to actually implement it this way. So the decoder contains specialized circuitry to detect these kind of patterns and ha die them specially. And there are a lot of examples like that: combined ALU+shift (which are standard on ARM), advanced addressing modes - all these are technically “one” instruction, but they are implemented as a sequence of separate steps on contemporary ARM CPUs.

So all this depends on the kind of the argument you want to make. You can argue that ARM has a more efficient instruction encoding scheme - but this is not the same as arguing it is simpler to implement in hardware (decoder complexity aside). I don’t think there is much difference in instruction complexity between ARM and x86 once you’ve moved past decode. What I like about ARM instructions is that they chunk some frequently co-occurring operations into a single instruction (e.g. shift and add or compute address and save the incremented register). This not only improves code density but also potentially simplifies dependency logic.


And ofc. I agree with that whole "complexity" I only mean Complex instruction in a sense of CISC (for example REP SCASB etc.)

Ah, I see. You are right, ARM traditionally lacked these types of instructions (to be fair, they are also very rarely used on x86, only one or two variants are actually properly accelerated - so it’s not a big implementation burden). On Apple CPUs, a bunch of instructions are microcoded, such as some table lookup variants.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Supposedly Qualcomm will indeed be coming out with Oryon-based servers later this year so we'll get to see how those compete.

Which rumors say that. This one???

That is actually dubious. Whole front of the article is lots of hype about this, but at the end ....

"... Furthermore, the status of this project is currently unconfirmed, but Qualcomm partners reportedly received briefings about it at the end of 2021 and early 2022, aligning with previous rumors ..."

This appears to be 'echo chamber' stuff.

2021 was when Arm was making noises about suing Qualcomm. 2022 was when Arm did sue Qualcomm

That server 'product' was pretty likely a 'plan C' in case the suit didn't look all that well (for Qualcomm). [ also a bit of a 'rope a dope' delaying tactic for Qualcomm also which drags out the timeline with Arm until it is almost 'too late' for Arm to do anything proactive. ] Basically, it is just continue with plan that Nuvia had in the beginning. That market is not why Qualcomm bought them. It is plain as day what the objective was in the current shipping product. Highly doubtful once Qualcomm went 'whole hog" into pushing Oryon cores into PC and reportedly into Smartphone zone (probably need a E-core or at least much more lower footprint core. ) that they would have any designer bandwidth left to do server processors. Nuvia wasn't some large design house with multiple completed cores shipping. They had not shipped anything. Nothing.

The whole X Elite/Plus product line is just one die. There is no track record of chasing after vastly different die sizes and markets at the same time.

[ The Centriq stuff? Qualcomm abandoned that after Broadcomm came sniffing around for a take over. To enhance the business financials they dropped it. A big chunk of those folks are at Ampere Computing and Microsoft now. If they had not dumped it they wouldn't have needed to pay Nuvia $1B to jump back into doing architecture license designs. There is a limit to Qualcomm's resources. ]

In server land Qualcomm is more likely to be putting money into follow ups to their inference card


as that stuff is 'red hot'. Not a server 'CPU' chip market. Neoverse is working relatively well and lots of hyperscale folks are doing "roll their own" CPU chips. If both Neoverse and Ampere Computing had badly stumbled maybe, then Qualcomm would have stepped in. So it was productive to have explored that line. It wasn't all fake, but doubtful it was the most fully funded effort. ]

If Qualcomm is diluting their talent, then this foray into the PC world is on very thin ice. This initial core is decent but AMD and Intel are closing lots of gaps ( and not all that competitive on the iGPU front). Qualcomm's position in smartphone space is at higher threat also. Oryon v2 is likely going to be incremental improvements. If Qualcomm is relatively late with Oryon v3 , then they'll probably loose some of the share they took ( Arm itself will be deeply committed to taking away their share at that point also. the number of competitors is only going up.).
 

crazy dave

macrumors 65816
Sep 9, 2010
1,450
1,221
Which rumors say that. This one???

That is actually dubious. Whole front of the article is lots of hype about this, but at the end ....

"... Furthermore, the status of this project is currently unconfirmed, but Qualcomm partners reportedly received briefings about it at the end of 2021 and early 2022, aligning with previous rumors ..."

This appears to be 'echo chamber' stuff.

2021 was when Arm was making noises about suing Qualcomm. 2022 was when Arm did sue Qualcomm

That server 'product' was pretty likely a 'plan C' in case the suit didn't look all that well (for Qualcomm). [ also a bit of a 'rope a dope' delaying tactic for Qualcomm also which drags out the timeline with Arm until it is almost 'too late' for Arm to do anything proactive. ] Basically, it is just continue with plan that Nuvia had in the beginning. That market is not why Qualcomm bought them. It is plain as day what the objective was in the current shipping product. Highly doubtful once Qualcomm went 'whole hog" into pushing Oryon cores into PC and reportedly into Smartphone zone (probably need a E-core or at least much more lower footprint core. ) that they would have any designer bandwidth left to do server processors. Nuvia wasn't some large design house with multiple completed cores shipping. They had not shipped anything. Nothing.

The whole X Elite/Plus product line is just one die. There is no track record of chasing after vastly different die sizes and markets at the same time.

[ The Centriq stuff? Qualcomm abandoned that after Broadcomm came sniffing around for a take over. To enhance the business financials they dropped it. A big chunk of those folks are at Ampere Computing and Microsoft now. If they had not dumped it they wouldn't have needed to pay Nuvia $1B to jump back into doing architecture license designs. There is a limit to Qualcomm's resources. ]

In server land Qualcomm is more likely to be putting money into follow ups to their inference card


as that stuff is 'red hot'. Not a server 'CPU' chip market. Neoverse is working relatively well and lots of hyperscale folks are doing "roll their own" CPU chips. If both Neoverse and Ampere Computing had badly stumbled maybe, then Qualcomm would have stepped in. So it was productive to have explored that line. It wasn't all fake, but doubtful it was the most fully funded effort. ]

If Qualcomm is diluting their talent, then this foray into the PC world is on very thin ice. This initial core is decent but AMD and Intel are closing lots of gaps ( and not all that competitive on the iGPU front). Qualcomm's position in smartphone space is at higher threat also. Oryon v2 is likely going to be incremental improvements. If Qualcomm is relatively late with Oryon v3 , then they'll probably loose some of the share they took ( Arm itself will be deeply committed to taking away their share at that point also. the number of competitors is only going up.).
Here's the original claim from Android authority. But yes they admit that they can't confirm the status of the project so indeed a Qualcomm Server chip would be classified as a rumor - hence the "supposedly" in the original post to denote some measure of doubt but perhaps I should've been even more circumspect in my original wording to further emphasize that this was a rumor and may not happen.

Unfortunately it is too late to edit the original.
 
Last edited:

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
For example, ARM lacks a register move instruction - those are often implemented as OR with a zero second operand. It’s a nice trick to manage the limited opcode space. But of course, it would be extremely wasteful to actually implement it this way.
Well, I am not sure why you would say that. Given the C=A op B design, as opposed to x86's A=A op B, the need for a move instruction is vanishing, so there might be no point in streamlining something that occurs extremely rarely. Compilers just put the result where it needs to go, no futzing with moving registers around.

One thing AArch64 lacks is a compare instruction: they just send the result to r31 and it gets ignored. What AArch64 also lacks is elaborate addressing modes like d32(BX,DX*4). There are several addressing modes - indexed, flat or scaled-to-operand-size, displacement, displacement-with-update and just register indirect. It is cool to have the symmetry of the x86 address mode structure that works the same in just about every op, but coolness draws more juice, makes more heat. The various addressing modes are tailored for particular instructions, and if there is need for some elaborate address calculation, it is handled by code and the large register file helps accommodate that.

It is almost as if the designers studied how stuff is coded and targeted their ISA to favor the most heavily used ops, at the cost of ease-of-coding – but, basically no one codes by hand any more. The x86 compilers are designed to optimise the object code they generate, using registers instead of memory as much as possible, because that creates a more smoothly-flowing program. But then, the decoders and logic have to be set up so that they could handle a mov d32(BX,DX), imm64, even if that instruction never gets used.

x86 has a lot of opcode patterns that it has to implement but that are very rarely used, which adds up to deadweight cruft – gates that have to exist, and draw some power even unused. AArch64 does have some rarely used instructions, but they do get used from time to time (e.g., the very elaborate RMW ops that are there for handling mutex locks, which are pretty critical in a multithreaded OS).
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Given the C=A op B design, as opposed to x86's A=A op B, the need for a move instruction is vanishing, so there might be no point in streamlining something that occurs extremely rarely. Compilers just put the result where it needs to go, no futzing with moving registers around.
Register-register ISAs don't seem to need a specific move instruction. For instance, RISC-V has the pseudo instruction "MV rd, rs" which translates into the instruction "ADDI rd, rs, 0".
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I don't know if Qualcomm wanted to turn the launch of its Snapdragon Elite/Plus SoC into an excellent marketing opportunity for Apple. This launch has helped to highlight Apple's competitive advantage even more clearly.

 

MrGunny94

macrumors 65816
Dec 3, 2016
1,148
675
Malaga, Spain
I had access to the Dell Latitude laptops, I have to say I'm deeply disappointed with the Prism support because of VPNs. Also half my games on Steam do not open, Digital Foundry examples are pretty much my experience.

One thing I did notice when talking with someone who had the Surface is that my Latitude has different performance levels when plugged/unplugged, I'm not sure if they will fix this via BIOS update or something.


Some information about Linux:

Bootloader is currently locked so can't try Linux just yet, they said they would provide a Ubuntu image with everything working including camera. (DELL said it's not supported until Kernel 6.11 is released for audio/webcam outside of their image as it's not been upstreamed
 
  • Like
Reactions: Xiao_Xi
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.