Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,520
19,671
I am a software engineer myself and typically I have several instances of VSCODE (plus Nedit and Notepad+ etc.) running simultaneously plus several Chrome windows (with dozens of tabs), plus MS Outlook and Teams, plus several VNC sessions, File Manager(s), remote desktop(s) etc. depending on the task at hand. I understand that not everyone needs that many apps running at the same time but at least several windows with API references and other documentation are a must. While I could manage those windows on a small screen with virtual desctops and other tricks - it's just not as productive. I use 38" monitor and a laptop screen.

All depends on your workflow I suppose. I put my references on a separate virtual desktop, so they are just a swipe away. The only time where I find myself really needing a secondary monitor is when doing occasional web frontend work.
 

Gudi

Suspended
May 3, 2013
4,590
3,267
Berlin, Berlin
There is no such thing as "inherent power inefficiency of x86 technology". There are low power x86 chips and there are high power ARM chips.
No, there aren’t. For heat reasons alone x86 chips will throttle in a laptop chassis. If you need to cut down performance in order to build a so-called Intel "laptop chip", then your technology is crippled by power inefficiency.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
There is no such thing as "inherent power inefficiency of x86 technology".
Yes there is. x86 instruction set encoding is objectively difficult to deal with in high performance designs. This causes substantial growth in transistor and power budget in a CPU's front end.

The front end is not the whole CPU, of course. If it was, x86 decode would be such a huge millstone around x86's neck that x86 would have ended up in the dustbin of history a long time ago. But it's not, so x86 had and still has a chance to compete... just wearing a weight belt that bogs it down a bit, compared to architectures which don't have x86's weird and nasty instruction encoding.

There are low power x86 chips and there are high power ARM chips. For example, TDP of the ARM processor Ampere Altra Max Q80-30 is 210W.
That chip might not be the best example to prove your point. The only reason its TDP is 210W is that it's an 80-core chip, which works out to about 2.625W per core. Ampere's latest is a 128-core chip with a 250W TDP, 1.95W/core.

Compare to Intel and AMD. Intel's biggest seems to be the Xeon Platinum 8490H, with 60 cores and a 350W TDP, or 5.83W/core (almost 3x more). AMD's biggest 4th gen Epycs are 96 cores with 360W TDP, or 4W/core (about 2x more).

I will say that this is not an apples to apples comparison. Ampere is using Neoverse series cores designed by Arm, and they simply aren't architected to hit the same peak frequencies as x86 cores. That matters. Maximum frequency has area and power costs independent of ISA (deeper pipelines, more gates switching), as well as ISA-dependent costs (those x86 decoders are much nastier the higher you push Fmax).

Point is, Intel and AMD do pay a price for economizing by sharing core designs between the desktop and server markets. Those 5+ GHz capable Golden Cove cores are restricted to 3.5 GHz max single core turbo in Intel's 60-core server chip, and Golden Cove could be more power and area efficient if it had been targeted to run no faster than 3.5. (Probably also noticeably faster than reality's Golden Cove is at 3.5. Increasing pipeline depth to target higher clocks hurts performance at the same clock, you only do it because you want to push frequency higher.)

Even with all these confounding factors, and more I didn't cover, those ratios are so big that you can't really handwave them away, IMO.
 
  • Like
Reactions: dmccloud

dmccloud

macrumors 68040
Sep 7, 2009
3,142
1,899
Anchorage, AK
Why? Do you unplug your computer while sitting at the desk or do you roam around while working? Neither are typical behaviors.

I take my laptop with me quite often, and not having to bring along my AC adapter all the time is a welcome thing. It's weird that you specifically call out the latter example as not being a "typical behavior" when you have an entire post promoting laptops over desktops for some of the same reasons I already use my laptop for.


There is no such thing as "inherent power inefficiency of x86 technology". There are low power x86 chips and there are high power ARM chips. For example, TDP of the ARM processor Ampere Altra Max Q80-30 is 210W. Apple, Intel and AMD design their architecture to cover a range of chips. For Apple this range stretches from smart phones to desktops (with the bulk of their profits coming from phones). Intel and AMD design chips for computers ranging from laptops to super computers, with heavy emphasis on server chips. This explains the design trade offs. Apple architecture is more power efficient but it does not scale well (hence no AS based Mac Pro in sight).

What does a non-Apple ARM-based processor have to do with the Mac? Apple is not designing silicon for the market Ampere Altra even reaches.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
What does a non-Apple ARM-based processor have to do with the Mac? Apple is not designing silicon for the market Ampere Altra even reaches.

The conversation was x86 vs. ARM. Nothing here has anything to do with the Mac.
 

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
What does a non-Apple ARM-based processor have to do with the Mac? Apple is not designing silicon for the market Ampere Altra even reaches.
The original post was about the inherent disadvantages if x86 architecture and, thus, conversely, inherent advantages of ARM architectures. Those are not Apple specific.
 

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
Yes there is. x86 instruction set encoding is objectively difficult to deal with in high performance designs. This causes substantial growth in transistor and power budget in a CPU's front end.
This line of reasoning is too simplistic to be taken at face value. Instruction decoding uses a miniscule fraction of silicon. 8086 processor was able to decode this instruction set just fine while using 0.000x percent of the transistors that current chips have. If "instruction decoding" also implies their execution then, again, it's not enough to just point out their complexity. While the time required for execution of such instructions might be greater than the time for execution of "simple" instructions one has to account for the fact that complex instructions also do more than simple instructions so given algorithm may use fewer complex instructions than simple extractions.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
This line of reasoning is too simplistic to be taken at face value. Instruction decoding uses a miniscule fraction of silicon. 8086 processor was able to decode this instruction set just fine while using 0.000x percent of the transistors that current chips have.

You opened with a great statement but then kind of lost it mentioning the 8086. Roughly half of that chip was dedicated to decoder and the microcode storage :) The 8086 also didn't have to decode 6 instructions in a 32-byte window per cycle, like current Intel CPUs do.

Detecting the boundaries and and types of this style variable-length instructions in parallel at constant time does carry a substantial cost compared to some other ISAs, there is just no way around it. What one can argue about is the extent of this cost. I do agree that it sometimes might be a bit exaggerated (especially with energy optimisations like microcode caches), but it's a disadvantage x86 CPUs have to deal with.

If "instruction decoding" also implies their execution then, again, it's not enough to just point out their complexity. While the time required for execution of such instructions might be greater than the time for execution of "simple" instructions one has to account for the fact that complex instructions also do more than simple instructions so given algorithm may use fewer complex instructions than simple extractions.

None of this has anything to do with execution. And x86 instructions are not more "complex" in any useful sense. They are just bloated and confusing. The 64-bit ARM generally does more work per instruction.
 

dmccloud

macrumors 68040
Sep 7, 2009
3,142
1,899
Anchorage, AK
You opened with a great statement but then kind of lost it mentioning the 8086. Roughly half of that chip was dedicated to decoder and the microcode storage :) The 8086 also didn't have to decode 6 instructions in a 32-byte window per cycle, like current Intel CPUs do.

Detecting the boundaries and and types of this style variable-length instructions in parallel at constant time does carry a substantial cost compared to some other ISAs, there is just no way around it. What one can argue about is the extent of this cost. I do agree that it sometimes might be a bit exaggerated (especially with energy optimisations like microcode caches), but it's a disadvantage x86 CPUs have to deal with.



None of this has anything to do with execution. And x86 instructions are not more "complex" in any useful sense. They are just bloated and confusing. The 64-bit ARM generally does more work per instruction.

I may be oversimplifying things a bit, but isn't part of the performance penalty associated with VLI due to the system having to check for the start and/or endpoint of the instruction continuously? With fixed length instructions, the boundaries can be hard coded, which frees up resources for other tasks on the CPU.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
I may be oversimplifying things a bit, but isn't part of the performance penalty associated with VLI due to the system having to check for the start and/or endpoint of the instruction continuously? With fixed length instructions, the boundaries can be hard coded, which frees up resources for other tasks on the CPU.

Which performance penalty are you referring to? The current crop of x86 cores are all quite a bit faster than ARM processors (albeit at a power consumption premium).

VLI instruction boundaries can be detected in constant time by wiring together multiple levels of logic (e.g. you put circuits that detect bit patterns that could refer to starts or ends of certain instruction sequences and then combine their results to figure out the actual instructions). All it takes is die area, some complexity and obviously some extra power.Alder Lake for example can decode up to six instructions per cycle, examining a 32-byte window in one step.
 
  • Sad
Reactions: Gudi

falainber

macrumors 68040
Mar 16, 2016
3,539
4,136
Wild West
You opened with a great statement but then kind of lost it mentioning the 8086. Roughly half of that chip was dedicated to decoder and the microcode storage :) The 8086 also didn't have to decode 6 instructions in a 32-byte window per cycle, like current Intel CPUs do.

Detecting the boundaries and and types of this style variable-length instructions in parallel at constant time does carry a substantial cost compared to some other ISAs, there is just no way around it. What one can argue about is the extent of this cost. I do agree that it sometimes might be a bit exaggerated (especially with energy optimisations like microcode caches), but it's a disadvantage x86 CPUs have to deal with.



None of this has anything to do with execution. And x86 instructions are not more "complex" in any useful sense. They are just bloated and confusing. The 64-bit ARM generally does more work per instruction.
Well, assuming you are not exaggerating (about the half of transistors used for decoding), 8086 had 29K transistors. 29000 / 2 = 15K. This translates to 300K for, say, 20 core processor. Current Intel processors have 3B transistors which means that the decoding part would use 0.1% of all transistors?

And if ARM instructions do more that x86/64 instructions then it is likely they use more time. I am not an expert here. I am just saying that people are claiming things they don't understand. And we did not even discuss the microcode thing which, as I understand, makes all this discussion moot (at least to a degree).
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
Well, assuming you are not exaggerating (about the half of transistors used for decoding), 8086 had 29K transistors. 29000 / 2 = 15K.

Luckily, a great article about how 8086 decodes instructions was published just yesterday. Notice the area taken by the microcode, translation ROM and decoder — those are all parts that are involved.

This translates to 300K for, say, 20 core processor. Current Intel processors have 3B transistors which means that the decoding part would use 0.1% of all transistors?

Judging by annotated die shots, more like 5-7%. But of course, instruction decoding in a modern x86 is vastly more challenging than in 8086.

And if ARM instructions do more that x86/64 instructions then it is likely they use more time. I am not an expert here.

Sure, doing more work either needs more time or more resources. I was merely pointing out that decoding end execution are different things. In the end, no matter which ISA you use, it will run the same program. It's just that ARM64 will often use fewer instructions on average, because, well, it often packs more useful information per instruction (which alone tends to be unexpected for folks who tend to get too zeroed in on the RISC vs. CISC story).

I am just saying that people are claiming things they don't understand. And we did not even discuss the microcode thing which, as I understand, makes all this discussion moot (at least to a degree).

I think we are in agreement here. It is easy to jump to conclusion that x86 ISA is the only responsible for extreme power consumption of modern x86 designs, but as usual, the devil is in details.

That said, I think it's beyond any doubt that decoding in x86 is more expensive. Not only the ISA itself is much more complicated (with different prefixes and many more instruction formats to consider), but detecting the instruction boundaries is extra work and complexity. So there is indeed inherent disadvantage in x86 from this perspective, it's just that actually quantifying the cost is not trivial.
 

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
VLI instruction boundaries can be detected in constant time by wiring together multiple levels of logic (e.g. you put circuits that detect bit patterns that could refer to starts or ends of certain instruction sequences and then combine their results to figure out the actual instructions). All it takes is die area, some complexity and obviously some extra power.Alder Lake for example can decode up to six instructions per cycle, examining a 32-byte window in one step.
Yeah, I suspect that that is not how it works. An x86 decoder would scan the instruction stream rather like a string. Here is a prefix, put that in the bucket; here is another prefix, put that in the decode scheme; here is an opcode, decode it; here is an address spec, decode it; here is an operand, collect it; move on with the next scan. It has to be a very different type of process, that scans pieces of an instruction and feeds them in as needed. Actual instruction boundaries do not have quite the same weight (apart from the fact that you damn well better not be branching code into the middle of an instruction).
 

Gudi

Suspended
May 3, 2013
4,590
3,267
Berlin, Berlin
@leman Since you don’t want to answer, why there are no x86 chips in smartwatches, here’s another question: Why didn’t Apple design its own x86 SoC and build it on TSMCs 5nm node? If technically possible this would’ve saved them so much software development costs! Just admit that x86 is inherently energy inefficient. To fix this problem one must cut down the bloated instruction set and voila an ARM processor. 💁
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
Yeah, I suspect that that is not how it works. An x86 decoder would scan the instruction stream rather like a string. Here is a prefix, put that in the bucket; here is another prefix, put that in the decode scheme; here is an opcode, decode it; here is an address spec, decode it; here is an operand, collect it; move on with the next scan. It has to be a very different type of process, that scans pieces of an instruction and feeds them in as needed. Actual instruction boundaries do not have quite the same weight (apart from the fact that you damn well better not be branching code into the middle of an instruction).

Thats possible as well, although in that case im not sure how they can do six instructions per cycle that way. Unless a step of the scan you mention can be done at a fraction of a cycle. We do know that x86 CPUs inspect 16 or 32 bytes (in Alder Lake) per cycle.
 

leman

macrumors Core
Oct 14, 2008
19,520
19,671
@leman Since you don’t want to answer, why there are no x86 chips in smartwatches, here’s another question: Why didn’t Apple design its own x86 SoC and build it on TSMCs 5nm node? If technically possible this would’ve saved them so much software development costs! Just admit that x86 is inherently energy inefficient. To fix this problem one must cut down the bloated instruction set and voila an ARM processor. 💁

Sorry, I’m just not that interested in engaging you. I’ve tried it in the past and I don’t think it’s a good use of either of our time.
 

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
Why didn’t Apple design its own x86 SoC and build it on TSMCs 5nm node? If technically possible this would’ve saved them so much software development costs

Apple was using an ARM CPU to power the iPod. After they switched the Macintosh line to x86, they did approach Intel for development of the iPhone processor, which would have unified the phone and Mac architectures. Intel told them to p*** off, so Apple turned to Samsung. By 2012, Apple was making their own chips and expanding their handheld line, as Android was growing rapidly as well and Intel was left out of the broader market segment.

That there is exactly why: Intel just wanted no part of it.
 
  • Like
  • Angry
Reactions: Gudi and leman

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
And if ARM instructions do more that x86/64 instructions then it is likely they use more time. I am not an expert here.

Yes, you are not.

Most of the ARM instruction set uses a = b + c design, which incorporates a register-move operation within the calculation. The heavily-used basic math/logic operations embed a shift for one operand, which would use extra code on x86. These instructions do multiple things in a single cycle.

There are the load/store-with-update operations, which allow any general register to function similar to a stack pointer or as an arbitrary array scanner, which takes more code on x86. All in a single cycle (plus memory access time), with no μops.

The vector gather-load/scatter-store operations may take a little longer, but mostly because they may require multiple memory accesses.

The only compound operations that genuinely take longer than a cycle are a set of 10 atomics that perform math on memory with writeback. The atomics do appear in code periodically – not a great deal – because a multiprocessor system needs them for synchronizing semaphores.

So, the ARM architecture is designed to accomplish as much as possible with each operation (and some ops in code do not use the full capability of a given instruction) and do almost all of what needs to be done in one cycle (or less). This is greatly simplified in the core by making use of the big register rename array rather than relying on a physically-fixed register set, something that is a little more difficult to do on x86 with its dedicated-use registers.

On top of this, the large generic register set makes it eminently practical for subroutine arguments to be passed in registers rather than on the stack. x86-64 is a little better here than 32-bit, but the 64-bit ARM architecture makes it easy for nearly all calls to pass their parameters in registers, which greatly improves efficiency.
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,520
19,671
Most of the ARM instruction set uses a = b + c design, which incorporates a register-move operation within the calculation. The heavily-used basic math/logic operations embed a shift for one operand, which would use extra code on x86. These instructions do multiple things in a single cycle.

There are the load/store-with-update operations, which allow any general register to function similar to a stack pointer or as an arbitrary array scanner, which takes more code on x86. All in a single cycle (plus memory access time), with no μops.

The vector gather-load/scatter-store operations may take a little longer, but mostly because they may require multiple memory accesses.

The only compound operations that genuinely take longer than a cycle are a set of 10 atomics that perform math on memory with writeback. The atomics do appear in code periodically – not a great deal – because a multiprocessor system needs them for synchronizing semaphores.

That’s not how it works on Apple Silicon though. Instructions with more complex semantics (op + shift, load pair, increment addressing modes) use multiple execution modes and have longer latency. The instruction latencies are different between instructions as well. I don’t know whether Apple uses micro-ops in their usual definition but there is certainly some sort of operation decomposition going on.

Instruction timing and throughtput: https://dougallj.github.io/applecpu/firestorm-int.html
 
  • Like
Reactions: Xiao_Xi

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I don’t know whether Apple uses micro-ops in their usual definition but there is certainly some sort of operation decomposition going on.
Authors of the paper The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V estimated ARMv8 micro-op count "by breaking up load-increment address, load-pair, and load-pair-increment-address into multiple micro-ops." They "assumed that any instruction that writes multiple registers would be broken down into additional microops (one micro-op per register write-back destination)."
 
  • Like
Reactions: Gudi

dmccloud

macrumors 68040
Sep 7, 2009
3,142
1,899
Anchorage, AK
@leman Since you don’t want to answer, why there are no x86 chips in smartwatches, here’s another question: Why didn’t Apple design its own x86 SoC and build it on TSMCs 5nm node? If technically possible this would’ve saved them so much software development costs! Just admit that x86 is inherently energy inefficient. To fix this problem one must cut down the bloated instruction set and voila an ARM processor. 💁

The only reason AMD can still make x86 CPUs is because it won a court case against Intel way back when (after Intel tried to say AMD couldn't make anything from the 486 on). The irony here is that Intel has to license the x64 extensions to the instruction set from AMD because of Itanium tanking in the 64-bit market, but that's a completely different story. Intel wouldn't license x86 to Apple for its own processors, because that could lead to more competition ala AMDs presence in that market space. Additionally, since Apple was already building out its own version of the ARM instruction set for the A-series SoCs, it was relatively trivial to extend that architecture to the Mac platform.
 
Last edited:

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
This line of reasoning is too simplistic to be taken at face value. Instruction decoding uses a miniscule fraction of silicon. 8086 processor was able to decode this instruction set just fine while using 0.000x percent of the transistors that current chips have. If "instruction decoding" also implies their execution then, again, it's not enough to just point out their complexity. While the time required for execution of such instructions might be greater than the time for execution of "simple" instructions one has to account for the fact that complex instructions also do more than simple instructions so given algorithm may use fewer complex instructions than simple extractions.
I'd say the person being simplistic isn't me.

No, decoding doesn't imply execution. The decoder is responsible for scanning bytes coming from the instruction fetch unit, figuring out which instruction those bytes represent, performing several tasks to expand the instruction into a format suitable for execution (in x86 land, these expanded instructions are called micro-ops or uops), and finally handing uops off to a dispatch unit which sends them on to execution units.

The ISA decoded by the original 8086 was much smaller and simpler to decode than x86-64 is today. They hadn't yet severely abused the prefix byte mechanism to add tons of new instructions. More to the point, the 8086 only had one execution pipeline and therefore only needed to decode one instruction at a time. Decoding multiple instructions in parallel is the big pain point today.

This is because x86-64 instructions are one to fifteen bytes long, in one-byte increments, and it's not trivial to determine exactly how long an instruction is. You have to examine the first byte, make decisions based on its contents, possibly move on to the second byte, make decisions based on its contents, etc. Not one-by-one all the way to 15 bytes, but it takes some serially dependent work to figure out how long an instruction is.

This isn't so bad if you're doing just one and you can decode the next instruction in the next cycle, but modern CPUs need to decode many instructions in the same cycle. If you're building an x86, you don't know where the second instruction begins until you've finished figuring out the length of the first, or where the third begins until you've figured out both the first and the second, and so on. This is not fun when you want to decode four or six instructions in parallel. (Six is the current state of the art, seen in Intel Golden Cove.)

The solutions x86 chip architects turn to are a bit insane. As far as I know the basic approach is still brute force: just build a ton of decoders. Let each one start work on a different byte offset. Later, figure out which ones to pay attention to (because they happened to start at the beginning of a real instruction) and which ones to ignore (because they produced meaningless garbage after starting in the middle of a real instruction). You still end up with a nasty dependency chain problem selecting which decoders to pay attention to, and you do a lot of work which gets thrown away, but you can make it run fast. At a price.

Parallel decode is trivial on Arm. All Arm v8/v9 A64 instructions are exactly 4 bytes long, and they are always aligned to 4-byte boundaries. You just grab bytes from the fetcher, stuff the first four into decoder 1, the second four into decoder 2, and so on. Simple, and no work is wasted.

Intel's low-power cores (Atom series) can't do 6-wide decode, it's too much to ask. Their current state of the art is dual 3-wide decoders, with the second bank of 3 coming into play only on branches. (When the branch predictor has cached the target address of a branch, it's possible to start the second group of 3 decoders off on that branch target address ahead of time.)

Atoms don't really scale down to phone power levels any more, Intel gave up on that segment long ago so Atom cores are now targeted at low end laptops, some embedded applications, and the "efficiency" cores in Alder Lake. Apple is able to put full 8-wide decode units into the performance cores in their phones, even though phones have tighter power constraints than the things Atoms go into. Does that tell you anything about how serious this issue is for x86?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.