Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Original poster
Oct 14, 2008
19,523
19,679
I have always thought that RISC-V and ARM have the same addressing modes. What is the name of this adressing mode?

I am not talking about addressing modes but about address computation, that is, regular arithmetic instructions. Regarding addressing modes, RISC-V does not have the pointer-increment addressing modes of ARM, which are useful for loops for example.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
What I admire about Aarch64 is that the designers have considered these things in the basic ISA. The core arithmetic routines (such as addition) have a shift parameter which essentially multiplies one of the arguments by a constant. That is, ARM addition instruction is not a + b, but actually a + b * shift where shift is a constant that can be 1, 2, 4, 8 etc. This means that the same operation can be used for both the regular addition and the more complex (but very common) address computation.
Now I understand it. ARM implementations have a barrel shifter that can shift an operand before the ALU. It's pretty cool!
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Now I understand it. ARM implementations have a barrel shifter that can shift an operand before the ALU. It's pretty cool!

Doesn’t exactly come for free. Ideally you’d set your cycle time by the time it takes to do a 64-bit add, not a 64-bit add plus a barrel shift. So the barrel shift may need its own pipeline stage, anyway, or you might be slowing down your clock by some amount. Alternatively, in the decode stage you split it into two instructions and run through the ALU twice. A lot of things seem neat to developers but cause problems in implementation.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,523
19,679
Doesn’t exactly come for free. Ideally you’d set your cycle time by the time it takes to do a 64-bit add, not a 64-bit add plus a barrel shift. So the barrel shift may need its own pipeline stage, anyway, or you might be slowing down your clock by some amount. Alternatively, in the decode stage you split it into two instructions and run through the ALU twice. A lot of things seem neat to developers but cause problems in implementation.

From a cursory search it appears that all modern ARM implementations indeed have this for “free”. Barrel shifter is part of the ALU, only applies to the second register operand and is controlled by a small constant field inside the instruction. That’s exactly what I mean that modern ARM ISA is driven by pragmatism and real world applicability. I don’t think the cost of including a shifter for each ALU to be too high, it might make things a bit “longer” of course, but it’s still only a 5 bit shift, so nothing too fancy. It would be weird for ARM to design the ISA this way if it meant that the implementation has to split up the instructions.
 
  • Like
Reactions: Xiao_Xi

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,869
From a cursory search it appears that all modern ARM implementations indeed have this for “free”. Barrel shifter is part of the ALU, only applies to the second register operand and is controlled by a small constant field inside the instruction. That’s exactly what I mean that modern ARM ISA is driven by pragmatism and real world applicability. I don’t think the cost of including a shifter for each ALU to be too high, it might make things a bit “longer” of course, but it’s still only a 5 bit shift, so nothing too fancy. It would be weird for ARM to design the ISA this way if it meant that the implementation has to split up the instructions.
I think you're saying the shift amount is limited to 5 bit positions? But it's not - I just had a look at the manual and the shift amount is encoded in a 6 bit field, and the documentation states the value in the field can be any integer from 0 to 63 when operating on 64-bit data, or 0 to 31 on 32-bit data. Furthermore, the shift operation can be any of logical shift left, logical shift right, or arithmetic shift right. So it's a full width, full featured barrel shifter.

What @cmaier was telling you is that barrel shifters ain't cheap. It might seem like a trivial operation from a software perspective, but if you're ever asked to implement one in hardware, you'll find it isn't. I've designed a hardware block to do something which isn't really a barrel shift, but does something quite similar, and it was a challenge. It took me quite a bit of work to minimize area and close timing, and accomplishing the latter required several pipeline stages. (This was in FPGA, not ASIC, for what it's worth, but the problems are often similar even if their magnitude and solutions are different.)

I'm sure that A64's designers put a lot of thought into whether it would actually be worth it to offer this merged instruction. It's not a cheap, consequence-free thing.
 
  • Like
Reactions: EntropyQ3

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
Ideally you’d set your cycle time by the time it takes to do a 64-bit add, not a 64-bit add plus a barrel shift.
Does the ALU stage limit the cycle time? I thought some other stages like the writeback take longer than the ALU stage.

I think you're saying the shift amount is limited to 5 bit positions? But it's not - I just had a look at the manual and the shift amount is encoded in a 6 bit field, and the documentation states the value in the field can be any integer from 0 to 63 when operating on 64-bit data, or 0 to 31 on 32-bit data. Furthermore, the shift operation can be any of logical shift left, logical shift right, or arithmetic shift right. So it's a full width, full featured barrel shifter.
You need two bits to choose the type of shift and other 5 (for 32-bit ARM) or 6 bits (for 64-bit ARM) to indicate how many bits you want to shift the operand.

The link explains the add instruction.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,523
19,679
I think you're saying the shift amount is limited to 5 bit positions? But it's not - I just had a look at the manual and the shift amount is encoded in a 6 bit field, and the documentation states the value in the field can be any integer from 0 to 63 when operating on 64-bit data, or 0 to 31 on 32-bit data. Furthermore, the shift operation can be any of logical shift left, logical shift right, or arithmetic shift right. So it's a full width, full featured barrel shifter.

What @cmaier was telling you is that barrel shifters ain't cheap. It might seem like a trivial operation from a software perspective, but if you're ever asked to implement one in hardware, you'll find it isn't. I've designed a hardware block to do something which isn't really a barrel shift, but does something quite similar, and it was a challenge. It took me quite a bit of work to minimize area and close timing, and accomplishing the latter required several pipeline stages. (This was in FPGA, not ASIC, for what it's worth, but the problems are often similar even if their magnitude and solutions are different.)

I'm sure that A64's designers put a lot of thought into whether it would actually be worth it to offer this merged instruction. It's not a cheap, consequence-free thing.

Indeed, I stand corrected. I looked again and the data processing instructions that make use of the shifter do increase the latency to 2 cycles (the throughput remains unchanged though): https://dougallj.github.io/applecpu/firestorm-int.html

Thanks for pointing this out @cmaier and @mr_roboto, I learned something today :) I still think its a cool ISA design :)
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,869
Indeed, I stand corrected. I looked again and the data processing instructions that make use of the shifter do increase the latency to 2 cycles (the throughput remains unchanged though): https://dougallj.github.io/applecpu/firestorm-int.html

Thanks for pointing this out @cmaier and @mr_roboto, I learned something today :) I still think its a cool ISA design :)
Dougall's data shows halved throughput too, though? Plain ADD has TP of 0.167 (he's listing reciprocal throughput here, so 0.167 = 1/6 => 6 ADD per cycle), while ADD (shift) is 0.333 (3 per cycle).
 
  • Like
Reactions: Xiao_Xi

leman

macrumors Core
Original poster
Oct 14, 2008
19,523
19,679
Dougall's data shows halved throughput too, though? Plain ADD has TP of 0.167 (he's listing reciprocal throughput here, so 0.167 = 1/6 => 6 ADD per cycle), while ADD (shift) is 0.333 (3 per cycle).

And again I stand corrected ? Not my finest day. Spreading my cognitive capacity too thin at the moment ?
 
  • Like
Reactions: Andropov

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Does the ALU stage limit the cycle time? I thought some other stages like the writeback take longer than the ALU stage.


You need two bits to choose the type of shift and other 5 (for 32-bit ARM) or 6 bits (for 64-bit ARM) to indicate how many bits you want to shift the operand.

The link explains the add instruction.
You never know where the critical path will be. You typically try to make the cycle time equal to what your alu can accomplish. But typically there’s some other random path that you just can’t make fit. At the start, when you are floor planning, you need to set a goal, though, and every chip I ever worked on went by the alu.
 
  • Like
Reactions: Xiao_Xi

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
You never know where the critical path will be. You typically try to make the cycle time equal to what your alu can accomplish. But typically there’s some other random path that you just can’t make fit. At the start, when you are floor planning, you need to set a goal, though, and every chip I ever worked on went by the alu.
Purely hypothetical, but I wanna pick your brain on this; how much performance do you think x86 could gain if it adopted some of the attributes of Apple Silicon, namely unified memory, and removing 32-bit instructions?
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Purely hypothetical, but I wanna pick your brain on this; how much performance do you think x86 could gain if it adopted some of the attributes of Apple Silicon, namely unified memory, and removing 32-bit instructions?
It would probably cut half the deficit? I’d have to model it. The issue is more the 16 and 8 bit instructions. But even getting rid of all that would still not help a ton because as long as you have variable length instructions you need to limit your issue width or take heroic measures (like burning power or adding pipe stages) to compensate. The deciding will always be a problem. Fixed length instructions would be a huge improvement. So if you excise everything that doesn’t have the same width instruction (regardless of which generation of the ISA we are talking about), that would help a lot. the unified memory ihelps a lot (for workloads where that matters).
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,628
1,101
I have read that x64 CPUs internally behave as RISC CPUs. Is it true? What does it mean?

What should I read/study to understand better CPU designs?
 
  • Like
Reactions: ingambe

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
I have read that x64 CPUs internally behave as RISC CPUs. Is it true? What does it mean?

What should I read/study to understand better CPU designs?

It’s not quite true.

I recommend the textbooks by Hennessy & Patterson. They are sort of the bible and very approachable.

going back to the first point, when people say this, it’s people who know just enough to get themselves in trouble. What they probably mean is that the core doesn’t directly execute ISA instructions. They go through a decoder, which breaks them down into (mostly) fixed-width instructions, and those fixed-width instructions are what all the rest of the hardware deals with.

But that’s been the case since the 80186. There’s always a translation. Even RISC CPUs do that.

And CISC CPUs still have to pay a huge penalty - the decoder is big and power-sucking and because it’s hard to find where ISA instructions start and end and you never know how many you have in a given instruction memory read, you can’t easily schedule them to issue a large number in parallel. And when you do issue, say, the 8 micro-instructions that correspond to one given ISA instruction, you need extra hardware to keep track of the fact that they are tied together. Otherwise, what happens if there is a context switch - an interrupt or an exception, when you are in the middle of those 8 instructions? They need to execute atomically, so either you unroll all of them or you let the rest of them execute. Lots of other complications due to addressing modes, etc.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
the unified memory ihelps a lot (for workloads where that matters).

How does unified memory help the CPU at all? This does not make any sense to me, it is not even a CPU feature but a system feature.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
How does unified memory help the CPU at all? This does not make any sense to me, it is not even a CPU feature but a system feature.

I referred to “workloads where that matters.” Those are obviously workloads where the CPU shares memory with other types of processing cores (typically GPUs, but conceivably other types of coprocessors). Intel chips often have GPUs on them.
 

Gerdi

macrumors 6502
Apr 25, 2020
449
301
I referred to “workloads where that matters.” Those are obviously workloads where the CPU shares memory with other types of processing cores (typically GPUs, but conceivably other types of coprocessors). Intel chips often have GPUs on them.

Sure, but this is not in any form related to the instruction set architecture like x86 whatsoever. In general x86 SoCs are all UMA systems as well - as for instance Xbox.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Sure, but this is not in any form related to the instruction set architecture like x86 whatsoever. In general x86 SoCs are all UMA systems as well - as for instance Xbox.

Sure. I guess i didn’t read the question i was responding to as being limited to the ISA, but interpreted it as “what if AMD or Intel adopted an M1-like system architecture?”
 

jeanlain

macrumors 68020
Mar 14, 2009
2,463
958
If I may post on topic... PCWorld has some performance data for the intel 12900HK :
15000 cinebench R23 MT
1892 cinebench ST
13575 Geekbench MT
1899 Geekbench ST

I think this is consistent with earlier reports.
PCWorld is keen on pointing out that the intel part beats the M1 Pro, but no word on power consumption. ?
 
  • Like
Reactions: JMacHack

StoneJack

macrumors 68030
Dec 19, 2009
2,732
1,983
If I may post on topic... PCWorld has some performance data for the intel 12900HK :
15000 cinebench R23 MT
1892 cinebench ST
13575 Geekbench MT
1899 Geekbench ST

I think this is consistent with earlier reports.
PCWorld is keen on pointing out that the intel part beats the M1 Pro, but no word on power consumption. ?
yes, it exceeds M1 benchmarks, but at higher power consumption. This always was true, even for earlier Intel CPUs. So nothing new. In addition, all tested systems had RTX3080 graphic cards, and 4800 DDR RAM
 
  • Like
Reactions: JMacHack

jinnyman

macrumors 6502a
Sep 2, 2011
762
671
Lincolnshire, IL
I believe all tests were cpu only. So having RTX3080 doesn't affect the results.
Given that they've worked with MSI GE76, the power consumption while doing the test would be probably at least 90watts long term (perhaps not for Geekbench as it's not that demanding for long time)
 

crazy dave

macrumors 65816
Sep 9, 2010
1,454
1,232
I believe all tests were cpu only. So having RTX3080 doesn't affect the results.
Given that they've worked with MSI GE76, the power consumption while doing the test would be probably at least 90watts long term (perhaps not for Geekbench as it's not that demanding for long time)

Actually it’s the opposite problem for Geekbench - unless you run the test many times in a row - the tests aren’t long enough to trigger throttling. So you get the measure of max performance at peak watts, but often little information about sustained performance. So the wattage never drops (eg all 2020 M1 base machines, air, pro, mini get the same MT score despite that obviously especially the air should throttle under sustained load).

In fact according to Andrei and Ryan of Anandtech, the GPU tests are so short that the M1 Max had trouble spinning the GPU to full speed before completing the task. Now that is the GPU version of the test, but you get the idea.
 

UBS28

macrumors 68030
Oct 2, 2012
2,893
2,340
Why don't we wait next year when Intel launch their 3nm chip made by TSMC? Going from the super old 10nm to 3nm is huge.
 

T'hain Esh Kelch

macrumors 603
Aug 5, 2001
6,478
7,444
Denmark
Why don't we wait next year when Intel launch their 3nm chip made by TSMC? Going from the super old 10nm to 3nm is huge.
TSMC is not in any way capable of delivering the amount of 3nm chips that Intel would need, while supplying to Apple at the same time.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.