Intel Alder Lake vs. Apple M1

leman · Jan 13, 2022

Xiao_Xi said:
I have always thought that RISC-V and ARM have the same addressing modes. What is the name of this adressing mode?

I am not talking about addressing modes but about address computation, that is, regular arithmetic instructions. Regarding addressing modes, RISC-V does not have the pointer-increment addressing modes of ARM, which are useful for loops for example.

Xiao_Xi · Jan 13, 2022

leman said:
What I admire about Aarch64 is that the designers have considered these things in the basic ISA. The core arithmetic routines (such as addition) have a shift parameter which essentially multiplies one of the arguments by a constant. That is, ARM addition instruction is not a + b, but actually a + b * shift where shift is a constant that can be 1, 2, 4, 8 etc. This means that the same operation can be used for both the regular addition and the more complex (but very common) address computation.

Now I understand it. ARM implementations have a barrel shifter that can shift an operand before the ALU. It's pretty cool!

cmaier · Jan 13, 2022

Xiao_Xi said:
Now I understand it. ARM implementations have a barrel shifter that can shift an operand before the ALU. It's pretty cool!

Doesn’t exactly come for free. Ideally you’d set your cycle time by the time it takes to do a 64-bit add, not a 64-bit add plus a barrel shift. So the barrel shift may need its own pipeline stage, anyway, or you might be slowing down your clock by some amount. Alternatively, in the decode stage you split it into two instructions and run through the ALU twice. A lot of things seem neat to developers but cause problems in implementation.

leman · Jan 13, 2022

cmaier said:
Doesn’t exactly come for free. Ideally you’d set your cycle time by the time it takes to do a 64-bit add, not a 64-bit add plus a barrel shift. So the barrel shift may need its own pipeline stage, anyway, or you might be slowing down your clock by some amount. Alternatively, in the decode stage you split it into two instructions and run through the ALU twice. A lot of things seem neat to developers but cause problems in implementation.

From a cursory search it appears that all modern ARM implementations indeed have this for “free”. Barrel shifter is part of the ALU, only applies to the second register operand and is controlled by a small constant field inside the instruction. That’s exactly what I mean that modern ARM ISA is driven by pragmatism and real world applicability. I don’t think the cost of including a shifter for each ALU to be too high, it might make things a bit “longer” of course, but it’s still only a 5 bit shift, so nothing too fancy. It would be weird for ARM to design the ISA this way if it meant that the implementation has to split up the instructions.

mr_roboto · Jan 14, 2022

leman said:
From a cursory search it appears that all modern ARM implementations indeed have this for “free”. Barrel shifter is part of the ALU, only applies to the second register operand and is controlled by a small constant field inside the instruction. That’s exactly what I mean that modern ARM ISA is driven by pragmatism and real world applicability. I don’t think the cost of including a shifter for each ALU to be too high, it might make things a bit “longer” of course, but it’s still only a 5 bit shift, so nothing too fancy. It would be weird for ARM to design the ISA this way if it meant that the implementation has to split up the instructions.

I think you're saying the shift amount is limited to 5 bit positions? But it's not - I just had a look at the manual and the shift amount is encoded in a 6 bit field, and the documentation states the value in the field can be any integer from 0 to 63 when operating on 64-bit data, or 0 to 31 on 32-bit data. Furthermore, the shift operation can be any of logical shift left, logical shift right, or arithmetic shift right. So it's a full width, full featured barrel shifter.

What @cmaier was telling you is that barrel shifters ain't cheap. It might seem like a trivial operation from a software perspective, but if you're ever asked to implement one in hardware, you'll find it isn't. I've designed a hardware block to do something which isn't really a barrel shift, but does something quite similar, and it was a challenge. It took me quite a bit of work to minimize area and close timing, and accomplishing the latter required several pipeline stages. (This was in FPGA, not ASIC, for what it's worth, but the problems are often similar even if their magnitude and solutions are different.)

I'm sure that A64's designers put a lot of thought into whether it would actually be worth it to offer this merged instruction. It's not a cheap, consequence-free thing.

Xiao_Xi · Jan 14, 2022

cmaier said:
Ideally you’d set your cycle time by the time it takes to do a 64-bit add, not a 64-bit add plus a barrel shift.

Does the ALU stage limit the cycle time? I thought some other stages like the writeback take longer than the ALU stage.

mr_roboto said:
I think you're saying the shift amount is limited to 5 bit positions? But it's not - I just had a look at the manual and the shift amount is encoded in a 6 bit field, and the documentation states the value in the field can be any integer from 0 to 63 when operating on 64-bit data, or 0 to 31 on 32-bit data. Furthermore, the shift operation can be any of logical shift left, logical shift right, or arithmetic shift right. So it's a full width, full featured barrel shifter.

You need two bits to choose the type of shift and other 5 (for 32-bit ARM) or 6 bits (for 64-bit ARM) to indicate how many bits you want to shift the operand.

The link explains the add instruction.

Documentation – Arm Developer

developer.arm.com

leman · Jan 14, 2022

mr_roboto said:
I think you're saying the shift amount is limited to 5 bit positions? But it's not - I just had a look at the manual and the shift amount is encoded in a 6 bit field, and the documentation states the value in the field can be any integer from 0 to 63 when operating on 64-bit data, or 0 to 31 on 32-bit data. Furthermore, the shift operation can be any of logical shift left, logical shift right, or arithmetic shift right. So it's a full width, full featured barrel shifter.

What @cmaier was telling you is that barrel shifters ain't cheap. It might seem like a trivial operation from a software perspective, but if you're ever asked to implement one in hardware, you'll find it isn't. I've designed a hardware block to do something which isn't really a barrel shift, but does something quite similar, and it was a challenge. It took me quite a bit of work to minimize area and close timing, and accomplishing the latter required several pipeline stages. (This was in FPGA, not ASIC, for what it's worth, but the problems are often similar even if their magnitude and solutions are different.)

I'm sure that A64's designers put a lot of thought into whether it would actually be worth it to offer this merged instruction. It's not a cheap, consequence-free thing.

Indeed, I stand corrected. I looked again and the data processing instructions that make use of the shifter do increase the latency to 2 cycles (the throughput remains unchanged though): https://dougallj.github.io/applecpu/firestorm-int.html

Thanks for pointing this out @cmaier and @mr_roboto, I learned something today

I still think its a cool ISA design

mr_roboto · Jan 14, 2022

leman said:
Indeed, I stand corrected. I looked again and the data processing instructions that make use of the shifter do increase the latency to 2 cycles (the throughput remains unchanged though): https://dougallj.github.io/applecpu/firestorm-int.html

Thanks for pointing this out @cmaier and @mr_roboto, I learned something today I still think its a cool ISA design

Dougall's data shows halved throughput too, though? Plain ADD has TP of 0.167 (he's listing reciprocal throughput here, so 0.167 = 1/6 => 6 ADD per cycle), while ADD (shift) is 0.333 (3 per cycle).

leman · Jan 14, 2022

mr_roboto said:
Dougall's data shows halved throughput too, though? Plain ADD has TP of 0.167 (he's listing reciprocal throughput here, so 0.167 = 1/6 => 6 ADD per cycle), while ADD (shift) is 0.333 (3 per cycle).

And again I stand corrected ? Not my finest day. Spreading my cognitive capacity too thin at the moment ?

cmaier · Jan 14, 2022

Xiao_Xi said:
Does the ALU stage limit the cycle time? I thought some other stages like the writeback take longer than the ALU stage.

You need two bits to choose the type of shift and other 5 (for 32-bit ARM) or 6 bits (for 64-bit ARM) to indicate how many bits you want to shift the operand.

The link explains the add instruction.

Documentation – Arm Developer

developer.arm.com

You never know where the critical path will be. You typically try to make the cycle time equal to what your alu can accomplish. But typically there’s some other random path that you just can’t make fit. At the start, when you are floor planning, you need to set a goal, though, and every chip I ever worked on went by the alu.

JMacHack · Jan 14, 2022

cmaier said:
You never know where the critical path will be. You typically try to make the cycle time equal to what your alu can accomplish. But typically there’s some other random path that you just can’t make fit. At the start, when you are floor planning, you need to set a goal, though, and every chip I ever worked on went by the alu.

Purely hypothetical, but I wanna pick your brain on this; how much performance do you think x86 could gain if it adopted some of the attributes of Apple Silicon, namely unified memory, and removing 32-bit instructions?

cmaier · Jan 14, 2022

JMacHack said:
Purely hypothetical, but I wanna pick your brain on this; how much performance do you think x86 could gain if it adopted some of the attributes of Apple Silicon, namely unified memory, and removing 32-bit instructions?

It would probably cut half the deficit? I’d have to model it. The issue is more the 16 and 8 bit instructions. But even getting rid of all that would still not help a ton because as long as you have variable length instructions you need to limit your issue width or take heroic measures (like burning power or adding pipe stages) to compensate. The deciding will always be a problem. Fixed length instructions would be a huge improvement. So if you excise everything that doesn’t have the same width instruction (regardless of which generation of the ISA we are talking about), that would help a lot. the unified memory ihelps a lot (for workloads where that matters).

Xiao_Xi · Jan 14, 2022

I have read that x64 CPUs internally behave as RISC CPUs. Is it true? What does it mean?

What should I read/study to understand better CPU designs?

cmaier · Jan 14, 2022

Xiao_Xi said:
I have read that x64 CPUs internally behave as RISC CPUs. Is it true? What does it mean?

What should I read/study to understand better CPU designs?

It’s not quite true.

I recommend the textbooks by Hennessy & Patterson. They are sort of the bible and very approachable.

going back to the first point, when people say this, it’s people who know just enough to get themselves in trouble. What they probably mean is that the core doesn’t directly execute ISA instructions. They go through a decoder, which breaks them down into (mostly) fixed-width instructions, and those fixed-width instructions are what all the rest of the hardware deals with.

But that’s been the case since the 80186. There’s always a translation. Even RISC CPUs do that.

And CISC CPUs still have to pay a huge penalty - the decoder is big and power-sucking and because it’s hard to find where ISA instructions start and end and you never know how many you have in a given instruction memory read, you can’t easily schedule them to issue a large number in parallel. And when you do issue, say, the 8 micro-instructions that correspond to one given ISA instruction, you need extra hardware to keep track of the fact that they are tied together. Otherwise, what happens if there is a context switch - an interrupt or an exception, when you are in the middle of those 8 instructions? They need to execute atomically, so either you unroll all of them or you let the rest of them execute. Lots of other complications due to addressing modes, etc.

Gerdi · Jan 14, 2022

cmaier said:
the unified memory ihelps a lot (for workloads where that matters).

How does unified memory help the CPU at all? This does not make any sense to me, it is not even a CPU feature but a system feature.

cmaier · Jan 14, 2022

Gerdi said:
How does unified memory help the CPU at all? This does not make any sense to me, it is not even a CPU feature but a system feature.

I referred to “workloads where that matters.” Those are obviously workloads where the CPU shares memory with other types of processing cores (typically GPUs, but conceivably other types of coprocessors). Intel chips often have GPUs on them.

Gerdi · Jan 14, 2022

cmaier said:
I referred to “workloads where that matters.” Those are obviously workloads where the CPU shares memory with other types of processing cores (typically GPUs, but conceivably other types of coprocessors). Intel chips often have GPUs on them.

Sure, but this is not in any form related to the instruction set architecture like x86 whatsoever. In general x86 SoCs are all UMA systems as well - as for instance Xbox.

cmaier · Jan 14, 2022

Gerdi said:
Sure, but this is not in any form related to the instruction set architecture like x86 whatsoever. In general x86 SoCs are all UMA systems as well - as for instance Xbox.

Sure. I guess i didn’t read the question i was responding to as being limited to the ISA, but interpreted it as “what if AMD or Intel adopted an M1-like system architecture?”

jeanlain · Jan 14, 2022

If I may post on topic... PCWorld has some performance data for the intel 12900HK :
15000 cinebench R23 MT
1892 cinebench ST
13575 Geekbench MT
1899 Geekbench ST

I think this is consistent with earlier reports.
PCWorld is keen on pointing out that the intel part beats the M1 Pro, but no word on power consumption. ?

StoneJack · Jan 14, 2022

jeanlain said:
If I may post on topic... PCWorld has some performance data for the intel 12900HK :
15000 cinebench R23 MT
1892 cinebench ST
13575 Geekbench MT
1899 Geekbench ST

I think this is consistent with earlier reports.
PCWorld is keen on pointing out that the intel part beats the M1 Pro, but no word on power consumption. ?

yes, it exceeds M1 benchmarks, but at higher power consumption. This always was true, even for earlier Intel CPUs. So nothing new. In addition, all tested systems had RTX3080 graphic cards, and 4800 DDR RAM

jinnyman · Jan 14, 2022

I believe all tests were cpu only. So having RTX3080 doesn't affect the results.
Given that they've worked with MSI GE76, the power consumption while doing the test would be probably at least 90watts long term (perhaps not for Geekbench as it's not that demanding for long time)

crazy dave · Jan 15, 2022

jinnyman said:
I believe all tests were cpu only. So having RTX3080 doesn't affect the results.
Given that they've worked with MSI GE76, the power consumption while doing the test would be probably at least 90watts long term (perhaps not for Geekbench as it's not that demanding for long time)

Actually it’s the opposite problem for Geekbench - unless you run the test many times in a row - the tests aren’t long enough to trigger throttling. So you get the measure of max performance at peak watts, but often little information about sustained performance. So the wattage never drops (eg all 2020 M1 base machines, air, pro, mini get the same MT score despite that obviously especially the air should throttle under sustained load).

In fact according to Andrei and Ryan of Anandtech, the GPU tests are so short that the M1 Max had trouble spinning the GPU to full speed before completing the task. Now that is the GPU version of the test, but you get the idea.

UBS28 · Jan 15, 2022

Why don't we wait next year when Intel launch their 3nm chip made by TSMC? Going from the super old 10nm to 3nm is huge.

thenewperson · Jan 15, 2022

UBS28 said:
Why don't we wait next year when Intel launch their 3nm chip made by TSMC? Going from the super old 10nm to 3nm is huge.

Why would we wait till then?

T'hain Esh Kelch · Jan 15, 2022

UBS28 said:
Why don't we wait next year when Intel launch their 3nm chip made by TSMC? Going from the super old 10nm to 3nm is huge.

TSMC is not in any way capable of delivering the amount of 3nm chips that Intel would need, while supplying to Apple at the same time.

Intel Alder Lake vs. Apple M1

macrumors Core

macrumors 68000

Suspended

macrumors Core

macrumors 6502a

macrumors 68000

macrumors Core

macrumors 6502a

macrumors Core

Suspended

Suspended

Suspended

macrumors 68000

Suspended

macrumors 6502

Suspended

macrumors 6502

Suspended

macrumors 68020

macrumors 68040

macrumors 6502a

macrumors 68000

macrumors 68030

macrumors 65816

macrumors 604

Our Staff