Can you give an example of such an instruction?
I can try! Please note that I am not a hardware person and what I am describing here might be amateurs and naive. I hope that the real experts who read this will correct me should I make blunders.
So, here is the example. There is a very common pattern when loading or storing data. It occurs when you are accessing data from an array. In C, it could be something like
a[b]
which means "get the b's element of the array a". From the CPU perspective, this involves taking the value of
b
, multiplying it by the size of the array element and adding the resulting offset to the base address of
a
. Let's say we are working with 32-bit integers, so our multiplier is 4. Then the operation we want to perform is
load a 32-bit word at the address (a + 4*b)
. Let's see how this looks in ARM64 and RISC-V (note: I have replaced the architectural registers with names
a
and
b
to make it more clear)
Code:
-- ARM64
ldr result, [a, b, sxtw #2] -- load the value at address a + b*4
-- RISC-V
slli b, b, 2 -- shift b by 2 (same as multiply by 4)
add a, a, b -- add a and b and store the value in a
lw result, 0(a) -- load the word at address a
As you can see, this is one instruction in ARM64, three instructions in RISC-V. The latter follows the RISC idea of "one instruction = one operation", so every individual operation has its own instruction. ARM however has special forms for frequent operations like these and allows you to encode address combinations with common multiplication factors in one instruction.
Since this kind of addressing modification is very common, a CPU might choose to implement it in hardware, as part of the load/store unit. You can have a dedicated adder + shifter that directly combine the values as they come in and produce the final address. The hardware overhead is very small and you have a big win in energy efficiency and latency. You also free up the general-purpose ALUs in the integer unit as well as the precious dependency-tracking resources.
Now, in an ARM CPU this can be done immediately, since all the information you need to feed to this address computation unit is already there. The decoder can trivially extract it and schedule it as one convenient package. And this is also how this stuff works on modern ARM CPUs — if you look at instruction timings, M1 does not need longer for more complex addressing modes.
With RISC-V, things get tricky. The basic implementation will use the regular integer ALUs for the address computation, scheduling and tracking three dependent instructions. If you want to use the optimisation of a specialised address computation unit, your decoder needs to able to detect these kind of instruction sequences, analyse the dependencies between then and collapse them into a single "address compute + load" instruction of the kind ARM already has. This is of course possible and routinely done, but introduces additional complexity and overhead in the decoder stage. In some way you area also going back from fixed-size instruction to unpredictable variable-size instructions (even worse, since you have to analyse dependencies!). The other non-trivial aspect of the story is that the RISC-V instruction sequence has visible side effects, as it modifies the values of the registers
a
and
b
. So even if you do the fusion so that the execution itself is efficient, you need to track additional data dependencies. What's more, most of the time you don't care about these intermediate values, which means you will waste registers, reducing the amount of data slots available for your algorithm (sure, RISC-V has a lot of registers, but there are limits).
And there are a lot of things like that. ARM has instructions that allow you to load two registers at once, automatically modify an indexing register, shift an instruction operand and other things. This adds complexity but also allows opportunities for more efficient hardware execution of common patterns. RISC-V on the other hand puts the burden of
making common patterns fast entirely on the CPU. Which brings me to the original point: ARM64 requires a more complex CPU to begin with, but it comes with features that make CPU's life easier when things have to go really really fast. RISC-V instead can be implemented on a much much simpler computer, but it requires progressively more complex logic to make things go really really fast.
Which by the way is confirmed by the latest experiments in code density. This
excellent writeup shows that on mostly integer programs, compressed RISC-V code density is approximately 10-15% smaller than the ARM64 code density (which is great for RISC-V), but the actual number of instructions is 10% higher! That is, compressed RISC-V manages to save 10-15% on instruction caches but has to pay for this by having to decode and execute 10% more instructions! That's not a good trade-off IMO. And that's mostly integer code, where RISC-V has the most advantage. This means that if you are pursuing a state of the art ultra-high single-core performance design, you will need more decoders + complex sequence and dependency detection logic for instruction fusion just to maintain parity with ARM64. This can be done, absolutely, but it's an extra cost that has to be accounted for.
In the end, what I am trying to say is that RISC-V is not some magical super-innovative panacea people make it to be. It's an ISA with its set of strength and weaknesses, and that's it. Personally I think it's great for microcontrollers and other specialised hardware, but I don't find it even remotely interesting for general purpose computing. There is just no added value.
I thought it showed that out-of-order RISC-V SoCs have similar performance to out-of-order ARM SoCs.
I don't think it shows much except that RISC-V is good for microcontrollers where area-efficiency matters a lot. I am hoverer talking about the state of the art general-purpose performance. And of course, it should be entirely possible to make a very fast RISC-V core. It just won't be very easy, due to the aforementioned problem's with its design.
I don't have a specific reference. But operation fusion is the common answer given by RISC-V proponents when asked about these problems.