I still don't find this definition very satisfactory. For example, AMR has a single instruction that will increment a register value and load two consecutive 64-bit values at the resulting address. I wound't call this a "simple" instruction. The second part of your definition is more interesting. Maybe a better way to define RISC would be as only using instructions that can are "fast" and have no variable costs? But then again ARM has dedicated memcpy instructions...
It sounds like you’re describing
SIMD. (Which would be pretty impressive if it's part of the
standard instruction set and not from an extension set like NEON.)
Either that or an equivalent to pure SIMD like “
SIMD within a register.”
Reading from memory is “expensive.” (That’s why fast SRAM in caches is so
computationally valuable compared to DRAM — and as pricey, too!)
But one fetch with
two executions sounds even
more efficient, not
as efficient —
let alone less.
Two
separate increments performed on two sets of 64 “bits” sounds costlier — which may seem only
minutely costly until it’s done a billion times.
I say “bits” because, one or even two sets of 64 “bits” does not necessarily equal one or two “values” or operands.
64 bits can mean two 32 bit “values,” four 16 bit values, or less/more depending on how narrow a shift the architecture allows. (It
should allow 4 to 64 in a string, but maybe not.)
In this case,
two 64 sets of bits, but
one increment instruction to however many “values” the two sets mean to the programmer sounds
less costly (at least to me).
It’s all up to the coder how many “values” any given set of bits represents; the
processor has no idea what they mean to the
programmer.
Bear in mind, the whole
paradoxical-seeming concept of RISC is that it was
discovered through statistics that showed that if compiled
code broke down tasks into smaller ops
before handing them to the CPU to process/execute, the overall performance was appreciably higher,
not lower as you’d
think (because software is always slower than hardware).
This had the consequence of requiring fewer instructions in the ISA — but this always has and always will be
relative. The number of instructions
can increase as the architecture evolves while
remaining a RISC architecture.
Simple Vector extensions, for example, usually
add new instructions, but
still comport with RISC design philosophy.
And a Matrix coprocessor is
incredibly simple yet incredibly fast and
powerful for
what it specializes in doing —processing numbers ordered in a way that CPUs and GPUs and ALUs just aren’t architected to handle.
Incidentally, RISC engineers didn’t get it
perfectly right the first time: the
first thing they jettisoned was
floating point.
Then software evolved from simple “mass calculator” and massive & fast telephone switching operations and financial accounting, etc. to
Scientific applications, graphics, 3D and simulation software — and, later, VisualFX and
games, like 3D FPSs as one salient example.
For today’s needs, floating point is a
must, and even way back when you could buy a PC with an empty socket on the motherboard for an optional dedicated floating point math coprocessor, it
still had to communicate over a
bus. Floating point is now an on-die feature of RISC chips (save for a few embedded designs) yet it maintains congruence with RISC design philosophy.
It’s natural to think that software doing more processing itself equals
slower (because it usually is), but think of the overhead of high-level programming languages versus
Assembly language.
Assembly is a lot harder for people because it “speaks” closer to the level of the hardware that its instructions will be performed on. It requires a lot more work by the programmer
AND the program — BUT the
software working harder doesn’t translate to
slower — just the opposite.
In contrast, high-level programming languages, which — varying depending on the efficiency of the compiler — do almost no “prechewing” or breaking down of tasks into smaller instructions
in software, they just throw all the work at the processor to handle for them.
If you rewrote a simple python program in Assembly,
you’d be doing a lot more work on behalf of the processor, and your uncompiled code might even be longer, but your extra work on behalf of the processor would pay off in
greatly improved speed of execution, smoothness and overall better UX.
All is
relative, and a programmer can write “slow” python code, while a more skilled programmer can write fast(er) python code — but never as fast as low-level.
To your last point, modern ARM designs now includes the instructions and performs the functions that were traditionally handled by a dedicated Memory Management Unit. (Another example of additional instructions while still being RISC.)
Tightly coupled memory is but
one of the
many other features that makes the ARM design so fast and power efficient (so fast and efficient that it now powers desktop Macs as well as the
world’s fastest supercomputer).
Its many design advantages probably account for why ARM is giving RISC-V such a “run for its money.” (
Despite ARM’s proprietary IP and
relatively tight control over it compared to RISC-V’s inherent
Open nature.)
YET! “ARM” probably “
wouldn’t be a thing” today
if it weren’t for Apple.
Apple chose an ARM CPU iteration for its
Newton PDA, then formed a joint partnership with the chip’s inventor, Acorn, that was spun off as an independent company called Advanced RISC Machines (ARM).
Then Apple came along again to bolster the company just ahead of its IPO by inking long-term agreement with ARM that extends
beyond the year 2040. (Did Apple “make” then “save” ARM? 🤔.)
Personally, I suspect Microsoft/Microsoft Windows will
ultimately go all-ARM — while keeping Intel happy by (though ditching Intel’s longtime, proprietary IP) having them fab and supply
custom ARM designs — custom like Apple. Microsoft
always wants to be Apple; always has, always will.
Historical chip designer Intel will have to swallow its pride as it fabs ARM-based designs for Microsoft, but, hey, x86-maker AMD is already doing it.
x86 will go the way of DOS, IMHO. (We’ll see.)
It’s fascinating that today’s most cutting-edge ARM technology evolved from a design that began
a long time ago. But the best, most modern operating systems in the world today are Unixes, and Unix’s development also began
a long time ago — in the
late 1960’s.
It all has to do with the design philosophy
at the start — and the
imagination required to build things —
originally — with headroom for a limitless future in mind. The fruits of this philosophy can be seen in Unix, ARM and Next — and all three now at Apple.
So Acorn’s design outlook from the start of its efforts in 1984 to design a RISC processor was versatility and
extensibility.
The result today is an ARM
ecosystem and a broad family of scalable ARM designs used in embedded systems, controllers, inexpensive IoT products, phones & “devices,” Macs — and even Supercomputers.
The simplicity of instructions of RISC designs equals less power and requires
fewer transistors, so, given that, think how
powerful it must be to have
35 Billion transistors on Apple’s new A17!
And I can’t
wait to learn
all the new things Apple will introduce in the upcoming
M3.