I still don't find this definition very satisfactory. For example, AMR has a single instruction that will increment a register value and load two consecutive 64-bit values at the resulting address. I wound't call this a "simple" instruction. The second part of your definition is more interesting. Maybe a better way to define RISC would be as only using instructions that can are "fast" and have no variable costs? But then again ARM has dedicated memcpy instructions...
		
		
	 
It sounds like you’re describing 
SIMD. (Which would be pretty impressive if it's part of the 
standard instruction set and not from an extension set like NEON.)
Either that or an equivalent to pure SIMD like “
SIMD within a register.”
Reading from memory is “expensive.” (That’s why fast SRAM in caches is so 
computationally valuable compared to DRAM — and as pricey, too!)
But one fetch with 
two executions sounds even 
more efficient, not 
as efficient — 
let alone less.
Two 
separate increments performed on two sets of 64 “bits” sounds costlier — which may seem only 
minutely costly until it’s done a billion times.
I say “bits” because, one or even two sets of 64 “bits” does not necessarily equal one or two “values” or operands.
64 bits can mean two 32 bit “values,” four 16 bit values, or less/more depending on how narrow a shift the architecture allows. (It 
should allow 4 to 64 in a string, but maybe not.)
In this case, 
two 64 sets of bits, but 
one increment instruction to however many “values” the two sets mean to the programmer sounds 
less costly (at least to me).
It’s all up to the coder how many “values” any given set of bits represents; the 
processor has no idea what they mean to the 
programmer.
Bear in mind, the whole 
paradoxical-seeming concept of RISC is that it was 
discovered through statistics that showed that if compiled 
code broke down tasks into smaller ops 
before handing them to the CPU to process/execute, the overall performance was appreciably higher, 
not lower as you’d 
think (because software is always slower than hardware).
This had the consequence of requiring fewer instructions in the ISA — but this always has and always will be 
relative. The number of instructions 
can increase as the architecture evolves while 
remaining a RISC architecture.
Simple Vector extensions, for example, usually 
add new instructions, but 
still comport with RISC design philosophy.
And a Matrix coprocessor is 
incredibly simple yet incredibly fast and 
powerful for 
what it specializes in doing —processing numbers ordered in a way that CPUs and GPUs and ALUs just aren’t architected to handle.
Incidentally, RISC engineers didn’t get it 
perfectly right the first time: the 
first thing they jettisoned was 
floating point.
Then software evolved from simple “mass calculator” and massive & fast telephone switching operations and financial accounting, etc. to 
Scientific applications, graphics, 3D and simulation software — and, later, VisualFX and 
games, like 3D FPSs as one salient example.
For today’s needs, floating point is a 
must, and even way back when you could buy a PC with an empty socket on the motherboard for an optional dedicated floating point math coprocessor, it 
still had to communicate over a 
bus. Floating point is now an on-die feature of RISC chips (save for a few embedded designs) yet it maintains congruence with RISC design philosophy.
It’s natural to think that software doing more processing itself equals 
slower (because it usually is), but think of the overhead of high-level programming languages versus 
Assembly language.
Assembly is a lot harder for people because it “speaks” closer to the level of the hardware that its instructions will be performed on. It requires a lot more work by the programmer 
AND the program — BUT the 
software working harder doesn’t translate to 
slower — just the opposite.
In contrast, high-level programming languages, which — varying depending on the efficiency of the compiler — do almost no “prechewing” or breaking down of tasks into smaller instructions 
in software, they just throw all the work at the processor to handle for them.
If you rewrote a simple python program in Assembly, 
you’d be doing a lot more work on behalf of the processor, and your uncompiled code might even be longer, but your extra work on behalf of the processor would pay off in 
greatly improved speed of execution, smoothness and overall better UX.
All is 
relative, and a programmer can write “slow” python code, while a more skilled programmer can write fast(er) python code — but never as fast as low-level.
To your last point, modern ARM designs now includes the instructions and performs the functions that were traditionally handled by a dedicated Memory Management Unit. (Another example of additional instructions while still being RISC.)
Tightly coupled memory is but 
one of the 
many other features that makes the ARM design so fast and power efficient (so fast and efficient that it now powers desktop Macs as well as the 
world’s fastest supercomputer).
Its many design advantages probably account for why ARM is giving RISC-V such a “run for its money.” (
Despite ARM’s proprietary IP and 
relatively tight control over it compared to RISC-V’s inherent 
Open nature.)
YET! “ARM” probably “
wouldn’t be a thing” today 
if it weren’t for Apple.
Apple chose an ARM CPU iteration for its 
Newton PDA, then formed a joint partnership with the chip’s inventor, Acorn, that was spun off as an independent company called Advanced RISC Machines (ARM).
Then Apple came along again to bolster the company just ahead of its IPO by inking long-term agreement with ARM that extends 
beyond the year 2040. (Did Apple “make” then “save” ARM? 🤔.)
Personally, I suspect Microsoft/Microsoft Windows will 
ultimately go all-ARM — while keeping Intel happy by (though ditching Intel’s longtime, proprietary IP) having them fab and supply 
custom ARM designs — custom like Apple. Microsoft 
always wants to be Apple; always has, always will.
Historical chip designer Intel will have to swallow its pride as it fabs ARM-based designs for Microsoft, but, hey, x86-maker AMD is already doing it.
x86 will go the way of DOS, IMHO. (We’ll see.)
It’s fascinating that today’s most cutting-edge ARM technology evolved from a design that began 
a long time ago. But the best, most modern operating systems in the world today are Unixes, and Unix’s development also began 
a long time ago — in the 
late 1960’s.
It all has to do with the design philosophy 
at the start — and the 
imagination required to build things — 
originally — with headroom for a limitless future in mind. The fruits of this philosophy can be seen in Unix, ARM and Next — and all three now at Apple.
So Acorn’s design outlook from the start of its efforts in 1984 to design a RISC processor was versatility and 
extensibility.
The result today is an ARM 
ecosystem and a broad family of scalable ARM designs used in embedded systems, controllers, inexpensive IoT products, phones & “devices,” Macs — and even Supercomputers.
The simplicity of instructions of RISC designs equals less power and requires 
fewer transistors, so, given that, think how 
powerful it must be to have 
35 Billion transistors on Apple’s new A17!
And I can’t 
wait to learn 
all the new things Apple will introduce in the upcoming 
M3.