This line of reasoning is too simplistic to be taken at face value. Instruction decoding uses a miniscule fraction of silicon. 8086 processor was able to decode this instruction set just fine while using 0.000x percent of the transistors that current chips have. If "instruction decoding" also implies their execution then, again, it's not enough to just point out their complexity. While the time required for execution of such instructions might be greater than the time for execution of "simple" instructions one has to account for the fact that complex instructions also do more than simple instructions so given algorithm may use fewer complex instructions than simple extractions.
I'd say the person being simplistic isn't me.
No, decoding doesn't imply execution. The decoder is responsible for scanning bytes coming from the instruction fetch unit, figuring out which instruction those bytes represent, performing several tasks to expand the instruction into a format suitable for execution (in x86 land, these expanded instructions are called micro-ops or uops), and finally handing uops off to a dispatch unit which sends them on to execution units.
The ISA decoded by the original 8086 was much smaller and simpler to decode than x86-64 is today. They hadn't yet severely abused the prefix byte mechanism to add tons of new instructions. More to the point, the 8086 only had one execution pipeline and therefore only needed to decode one instruction at a time. Decoding multiple instructions in parallel is the big pain point today.
This is because x86-64 instructions are one to fifteen bytes long, in one-byte increments, and it's not trivial to determine exactly how long an instruction is. You have to examine the first byte, make decisions based on its contents, possibly move on to the second byte, make decisions based on its contents, etc. Not one-by-one all the way to 15 bytes, but it takes some serially dependent work to figure out how long an instruction is.
This isn't so bad if you're doing just one and you can decode the next instruction in the next cycle, but modern CPUs need to decode many instructions in the same cycle. If you're building an x86, you don't know where the second instruction begins until you've finished figuring out the length of the first, or where the third begins until you've figured out both the first and the second, and so on. This is not fun when you want to decode four or six instructions in parallel. (Six is the current state of the art, seen in Intel Golden Cove.)
The solutions x86 chip architects turn to are a bit insane. As far as I know the basic approach is still brute force: just build a ton of decoders. Let each one start work on a different byte offset. Later, figure out which ones to pay attention to (because they happened to start at the beginning of a real instruction) and which ones to ignore (because they produced meaningless garbage after starting in the middle of a real instruction). You still end up with a nasty dependency chain problem selecting which decoders to pay attention to, and you do a lot of work which gets thrown away, but you can make it run fast. At a price.
Parallel decode is trivial on Arm. All Arm v8/v9 A64 instructions are exactly 4 bytes long, and they are always aligned to 4-byte boundaries. You just grab bytes from the fetcher, stuff the first four into decoder 1, the second four into decoder 2, and so on. Simple, and no work is wasted.
Intel's low-power cores (Atom series) can't do 6-wide decode, it's too much to ask. Their current state of the art is dual 3-wide decoders, with the second bank of 3 coming into play only on branches. (When the branch predictor has cached the target address of a branch, it's possible to start the second group of 3 decoders off on that branch target address ahead of time.)
Atoms don't really scale down to phone power levels any more, Intel gave up on that segment long ago so Atom cores are now targeted at low end laptops, some embedded applications, and the "efficiency" cores in Alder Lake. Apple is able to put full 8-wide decode units into the performance cores in their phones, even though phones have tighter power constraints than the things Atoms go into. Does that tell you anything about how serious this issue is for x86?