I doubt neither Intel nor AMD is ever going to be explicit on their decision making process but I think the assumption is that Intel and AMD have very smart engineers making these decisions. If a wider decode was worth the transistor budget I expect that they would do it. The fact that they haven't is a pretty good indication that 4-wide is about as good as it is going to get.
What about ARM itself? Their Cortex X-1 has 5 instruction decoders. Apple A7 had 6 decoders (and a 192-long ROB) in 2013! What was ARM doing these past seven years?
The problem is probably a basic computer science issue of cascading complexity. An x86_64 instruction can be anywhere from 1 byte to 15 bytes wide. When you start the decode stream, you have no idea how long the instruction will be. You have to finish the decode to find out. So each decoder has to start somewhere in the middle of another decoder's stream. You start to see the problem as you add decoders, the problem gets more and more complex. There is no easy way to know which decoder will reach the end of an instruction.
You are making this problem more complex than it has to be. Just because x86 has variable-length instructions, it doesn't mean that the instruction stream has to be scanned sequentially. It is possible to examine a block of instructions (say, 16-bytes) at once, detecting significant bit patterns. Similar techniques are used to process sequential data using SIMD-hardware, for instance for UTF8 validation or JSON parsing. Basically you test each byte for every possible significant bit pattern (which are limited) and then use masking and logical operations to mark element boundaries.
I see no reason why Intel or AMD wouldn't use a similar technique (or maybe something even smarter), as it easily maps to a hardware implementation, can be pipelined and will deliver results simultaneously. It would be obviously more complex than fixed-length decoding, and I can imagine that the silicon ends up more power-hungry, but it would be as fast — with minimal added latency.
Again, I have no proof and I will freely admit that my opinion is just an uneducated speculation. But I have also seen no clear proof that x86 is limited by the variable length of its ISA. My gut feeling is that the reason why Intel and AMD are stuck at 4-wide decode (with some optimizations here and there) is essentially the same as why ARM Cortex designs are stuck at 4-5 wide decode(*). Because neither of them can figure out how to execute the instruction stream that far ahead. It's not the ISA — it's the architectural approach taken by these companies that is unable to scale beyond a certain width. Apple however managed to circumvent this.
(*) Since x86 packs load/store+op into a single instruction and ARM needs a separate load/store instruction, 4 x86 decoders probably just about match 5-6 ARM decoders in practice.