What about the memory bandwidth? Also, are we talking SoC or CPU? This whole thread is about compute performance, now we're bringing in memory bandwith which is something completely different for larger data chunks and will introduce additional dependencies, the OS itself, so we're not talking chip level anymore now?
Also, what the heck are you talking about with 8 wide ALUs and 4 wide ALUs? Have you actually read the article I linked to? I'm going to quote:
I'm simply going to assume you do not understand this. I can't explain this in a single sentence. Books are written about it, lectures are taught. Here's a starter:
https://www.robots.ox.ac.uk/~dwm/Courses/2CO_2014/2CO-N2.pdf
You'll probably need additional sources to fully understand it.