Looking up MacroScalar I came across this written about 13 years ago. It was a good read for multiple reasons, not just the technical description of how MacroScalar might work but also from a historical perspective on Apple's entrance into microarchitecture design:
A couple months ago, a story made the nerd-press rounds about Apple’s trademark application and several patents for something called a “macroscalar” processor architecture. I’ve taken a stab at decoding the publicly available information about macroscalar architectures to give a coherent picture...
www.cs.cornell.edu
The nutshell version of something like MacroScalar is that rather than work with fixed sized registers and SIMD operations (even SVE-style) you write your code as standard scalar loops, but some of those loops are captured by the compiler and annotated as being amenable to SIMD'ization.
Everything after that happens in the CPU; at some point (possibly when the instructions are loaded to L2, possibly when they're loaded to L1, possibly even at Decode time) certain patterns of instructions are on-the-fly converted to a SIMD version appropriate for this hardware.
The primary win of this is that you get everything SVE gives you, without the various pain points; AND you don't burn up your 32b ISA space the way SVE has done.
The primary loss is that you give up perhaps even more control than SVE (eg for things like data rearrangement).
The idea admittedly sounds some combination of cool and insane. But others have suggested the same sort of idea, for example here's a version for POWER:
https://libre-soc.org/openpower/sv/
Another way to look at this is that it's an implementation of "real" vector computing Cray-style ["hardware for loop"], rather than SIMD, and the implementation as SIMD or otherwise is a low-level detail unimportant to the developer.
Yet another way to look at this is to assume (who knows?) that Apple considered THE important part of MacroScalar to be not the "hardware for loop" part but the offloading to separate hardware part, and that's already present with AMX/SME so Mission Achieved, move on...
As I keep saying, the primary win in all these schemes is you get a better compiler target, and that's worth a lot. Hell, if Apple just adopts SVE128, and says they will NEVER change the width, so we NEON plus predication, no need to test for load/store boundary conditions, and no loop tails, that's worth a few percent on "most" code, including non-FP code, and getting a few percent on current CPUs any other way is not easy...
BTW I think the question of increasing transistor densities is interesting. Here's what I wrote recently:
www.realworldtech.com
You can read this and view it as meaning that RAPID transistor density increases are over, so SVE256b is a luxury that will not make sense for most users, better to use that area in other ways.
Or you can view it as meaning that all the machinery we hope might make SVE256b less necessary (like eg 5 or 6 NEON units) will not come as easily as it did in the past, and widening vector length [with the obvious concomitant features, like not activating unused lanes to save power] is in fact easier than those alternatives.
I honestly don't know.