Think on this: Intel and AMD are holding to a certain pattern of core counts because of the way their market works. Low-power chips with 2 or 4 cores, standard desktop chips with 4 or 6 cores, high-end desktops with 8 or 10 cores. Then server chips with up to 32 or 64 cores. This allows them to maximise their revenue by charging what the market will bear in different market segments.
I don't think that this is a full story. Different markets have different needs. A consumer at home or at the office will care more about responsiveness and versatility (fewer cores with higher peak clock), while a workstation or a server user will car more about multi-threaded throughput (more cores with balanced clock). Number of cores, dynamic clock range, RAM configurations — they heavily depends on these considerations.
And of course, there is the price issue. Making larger chips is exponentially more difficult and expensive. And it's not just the chip production itself, it's the things around it. Memory subsystem is already a tricky thing to deal with — to have many cores you need more RAM bandwidth, which means more memory controllers and wider RAM bus, all that is rally expensive.
Now, the ARM Neoverse architecture already allows for 64 core ARM server chips, and these are already being made for Amazon Web Services in the form of their Graviton 2 processors which largely follow ARMs reference design. This means that there is already an example implementation which solves all of the problems associated with putting ARM cpu cores in a large-scale setup.
There are some users here who don't believe in Apple's ability to execute a new Mac Pro, but frankly, this is where I have very little doubt. The energy efficiency of Apple CPU cores make them inherently scalable. Intel and AMD have to severely underclock their server CPU cores (relative to the desktop models), Apple does not have to, because their peak power usage is four to five times lower.
One disclaimer though: ARM Neoverse has nothing to do with this since Apple uses custom designs. Performance and thermal characteristics of Apple CPUs are completely different.
So in order to maximise the impact of the new machines, we might see processors with 16 performance cores for desktop iMacs, and perhaps even 64 performance cores for a future Mac Pro. After all, Apple doesn’t make chips for the server market, and are free to use these technologies to fill in their own product lineup however it suits them. In one fell swoop Apple could redefine the market so that every iMac could function as a high-class workstation, and the Mac Pro would compete with machines bearing the most expensive Xeon or EPYC processors.
I have little doubt that the upcoming Mac Pro will offer excellent performance and plenty of cores (Apple doesn't need to go 64-cores to compete with the likes of EPYC, a 32-core Apple CPU will be more then sufficient), but I am skeptical about 16 cores in the iMac. It is an overkill for an iMac user and it makes little sense financially.
Now, I do agree with you that Apple Silicon will disrupt the desktop market, but in a different way. The interesting thing of Apple Silicon is that it brings to the user technology that is usually associated with datacenter. Look at new Nvidia Grace warehouse systems for example: large CPU and GPU, connected with super-fast fabric, using high-bandwidth unified memory. Does this ring a bell? Its like straight from Apple's M1 announcement slides. The traditional desktop tech up to now is
(CPU+large quantity of slow, low latency RAM) ----- slow data link --- (GPU + small quantity of fast, high latency RAM)
Apple Silicon is changing this to
(CPU+GPU+coprocessors) + large quantity of fast, low latency RAM
This is the "true" revolution for the consumer market as it enables new levels of performance, efficiency, and also completely new applications. On Apple Silicon I can for example make a game that adaptively generates geometry on the CPU and immediately renders it on the GPU. Or I can have a professional content creation app that applies complex effects to a video stream using the GPU and simultaneously does image detection using the ML coprocessor. I can't do any of these things in the traditional architecture because PCI-e link between CPU and GPU is too slow. Why do you think Nvidia is packing ML processors on the GPU? Exactly, to prevent data sending back and forth...
There's a difference that a lot of people don't appreciate in that Apple has a better idea of the software that their customers run. And they can add APIs and custom silicon for expensive compute pieces that normally use standard instructions to perform something used a lot more efficiently. Intel and AMD can't really do that. They added vectors which can kind of do that but those are more building blocks than APIs to do a particular operation. Intel sells software packages that provide APIs to do things more efficiently but that's still in software to specialized hardware.
So they could make specific applications or functions faster and it may or may not show up in specific benchmarks but it would feel a lot faster for those people that used the software that was optimized.
The specialized coprocessors Apple includes in their systems are not that much different from what others are doing. Intel has ML acceleration block in their CPUs (Intel AMX), Nvidia and AMD have them in their GPUs. I don't see much principle difference here. Now, Apple has an obvious advantage here since they have easier time integrating all these accelerators into the standard APIs, Intel and co. don't really have that luxury because they have to convince the users to use their API first (and that's how unhealthy market phenomena like CUDA is born).
There is also another part of this story, and this is really where Apple is very different, which is about optimizing commonly used patterns. Apple CPUs for example include hardware features that accelerate critical Objective-C and Swift APIs — these are not specialized instructions, but simply CPU behavior. For example, Apple frameworks heavily really on reference counting, so they have designed their CPUs to do very quick reference counting. I've also seen mention that their CPUs have hardware predictor for Objective-C method dispatch. These features won't make much sense in code that does not use these software patters, but on Apple platforms, it can provide a massive increase in performance.