Pro Desktop chips won't have 'efficiency' cores as they're useless and consume space on the chip.
I put off responding to this earlier because decided to wait until there was more quantitative data to highlight just how substantively wrong this is. So let's just jump to AnandTech's review for the Mini.
"... I also included multi-threaded scores of the M1 when ignoring the 4 efficiency cores of the system. Here although it’s an “8-core” design, the heterogeneous nature of the CPUs means that performance is lop-sided towards the big cores. That doesn’t mean that the efficiency cores are absolutely weak: Using them still increases total throughput by 20-33%, depending on the workload, favouring compute-heavy tasks. ..."
https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested/5
The efficiency cores were more effective the "harder" the workload was. Their impact is lower where there is just lightweight multiple work to get done (on a single task). They are far more effective at handling a a broader array of processes/threads that have completely different instruction streams and/or data streams.
Why? The same reason why embrassingly parallel , high intensity workloads are more effect on GPU cores. (which is relatively much less power consuming than the "Super power consuming" Performance cores in CPUs. More function units tends to trump more clock speed.
8 arithmetic units at 1.8GHz = 14.4 max compute results. ( 2 units for 4 cores = 8 )
4 arithmetic units at 3.6GHz = 14.4 max compute results. ( 4 units for 1 core = 4 )
Where the E cores would be "useless" is if have some narrow code that invoked a function unit that the E cores don't have. (e.g., the Apple Martix Extension AMX). [ E cores have the basic vector handling and mainstream math function units ] . In that case yeah.
0 AMX at 1.2 GHZ = 0
1 AMX at 3.0 GHz = 1
that would be a major "win" with an additional P core.
One big "power" core isn't going to out hustle 4 "smaller" ones if split the work up evenly to the cores.
(same page as the first linke above. )
The four power cores are have a result of 21.60 on the "int" benchmark. That boils down to about to 5.4 per core. If added another P core would get about 27. And yet the 4 efficiency cores added boost the score to .... 28.85. Which is a
bigger number.
Apple matched the gen 11 ( Tiger Lake ) Intel SoC there with the 4 P and the E core are allowing them to put on a gap. At some point Intel wil release H series ( 8 cores ) that will do better on "int" but probably do worse on iGPU ( and Power).
That pretty much demonstrates quantitatively that the notion you are peddling here is at best flawed and largely more based on "smack talking" than insight.
If there was tons more bandwidth , power TDP , and several other things Apple many or may not do because the drop performance/watt, then adding two more P cores would get a better number.
With bigger TDP and transistor budgets , Apple probably will add more P cores ( along with more memory I/O bandwidth), but they are also quite likely to shoot for lower core counts than the mainstream desktop AMD and Intel models. ( Intel is about to shift gears with what should be gen 12 desktops ( Alder Lake). ). But the E cores and iGPU ( and Neural and Secure Enclave , and the rest of the SoC die here ) probably are not going away.
Besides earlier in the review :
"... Because one core is able to make use of almost the whole memory bandwidth, having multiple cores access things at the same time don’t actually increase the system bandwidth, but actually due to congestion lower the effective achieved aggregate bandwidth. I’ve especially noted this when using the performance cores in tandem with the efficiency cores in memory copies – 4 big cores peak out at 59GB/s memory copies, but as soon as an efficiency core is added this reduces to 49GB/s, going down to 46GB/s when all cores are active, pointing out to a bottleneck in the system somewhere. ..."
www.anandtech.com
What Apple needs for the desktop variants further up the stack is more I/O bandwidth. A single core saturating the Memory controller is nice for "single thread drag racing" benchmark "porn" contests but if have an application that has a complicated model coupled to also doing graphics and they are all sharing the same bandwidth. Not going to beat a RX 560 while 1-2 P cores are almost completely soaking up the whole bandwidth to the unified memory.
For most apps this sharing will work because there is a balance between display and computation. But walk and chew gum at the same time ... this won't work as well. Going up the CPU food chain toward the Xeon W class Apple has far more an I/O problem than a "too many E cores" problem.
Apple's design here is skewed toward maxing out the single thread drag racing score. I'd be very surprised if Apple every got to point of scaficing that just to beat AMD/Intel in some maximum "core count" marketing war. Folks looking at the Mac Pro 2019 or Threadripper and saying that's the core count target Apple has to hit... probably are on the wrong path.
What you'll probably have is two division of chips for Pro and for power savings. We got the latter which is why those products were dumped on the market last week.
I'm highly doubtful Apple is going to do fundamentally different cores for the desktop. For better or work people tend to buy mid-high range systems based on core counts. With the E cores Apple can bump that count by 4 with the same space that pragmatically just 1 P core could fit. As long as folks are just buying "12 is better than 6 which is better than 4 " then that gets them to a high count with a lower cost die ( and probably don't have the I/O bandwidth or transistor budget) to swap in
The follo w on Macs will be Pro and the laptops will be the systems they didn't have time to built yet. Remember, designing unique chips is a very expensive process so Apple will be trying to reuse designs across as many platforms as possible to reduce costs.
That huge expensive is highly likely why they aren't going to go to a completely different focused internal bus and memory hierarchy strategy.
A path to a larger System level cache where "half" of the cache is feed by another memory subsystem and get another 4P and another 8 GPU cores sitting around a bigger "round table" with the stuff that is already there. But doing that 4-5 times? I'm more than a bit skeptical that Apple is going to pay to "fork" that far off the design building blocks that is largely being paid for with the iPhone/iPad .
And for scaling up the GPU at some point it is just "cheaper" to put a PCI-e bus handling system in there and just enable large GPU only chips with a incrementally different graphics driver model ( unified just with much larger latentcies . )