Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053

M3 Max up to 20 cores, M3 Ultra up to 40 CPU cores?

It makes sense.

Makes sense in which universe?????

That link comes from 2021. The '40' came from multiplying 10 times 4 ; not 2. M1 Max 8 P + 2 E cores (kind of 10).

The M2 put 4 E cores ( full E cluster ) into the Max. It would make about zero sense for Apple to backslide on E Core count in a move to TSMC N3 for M2. None.

The quad was tried and failed to be a marketable solution. Multiple indications that it is coming no time soon.

If Apple pushes to 16 ( 3 * 4 P cores + 4 E cores ) then the Ultra can hit 32 cores. Which is decently close to 40 that was using twice as much silicon area. ( smaller being way more affordable and therefore more viable in the market. Remember Apple is making folks that want more CPU cores to also pay for more RAM and GPU cores at teh same time. Vice versa so those looking for Max RAM capacity or Max GPU count.... have to pay that other ride-alongs-also. So more affordable is going to matter because dragging a substantive number of folks out of their price zone they'd like to pay. ).

24 P-cores is likely substantially better than any 24-28 core MP 2019 set up that some folks ran on a wide variety of workloads. Plus have 8 E cores to do overhead offloading.


If Apple wanted to get into a core count pissing contest then go ( 3 P clusters + 2 E clusters : 12P + 8E ) to get to 20. Yeah would win some marketing sizzle points. ( almost shrinking a M1 Ultra back into a "Max" solution at the M3 stage ). At least superficially at core counts. But basically loading up on E cores just to win at 'core count' wars. Apple really doesn't need that. Lots of higher end customers don't really need that either ( as likely running code that to purposely push large amounts of workload to E cores. ) .

However, pretty good chance that those 24 M3 P cores do a better job than those projections of what 32 M1 era P cores did. Especially, if they are attaching a better memory hierarchy backhaul to them.



Even more warped if count was the sole objective 2 P clusters + 3 E clusters : 8P + 12E ... going done some rabbit hole Intel is falling into. And gets even more goofy when packaged as an Ultra with a gross number of E cores with likely few tasks to do for many users in that workload space. I really doubt Apple is myopically fixated on a specific core count numbers. (and trying to win 'count wars' versus Intel/AMD).
 
Last edited:
  • Like
Reactions: Confused-User

name99

macrumors 68020
Jun 21, 2004
2,410
2,313
The quad was tried and failed to be a marketable solution. Multiple indications that it is coming no time soon.
You lose all credibility by making strong assertions about something NONE of us know anything about.
Was there an M1 Extreme for internal testing? Who knows? Maybe, maybe not.
Was it considered for sale but rejected for whatever reason? Again, who knows, maybe, maybe not.
It could have been rejected for any number of reasons. By DESIGN it may be that some aspects of the M1 Max (something low-level like power co-ordination or interrupt delivery) was never planned for more than two chips. There certainly seems no obvious physical way to connect two Ultras together. Maybe quads were constructed using a messy and high energy/hot thermal mounting, but PURELY for the purpose of debugging the setup, by the OS team and designers of the future, real, quad-chip targeted design.

Anything could have happened and you know no more than the rest of us.

This is a shame because you have some reasonable points. But blathering on about the utterly unknowable drowns those points in a sea of skepticism.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,313
Makes sense in which universe?????

Even more warped if count was the sole objective 2 P clusters + 3 E clusters : 8P + 12E ... going done some rabbit hole Intel is falling into. And gets even more goofy when packaged as an Ultra with a gross number of E cores with likely few tasks to do for many users in that workload space. I really doubt Apple is myopically fixated on a specific core count numbers. (and trying to win 'count wars' versus Intel/AMD).

Given the current design, I agree that 4 E-cores on A, M/Pro/Max appears to make sense.
BUT it's always possible that "the design" changes.
Consider for example the T2-based Intel Macs. The T2 provided various things (eg camera control, flash control) but a primary thing it provided was security. And much of that was achieved by, in the same way that this is done for the SEP, substantially isolating the T2 from the Intel chip.

Suppose now that Apple restructured Darwin so that the absolutely most critical elements ran on separate cores (my guess is 2 E-cores would be enough) isolated from the rest of the chip in the same way that the SEP is isolated. The isolation can't be quite as total because the OS has to write a lot of RAM, and has to send many message (interrupts, TLB invalidations, and so on) to other cores, but it can be substantial.
At least initially much of the OS, in the form of demons, User Space Networking, and so on, would still run on normal cores; just a minimal OS would run on the secure E-cores, though obviously as feasible more and more functionality would be move to that core.
This would eliminate a whole range of security problems we currently face, both in the form of exploiting bugs in the OS, and in the form of one piece of code attempting to "spy" on the OS (Spectre, Zenbleed, and all that).

If Apple goes down this path (and honestly, my guess is that this is the long term evolution) then it makes sense to provide at least a second E-cluster (perhaps only two cores, and with limited, very controlled, communication to the rest of the SoC). At which point two E-clusters would make sense (even on the phone).

Apple has, in a way, already prototyped much of this with the T2, so it's not completely outrageous. It's non-trivial, of course – it's not easy to split up a complex OS into a minimal piece that can run separately, along with minimal and lightweight communication primitives to control that OS. But the payoff is large enough to justify the effort...
 
  • Like
Reactions: ArkSingularity

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
Suppose now that Apple restructured Darwin so that the absolutely most critical elements ran on separate cores (my guess is 2 E-cores would be enough) isolated from the rest of the chip in the same way that the SEP is isolated. The isolation can't be quite as total because the OS has to write a lot of RAM, and has to send many message (interrupts, TLB invalidations, and so on) to other cores, but it can be substantial.
At least initially much of the OS, in the form of demons, User Space Networking, and so on, would still run on normal cores; just a minimal OS would run on the secure E-cores, though obviously as feasible more and more functionality would be move to that core.
This is a fascinating idea, though so far I'm not convinced (which is not to say that I won't be in the future).

I'm trying to see where this would actually work well - that is, for what parts of the OS - and it's not clear to me that much of the OS could usefully be isolated that way. Perhaps, if you restructured it significantly, you could do a bit better (I'm thinking, master OS running on your superSEP, worker OS threads on regular cores that only execute tasks assigned by the OS on the SEP and never accept input direct from user processes... maybe, I'm just spitballing). Anything that went to this superSEP would at the least be going through an L3$/SLC, so that would take more pressure than it does today. (As opposed to the OS being able to run on the same core, or at least in the same cluster.)

I can definitely see uses for this idea though - at least, for hardware access control. What else? Tell me what I'm missing.

Edit: Obviously, more generally, hardware control like dynamic power management, which is what I meant to write (yeah, it's late).
 
Last edited:

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
But the payoff is large enough to justify the effort.
The payoff would be that macOS becomes significantly slower than it already is. Think of what you are doing to the caches if system calls or messages to the operating system switch to a different processor core in a different cluster.
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
The payoff would be that macOS becomes significantly slower than it already is. Think of what you are doing to the caches if system calls or messages to the operating system switch to a different processor core in a different cluster.
That's what I was talking about in the message above. But... it really depends on what particular functions of the OS are involved. Or possibly not even that?

My first thought was that this would be an overwhelming issue. But on further thought I'm not so sure. I realized that I don't have a good feel for what the critical paths are in a modern OS (and more to the point, in the interactions of system calls). For example:
- memory allocation. Does this happen often enough *in the OS* to matter? (I'm talking about allocations to userspace processes, not kmalloc calls, slab and cache manipulation, etc. that's relevant purely to internal kernel function.) That's a different question from, say, how often does your code instantiate objects. Maybe it's not important because it's extremely rare. That seems likely to me.
- I/O. At first thought this is an immediate killer, but is it? If you're doing networking in user space and especially doing zero-copy, maybe you're *not* actually going to trash your L3/SLC. If you're doing scatter/gather disk I/O is most of it typically in cache? What about normal writes and reads? If it mostly involves data out of cache, then this might not matter at all - the small amount of data being passed between clusters would involve orders of magnitude less time than the actual I/O.

And since this is Apple's silicon team we're talking about, you can expect that they'll find some clever optimizations. Like, maybe, bypassing normal caching for data going between regular clusters and superSEP cluster. Maybe building a special queue adjacent to (in?) the NoC for such messages. Who knows what optimizations you could find for such a specific setup?

It occurs to me that there's one more giant hairball this sort of scheme would cough up, if it came to fruition. Virtualization would be difficult or impossible. You'd need to either lie about having a superSEP and then fake it, or you'd need a different build of the OS for "hardware" that doesn't have that feature (or a single OS build that understands both environments).
 
Last edited:

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
The payoff would be that macOS becomes significantly slower than it already is. Think of what you are doing to the caches if system calls or messages to the operating system switch to a different processor core in a different cluster.
Actually, what is the differences compared to current situation. From my limited understanding, more cache will be required if the addressable memory gets larger to mitigate cache trashing. More cores does not require more cache, other than their local L1/L2.

Ring fencing core OS from the rest like what @name99 suggested would slow down the OS API calls, but macOS is already like this with the micro-kernel-ish architecture, compared to Linux and Windows(?) monolithic kernel.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
From my limited understanding, more cache will be required if the addressable memory gets larger to mitigate cache trashing.
Required cache size and organisation depend on working sets and memory access patterns, not on the size of addressable memory. My point was that switching to a different core in a different cluster lets you start with cold L1 and L2 caches up to two times, once when switching there and once when switching back. Cache size has nothing to do with it.
 

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
Required cache size and organisation depend on working sets and memory access patterns, not on the size of addressable memory. My point was that switching to a different core in a different cluster lets you start with cold L1 and L2 caches up to two times, once when switching there and once when switching back. Cache size has nothing to do with it.
But this is already happening with existing macOS.
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
Required cache size and organisation depend on working sets and memory access patterns, not on the size of addressable memory. My point was that switching to a different core in a different cluster lets you start with cold L1 and L2 caches up to two times, once when switching there and once when switching back. Cache size has nothing to do with it.
Is that true? Again, I don't feel like I have good instincts for this, so I'm interested in more knowledgeable peoples' reasoning, but my naive take on this is that the caches aren't much more likely to be cold that way than if it all ran on the same core. Consider: the cache is pre-warmed for user process "X" in the sense that X is running, filling the cache with its used memory. Then X makes an OS call, and it sleeps until the OS returns. What factors can make the cache cold for X when the OS returns? Just, how much it is used by other processes while X is sleeping.

So, how does that change if the OS is local, vs. remote (superSEP cluster)? If the OS is local, you get the fastest call time. But that doesn't tell you how fast X wakes up again. Perhaps some other processes get time first, in which case the cache is likely cold when you get it back. And the OS itself will dirty the cache, just servicing X's call. On the other hand, if the OS is remote, it takes slightly longer to get the call, but it runs potentially quicker (no contention with user processes), and it doesn't dirty X's cache with OS cache lines.

It's not obvious to me that the local OS situation is drastically better. Am I missing something?
 

jha

macrumors 6502
May 4, 2021
271
191
Will M3 be worth holding off for over M2? Will it really be a substantial difference?
 

koyoot

macrumors 603
Jun 5, 2012
5,939
1,853
Makes sense in which universe?????

That link comes from 2021. The '40' came from multiplying 10 times 4 ; not 2. M1 Max 8 P + 2 E cores (kind of 10).
"Up to 4 dies" could mean this:
M3 - first die.
M3 Pro - second die
M3 Max - third die
M3 Ultra - fourth die.

So if M3 Ultra really is fourth die in this context, and its supposed to have up to 40 cores, it may mean that what Gurman leaked, all of it is just base models, and M3 Max has 20 CPUs and 48 GPU cores.

100% more CPU cores than M1 Max and 50% more GPU cores than M1 Max. Which is why there exists universe in which it makes sense.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
Is that true? Again, I don't feel like I have good instincts for this, so I'm interested in more knowledgeable peoples' reasoning, but my naive take on this is that the caches aren't much more likely to be cold that way than if it all ran on the same core. Consider: the cache is pre-warmed for user process "X" in the sense that X is running, filling the cache with its used memory. Then X makes an OS call, and it sleeps until the OS returns. What factors can make the cache cold for X when the OS returns? Just, how much it is used by other processes while X is sleeping.
We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.
So, how does that change if the OS is local, vs. remote (superSEP cluster)? If the OS is local, you get the fastest call time. But that doesn't tell you how fast X wakes up again. Perhaps some other processes get time first, in which case the cache is likely cold when you get it back. And the OS itself will dirty the cache, just servicing X's call. On the other hand, if the OS is remote, it takes slightly longer to get the call, but it runs potentially quicker (no contention with user processes), and it doesn't dirty X's cache with OS cache lines.
In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
Will M3 be worth holding off for over M2? Will it really be a substantial difference?
Impossible to know before it's announced and Apple shows its cards. And it'll likely depend on what you will be doing with your Mac.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
However, pretty good chance that those 24 M3 P cores do a better job than those projections of what 32 M1 era P cores did. Especially, if they are attaching a better memory hierarchy backhaul to them.
What options does Apple have on the memory side? Wider doesn't seem possible, that would be too expensive. AFAIK DDR6 isn't ready yet. Slightly higher clocks? HBM?
 

ArkSingularity

macrumors 6502a
Mar 5, 2022
928
1,130
I'm curious, does anyone know how the last level cache architecture works on the Ultra? I know that memory access latency is fairly high on Apple Silicon (if I'm not mistaken, they're literally encrypting RAM through the secure enclave, which would explain 300+ cycle latency). On the Max, this doesn't seem like it should be a huge issue since they have such a ludicrously large amount of cache, but are they combining the last-level cache on the ultra and treating it as one pool?

If they're still splitting these up into two separate pools, I'd assume that it might explain some of the scaling issues we're seeing in benchmarks (since cores would have to communicate through main memory rather than having access to a shared pool of cache). I've tried to do some research on this, but information has been a bit difficult to find.
 
  • Like
Reactions: Xiao_Xi and Basic75

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
System calls already switch to a different core?
Depends. App thread will block when calling system API. A different or the same CPU core will service requests, depending on how the OS scheduler schedule the OS threads.
 

koyoot

macrumors 603
Jun 5, 2012
5,939
1,853
What options does Apple have on the memory side? Wider doesn't seem possible, that would be too expensive. AFAIK DDR6 isn't ready yet. Slightly higher clocks? HBM?
36 and 48 GB M3 Pro and M3 Max that leaked suggest that Apple will indeed bump the memory bus width by 50% to 192, 384 and 768 bit LPDDR5.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,673
I'm curious, does anyone know how the last level cache architecture works on the Ultra? I know that memory access latency is fairly high on Apple Silicon (if I'm not mistaken, they're literally encrypting RAM through the secure enclave, which would explain 300+ cycle latency). On the Max, this doesn't seem like it should be a huge issue since they have such a ludicrously large amount of cache, but are they combining the last-level cache on the ultra and treating it as one pool?

If they're still splitting these up into two separate pools, I'd assume that it might explain some of the scaling issues we're seeing in benchmarks (since cores would have to communicate through main memory rather than having access to a shared pool of cache). I've tried to do some research on this, but information has been a bit difficult to find.

Here is how I understand it works (and keep in mind that my understanding might be wrong): last level cache on Apple Silicon is essentially part of the memory controller and only caches data that this controller has access to. When a data request is made, the hardware will identify the memory controller that this request goes to (based on the physical RAM address) and check the cache block associated with that controller. That is, given a memory address you always known which cache to check, that line cannot be in multiple caches simultaneously — this is very different from the CPU caches, where different CPUs can cache the same line. What this means is that the data requests for memory blocks attached to the other die always have to go via UltraFusion. This will likely increase the latency, but the latency of communicating between different Apple Silicon clusters is already fairly high (130-150ns between CPU clusters on the same die), so you likely won't notice it.

It's a very straightforward implementation that avoids complexity while still offering opportunities for scaling and optimisation. Some more recent Apple patents for example describe a power- and area-efficient internal bus design that drops address bits as the data package gets closer to its destination. They also describe methods of moving data between memory controllers (memory folding) to bring it closer to consumers as well as power down unused memory controllers.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
Depends. App thread will block when calling system API. A different or the same CPU core will service requests, depending on how the OS scheduler schedule the OS threads.
Do you have a source for system calls generally being executed in the context of a different thread? I'm having a hard time believing Darwin could do this without being dog slow.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
Here is how I understand it works (and keep in mind that my understanding might be wrong): ...
Do you also know how memory addresses are distributed across the different memory controllers? Interleaved with LLC cacheline granularity?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.