What I suggested in 2021:Assuming these RAM options?
M3 - 12GB / 24GB / 36GB
M3 Pro - 24GB / 36GB / 48GB
M3 Max - 48GB / 96GB / 144GB
M3 Ultra - 48GB / 96GB / 144GB / 288GB
What I suggested in 2021:Assuming these RAM options?
M3 - 12GB / 24GB / 36GB
M3 Pro - 24GB / 36GB / 48GB
M3 Max - 48GB / 96GB / 144GB
M3 Ultra - 48GB / 96GB / 144GB / 288GB
Future Apple Silicon Macs Will Reportedly Use 3nm Chips With Up to 40 Cores
The Information's Wayne Ma today shared alleged details about future Apple silicon chips that will succeed the first-generation M1, M1 Pro, and...www.macrumors.com
M3 Max up to 20 cores, M3 Ultra up to 40 CPU cores?
It makes sense.
You lose all credibility by making strong assertions about something NONE of us know anything about.The quad was tried and failed to be a marketable solution. Multiple indications that it is coming no time soon.
Makes sense in which universe?????
Even more warped if count was the sole objective 2 P clusters + 3 E clusters : 8P + 12E ... going done some rabbit hole Intel is falling into. And gets even more goofy when packaged as an Ultra with a gross number of E cores with likely few tasks to do for many users in that workload space. I really doubt Apple is myopically fixated on a specific core count numbers. (and trying to win 'count wars' versus Intel/AMD).
This is a fascinating idea, though so far I'm not convinced (which is not to say that I won't be in the future).Suppose now that Apple restructured Darwin so that the absolutely most critical elements ran on separate cores (my guess is 2 E-cores would be enough) isolated from the rest of the chip in the same way that the SEP is isolated. The isolation can't be quite as total because the OS has to write a lot of RAM, and has to send many message (interrupts, TLB invalidations, and so on) to other cores, but it can be substantial.
At least initially much of the OS, in the form of demons, User Space Networking, and so on, would still run on normal cores; just a minimal OS would run on the secure E-cores, though obviously as feasible more and more functionality would be move to that core.
The payoff would be that macOS becomes significantly slower than it already is. Think of what you are doing to the caches if system calls or messages to the operating system switch to a different processor core in a different cluster.But the payoff is large enough to justify the effort.
That's what I was talking about in the message above. But... it really depends on what particular functions of the OS are involved. Or possibly not even that?The payoff would be that macOS becomes significantly slower than it already is. Think of what you are doing to the caches if system calls or messages to the operating system switch to a different processor core in a different cluster.
Actually, what is the differences compared to current situation. From my limited understanding, more cache will be required if the addressable memory gets larger to mitigate cache trashing. More cores does not require more cache, other than their local L1/L2.The payoff would be that macOS becomes significantly slower than it already is. Think of what you are doing to the caches if system calls or messages to the operating system switch to a different processor core in a different cluster.
Required cache size and organisation depend on working sets and memory access patterns, not on the size of addressable memory. My point was that switching to a different core in a different cluster lets you start with cold L1 and L2 caches up to two times, once when switching there and once when switching back. Cache size has nothing to do with it.From my limited understanding, more cache will be required if the addressable memory gets larger to mitigate cache trashing.
But this is already happening with existing macOS.Required cache size and organisation depend on working sets and memory access patterns, not on the size of addressable memory. My point was that switching to a different core in a different cluster lets you start with cold L1 and L2 caches up to two times, once when switching there and once when switching back. Cache size has nothing to do with it.
Is that true? Again, I don't feel like I have good instincts for this, so I'm interested in more knowledgeable peoples' reasoning, but my naive take on this is that the caches aren't much more likely to be cold that way than if it all ran on the same core. Consider: the cache is pre-warmed for user process "X" in the sense that X is running, filling the cache with its used memory. Then X makes an OS call, and it sleeps until the OS returns. What factors can make the cache cold for X when the OS returns? Just, how much it is used by other processes while X is sleeping.Required cache size and organisation depend on working sets and memory access patterns, not on the size of addressable memory. My point was that switching to a different core in a different cluster lets you start with cold L1 and L2 caches up to two times, once when switching there and once when switching back. Cache size has nothing to do with it.
System calls already switch to a different core?But this is already happening with existing macOS.
"Up to 4 dies" could mean this:Makes sense in which universe?????
That link comes from 2021. The '40' came from multiplying 10 times 4 ; not 2. M1 Max 8 P + 2 E cores (kind of 10).
We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.Is that true? Again, I don't feel like I have good instincts for this, so I'm interested in more knowledgeable peoples' reasoning, but my naive take on this is that the caches aren't much more likely to be cold that way than if it all ran on the same core. Consider: the cache is pre-warmed for user process "X" in the sense that X is running, filling the cache with its used memory. Then X makes an OS call, and it sleeps until the OS returns. What factors can make the cache cold for X when the OS returns? Just, how much it is used by other processes while X is sleeping.
In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.So, how does that change if the OS is local, vs. remote (superSEP cluster)? If the OS is local, you get the fastest call time. But that doesn't tell you how fast X wakes up again. Perhaps some other processes get time first, in which case the cache is likely cold when you get it back. And the OS itself will dirty the cache, just servicing X's call. On the other hand, if the OS is remote, it takes slightly longer to get the call, but it runs potentially quicker (no contention with user processes), and it doesn't dirty X's cache with OS cache lines.
Impossible to know before it's announced and Apple shows its cards. And it'll likely depend on what you will be doing with your Mac.Will M3 be worth holding off for over M2? Will it really be a substantial difference?
What options does Apple have on the memory side? Wider doesn't seem possible, that would be too expensive. AFAIK DDR6 isn't ready yet. Slightly higher clocks? HBM?However, pretty good chance that those 24 M3 P cores do a better job than those projections of what 32 M1 era P cores did. Especially, if they are attaching a better memory hierarchy backhaul to them.
Not when the I/O requests are being served from the buffer cache.the small amount of data being passed between clusters would involve orders of magnitude less time than the actual I/O.
Depends. App thread will block when calling system API. A different or the same CPU core will service requests, depending on how the OS scheduler schedule the OS threads.System calls already switch to a different core?
36 and 48 GB M3 Pro and M3 Max that leaked suggest that Apple will indeed bump the memory bus width by 50% to 192, 384 and 768 bit LPDDR5.What options does Apple have on the memory side? Wider doesn't seem possible, that would be too expensive. AFAIK DDR6 isn't ready yet. Slightly higher clocks? HBM?
I'm curious, does anyone know how the last level cache architecture works on the Ultra? I know that memory access latency is fairly high on Apple Silicon (if I'm not mistaken, they're literally encrypting RAM through the secure enclave, which would explain 300+ cycle latency). On the Max, this doesn't seem like it should be a huge issue since they have such a ludicrously large amount of cache, but are they combining the last-level cache on the ultra and treating it as one pool?
If they're still splitting these up into two separate pools, I'd assume that it might explain some of the scaling issues we're seeing in benchmarks (since cores would have to communicate through main memory rather than having access to a shared pool of cache). I've tried to do some research on this, but information has been a bit difficult to find.
Do you have a source for system calls generally being executed in the context of a different thread? I'm having a hard time believing Darwin could do this without being dog slow.Depends. App thread will block when calling system API. A different or the same CPU core will service requests, depending on how the OS scheduler schedule the OS threads.
Do you also know how memory addresses are distributed across the different memory controllers? Interleaved with LLC cacheline granularity?Here is how I understand it works (and keep in mind that my understanding might be wrong): ...