Do you also know how memory addresses are distributed across the different memory controllers? Interleaved with LLC cacheline granularity?
That I do not know. I am sure Maynard's document contains more info, I didn't get that far yet 😅
Do you also know how memory addresses are distributed across the different memory controllers? Interleaved with LLC cacheline granularity?
Your thoughts are a lot closer to reality than the people who reflexively criticize this idea. To take just one example, Apple doesn’t copy data through the CPU from IO to target, it uses DMA, targeting the appropriate cache of the target.That's what I was talking about in the message above. But... it really depends on what particular functions of the OS are involved. Or possibly not even that?
My first thought was that this would be an overwhelming issue. But on further thought I'm not so sure. I realized that I don't have a good feel for what the critical paths are in a modern OS (and more to the point, in the interactions of system calls). For example:
- memory allocation. Does this happen often enough *in the OS* to matter? (I'm talking about allocations to userspace processes, not kmalloc calls, slab and cache manipulation, etc. that's relevant purely to internal kernel function.) That's a different question from, say, how often does your code instantiate objects. Maybe it's not important because it's extremely rare. That seems likely to me.
- I/O. At first thought this is an immediate killer, but is it? If you're doing networking in user space and especially doing zero-copy, maybe you're *not* actually going to trash your L3/SLC. If you're doing scatter/gather disk I/O is most of it typically in cache? What about normal writes and reads? If it mostly involves data out of cache, then this might not matter at all - the small amount of data being passed between clusters would involve orders of magnitude less time than the actual I/O.
And since this is Apple's silicon team we're talking about, you can expect that they'll find some clever optimizations. Like, maybe, bypassing normal caching for data going between regular clusters and superSEP cluster. Maybe building a special queue adjacent to (in?) the NoC for such messages. Who knows what optimizations you could find for such a specific setup?
It occurs to me that there's one more giant hairball this sort of scheme would cough up, if it came to fruition. Virtualization would be difficult or impossible. You'd need to either lie about having a superSEP and then fake it, or you'd need a different build of the OS for "hardware" that doesn't have that feature (or a single OS build that understands both environments).
We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.
In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.
Faster LPDDR5X. The M2s use 6400MT RAM. You can get substantially (not slightly) faster RAM now. It's more expensive, but may not be more expensive that the M2's RAM was when it was first shipped, I don't know.What options does Apple have on the memory side? Wider doesn't seem possible, that would be too expensive. AFAIK DDR6 isn't ready yet. Slightly higher clocks? HBM?
No, it doesn't. I thought so too before the M2 shipped with a 24GB option, but it turns out, no, you just have non-powers-of-two amounts of memory on each bus.36 and 48 GB M3 Pro and M3 Max that leaked suggest that Apple will indeed bump the memory bus width by 50% to 192, 384 and 768 bit LPDDR5.
I believe it's still orders of magnitude, though maybe only two? That is, getting data out of RAM is still a lot slower than getting it out of any layer of CPU/chip cache.Not when the I/O requests are being served from the buffer cache.
I don't think that's the way it will usually work. The OS will normally use DMA for I/O. In that case, you're wrong because the data is never in the OS's cache lines. (Edit: In fact, as name99 reminds us above, the DMA will drop the data directly into the user process' cache.)We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.
And in another post:In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.
I think you are confused about how OSes (not just MacOS) work.Do you have a source for system calls generally being executed in the context of a different thread? I'm having a hard time believing Darwin could do this without being dog slow.
This is known as tickless CPU scheduling. It's something that Linux already supports (though most distributions haven't configured this to be enabled). I've heard that MacOS does this too, however I have never been able to confirm this with my own research, so I'm unsure if this is true.(Speaking of which, a scheduler would be a fantastic candidate for running on a separate core. If you could guarantee that the scheduler and some other OS bits always ran on a separate core, that might simplify some tricky bits in the OS where you have to be careful about dead/livelocking. It might even be a good candidate for special hardware on that core. But it's been way too long since I poked at the guts of an OS for me to have any real feel for this.)
No, it's not. Tickless is something else - and I suspect MacOS has it, since they're so concerned about power efficiency, but I don't actually know.This is known as tickless CPU scheduling. It's something that Linux already supports (though most distributions haven't configured this to be enabled). I've heard that MacOS does this too, however I have never been able to confirm this with my own research, so I'm unsure if this is true.
User land to kernel space switches is very expensive.*System* calls *must* operate in a "different thread" than the user processes that call them. This is fundamental to security and memory protection. This is different from *library* calls, which are code executing in the same context as the rest of the user process. (In Unix speak, that basically corresponds to the difference between sections 2 and 3 in the man pages.) Library calls will obviously run on the same core since they're in the same thread (unless/until the OS moves the thread, which is orthogonal to when such calls are made), whereas system calls will execute wherever they're scheduled by the OS scheduler.
FSVO very.User land to kernel space switches is very expensive.
There are other possibilities. Various types of specialized hardware assists, and shared pages, and probably other stuff I'm not thinking about. But yes, there's a reason DMA is ubiquitous now (it wasn't, in the early days of the PC).I was trying to make use of hardware crypto engine to try speeding up OpenSSL/OpenVPN for my OpenWrt router and found out the hard way that any benefit obtained from the crypto engine was completely wiped out by the data transfers between user land memory space to kernel memory space and vice-versa. End result is slower compared to just using optimized software crypto code running in user space memory.
So device DMA is the way to go.
Yeah, I tried shared pages and most (I think) features of what the board and SoC provided that I know of. Context switches between user and kernal land is just expensive.There are other possibilities. Various types of specialized hardware assists, and shared pages, and probably other stuff I'm not thinking about.
What I meant was, other better hardware (such as Apple's SoCs) can implement features to reduce the cost of such switches, even though a lot is irreducible. SoCs on consumer routers are going to be cheap, with only enough performance to do their job.Yeah, I tried shared pages and most (I think) features of what the board and SoC provided that I know of. Context switches between user and kernal land is just expensive.
Memory Density | Memory Size Per Chip x64 | 128-bit Memory Bus | 256-bit Memory Bus | 512-bit Memory Bus |
---|---|---|---|---|
Memory Chips Required | 2 pcs | 4 pcs | 8 pcs | |
32Gb | 4 GB | 8 GB | 16 GB | 32 GB |
48Gb | 6 GB | 12 GB | 24 GB | 48 GB |
64Gb | 8 GB | 16 GB | 32 GB | 64 GB |
96Gb | 12 GB | 24 GB | 48 GB | 96 GB |
128Gb | 16 GB | 32 GB | 64 GB | 128 GB |
Total Memory Size | M3 128-bit (2 Chips) | Size Up | M3 Pro 256-bit (4 Chips) | Size Up | M3 Max 512-bit (8 Chips) | Size Up |
---|---|---|---|---|---|---|
12 GB | 4 GB + 8 GB OR 6 GB * 2 | Base | ||||
24 GB | 12 GB * 2 | + 12 GB | 4GB + 4GB + 8GB + 8GB OR 6 GB * 4 | Base | ||
32 GB | 16 GB * 2 | + 8 GB | ||||
36 GB | 8GB + 8GB + 8GB + 12GB | + 12 GB | ||||
48 GB | 12 GB * 4 | + 12 GB | 4+4+4+4+8+8+8+8 OR 6 GB * 8 | Base | ||
64 GB | 16 GB * 4 | + 16 GB | 8 GB * 8 | + 16 GB | ||
96 GB | 12 GB * 8 | + 32 GB | ||||
128 GB | 16 GB * 8 | + 32 GB |
Short answer, no.*System* calls *must* operate in a "different thread" than the user processes that call them.
Ok, now you're getting very technical, which I was trying not to do. Mostly because I don't remember enough to talk about this without screwing up, if we get deep enough into the weeds. FWIW, if I just say "context switch" I'm not bothering with the distinction you're citing. They're both context switches, though not exactly the same, as you point out.Short answer, no.
"System calls in most Unix-like systems are processed in kernel mode, which is accomplished by changing the processor execution mode to a more privileged one, but no process context switch is necessary – although a privilege context switch does occur. The hardware sees the world in terms of the execution mode according to the processor status register, and processes are an abstraction provided by the operating system. A system call does not generally require a context switch to another process; instead, it is processed in the context of whichever process invoked it."
I'd rather bang my head against a wall than do that again...You might want to read the Linux kernel source code for details.
It might be more complicated on Darwin with its monolithic-kernel-on-top-of-message-passing-microkernel approach.
Speaking of these security vulnerabilities, does everybody agree that the M3 and future Apple processors will not have SMT?Except... since Meltdown, maybe it's again the case, sometimes? I think I remember TLB flushes being a big part of the mitigations.
Since when Apple has SMT design to begin with?😅Speaking of these security vulnerabilities, does everybody agree that the M3 and future Apple processors will not have SMT?
Who said that they did?Since when Apple has SMT design to begin with?😅
Probably he refers that is easier to say what Apple processors have than what they dont have (shorter list)Who said that they did?
Nobody, very soon, will use SMT in their architectures.Speaking of these security vulnerabilities, does everybody agree that the M3 and future Apple processors will not have SMT?