Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,521
19,673
Do you also know how memory addresses are distributed across the different memory controllers? Interleaved with LLC cacheline granularity?

That I do not know. I am sure Maynard's document contains more info, I didn't get that far yet 😅
 
  • Like
Reactions: Basic75

name99

macrumors 68020
Jun 21, 2004
2,410
2,313
That's what I was talking about in the message above. But... it really depends on what particular functions of the OS are involved. Or possibly not even that?

My first thought was that this would be an overwhelming issue. But on further thought I'm not so sure. I realized that I don't have a good feel for what the critical paths are in a modern OS (and more to the point, in the interactions of system calls). For example:
- memory allocation. Does this happen often enough *in the OS* to matter? (I'm talking about allocations to userspace processes, not kmalloc calls, slab and cache manipulation, etc. that's relevant purely to internal kernel function.) That's a different question from, say, how often does your code instantiate objects. Maybe it's not important because it's extremely rare. That seems likely to me.
- I/O. At first thought this is an immediate killer, but is it? If you're doing networking in user space and especially doing zero-copy, maybe you're *not* actually going to trash your L3/SLC. If you're doing scatter/gather disk I/O is most of it typically in cache? What about normal writes and reads? If it mostly involves data out of cache, then this might not matter at all - the small amount of data being passed between clusters would involve orders of magnitude less time than the actual I/O.

And since this is Apple's silicon team we're talking about, you can expect that they'll find some clever optimizations. Like, maybe, bypassing normal caching for data going between regular clusters and superSEP cluster. Maybe building a special queue adjacent to (in?) the NoC for such messages. Who knows what optimizations you could find for such a specific setup?

It occurs to me that there's one more giant hairball this sort of scheme would cough up, if it came to fruition. Virtualization would be difficult or impossible. You'd need to either lie about having a superSEP and then fake it, or you'd need a different build of the OS for "hardware" that doesn't have that feature (or a single OS build that understands both environments).
Your thoughts are a lot closer to reality than the people who reflexively criticize this idea. To take just one example, Apple doesn’t copy data through the CPU from IO to target, it uses DMA, targeting the appropriate cache of the target.

Virtualization would (at first) have to run the virtualized OS as it does today, though the hypervisor has more flexibility. As usual, Apple will lead the way and maybe, kicking and screaming at every step, the other OS will follow ten years later. (cf Android now switching to 16kB pages — after 10 years of Linus telling us loudly why this would suck.)
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,313
We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.

In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.

Apple DMA can deliver data to a cache. And DMA is the intelligent way to move read()/write() data, not the CPU a of the OS…
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
What options does Apple have on the memory side? Wider doesn't seem possible, that would be too expensive. AFAIK DDR6 isn't ready yet. Slightly higher clocks? HBM?
Faster LPDDR5X. The M2s use 6400MT RAM. You can get substantially (not slightly) faster RAM now. It's more expensive, but may not be more expensive that the M2's RAM was when it was first shipped, I don't know.

For the obscene prices Apple charges for more RAM, they'll still make huge gobs of money on it, so it's not likely an issue. Heat might matter more, but I think the LPDDR5X runs at lower voltages, too, which will counter that.
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
36 and 48 GB M3 Pro and M3 Max that leaked suggest that Apple will indeed bump the memory bus width by 50% to 192, 384 and 768 bit LPDDR5.
No, it doesn't. I thought so too before the M2 shipped with a 24GB option, but it turns out, no, you just have non-powers-of-two amounts of memory on each bus.
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
I wrote "the small amount of data being passed between clusters would involve orders of magnitude less time than the actual I/O." and then:
Not when the I/O requests are being served from the buffer cache.
I believe it's still orders of magnitude, though maybe only two? That is, getting data out of RAM is still a lot slower than getting it out of any layer of CPU/chip cache.

Your point still might have some weight (and perhaps more on the M chips which have relatively high latency). I think the only way to see if it actually mattered would be to try to implement it...
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.
I don't think that's the way it will usually work. The OS will normally use DMA for I/O. In that case, you're wrong because the data is never in the OS's cache lines. (Edit: In fact, as name99 reminds us above, the DMA will drop the data directly into the user process' cache.)

That said, there is an advantage if the OS call executes on the same core, to the extent that arguments are passed directly in to the kernel and back out. But I doubt that that's significant compared to the run time of the call, in most cases. Certainly in the case of any I/O.

But more importantly:
In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.
And in another post:
Do you have a source for system calls generally being executed in the context of a different thread? I'm having a hard time believing Darwin could do this without being dog slow.
I think you are confused about how OSes (not just MacOS) work.

*System* calls *must* operate in a "different thread" than the user processes that call them. This is fundamental to security and memory protection. This is different from *library* calls, which are code executing in the same context as the rest of the user process. (In Unix speak, that basically corresponds to the difference between sections 2 and 3 in the man pages.) Library calls will obviously run on the same core since they're in the same thread (unless/until the OS moves the thread, which is orthogonal to when such calls are made), whereas system calls will execute wherever they're scheduled by the OS scheduler.

(Speaking of which, a scheduler would be a fantastic candidate for running on a separate core. If you could guarantee that the scheduler and some other OS bits always ran on a separate core, that might simplify some tricky bits in the OS where you have to be careful about dead/livelocking. It might even be a good candidate for special hardware on that core. But it's been way too long since I poked at the guts of an OS for me to have any real feel for this.)

This distinction doesn't exist just in MacOS. It's in basically every OS in use today, possibly aside from some very tiny RTOSes that link directly to their apps. The last significant OS that didn't do this was... MacOS 9, first released in late 1999. (I don't recall when Windows switched to this - it was certainly by XP. NT also used this model. Win 9x had partial memory protection, and I'm not sure how OS calls were handled.)
 
Last edited:
  • Like
Reactions: quarkysg

ArkSingularity

macrumors 6502a
Mar 5, 2022
928
1,130
(Speaking of which, a scheduler would be a fantastic candidate for running on a separate core. If you could guarantee that the scheduler and some other OS bits always ran on a separate core, that might simplify some tricky bits in the OS where you have to be careful about dead/livelocking. It might even be a good candidate for special hardware on that core. But it's been way too long since I poked at the guts of an OS for me to have any real feel for this.)
This is known as tickless CPU scheduling. It's something that Linux already supports (though most distributions haven't configured this to be enabled). I've heard that MacOS does this too, however I have never been able to confirm this with my own research, so I'm unsure if this is true.
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
This is known as tickless CPU scheduling. It's something that Linux already supports (though most distributions haven't configured this to be enabled). I've heard that MacOS does this too, however I have never been able to confirm this with my own research, so I'm unsure if this is true.
No, it's not. Tickless is something else - and I suspect MacOS has it, since they're so concerned about power efficiency, but I don't actually know.

Tickless basically means you get rid of the kernel "tick", the periodic interrupt, and instead schedule each future kernel wakeup based on what's coming up. I first saw this the Linux added NO_HZ, and they had a good doc that explains... ah here it is: https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt - which looks like it's been updated.

I was loosely speculating about deadlock and livelock issues. Sometimes the kernel and user processes can contend for resources, which in the worst case can produce locks. A bunch of very tricky/subtle code tries to avoid these issues. I am wondering, if you know for sure the kernel will never execute on the same core as any user process, can any of that code be avoided?
 

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
*System* calls *must* operate in a "different thread" than the user processes that call them. This is fundamental to security and memory protection. This is different from *library* calls, which are code executing in the same context as the rest of the user process. (In Unix speak, that basically corresponds to the difference between sections 2 and 3 in the man pages.) Library calls will obviously run on the same core since they're in the same thread (unless/until the OS moves the thread, which is orthogonal to when such calls are made), whereas system calls will execute wherever they're scheduled by the OS scheduler.
User land to kernel space switches is very expensive.

I was trying to make use of hardware crypto engine to try speeding up OpenSSL/OpenVPN for my OpenWrt router and found out the hard way that any benefit obtained from the crypto engine was completely wiped out by the data transfers between user land memory space to kernel memory space and vice-versa. End result is slower compared to just using optimized software crypto code running in user space memory.

So device DMA is the way to go.
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
User land to kernel space switches is very expensive.
FSVO very.
I was trying to make use of hardware crypto engine to try speeding up OpenSSL/OpenVPN for my OpenWrt router and found out the hard way that any benefit obtained from the crypto engine was completely wiped out by the data transfers between user land memory space to kernel memory space and vice-versa. End result is slower compared to just using optimized software crypto code running in user space memory.

So device DMA is the way to go.
There are other possibilities. Various types of specialized hardware assists, and shared pages, and probably other stuff I'm not thinking about. But yes, there's a reason DMA is ubiquitous now (it wasn't, in the early days of the PC).
 
  • Like
Reactions: quarkysg

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
There are other possibilities. Various types of specialized hardware assists, and shared pages, and probably other stuff I'm not thinking about.
Yeah, I tried shared pages and most (I think) features of what the board and SoC provided that I know of. Context switches between user and kernal land is just expensive.

Anyway, apologies for getting off-track.

Now back to our usual programming ...
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
Yeah, I tried shared pages and most (I think) features of what the board and SoC provided that I know of. Context switches between user and kernal land is just expensive.
What I meant was, other better hardware (such as Apple's SoCs) can implement features to reduce the cost of such switches, even though a lot is irreducible. SoCs on consumer routers are going to be cheap, with only enough performance to do their job.
 

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
@Confused-User @name99 If the operating system satisfies uncached read(2) requests from applications by performing a DMA straight to application memory and/or the processor cache, how does this data end up in the buffer cache to satisfy future requests to the same data from memory? Or doesn't it? That sounds weird.
 

TigeRick

macrumors regular
Oct 20, 2012
144
153
Malaysia
While waiting for M3 family to arrive, there is some discussion about how Apple going to implement the memory bus design. Below is standard config with different memory bus support. Both LPDDR5 and LPDDR5x are sharing similar memory specifications and configurations. Apple so far has only validated 96Gb memory density for M2 family with LPDDR5 support. That's why maximum memory size for MBA is 24GB; but not for MBP with M2 Pro cause Apple wants you to upgrade to M2 Max if you want more memory size.


Memory DensityMemory Size Per Chip x64128-bit Memory Bus256-bit Memory Bus512-bit Memory Bus
Memory Chips Required2 pcs4 pcs8 pcs
32Gb4 GB8 GB16 GB32 GB
48Gb6 GB12 GB24 GB48 GB
64Gb8 GB16 GB32 GB64 GB
96Gb12 GB24 GB48 GB96 GB
128Gb16 GB32 GB64 GB128 GB

128Gb memory density would be next upgrades for M3 family; with M3 Max most likely support up to 128GB. But where is the rumored 12GB fit in? Some users thought Apple would increase the memory bus to accommodate extra 4GB of memory. I don't think so and I believe Apple is currently testing mixture of different memory size with same memory bus. Below is my thought of how Apple going to implement:

Total Memory SizeM3 128-bit (2 Chips)Size UpM3 Pro 256-bit (4 Chips)Size UpM3 Max 512-bit (8 Chips)Size Up
12 GB4 GB + 8 GB OR
6 GB * 2
Base
24 GB12 GB * 2+ 12 GB4GB + 4GB + 8GB + 8GB OR
6 GB * 4
Base
32 GB16 GB * 2+ 8 GB
36 GB8GB + 8GB + 8GB + 12GB+ 12 GB
48 GB12 GB * 4+ 12 GB4+4+4+4+8+8+8+8 OR
6 GB * 8
Base
64 GB16 GB * 4+ 16 GB8 GB * 8+ 16 GB
96 GB12 GB * 8+ 32 GB
128 GB16 GB * 8+ 32 GB

  • Apple will most likely increase the base memory size from 8GB to 12GB for M3; 16GB to 24GB for M3 Pro and 32GB to 48GB for M3 Max.
  • As for 32GB memory support of M3, Apple might not validate it cause the increment is less than 12GB. That's why I also suspect M3 only support LPDDR5 as the rumored specs are similar with M2. M3 SoC is available for 12GB and 24GB memory config only. Now anyone wants to guess how much Apple going to charge for extra 12GB memory upgrades? :p
  • M3 Pro most likely support 24GB and 36GB memory size. If you want more memory support, then M3 Max with 48GB is your only choice, very Apple way :rolleyes:
  • We should know whether I am correct when Apple announces first Max with M3 in two month times 😅
 
Last edited:

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
*System* calls *must* operate in a "different thread" than the user processes that call them.
Short answer, no.

"System calls in most Unix-like systems are processed in kernel mode, which is accomplished by changing the processor execution mode to a more privileged one, but no process context switch is necessary – although a privilege context switch does occur. The hardware sees the world in terms of the execution mode according to the processor status register, and processes are an abstraction provided by the operating system. A system call does not generally require a context switch to another process; instead, it is processed in the context of whichever process invoked it."

You might want to read the Linux kernel source code for details.

It might be more complicated on Darwin with its monolithic-kernel-on-top-of-message-passing-microkernel approach.
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
It's been a LONG time since I looked at this stuff. So I may be misremembering entirely.

You're asking about disk I/O now. What I think I recall is that DMA will put the data into the buffer cache, first. It's operating in page-sized chunks. Then, data will be copied from the cache to userspace. The userspace target address is typically not page-aligned and not sized in pages. You can also map files to RAM (see the mmap() man page); when you do that things might be different.

There are surely clever OS and chip optimizations for this, possibly targeting specific scenarios. I vaguely recall something about delivering data direct to cache lines in the M series (in fact it was probably in name99's PDF, and he referenced it above somewhere). Someone else who's more recently been arms-deep in a kernel can surely give you a better answer.

Note that different kernels will have very different answers. Linux, Windows, MacOS, as well as various vendor Unixes (Solaris et al) may all do things differently, though I expect the basics are going to be pretty similar, at least for most of them.

The performance problem that's bugging you has more generally been bugging everyone for a long time. See for example https://en.wikipedia.org/wiki/Zero-copy
 
  • Like
Reactions: Basic75

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
Short answer, no.

"System calls in most Unix-like systems are processed in kernel mode, which is accomplished by changing the processor execution mode to a more privileged one, but no process context switch is necessary – although a privilege context switch does occur. The hardware sees the world in terms of the execution mode according to the processor status register, and processes are an abstraction provided by the operating system. A system call does not generally require a context switch to another process; instead, it is processed in the context of whichever process invoked it."
Ok, now you're getting very technical, which I was trying not to do. Mostly because I don't remember enough to talk about this without screwing up, if we get deep enough into the weeds. FWIW, if I just say "context switch" I'm not bothering with the distinction you're citing. They're both context switches, though not exactly the same, as you point out.

The cost of those switches have changed over time. In the early days, TLBs didn't have tags, and a switch to kernel mode would flush the TLB, which is pretty expensive. Now that's no longer the case. Except... since Meltdown, maybe it's again the case, sometimes? I think I remember TLB flushes being a big part of the mitigations. But I'm not up to date on this. (And yes, this will vary between processor architectures. Fundamentals should be close to the same though.)

Anyway, system calls execute in the kernel address space. That's enough for me to say it has its own context.
You might want to read the Linux kernel source code for details.

It might be more complicated on Darwin with its monolithic-kernel-on-top-of-message-passing-microkernel approach.
I'd rather bang my head against a wall than do that again...
 
  • Haha
Reactions: Basic75

Basic75

macrumors 68020
May 17, 2011
2,101
2,447
Europe
Except... since Meltdown, maybe it's again the case, sometimes? I think I remember TLB flushes being a big part of the mitigations.
Speaking of these security vulnerabilities, does everybody agree that the M3 and future Apple processors will not have SMT?
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
986
We won't know about SMT on M3 until it ships but I would be *astonished* if they did that. There is *every* indication that they won't.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.