M3 Chip Generation - Discussion Megathread

leman · Aug 16, 2023

Basic75 said:
Do you also know how memory addresses are distributed across the different memory controllers? Interleaved with LLC cacheline granularity?

That I do not know. I am sure Maynard's document contains more info, I didn't get that far yet 😅

name99 · Aug 16, 2023

Confused-User said:
That's what I was talking about in the message above. But... it really depends on what particular functions of the OS are involved. Or possibly not even that?

My first thought was that this would be an overwhelming issue. But on further thought I'm not so sure. I realized that I don't have a good feel for what the critical paths are in a modern OS (and more to the point, in the interactions of system calls). For example:
- memory allocation. Does this happen often enough *in the OS* to matter? (I'm talking about allocations to userspace processes, not kmalloc calls, slab and cache manipulation, etc. that's relevant purely to internal kernel function.) That's a different question from, say, how often does your code instantiate objects. Maybe it's not important because it's extremely rare. That seems likely to me.
- I/O. At first thought this is an immediate killer, but is it? If you're doing networking in user space and especially doing zero-copy, maybe you're *not* actually going to trash your L3/SLC. If you're doing scatter/gather disk I/O is most of it typically in cache? What about normal writes and reads? If it mostly involves data out of cache, then this might not matter at all - the small amount of data being passed between clusters would involve orders of magnitude less time than the actual I/O.

And since this is Apple's silicon team we're talking about, you can expect that they'll find some clever optimizations. Like, maybe, bypassing normal caching for data going between regular clusters and superSEP cluster. Maybe building a special queue adjacent to (in?) the NoC for such messages. Who knows what optimizations you could find for such a specific setup?

It occurs to me that there's one more giant hairball this sort of scheme would cough up, if it came to fruition. Virtualization would be difficult or impossible. You'd need to either lie about having a superSEP and then fake it, or you'd need a different build of the OS for "hardware" that doesn't have that feature (or a single OS build that understands both environments).

Your thoughts are a lot closer to reality than the people who reflexively criticize this idea. To take just one example, Apple doesn’t copy data through the CPU from IO to target, it uses DMA, targeting the appropriate cache of the target.

Virtualization would (at first) have to run the virtualized OS as it does today, though the hypervisor has more flexibility. As usual, Apple will lead the way and maybe, kicking and screaming at every step, the other OS will follow ten years later. (cf Android now switching to 16kB pages — after 10 years of Linus telling us loudly why this would suck.)

name99 · Aug 16, 2023

Basic75 said:
We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.

In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.

Apple DMA can deliver data to a cache. And DMA is the intelligent way to move read()/write() data, not the CPU a of the OS…

Confused-User · Aug 16, 2023

Basic75 said:
What options does Apple have on the memory side? Wider doesn't seem possible, that would be too expensive. AFAIK DDR6 isn't ready yet. Slightly higher clocks? HBM?

Faster LPDDR5X. The M2s use 6400MT RAM. You can get substantially (not slightly) faster RAM now. It's more expensive, but may not be more expensive that the M2's RAM was when it was first shipped, I don't know.

For the obscene prices Apple charges for more RAM, they'll still make huge gobs of money on it, so it's not likely an issue. Heat might matter more, but I think the LPDDR5X runs at lower voltages, too, which will counter that.

Confused-User · Aug 16, 2023

koyoot said:
36 and 48 GB M3 Pro and M3 Max that leaked suggest that Apple will indeed bump the memory bus width by 50% to 192, 384 and 768 bit LPDDR5.

No, it doesn't. I thought so too before the M2 shipped with a 24GB option, but it turns out, no, you just have non-powers-of-two amounts of memory on each bus.

Confused-User · Aug 16, 2023

I wrote "the small amount of data being passed between clusters would involve orders of magnitude less time than the actual I/O." and then:

Basic75 said:
Not when the I/O requests are being served from the buffer cache.

I believe it's still orders of magnitude, though maybe only two? That is, getting data out of RAM is still a lot slower than getting it out of any layer of CPU/chip cache.

Your point still might have some weight (and perhaps more on the M chips which have relatively high latency). I think the only way to see if it actually mattered would be to try to implement it...

Confused-User · Aug 16, 2023

Basic75 said:
We're talking L1 and perhaps L2 cache here. Think of a simple read(2) or write(2) system call. For a small read everything will be in the operating system's core's L1 after it copies the data. When the application starts working on the data it has to go through L2 to SLC and then back up L2 and L1 of the aplication's core. That sounds like a lot of overhead.

I don't think that's the way it will usually work. The OS will normally use DMA for I/O. In that case, you're wrong because the data is never in the OS's cache lines. (Edit: In fact, as name99 reminds us above, the DMA will drop the data directly into the user process' cache.)

That said, there is an advantage if the OS call executes on the same core, to the extent that arguments are passed directly in to the kernel and back out. But I doubt that that's significant compared to the run time of the call, in most cases. Certainly in the case of any I/O.

But more importantly:

In the normal case a system call is performed on the context of the application thread making the call, there is no context switch, that would be much too slow. And I don't see how the "remote OS core" would remove contention, it would just be somewhere else.

And in another post:

Basic75 said:
Do you have a source for system calls generally being executed in the context of a different thread? I'm having a hard time believing Darwin could do this without being dog slow.

I think you are confused about how OSes (not just MacOS) work.

*System* calls *must* operate in a "different thread" than the user processes that call them. This is fundamental to security and memory protection. This is different from *library* calls, which are code executing in the same context as the rest of the user process. (In Unix speak, that basically corresponds to the difference between sections 2 and 3 in the man pages.) Library calls will obviously run on the same core since they're in the same thread (unless/until the OS moves the thread, which is orthogonal to when such calls are made), whereas system calls will execute wherever they're scheduled by the OS scheduler.

(Speaking of which, a scheduler would be a fantastic candidate for running on a separate core. If you could guarantee that the scheduler and some other OS bits always ran on a separate core, that might simplify some tricky bits in the OS where you have to be careful about dead/livelocking. It might even be a good candidate for special hardware on that core. But it's been way too long since I poked at the guts of an OS for me to have any real feel for this.)

This distinction doesn't exist just in MacOS. It's in basically every OS in use today, possibly aside from some very tiny RTOSes that link directly to their apps. The last significant OS that didn't do this was... MacOS 9, first released in late 1999. (I don't recall when Windows switched to this - it was certainly by XP. NT also used this model. Win 9x had partial memory protection, and I'm not sure how OS calls were handled.)

ArkSingularity · Aug 16, 2023

Confused-User said:
(Speaking of which, a scheduler would be a fantastic candidate for running on a separate core. If you could guarantee that the scheduler and some other OS bits always ran on a separate core, that might simplify some tricky bits in the OS where you have to be careful about dead/livelocking. It might even be a good candidate for special hardware on that core. But it's been way too long since I poked at the guts of an OS for me to have any real feel for this.)

This is known as tickless CPU scheduling. It's something that Linux already supports (though most distributions haven't configured this to be enabled). I've heard that MacOS does this too, however I have never been able to confirm this with my own research, so I'm unsure if this is true.

Confused-User · Aug 16, 2023

ArkSingularity said:
This is known as tickless CPU scheduling. It's something that Linux already supports (though most distributions haven't configured this to be enabled). I've heard that MacOS does this too, however I have never been able to confirm this with my own research, so I'm unsure if this is true.

No, it's not. Tickless is something else - and I suspect MacOS has it, since they're so concerned about power efficiency, but I don't actually know.

Tickless basically means you get rid of the kernel "tick", the periodic interrupt, and instead schedule each future kernel wakeup based on what's coming up. I first saw this the Linux added NO_HZ, and they had a good doc that explains... ah here it is: https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt - which looks like it's been updated.

I was loosely speculating about deadlock and livelock issues. Sometimes the kernel and user processes can contend for resources, which in the worst case can produce locks. A bunch of very tricky/subtle code tries to avoid these issues. I am wondering, if you know for sure the kernel will never execute on the same core as any user process, can any of that code be avoided?

quarkysg · Aug 16, 2023

Confused-User said:
*System* calls *must* operate in a "different thread" than the user processes that call them. This is fundamental to security and memory protection. This is different from *library* calls, which are code executing in the same context as the rest of the user process. (In Unix speak, that basically corresponds to the difference between sections 2 and 3 in the man pages.) Library calls will obviously run on the same core since they're in the same thread (unless/until the OS moves the thread, which is orthogonal to when such calls are made), whereas system calls will execute wherever they're scheduled by the OS scheduler.

User land to kernel space switches is very expensive.

I was trying to make use of hardware crypto engine to try speeding up OpenSSL/OpenVPN for my OpenWrt router and found out the hard way that any benefit obtained from the crypto engine was completely wiped out by the data transfers between user land memory space to kernel memory space and vice-versa. End result is slower compared to just using optimized software crypto code running in user space memory.

So device DMA is the way to go.

Confused-User · Aug 16, 2023

quarkysg said:
User land to kernel space switches is very expensive.

FSVO very.

I was trying to make use of hardware crypto engine to try speeding up OpenSSL/OpenVPN for my OpenWrt router and found out the hard way that any benefit obtained from the crypto engine was completely wiped out by the data transfers between user land memory space to kernel memory space and vice-versa. End result is slower compared to just using optimized software crypto code running in user space memory.

So device DMA is the way to go.

There are other possibilities. Various types of specialized hardware assists, and shared pages, and probably other stuff I'm not thinking about. But yes, there's a reason DMA is ubiquitous now (it wasn't, in the early days of the PC).

quarkysg · Aug 16, 2023

Confused-User said:
There are other possibilities. Various types of specialized hardware assists, and shared pages, and probably other stuff I'm not thinking about.

Yeah, I tried shared pages and most (I think) features of what the board and SoC provided that I know of. Context switches between user and kernal land is just expensive.

Anyway, apologies for getting off-track.

Now back to our usual programming ...

Confused-User · Aug 16, 2023

quarkysg said:
Yeah, I tried shared pages and most (I think) features of what the board and SoC provided that I know of. Context switches between user and kernal land is just expensive.

What I meant was, other better hardware (such as Apple's SoCs) can implement features to reduce the cost of such switches, even though a lot is irreducible. SoCs on consumer routers are going to be cheap, with only enough performance to do their job.

Basic75 · Aug 16, 2023

@Confused-User @name99 If the operating system satisfies uncached read(2) requests from applications by performing a DMA straight to application memory and/or the processor cache, how does this data end up in the buffer cache to satisfy future requests to the same data from memory? Or doesn't it? That sounds weird.

TigeRick · Aug 16, 2023

While waiting for M3 family to arrive, there is some discussion about how Apple going to implement the memory bus design. Below is standard config with different memory bus support. Both LPDDR5 and LPDDR5x are sharing similar memory specifications and configurations. Apple so far has only validated 96Gb memory density for M2 family with LPDDR5 support. That's why maximum memory size for MBA is 24GB; but not for MBP with M2 Pro cause Apple wants you to upgrade to M2 Max if you want more memory size.

Memory Density	Memory Size Per Chip x64	128-bit Memory Bus	256-bit Memory Bus	512-bit Memory Bus
	Memory Chips Required	2 pcs	4 pcs	8 pcs
32Gb	4 GB	8 GB	16 GB	32 GB
48Gb	6 GB	12 GB	24 GB	48 GB
64Gb	8 GB	16 GB	32 GB	64 GB
96Gb	12 GB	24 GB	48 GB	96 GB
128Gb	16 GB	32 GB	64 GB	128 GB

128Gb memory density would be next upgrades for M3 family; with M3 Max most likely support up to 128GB. But where is the rumored 12GB fit in? Some users thought Apple would increase the memory bus to accommodate extra 4GB of memory. I don't think so and I believe Apple is currently testing mixture of different memory size with same memory bus. Below is my thought of how Apple going to implement:

Total Memory Size	M3 128-bit (2 Chips)	Size Up	M3 Pro 256-bit (4 Chips)	Size Up	M3 Max 512-bit (8 Chips)	Size Up
12 GB	4 GB + 8 GB OR 6 GB * 2	Base
24 GB	12 GB * 2	+ 12 GB	4GB + 4GB + 8GB + 8GB OR 6 GB * 4	Base
32 GB	16 GB * 2	+ 8 GB
36 GB			8GB + 8GB + 8GB + 12GB	+ 12 GB
48 GB			12 GB * 4	+ 12 GB	4+4+4+4+8+8+8+8 OR 6 GB * 8	Base
64 GB			16 GB * 4	+ 16 GB	8 GB * 8	+ 16 GB
96 GB					12 GB * 8	+ 32 GB
128 GB					16 GB * 8	+ 32 GB

Apple will most likely increase the base memory size from 8GB to 12GB for M3; 16GB to 24GB for M3 Pro and 32GB to 48GB for M3 Max.
As for 32GB memory support of M3, Apple might not validate it cause the increment is less than 12GB. That's why I also suspect M3 only support LPDDR5 as the rumored specs are similar with M2. M3 SoC is available for 12GB and 24GB memory config only. Now anyone wants to guess how much Apple going to charge for extra 12GB memory upgrades?
M3 Pro most likely support 24GB and 36GB memory size. If you want more memory support, then M3 Max with 48GB is your only choice, very Apple way
We should know whether I am correct when Apple announces first Max with M3 in two month times 😅

Basic75 · Aug 16, 2023

Confused-User said:
*System* calls *must* operate in a "different thread" than the user processes that call them.

Short answer, no.

"System calls in most Unix-like systems are processed in kernel mode, which is accomplished by changing the processor execution mode to a more privileged one, but no process context switch is necessary – although a privilege context switch does occur. The hardware sees the world in terms of the execution mode according to the processor status register, and processes are an abstraction provided by the operating system. A system call does not generally require a context switch to another process; instead, it is processed in the context of whichever process invoked it."

You might want to read the Linux kernel source code for details.

It might be more complicated on Darwin with its monolithic-kernel-on-top-of-message-passing-microkernel approach.

Confused-User · Aug 16, 2023

It's been a LONG time since I looked at this stuff. So I may be misremembering entirely.

You're asking about disk I/O now. What I think I recall is that DMA will put the data into the buffer cache, first. It's operating in page-sized chunks. Then, data will be copied from the cache to userspace. The userspace target address is typically not page-aligned and not sized in pages. You can also map files to RAM (see the mmap() man page); when you do that things might be different.

There are surely clever OS and chip optimizations for this, possibly targeting specific scenarios. I vaguely recall something about delivering data direct to cache lines in the M series (in fact it was probably in name99's PDF, and he referenced it above somewhere). Someone else who's more recently been arms-deep in a kernel can surely give you a better answer.

Note that different kernels will have very different answers. Linux, Windows, MacOS, as well as various vendor Unixes (Solaris et al) may all do things differently, though I expect the basics are going to be pretty similar, at least for most of them.

The performance problem that's bugging you has more generally been bugging everyone for a long time. See for example https://en.wikipedia.org/wiki/Zero-copy

Confused-User · Aug 17, 2023

Basic75 said:
Short answer, no.

"System calls in most Unix-like systems are processed in kernel mode, which is accomplished by changing the processor execution mode to a more privileged one, but no process context switch is necessary – although a privilege context switch does occur. The hardware sees the world in terms of the execution mode according to the processor status register, and processes are an abstraction provided by the operating system. A system call does not generally require a context switch to another process; instead, it is processed in the context of whichever process invoked it."

Ok, now you're getting very technical, which I was trying not to do. Mostly because I don't remember enough to talk about this without screwing up, if we get deep enough into the weeds. FWIW, if I just say "context switch" I'm not bothering with the distinction you're citing. They're both context switches, though not exactly the same, as you point out.

The cost of those switches have changed over time. In the early days, TLBs didn't have tags, and a switch to kernel mode would flush the TLB, which is pretty expensive. Now that's no longer the case. Except... since Meltdown, maybe it's again the case, sometimes? I think I remember TLB flushes being a big part of the mitigations. But I'm not up to date on this. (And yes, this will vary between processor architectures. Fundamentals should be close to the same though.)

Anyway, system calls execute in the kernel address space. That's enough for me to say it has its own context.

You might want to read the Linux kernel source code for details.

It might be more complicated on Darwin with its monolithic-kernel-on-top-of-message-passing-microkernel approach.

I'd rather bang my head against a wall than do that again...

Basic75 · Aug 17, 2023

Confused-User said:
Except... since Meltdown, maybe it's again the case, sometimes? I think I remember TLB flushes being a big part of the mitigations.

Speaking of these security vulnerabilities, does everybody agree that the M3 and future Apple processors will not have SMT?

TigeRick · Aug 17, 2023

Basic75 said:
Speaking of these security vulnerabilities, does everybody agree that the M3 and future Apple processors will not have SMT?

Since when Apple has SMT design to begin with?😅

Basic75 · Aug 17, 2023

TigeRick said:
Since when Apple has SMT design to begin with?😅

Who said that they did?

MayaUser · Aug 17, 2023

Basic75 said:
Who said that they did?

Probably he refers that is easier to say what Apple processors have than what they dont have (shorter list)
Water is wet, who said that it is not? Something like that

leman · Aug 17, 2023

Are you suggesting that Apple would use varying RAM capacities per channel? I find it fairly unlikely.

Confused-User · Aug 17, 2023

We won't know about SMT on M3 until it ships but I would be *astonished* if they did that. There is *every* indication that they won't.

koyoot · Aug 17, 2023

Basic75 said:
Speaking of these security vulnerabilities, does everybody agree that the M3 and future Apple processors will not have SMT?

Nobody, very soon, will use SMT in their architectures.

M3 Chip Generation - Discussion Megathread

macrumors Core

macrumors 68030

macrumors 68030

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 68020

macrumors regular

macrumors 68020

macrumors 6502a

macrumors 6502a

macrumors 68020

macrumors regular

macrumors 68020

macrumors 68040

macrumors Core

macrumors 6502a

macrumors 603

Our Staff