Is the M1 GPU sharing LPDDR4 RAM going to be enough?

leman · Sep 26, 2021

poorcody said:
If you compare the 3090 to the A6000, they basically have similar performance, but the 3090 uses GDDR6X, and the A6000 uses straight GDDR6. The "X" version is faster, but overheats easily. The "non-X" RAM does not see similar problems, even with much simpler coolers.

Modern GPUs live and die by memory bandwidth (especially true for Nvidia, AMD has advanced cache technology to help them out). If you are having thousands of processing units in your cluster, you‘d better have the ability to fetch and store the data quickly enough to keep them occupied. GDDR is very fast, but it pays for it by running hot. HBM needs less power for thevsamw bandwidth, but it’s too expensive for a mainstream gaming card.

poorcody said:
I only mention this because the RAM Apple is using for the M1's GPU might be limited by heat issues more than anything else.

Nah, Apple uses custom versions of RAM already designed for low-power operation. The M1 RAM is drawing less than 2 watts of power under load (and most of the time, less than half a watt), which is crazy low. I am sure their prosumer chips will use similarly efficient RAM tech. Of course, the fact that Apple has huge caches (their latest phone chip has more last level GPU cache than the entirety of a 3090) + bandwidth saving rendering technology helps a lot.

Why don’t GPUs use similar RAM I hear you asking? Because using low-power RAM with very wide memory bus would would be crazy expensive and make a very large board. One could ask, wait, can’t one do something smart to cut those drawbacks down? One can indeed! That’s what HBM is

ChrisA · Sep 26, 2021

leman said:
They kind of already do. From what we know, the SSD is connected directly to the M1 internal bus and the on-chip controller emulates NVMe protocol to communicate with the OS. The next logical step is to drop NVMe altogether and just go full custom, potentially exposing the SSD storage as byte-addressable physical RAM in a common address space. This would allow the kernel to map SSD storage directly, eliminating the need for any logical layer and dramatically improving latency. But I have no idea about these things and I don't know whether there are any special requirements that would make such direct mapping approach non-viable.

Yes. It is about time to revisit Multics.

Just over 50 years ago, in 1969, The OS called "Multics" was released. It was revolutionary, but one feature really was brilliant. It used a unified RAM and storage. Just like the above quote, the OS did not make a distinction between the files on the disk and RAM. It was all just byte addressable random access storage. This was done in 1969 so the idea is not new.

Later a couple of engineers at AT&T wrote a smaller and more simple OS, and they wanted a name. So they used a word-play on "Multics" and called their new OS "Unix". This was their way of saying that their OS was simpler and maybe more elegant.

Later still Steve Jobs needed an OS for his new computer he called "Next" so they used the Unix kernel and then wrote a graphical interface over it. Jobs moved back to Apple and took his OS with him and re-named it "macOS". But really macOS is Unix which is a re-written and streamlined Multics.

Would it not be great if macOS regained the best idea from Multics, unified RAM and storage.

BTW Multics is now an open source project at https://www.multicians.org/ It is nor suitable for practical use but it exists for study.

mr_roboto · Sep 27, 2021

leman said:
They kind of already do. From what we know, the SSD is connected directly to the M1 internal bus and the on-chip controller emulates NVMe protocol to communicate with the OS. The next logical step is to drop NVMe altogether and just go full custom, potentially exposing the SSD storage as byte-addressable physical RAM in a common address space. This would allow the kernel to map SSD storage directly, eliminating the need for any logical layer and dramatically improving latency. But I have no idea about these things and I don't know whether there are any special requirements that would make such direct mapping approach non-viable.

Saw this because ChrisA quoted it, and here's the reason it's not viable: NAND flash memory isn't byte-addressable. To read a single byte, you must read the entire page containing that byte, and a page is generally at least 2112 bytes (intended to be used as 2048 bytes data + 64 bytes extra for ECC data or other overhead). Additionally, page read latency is quite high compared to memory technology like DRAM. Things are even worse for writes, where there's a larger erase block size which means you really cannot write to random pages as doing so implies erasing a bunch of other pages first. So long as NAND flash is the memory technology of choice for high capacity SSDs, byte addressability is impossible.

(Intel/Micron 3D Xpoint, aka Optane memory, is byte addressable nonvolatile memory. It has random performance close to DRAM, but lower density and thus higher cost per bit than NAND flash. You've probably heard of Optane SSDs, but Intel also has support for Optane DIMMs in some Xeon processors, which is the kind of interface you need if you want to have nonvolatile storage exposed directly to the CPU. However, the cost has made it a non-starter for consumer systems, even in SSD form.)

Another point about Apple Silicon SSDs - There is no emulation of NVMe protocol going on. It's just a NVMe SSD with some proprietary Apple extensions to NVMe designed to offload FileVault encrypt/decrypt to the SSD controller.

ChrisA said:
Yes. It is about time to revisit Multics.

Just over 50 years ago, in 1969, The OS called "Multics" was released. It was revolutionary, but one feature really was brilliant. It used a unified RAM and storage. Just like the above quote, the OS did not make a distinction between the files on the disk and RAM. It was all just byte addressable random access storage. This was done in 1969 so the idea is not new.

Multics had a normal file system. (though when I say "normal", I should mention that it actually pioneered a lot of the file system attributes we now regard as "normal".)

It didn't unify RAM and storage in any way we cannot do today, right now, on macOS or Linux. What I think you're interpreting as unifying RAM and storage was just that the standard API for interacting with files on Multics was the equivalent of mmap() on UNIX-descended operating systems like macOS.

There has been operating systems research around the concept of truly globally persistent RAM, but as far as I know nobody has ever built anything like a general purpose desktop computer based on it.

leman · Sep 27, 2021

mr_roboto said:
Saw this because ChrisA quoted it, and here's the reason it's not viable: NAND flash memory isn't byte-addressable. To read a single byte, you must read the entire page containing that byte, and a page is generally at least 2112 bytes (intended to be used as 2048 bytes data + 64 bytes extra for ECC data or other overhead). Additionally, page read latency is quite high compared to memory technology like DRAM. Things are even worse for writes, where there's a larger erase block size which means you really cannot write to random pages as doing so implies erasing a bunch of other pages first. So long as NAND flash is the memory technology of choice for high capacity SSDs, byte addressability is impossible.

Thanks for the clarification! I suppose that byte-addressability can be "faked" with a sufficiently smart caching controller that does coalescing. At the same time, if one wants the best performance, endurance and reliability, the OS should stay aware of the implementation details.

mr_roboto said:
Another point about Apple Silicon SSDs - There is no emulation of NVMe protocol going on. It's just a NVMe SSD with some proprietary Apple extensions to NVMe designed to offload FileVault encrypt/decrypt to the SSD controller.

You are right again, sorry, I misremembered. It's PCIe that they are emulating, not NVMe. It's just that the NVMe protocol in M1 is non-standard (different queue format etc.)

jdb8167 · Sep 28, 2021

mr_roboto said:
There has been operating systems research around the concept of truly globally persistent RAM, but as far as I know nobody has ever built anything like a general purpose desktop computer based on it.

HP did a large amount of research on this and was seemingly about to release a system designed around it when HP management pulled the plug. I'm not sure if it didn't work as designed or there was some other issue (like may Intel didn't like the research.)

Krevnik · Sep 28, 2021

leman said:
You are right again, sorry, I misremembered. It's PCIe that they are emulating, not NVMe. It's just that the NVMe protocol in M1 is non-standard (different queue format etc.)

While we are getting into the weeds, even that seems a bit imprecise.

On macOS, Apple isn't emulating a PCIe bus or device to make this work, and instead using an ANS-specific NVMe driver, which reports that the physical interconnect is "Apple Fabric".

At least according to Correlium's driver, they suggest that PCIe is used behind the ANS coprocessor, while presenting an NVMe BAR as MMIO to the CPU. But the Linux driver does make the SSD appear as a PCIe device on a virtual bus.

I'm not convinced there's a PCIe bus behind the ANS just to provide a link between the ANS and SSD controller, but since the ANS seems to be part of the secure enclave, who knows. At the end of the day, it probably doesn't matter a whole lot what is back there unless you are researching the enclave and how it functions wrt/SSD security.

/*
* Quick explanation of this driver:
*
* M1 contains a coprocessor, called ANS, that fronts the actual PCIe
* bus to the NVMe device. This coprocessor does not, unfortunately,
* expose a PCIe like interface. But it does have a MMIO range that is
* a mostly normal NVMe BAR, with a few quirks (handled in NVMe code).
*
* So, to reduce code duplication in NVMe code (to add a non-PCI backend)
* we add a synthetic PCI bus for this device.
*/

mr_roboto · Sep 28, 2021

Krevnik said:
At least according to Correlium's driver, they suggest that PCIe is used behind the ANS coprocessor, while presenting an NVMe BAR as MMIO to the CPU. But the Linux driver does make the SSD appear as a PCIe device on a virtual bus.

I'm not convinced there's a PCIe bus behind the ANS just to provide a link between the ANS and SSD controller, but since the ANS seems to be part of the secure enclave, who knows.

I'm not convinced by the comment either, it sounds like it was written by a software person who didn't quite connect the dots and realize the reason why they had to "add a synthetic PCI bus" is that there's no PCIe there.

For some reason, a lot of people seem to be convinced that NVMe is tied to PCIe in some way, so if they see a NVMe device they think it must also be a PCIe endpoint. But the NVMe spec supports multiple transports, not just PCIe, so it's totally a legitimate thing to just have a NVMe device appear as MMIO registers somewhere in a SoC's address space.

Krevnik · Sep 28, 2021

mr_roboto said:
I'm not convinced by the comment either, it sounds like it was written by a software person who didn't quite connect the dots and realize the reason why they had to "add a synthetic PCI bus" is that there's no PCIe there.

I’m willing to give them more credit then that. They do outline why they added a virtual PCIe device in the comment, which is for code sharing with the Linux NVMe PCIe drivers (as Linux doesn’t seem to support non-PCIe NVMe devices).

mr_roboto said:
For some reason, a lot of people seem to be convinced that NVMe is tied to PCIe in some way, so if they see a NVMe device they think it must also be a PCIe endpoint. But the NVMe spec supports multiple transports, not just PCIe, so it's totally a legitimate thing to just have a NVMe device appear as MMIO registers somewhere in a SoC's address space.

It’s legitimate, but it’s also the case that Apple’s design uses the secure enclave as a proxy to handle encryption so the kernel doesn’t have access to the encryption keys (on paper anyhow). The enclave is the one hooked up to the SSD controller, which then re-exposes that to the fabric interconnect. How the enclave talks to (or integrates) the SSD controller is not documented anywhere that I’m aware of, so it could be PCIe, but I don’t know why it would need to be, or how this dev would know for sure, so I assume the comment about the ANS using PCIe to talk to the SSD controller itself is a guess.

It’s also rare to not be tied to PCIe. To the point that it was easier to piggy back on the PCIe NVMe drivers in the Linux kernel. And to the point that it’s not far fetched to guess that PCIe is involved between the ANS and the SSD controller. Just not likely in the case of an Apple SoC IMO, with Apple’s history of extensive custom silicon on the SoC.

mr_roboto · Sep 29, 2021

Krevnik said:
It’s legitimate, but it’s also the case that Apple’s design uses the secure enclave as a proxy to handle encryption so the kernel doesn’t have access to the encryption keys (on paper anyhow). The enclave is the one hooked up to the SSD controller, which then re-exposes that to the fabric interconnect. How the enclave talks to (or integrates) the SSD controller is not documented anywhere that I’m aware of, so it could be PCIe, but I don’t know why it would need to be, or how this dev would know for sure, so I assume the comment about the ANS using PCIe to talk to the SSD controller itself is a guess.

Not only does it not need to be PCIe, there's good reasons for it not to be.

PCIe is designed to connect chips separated by many inches of PCB trace with a connector somewhere in the middle. That requirement drives a lot of complexity. When you need two things on the same chip to talk to each other, PCIe is the wrong tool, because it has too much unnecessary overhead: performance, power, die area, design effort, you name it.

Because PCIe is a layered protocol (think: the OSI 7-layer network model), you can strip away pairs of layers on each end of the link when they're not required. But any SoC designer going through this exercise would end up removing everything and just letting the two ends talk through an interconnect designed for internal SoC use.

Krevnik said:
It’s also rare to not be tied to PCIe. To the point that it was easier to piggy back on the PCIe NVMe drivers in the Linux kernel. And to the point that it’s not far fetched to guess that PCIe is involved between the ANS and the SSD controller. Just not likely in the case of an Apple SoC IMO, with Apple’s history of extensive custom silicon on the SoC.

I haven't looked to be sure, but the thing they had to fake out was almost certainly just init code. The driver's runtime code shouldn't have any hard PCIe dependency; it's just going to be interacting with memory-mapped IO registers and memory-resident command/completion queues.

altaic · Oct 18, 2021

400GB/s LPDDR5!!!

BenRacicot · Oct 18, 2021

altaic said:
400GB/s LPDDR5!!!

WOOOOOOOOO!!!!!! CHAMPAGNE TONIGHT MAN!

leman · Oct 18, 2021

Yeah, I really didn’t see that coming. So much for „bandwidth starved chips“

altaic · Oct 18, 2021

Yeah, what a beast.

Homy · Oct 18, 2021

MBP 14” comes with 14, 16, 24 or 32 GPU cores. 16” comes with 16, 24 or 32 cores. The memory bandwidth seems to be between Radeon RX 6600 and RTX 3070.

M1 Pro 200 GB/s
M1 Max 400 GB/s

RX 6600 224 GB/s
RX 6600 XT 256 GB/s
RTX 3060 360 GB/s
RX 6700 384 GB/s
RX 6700 XT 384 GB/s
RTX 3060 Ti 448 GB/s
RTX 3070 448 GB/s
RX 6800 512 GB/s
RX 6800 XT 512 GB/s
RX 6900 XT 512 GB/s
RTX 3070 Ti 608,3 GB/s
RTX 3080 760.3 GB/s
RTX 3080 Ti (20 GB) 760.3 GB/s
RTX 3080 Ti (12 GB) 912.4 GB/s
RTX 3090 936.2 GB/s
RTX 3090 Ti 1018 GB/s

BenRacicot · Oct 18, 2021

Yeah I wonder what model the comparison of "the best gaming laptop they could find" is.

Boil · Oct 18, 2021

Homy said:
MBP 14” comes with 14, 16, 24 or 32 GPU cores. 16” comes with 16, 24 or 32 cores. The performance seems to be between Radeon RX 6600 and RTX 3070.

M1 Pro 200 GB/s
M1 Max 400 GB/s

RX 6600 224 GB/s
RX 6600 XT 256 GB/s
RTX 3060 360 GB/s
RX 6700 384 GB/s
RX 6700 XT 384 GB/s
RTX 3060 Ti 448 GB/s
RTX 3070 448 GB/s
RX 6800 512 GB/s
RX 6800 XT 512 GB/s
RX 6900 XT 512 GB/s
RTX 3070 Ti 608,3 GB/s
RTX 3080 760.3 GB/s
RTX 3080 Ti (20 GB) 760.3 GB/s
RTX 3080 Ti (12 GB) 912.4 GB/s
RTX 3090 936.2 GB/s
RTX 3090 Ti 1018 GB/s

Memory bandwidth, while a good thing, is not an accurate measure of overall GPU performance...

Homy · Oct 18, 2021

Boil said:
Memory bandwidth, while a good thing, is not an accurate measure of overall GPU performance...

of course not but it's all we have for now and it says something at least.

diamond.g · Oct 18, 2021

BenRacicot said:
Yeah I wonder what model the comparison of "the best gaming laptop they could find" is.

It said on the slide but it flashed by really quick.

EDIT: Found it

https://live.arstechnica.com/apple-october-18-unleashed-event/images/Screen%20Shot%202021-10-18%20at%2012.23.12%20PM.jpg

https://live.arstechnica.com/apple-october-18-unleashed-event/images/Screen%20Shot%202021-10-18%20at%2012.22.21%20PM.jpg

diamond.g · Oct 18, 2021

MSI GE76 Raider 11UH-053 17.3" FHD Intel Core i9 Gaming Laptop - MSI-US Official Store

MSI GE76 Raider 17.3" 360Hz Gaming Laptop Intel Core i9-11980HK RTX3080 32GB 1TB NVMe SSD Win10

us-store.msi.com

This is the system they said was the fastest Laptop you could buy.

leman · Oct 18, 2021

Homy said:
MBP 14” comes with 14, 16, 24 or 32 GPU cores. 16” comes with 16, 24 or 32 cores. The memory bandwidth seems to be between Radeon RX 6600 and RTX 3070.

M1 Pro 200 GB/s
M1 Max 400 GB/s

RX 6600 224 GB/s
RX 6600 XT 256 GB/s
RTX 3060 360 GB/s
RX 6700 384 GB/s
RX 6700 XT 384 GB/s
RTX 3060 Ti 448 GB/s
RTX 3070 448 GB/s
RX 6800 512 GB/s
RX 6800 XT 512 GB/s
RX 6900 XT 512 GB/s
RTX 3070 Ti 608,3 GB/s
RTX 3080 760.3 GB/s
RTX 3080 Ti (20 GB) 760.3 GB/s
RTX 3080 Ti (12 GB) 912.4 GB/s
RTX 3090 936.2 GB/s
RTX 3090 Ti 1018 GB/s

Note that you are quoting desktop bandwidth. And these chips also have much more cache than those GPUs.

BenRacicot said:
Yeah I wonder what model the comparison of "the best gaming laptop they could find" is.

diamond.g said:
It said on the slide but it flashed by really quick.

EDIT: Found it

https://live.arstechnica.com/apple-october-18-unleashed-event/images/Screen%20Shot%202021-10-18%20at%2012.23.12%20PM.jpg

https://live.arstechnica.com/apple-october-18-unleashed-event/images/Screen%20Shot%202021-10-18%20at%2012.22.21%20PM.jpg

A mobile RTX 3080 with a TDP of 165W

BenRacicot · Oct 18, 2021

diamond.g said:
It said on the slide but it flashed by really quick.

Oh nice, I went to look for it. Found this at 25:56:

"Compact pro PC laptop performance data from testing Razer Blade 15 Advanced (RZ09-0409CE53)
High-end PC laptop performance data from testing MSI GE76 Raider 11UH-053"

And this just for fun:

Screen Shot 2021-10-18 at 2.47.51 PM.png

Like @Boil said GPU bandwidth isnt everything but that's also true for the competitors. The specs we saw today mean something. It could be that they are trying very hard to embarrass x86 and the current GPU offerings.

BenRacicot · Oct 18, 2021

leman said:
Note that you are quoting desktop bandwidth. And these chips also have much more cache than those GPUs.

A mobile RTX 3080 with a TDP of 165W

Nice find! Wow a mobile RTX 3080! Considering I was hoping to compete with the RX 6600 this is great news.

leman · Oct 18, 2021

BenRacicot said:
I know GPU bandwidth isnt everything but that's also true for the competitors. The specs we saw today mean something. It could be that they are trying very hard to embarrass x86 and the current GPU offerings.

Well, 400GB/s is at the same level as a mobile 3080 RTX…

BenRacicot · Oct 18, 2021

leman said:
Well, 400GB/s is at the same level as a mobile 3080 RTX…

Oh, hmm. Well now let's get into it! Question, does 64GB of shared memory play into this somehow?

diamond.g · Oct 18, 2021

leman said:
Note that you are quoting desktop bandwidth. And these chips also have much more cache than those GPUs.

A mobile RTX 3080 with a TDP of 165W

Looks like the whole M1 Max can pull ~90 watts? Was I seeing those slides correctly?

Is the M1 GPU sharing LPDDR4 RAM going to be enough?

macrumors Core

macrumors G5

macrumors 6502a

macrumors Core

macrumors 601

macrumors 601

macrumors 6502a

macrumors 601

macrumors 6502a

macrumors 6502a

macrumors member

macrumors Core

macrumors 6502a

macrumors 68030

macrumors member

macrumors 68040

macrumors 68030

macrumors G5

macrumors G5

macrumors Core

macrumors member

macrumors member

macrumors Core

macrumors member

macrumors G5

Our Staff