M1 Max Quadro Mac Pro Leaks - The Alder Lake KILLER! 🤯

throAU · Nov 13, 2021

theorist9 said:
I wonder if ~1 TB of unified memory, which would make an unusual amount of RAM accesible to the GPU, would allow coders to accomplish things not commonly possible with a workstation.

Yes, and that's partially the point.

Analog Kid · Nov 13, 2021

Boil said:
So, Unified Memory Architecture...? ;^p

Some people seem to be having a hard time understanding the difference I was drawing between unified memory and the traditional memory architecture. I decided to get explicit to better distinguish between the UM on package and the off-package RAM architecture I was describing.

Boil said:
^^^ This, to avoid any issues with overworked SSDs down the road...?

No, to avoid slowdowns reading/writing to SSD. If you have 2TB of RAM in your system, it’s not so you can open lots of Chrome tabs at human speeds— it’s so you can work on massive datasets at machine speeds.

throAU · Nov 13, 2021

I think with the multi-SOC machines the memory won't be quite as "unified".

It will be package local.

So you'll have 2-4 packages (or more) complete with CPU cores, GPU cores and local RAM in a unified cluster with a slightly less high speed bus between the packages.

So it won't be entirely unified, but unified within the local package, which is then linked to other packages via a high speed (but not as high as the on-package bus, due to distance vs. physics) bus.

Think of an M1 Pro or M1 Max cluster as like an EPYC multi-core socket. The EPYC has multiple processor dies on the package, and multiple CCXs within each CPU die - but there's a slower motherboard bus between the sockets.

I can see each M1 based SOC package talking to a cache controller which then accesses resources on the other SOCs via the SOC to SOC bus.

edit:
this is all entirely speculation on my part, but based on what both intel and AMD have been doing, and industry trends/discussion for the past decade, it's how I see Apple scaling the M1 out further.

quarkysg · Nov 13, 2021

Analog Kid said:
It seems more sensible to use off-package RAM as a distinct pool in the virtual memory stack. There's essentially two ways to look at it: you can think of package RAM as an L4 cache (or whatever) or you can think of the off package RAM essentially as a RAM-disk for swap. In the end, the difference is academic-- I'd imagine that the package would page less used memory off package, and page needed memory into the package.

Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.

What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.

Analog Kid · Nov 13, 2021

quarkysg said:
Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.

What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.

I see your point. I agree it would be a problem if we needed to mirror everything on package to off package memory. I’m not sure that’s how it has to be, though. It sounds like you get the spirit of what I’m suggesting anyway. That’s what I meant when I said it’s two ways of looking at the same thing. I just look at the memory hierarchy as concentric memory spaces, each ring having more capacity but slower access. Registers->L1->L2->SLC->Package RAM->Off-Package RAM->Disk. In addition to different capacities and access speeds, each ring shares data across consumers differently.

Swap doesn’t take disk space if you don’t exceed physical memory. Likewise I’d imagine you wouldn’t need off chip RAM if you could execute just in cache— we just don’t see that because just about everything we do exceeds code/data cache, and we generally transfer data through RAM to other systems (such as the GPU for display).

cmaier · Nov 13, 2021

crazy dave said:
Just to push back at one aspect of this: on-package memory doesn’t necessarily have lower latency than normal DRAM, just much lower power for that latency.

Analog Kid said:
Is that true? I assumed when the bus starts getting longer than the clock wavelength, and then the buffers and drivers at both ends, there’s going to be a prop delay associated.

I think both statements are true.

You need about 6ps for every mm of travel (depends on the substrate, metal, etc.). So the farther away the memory, the longer that part of the latency is.

But the latency also includes driver delays (and the delay caused the ramp at the receiver). So you could have off-package memory that needs 250ps for the wire + 10ps for the driver and 10ps for the receiver, and have on-package memory that needs 150ps for the wire + 70ps for the driver and 70ps for the receiver.

The power consideration is also important - burn enough power and you can, at least, minimize the driver and receiver delay (nothing you can do about the wire delay).

Of course, all else being equal, closer-by is less latency.

Romanesco · Nov 13, 2021

Adarna said:
They're the only one talking about the rumors in a video format.

They’d do anything to make a buck on YouTube. Their content is made for monetization purposes only.

cmaier · Nov 13, 2021

quarkysg said:
Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.

What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.

How much memory needs to be “unified?” I take it the number is higher for non-graphics GPU workloads than for graphics GPU workloads?

quarkysg · Nov 13, 2021

Analog Kid said:
Swap doesn’t take disk space if you don’t exceed physical memory. Likewise I’d imagine you wouldn’t need off chip RAM if you could execute just in cache— we just don’t see that because just about everything we do exceeds code/data cache, and we generally transfer data through RAM to other systems (such as the GPU for display).

Well, the definition of a cache in this instance is a copy of data. So both the faster cache memory and slower RAM would both have to have the same copies, and what’s in cache must be in memory, else we get data inconsistencies. But yeah, I get your point.

quarkysg · Nov 13, 2021

cmaier said:
How much memory needs to be “unified?” I take it the number is higher for non-graphics GPU workloads than for graphics GPU workloads?

Well, I guess that’ll be the million dollar question … heh heh.

I still think Apple may just go for 16 DDR5-6400 ECC DIMM slots for their AS Mac Pro, for a total of 800 GB/s of memory bandwidth and call it a day. That’s a whopping 1024 + 128 bits of data + ECC traces. So all of the memory will be unified ?

crazy dave · Nov 13, 2021

Analog Kid said:
Is that true? I assumed when the bus starts getting longer than the clock wavelength, and then the buffers and drivers at both ends, there’s going to be a prop delay associated.

cmaier said:
I think both statements are true.

You need about 6ps for every mm of travel (depends on the substrate, metal, etc.). So the farther away the memory, the longer that part of the latency is.

But the latency also includes driver delays (and the delay caused the ramp at the receiver). So you could have off-package memory that needs 250ps for the wire + 10ps for the driver and 10ps for the receiver, and have on-package memory that needs 150ps for the wire + 70ps for the driver and 70ps for the receiver.

The power consideration is also important - burn enough power and you can, at least, minimize the driver and receiver delay (nothing you can do about the wire delay).

Of course, all else being equal, closer-by is less latency.

In practice, the wires aren’t that long and the other factors overwhelm it.

Here you can see DDR4/5 latency on Alder Lake is at least as good if not much better than the lpDDR4/5 on-package memory:

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

www.anandtech.com

The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test

www.anandtech.com

Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights

www.anandtech.com

DDR4/5: 85-92ns
lpDDR4/5: 96-111ns

cmaier · Nov 13, 2021

crazy dave said:
In practice, the wires aren’t that long and the other factors overwhelm it.

Here you can see DDR4/5 latency on Alder Lake is at least as good if not much better than the lpDDR4/5 on-package memory:

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

www.anandtech.com

The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test

www.anandtech.com

Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights

www.anandtech.com

DDR4/5: 85-92ns
lpDDR4/5: 96-111ns

Given those number they look like read latencies and not signaling latencies. Different issue.

Adarna · Nov 13, 2021

Romanesco said:
They’d do anything to make a buck on YouTube. Their content is made for monetization purposes only.

Yes, people who earn a living are disingenuous.

They presented the concept of a multi die M1 Max for pro desktops accessible to more people.

Especially those who do not want to read.

ian87w · Nov 13, 2021

Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell...

crazy dave · Nov 13, 2021

cmaier said:
Given those number they look like read latencies and not signaling latencies. Different issue.

Right but unless I misunderstood his initial post I think read/write latency was what he was concerned about - that moving to DDR4/5 would slow down the ability to read/write from RAM compared to on-package RAM

throAU · Nov 13, 2021

ian87w said:
Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell...

This is pretty much exactly what I expect. A Mac Pro will have 2-4 sockets which fit M1 Pro/Max packages.

cmaier · Nov 13, 2021

crazy dave said:
Right but unless I misunderstood his initial post I think read/write latency was what he was concerned about - that moving to DDR4/5 would slow down the ability to read/write from RAM compared to on-package RAM

I’m not sure the numbers you cited include the transmission delays (which are the delays that would vary). But, yes, they should be smallish compared to the read latency.

crazy dave · Nov 13, 2021

Analog Kid said:
This seems simpler than @crazy dave 's idea of massive bandwidth to the off package RAM which would be necessary to prevent slowing down the workload depending on address if it was all treated as a massive unified pool.

Not really my idea as I don’t think they’ll do this, but the idea with the massive DDR5 pool is that there is zero on-package memory. It’s all off-package with just a massive number of filled slots to make up the bandwidth. In my opinion: technically possible, practically unlikely as then all slots *have to be filled*. Huge minimum amount of RAM and just awkward all around, but there’s nothing technically stopping them from doing it.

Serban55 · Nov 13, 2021

Adarna said:
He makes a very compelling case for the iMac Pro & Mac Pro's Apple Silicon chips.

I still feel bad for 2019 Mac Pro owners. That desktop should have debuted in 2017 instead of the iMac Pro so that owners enjoys over 5 years of use being phased out.

Dont feel bad for those who bought the mac pro in 2019. In 3 years they made more money than they spent on the mac pro

theorist9 · Nov 13, 2021

throAU said:
Not so much.

Part of Apple's performance advantage is that the RAM is so close to the SOC. Slots will negate that - the longer data path will make it hard to push the RAM so hard.

It's not quite accurate to say slots will "negate" that. See discussion above.

throAU said:
Yes, and that's partially the point.

I don't believe making huge amounts of RAM (~1 TB) available to the GPU was ever was intended as one of the purposes (points) of UMA, since it would be a rare use case where the GPU would need that much RAM, and Apple doesn't design its architecture for rare use cases.

Apple's developer talks at WWDC explain the purpose of UMA was so the GPU and CPU could have access to the same RAM pool, obviating the need to copy data back and forth between GPU and CPU RAM, which is something of general benefit (see first few minutes of video below).

I don't recall Apple ever saying they also designed it for those needing 100's of GB's of GPU RAM. [If you can find such a reference from Apple, I'd be interested to see it.] So, rather, I think this is better charterized as an intriguing potential side-benefit of UMA, rather than one of its designed purposes.

Explore the new system architecture of Apple silicon Macs - WWDC20 - Videos - Apple Developer

Discover how Macs with Apple silicon will deliver modern advantages using Apple's System-on-Chip (SoC) architecture. Leveraging a unified...

developer.apple.com

singhs.apps · Nov 13, 2021

ian87w said:
Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell...

That’s what I was thinking too a while ago, until I remembered at present macOS only supports 64 cores.
So it’s a moot point, if the system has to think of those as nodes. This maybe expensive on the software side too, you may have to make multiple license purchases.

If the AS Mac Pro has in chassis expansions, I suspect it will be non CPU modules.

cmaier · Nov 13, 2021

singhs.apps said:
That’s what I was thinking too a while ago, until I remembered at present macOS only supports 64 threads.

I assume you mean cores, not threads? (I.e. currently-executing streams, not executable streams)

singhs.apps · Nov 13, 2021

cmaier said:
I assume you mean cores, not threads? (I.e. currently-executing streams, not executable streams)

Cores. Corrected.

cmaier · Nov 13, 2021

singhs.apps said:
Cores. Corrected.

Not sure that it would be very hard for them to change:

#define MAX_CPUS 64 /* 8 * sizeof(cpumask_t) */
…
typedef volatile uint64_t cpumask_t;
static inline cpumask_t

Also not sure where that math comes from.

singhs.apps · Nov 13, 2021

cmaier said:
Not sure that it would be very hard for them to change:

#define MAX_CPUS 64 /* 8 * sizeof(cpumask_t) */
…
typedef volatile uint64_t cpumask_t;
static inline cpumask_t

Also not sure where that math comes from.

If they do, it will be only when they plan to move past 40 cores as speculated.

M1 Max Quadro Mac Pro Leaks - The Alder Lake KILLER! 🤯

macrumors G3

macrumors G3

macrumors G3

macrumors 65816

macrumors G3

Suspended

macrumors regular

Suspended

macrumors 65816

macrumors 65816

macrumors 65816

Suspended

Suspended

macrumors G3

macrumors 65816

macrumors G3

Suspended

macrumors 65816

Suspended

macrumors 68040

macrumors 6502a

Suspended

macrumors 6502a

Suspended

macrumors 6502a

Our Staff