Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
So, Unified Memory Architecture...? ;^p
Some people seem to be having a hard time understanding the difference I was drawing between unified memory and the traditional memory architecture. I decided to get explicit to better distinguish between the UM on package and the off-package RAM architecture I was describing.

^^^ This, to avoid any issues with overworked SSDs down the road...?
No, to avoid slowdowns reading/writing to SSD. If you have 2TB of RAM in your system, it’s not so you can open lots of Chrome tabs at human speeds— it’s so you can work on massive datasets at machine speeds.
 

throAU

macrumors G3
Feb 13, 2012
9,204
7,354
Perth, Western Australia
I think with the multi-SOC machines the memory won't be quite as "unified".

It will be package local.

So you'll have 2-4 packages (or more) complete with CPU cores, GPU cores and local RAM in a unified cluster with a slightly less high speed bus between the packages.

So it won't be entirely unified, but unified within the local package, which is then linked to other packages via a high speed (but not as high as the on-package bus, due to distance vs. physics) bus.

Think of an M1 Pro or M1 Max cluster as like an EPYC multi-core socket. The EPYC has multiple processor dies on the package, and multiple CCXs within each CPU die - but there's a slower motherboard bus between the sockets.

I can see each M1 based SOC package talking to a cache controller which then accesses resources on the other SOCs via the SOC to SOC bus.



edit:
this is all entirely speculation on my part, but based on what both intel and AMD have been doing, and industry trends/discussion for the past decade, it's how I see Apple scaling the M1 out further.
 
Last edited:

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
It seems more sensible to use off-package RAM as a distinct pool in the virtual memory stack. There's essentially two ways to look at it: you can think of package RAM as an L4 cache (or whatever) or you can think of the off package RAM essentially as a RAM-disk for swap. In the end, the difference is academic-- I'd imagine that the package would page less used memory off package, and page needed memory into the package.
Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.

What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.
 
  • Like
Reactions: jido and Argoduck

Analog Kid

macrumors G3
Mar 4, 2003
9,360
12,603
Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.

What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.
I see your point. I agree it would be a problem if we needed to mirror everything on package to off package memory. I’m not sure that’s how it has to be, though. It sounds like you get the spirit of what I’m suggesting anyway. That’s what I meant when I said it’s two ways of looking at the same thing. I just look at the memory hierarchy as concentric memory spaces, each ring having more capacity but slower access. Registers->L1->L2->SLC->Package RAM->Off-Package RAM->Disk. In addition to different capacities and access speeds, each ring shares data across consumers differently.

Swap doesn’t take disk space if you don’t exceed physical memory. Likewise I’d imagine you wouldn’t need off chip RAM if you could execute just in cache— we just don’t see that because just about everything we do exceeds code/data cache, and we generally transfer data through RAM to other systems (such as the GPU for display).
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Just to push back at one aspect of this: on-package memory doesn’t necessarily have lower latency than normal DRAM, just much lower power for that latency.
Is that true? I assumed when the bus starts getting longer than the clock wavelength, and then the buffers and drivers at both ends, there’s going to be a prop delay associated.

I think both statements are true.

You need about 6ps for every mm of travel (depends on the substrate, metal, etc.). So the farther away the memory, the longer that part of the latency is.

But the latency also includes driver delays (and the delay caused the ramp at the receiver). So you could have off-package memory that needs 250ps for the wire + 10ps for the driver and 10ps for the receiver, and have on-package memory that needs 150ps for the wire + 70ps for the driver and 70ps for the receiver.

The power consideration is also important - burn enough power and you can, at least, minimize the driver and receiver delay (nothing you can do about the wire delay).

Of course, all else being equal, closer-by is less latency.
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.

What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.

How much memory needs to be “unified?” I take it the number is higher for non-graphics GPU workloads than for graphics GPU workloads?
 

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
Swap doesn’t take disk space if you don’t exceed physical memory. Likewise I’d imagine you wouldn’t need off chip RAM if you could execute just in cache— we just don’t see that because just about everything we do exceeds code/data cache, and we generally transfer data through RAM to other systems (such as the GPU for display).
Well, the definition of a cache in this instance is a copy of data. So both the faster cache memory and slower RAM would both have to have the same copies, and what’s in cache must be in memory, else we get data inconsistencies. But yeah, I get your point.
 

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
How much memory needs to be “unified?” I take it the number is higher for non-graphics GPU workloads than for graphics GPU workloads?
Well, I guess that’ll be the million dollar question … heh heh.

I still think Apple may just go for 16 DDR5-6400 ECC DIMM slots for their AS Mac Pro, for a total of 800 GB/s of memory bandwidth and call it a day. That’s a whopping 1024 + 128 bits of data + ECC traces. So all of the memory will be unified ?
 
  • Like
Reactions: Argoduck

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
Is that true? I assumed when the bus starts getting longer than the clock wavelength, and then the buffers and drivers at both ends, there’s going to be a prop delay associated.

I think both statements are true.

You need about 6ps for every mm of travel (depends on the substrate, metal, etc.). So the farther away the memory, the longer that part of the latency is.

But the latency also includes driver delays (and the delay caused the ramp at the receiver). So you could have off-package memory that needs 250ps for the wire + 10ps for the driver and 10ps for the receiver, and have on-package memory that needs 150ps for the wire + 70ps for the driver and 70ps for the receiver.

The power consideration is also important - burn enough power and you can, at least, minimize the driver and receiver delay (nothing you can do about the wire delay).

Of course, all else being equal, closer-by is less latency.


In practice, the wires aren’t that long and the other factors overwhelm it.

Here you can see DDR4/5 latency on Alder Lake is at least as good if not much better than the lpDDR4/5 on-package memory:





DDR4/5: 85-92ns
lpDDR4/5: 96-111ns
 
Last edited:

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
In practice, the wires aren’t that long and the other factors overwhelm it.

Here you can see DDR4/5 latency on Alder Lake is at least as good if not much better than the lpDDR4/5 on-package memory:





DDR4/5: 85-92ns
lpDDR4/5: 96-111ns
Given those number they look like read latencies and not signaling latencies. Different issue.
 

Adarna

Suspended
Original poster
Jan 1, 2015
685
429
They’d do anything to make a buck on YouTube. Their content is made for monetization purposes only.
Yes, people who earn a living are disingenuous.

They presented the concept of a multi die M1 Max for pro desktops accessible to more people.

Especially those who do not want to read.
 
  • Like
Reactions: Argoduck

ian87w

macrumors G3
Feb 22, 2020
8,704
12,638
Indonesia
Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell... :D
 
  • Like
Reactions: Argoduck

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
Given those number they look like read latencies and not signaling latencies. Different issue.

Right but unless I misunderstood his initial post I think read/write latency was what he was concerned about - that moving to DDR4/5 would slow down the ability to read/write from RAM compared to on-package RAM
 
  • Like
Reactions: Analog Kid

throAU

macrumors G3
Feb 13, 2012
9,204
7,354
Perth, Western Australia
Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell... :D

This is pretty much exactly what I expect. A Mac Pro will have 2-4 sockets which fit M1 Pro/Max packages.
 
  • Like
Reactions: Argoduck

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
Right but unless I misunderstood his initial post I think read/write latency was what he was concerned about - that moving to DDR4/5 would slow down the ability to read/write from RAM compared to on-package RAM
I’m not sure the numbers you cited include the transmission delays (which are the delays that would vary). But, yes, they should be smallish compared to the read latency.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
This seems simpler than @crazy dave 's idea of massive bandwidth to the off package RAM which would be necessary to prevent slowing down the workload depending on address if it was all treated as a massive unified pool.

Not really my idea as I don’t think they’ll do this, but the idea with the massive DDR5 pool is that there is zero on-package memory. It’s all off-package with just a massive number of filled slots to make up the bandwidth. In my opinion: technically possible, practically unlikely as then all slots *have to be filled*. Huge minimum amount of RAM and just awkward all around, but there’s nothing technically stopping them from doing it.
 
  • Like
Reactions: Analog Kid

Serban55

Suspended
Oct 18, 2020
2,153
4,344

He makes a very compelling case for the iMac Pro & Mac Pro's Apple Silicon chips.

I still feel bad for 2019 Mac Pro owners. That desktop should have debuted in 2017 instead of the iMac Pro so that owners enjoys over 5 years of use being phased out.
Dont feel bad for those who bought the mac pro in 2019. In 3 years they made more money than they spent on the mac pro
 
  • Like
Reactions: Argoduck and Adarna

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
Not so much.

Part of Apple's performance advantage is that the RAM is so close to the SOC. Slots will negate that - the longer data path will make it hard to push the RAM so hard.
It's not quite accurate to say slots will "negate" that. See discussion above.


Yes, and that's partially the point.
I don't believe making huge amounts of RAM (~1 TB) available to the GPU was ever was intended as one of the purposes (points) of UMA, since it would be a rare use case where the GPU would need that much RAM, and Apple doesn't design its architecture for rare use cases.

Apple's developer talks at WWDC explain the purpose of UMA was so the GPU and CPU could have access to the same RAM pool, obviating the need to copy data back and forth between GPU and CPU RAM, which is something of general benefit (see first few minutes of video below).

I don't recall Apple ever saying they also designed it for those needing 100's of GB's of GPU RAM. [If you can find such a reference from Apple, I'd be interested to see it.] So, rather, I think this is better charterized as an intriguing potential side-benefit of UMA, rather than one of its designed purposes.

 
Last edited:
  • Like
Reactions: Argoduck

singhs.apps

macrumors 6502a
Oct 27, 2016
660
400
Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell... :D
:D
That’s what I was thinking too a while ago, until I remembered at present macOS only supports 64 cores.
So it’s a moot point, if the system has to think of those as nodes. This maybe expensive on the software side too, you may have to make multiple license purchases.

If the AS Mac Pro has in chassis expansions, I suspect it will be non CPU modules.
 
Last edited:
  • Like
Reactions: Argoduck

singhs.apps

macrumors 6502a
Oct 27, 2016
660
400
Not sure that it would be very hard for them to change:

#define MAX_CPUS 64 /* 8 * sizeof(cpumask_t) */

typedef volatile uint64_t cpumask_t;
static inline cpumask_t

Also not sure where that math comes from.
If they do, it will be only when they plan to move past 40 cores as speculated.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.