Yes, and that's partially the point.I wonder if ~1 TB of unified memory, which would make an unusual amount of RAM accesible to the GPU, would allow coders to accomplish things not commonly possible with a workstation.
Yes, and that's partially the point.I wonder if ~1 TB of unified memory, which would make an unusual amount of RAM accesible to the GPU, would allow coders to accomplish things not commonly possible with a workstation.
Some people seem to be having a hard time understanding the difference I was drawing between unified memory and the traditional memory architecture. I decided to get explicit to better distinguish between the UM on package and the off-package RAM architecture I was describing.So, Unified Memory Architecture...? ;^p
No, to avoid slowdowns reading/writing to SSD. If you have 2TB of RAM in your system, it’s not so you can open lots of Chrome tabs at human speeds— it’s so you can work on massive datasets at machine speeds.^^^ This, to avoid any issues with overworked SSDs down the road...?
Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.It seems more sensible to use off-package RAM as a distinct pool in the virtual memory stack. There's essentially two ways to look at it: you can think of package RAM as an L4 cache (or whatever) or you can think of the off package RAM essentially as a RAM-disk for swap. In the end, the difference is academic-- I'd imagine that the package would page less used memory off package, and page needed memory into the package.
I see your point. I agree it would be a problem if we needed to mirror everything on package to off package memory. I’m not sure that’s how it has to be, though. It sounds like you get the spirit of what I’m suggesting anyway. That’s what I meant when I said it’s two ways of looking at the same thing. I just look at the memory hierarchy as concentric memory spaces, each ring having more capacity but slower access. Registers->L1->L2->SLC->Package RAM->Off-Package RAM->Disk. In addition to different capacities and access speeds, each ring shares data across consumers differently.Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.
What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.
Just to push back at one aspect of this: on-package memory doesn’t necessarily have lower latency than normal DRAM, just much lower power for that latency.
Is that true? I assumed when the bus starts getting longer than the clock wavelength, and then the buffers and drivers at both ends, there’s going to be a prop delay associated.
They're the only one talking about the rumors in a video format.
Actually if the unified ram is treated as a cache for main memory, it’ll make the design extremely expensive, and that Unified memory will not be meaningfully added to the available memory capacity. Say you need 1 TB of useable memory, if we have L1, L2, SLC, Unified Memory (e.g. 256 GB, 1.6 TB/s), and slower last level high capacity RAM (1 TB, 800 GB/s - I.e. 16 DDR5-6400 ECC DIMM slots), total memory capacity would still be 1TB, although you have 1.25 TB of physical RAM installed. In fact you will have to have at least 256 GB installed in the DIMM slots for a total of 512 GB physical RAM installed but only 256 GB visible for use. That the nature of cache as I understand it.
What Apple may opt to do (and I got the inspiration from your comments) is for Apple to use the memory installed in the DIMM slots as RAM disk, but treat them as slower system RAM. So UMA memory page to slower DIMM RAMs, and DIMM RAMs page to SSD. This benefits workload with lower than 256 GB datasets (assuming 256 UMA memory installed) but not so good if datasets are larger.
Well, the definition of a cache in this instance is a copy of data. So both the faster cache memory and slower RAM would both have to have the same copies, and what’s in cache must be in memory, else we get data inconsistencies. But yeah, I get your point.Swap doesn’t take disk space if you don’t exceed physical memory. Likewise I’d imagine you wouldn’t need off chip RAM if you could execute just in cache— we just don’t see that because just about everything we do exceeds code/data cache, and we generally transfer data through RAM to other systems (such as the GPU for display).
Well, I guess that’ll be the million dollar question … heh heh.How much memory needs to be “unified?” I take it the number is higher for non-graphics GPU workloads than for graphics GPU workloads?
Is that true? I assumed when the bus starts getting longer than the clock wavelength, and then the buffers and drivers at both ends, there’s going to be a prop delay associated.
I think both statements are true.
You need about 6ps for every mm of travel (depends on the substrate, metal, etc.). So the farther away the memory, the longer that part of the latency is.
But the latency also includes driver delays (and the delay caused the ramp at the receiver). So you could have off-package memory that needs 250ps for the wire + 10ps for the driver and 10ps for the receiver, and have on-package memory that needs 150ps for the wire + 70ps for the driver and 70ps for the receiver.
The power consideration is also important - burn enough power and you can, at least, minimize the driver and receiver delay (nothing you can do about the wire delay).
Of course, all else being equal, closer-by is less latency.
Given those number they look like read latencies and not signaling latencies. Different issue.In practice, the wires aren’t that long and the other factors overwhelm it.
Here you can see DDR4/5 latency on Alder Lake is at least as good if not much better than the lpDDR4/5 on-package memory:
The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test
www.anandtech.com
DDR4/5: 85-92ns
lpDDR4/5: 96-111ns
Yes, people who earn a living are disingenuous.They’d do anything to make a buck on YouTube. Their content is made for monetization purposes only.
Given those number they look like read latencies and not signaling latencies. Different issue.
Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell...
I’m not sure the numbers you cited include the transmission delays (which are the delays that would vary). But, yes, they should be smallish compared to the read latency.Right but unless I misunderstood his initial post I think read/write latency was what he was concerned about - that moving to DDR4/5 would slow down the ability to read/write from RAM compared to on-package RAM
This seems simpler than @crazy dave 's idea of massive bandwidth to the off package RAM which would be necessary to prevent slowing down the workload depending on address if it was all treated as a massive unified pool.
Dont feel bad for those who bought the mac pro in 2019. In 3 years they made more money than they spent on the mac pro
He makes a very compelling case for the iMac Pro & Mac Pro's Apple Silicon chips.
I still feel bad for 2019 Mac Pro owners. That desktop should have debuted in 2017 instead of the iMac Pro so that owners enjoys over 5 years of use being phased out.
It's not quite accurate to say slots will "negate" that. See discussion above.Not so much.
Part of Apple's performance advantage is that the RAM is so close to the SOC. Slots will negate that - the longer data path will make it hard to push the RAM so hard.
I don't believe making huge amounts of RAM (~1 TB) available to the GPU was ever was intended as one of the purposes (points) of UMA, since it would be a rare use case where the GPU would need that much RAM, and Apple doesn't design its architecture for rare use cases.Yes, and that's partially the point.
Since Apple Silicon is an SoC, I wonder if the "upgrade" method Apple would design for the Mac Pro would be adding expansion cards containing additional set(s) of the M1 Pro/Max. So you'll be upgrading all CPU, GPU, RAM, and storage y the package. Akin to adding multiple Mac minis inside a Mac Pro shell...
That’s what I was thinking too a while ago, until I remembered at present macOS only supports 64 threads.
Cores. Corrected.I assume you mean cores, not threads? (I.e. currently-executing streams, not executable streams)
Cores. Corrected.
If they do, it will be only when they plan to move past 40 cores as speculated.Not sure that it would be very hard for them to change:
#define MAX_CPUS 64 /* 8 * sizeof(cpumask_t) */
…
typedef volatile uint64_t cpumask_t;
static inline cpumask_t
Also not sure where that math comes from.