Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

theorist9

macrumors 68040
May 28, 2015
3,714
2,820
In most cases, the CPU and GPU are handling different data, so there is minimal swapping between system RAM and VRAM. There are other factors at play which preclude VRAM from being treated as additional system RAM. For starters, most dedicated videocards actually are running higher spec DDR than modern CPUs support. For example, even a previous generation Radeon 6700XT is running GDDR6, while current AM4 CPUs only run DDR5 and Intel's 13th/14th gen parts can run either DDR4 or DDR5 depending on the motherboard being used. Beyond the simple generational differences, there are significant differences between DDR and GDDR, including a wider data bus with GDDR and lower power consumption compared to desktop RAM. DDR is also optimized for latency (which is why system builders often talk about memory timings, aka the CLxx numbers) instead of bandwidth (which is how GDDR is designed).
I think you misunderstood my question. I wasn't asking if GDDR could be accessed by the CPU to bolster its RAM pool. I was asking about something entirely different: It was not about the CPU being able to access GDDR VRAM; it was about whether there are cases in which the presence of the separate VRAM results in a reduction in the amount of data that needs to be stored in the CPU RAM.

I.e., most data that's in the VRAM needs to be copied to it from CPU RAM. Thus, for such data, the separate VRAM doesn't represent an effective increase in available memory, since having separate VRAM just means you need to have a duplicate copy. Here's a toy illustration of this:

Separate CPU and GPU RAM:
16 GB CPU RAM: Contains 12 GB CPU-only data, plus 4 GB that the CPU had to access to send to the GPU
4 GB GPU RAM: Contains the 4 GB copied over from the CPU

UMA:
16 GB Unified RAM: Contains 12 GB CPU-only data, plus 4 GB for the GPU

So while you have 4 GB extra RAM in the former case, that also requires that 4 GB of the data is duplicated, so there's no net gain.

My question was thus about whether there are cases in which some of the data in the GPU RAM either never was in, or could be deleted from, the CPU RAM. In those cases, having that separate GPU RAM would free up some CPU RAM.
These charts compare GDDR, DDR, and LPDDR both in relation to overall bandwidth and power efficiency. What's interesting to me is that LPDDR beats standard DDR both in bandwidth and power efficiency
Interestingly, NVIDIA uses LPDDR5x in its Grace-Hopper superchip.
 
Last edited:
  • Like
Reactions: pipo2

leman

macrumors Core
Oct 14, 2008
19,319
19,336
Anyway, the topic is whether MacOS+AS is more memory efficient, and my admittedly limited experience says it isn't.

There are different ways to define what memory-efficiency is. The most popular is “using less RAM overall”, which I personally don’t find very useful due to how memory usage is measured. Different operating systems simply behave differently.
 

Sydde

macrumors 68030
Aug 17, 2009
2,557
7,059
IOKWARDI
My question was thus about whether there are cases in which some of the data in the GPU RAM either never was, or could be delted from, the CPU RAM. In those cases, having that separate GPU RAM would free up some CPU RAM.

For a scene, the GPU may do some off-screen composition, expanding shape maps and textures into output image data and compositing the elements. Also, the CPU may send text data to the GPU which is then expanded into font image data, which is usually considerably larger than the source text stream (and the font outline data does not have to reside in CPU memory after being copied over).

When you get into heavy math jobs, though, I believe the situation is different: transient data will usually not be considerably larger than source data (I could be wrong, though, or it may vary a lot).

Interestingly, NVIDIA uses LPDDR5x in its Grace-Hopper superchip.

Grace-Hopper uses a semi-UMA approach.
 
  • Like
Reactions: pipo2 and theorist9

theorist9

macrumors 68040
May 28, 2015
3,714
2,820
Grace-Hopper uses a semi-UMA approach.
Their approach is interesting. This is somewhat semantics, but I'm not sure I'd call it even semi-UMA, since I'm not sure if there is any cotemporaneous shared access to the same memory locations by the CPU and GPU, which I think is the essence of UMA.

All I can tell is that their approach allows the GPU to utilize the CPU's LPDDR5x as an additional memory pool if it runs out of VRAM, giving it access to up to 480 + 144 = 624 GB RAM, less whatever the CPU is using (but it doesn't work the other way around—the CPU can't use the GPU's HBM3/3e VRAM).

That is a neat feature, to be sure. It effectively provides the the GPU a huge amount of video memory. I'd call it "expandable VRAM".

But in order for it to qualify even as semi-UMA, when the GPU does that, both the CPU and GPU would need to be able to access those memory locations on the CPU RAM where the GPU data is temporarily stored. Then, at least for that data, you'd have shared access. But I don't know if it works that way. It may be that, when the GPU uses the CPU's RAM, that RAM becomes reserved for the GPU only, and thus there is never any joint CPU/GPU access to a single pool of data.
 
Last edited:

MacInMotion

macrumors member
Feb 24, 2008
36
4
I don't think our opinion differ that much. And I fully agree with you that these terms can be sometimes useful in a technical discussion, as long as all the interlocutors understand the nuances. But also I think that these notions are potentially dangerous in a casual non-technical discussion for wider audiences (taking @MacInMotion's post for example), because they obfuscate the reality. Instead of discussing what is actually going on (which might be interesting and educational for a hobbyist curious about these things), labels perpetuate unhealthy myths and overzealous generalizations (like "RISC is low-power, CISC is high-power" or "integrated is slow, dedicated is fast"). Labels are easy, and they tend to get repeated a lot, which makes them seem "right". And the end effect is that people stop at labels and don't bother learning the actual interesting effect hiding behind it.

First, let me point out that in we are in agreement about all of the conclusions of my post except possibly the impact on memory usage caused by the lack of the GPU having dedicated VRAM.

My main point was that several people were making posts which seemed to be saying the question of whether an M* Mac uses more or less memory than an Intel Mac was a ridiculous question, and my post was an effort to explain why it was a legitimate question, regardless of the answer.

For all of that, it seems to have generated a lot of unnecessary pushback. Please let us all chill a bit.

Regarding RISC vs CISC, as I said, in agreement with @leman, the distinctions have significantly lessened over time. To me, the primary distinction is that RISC chips operate primarily (if not necessarily exclusively) one instruction per clock cycle and do not use microcode, whereas CISC chips use microcode and multi-clock instructions a lot. The poster child for this is the Integer Divide instruction, which ARM did not have until v7, by which point I will agree any clear cut distinction between RISC vs CISC was lost.

At the same time, I want to point out that my only reference to Intel as CISC and M*/ARM as RISC was to point out what those terms meant historically and that both historically and currently, the same source code compiles to machine language code of different sizes, and historically it was by a very large amount. The only other thing I said was that in current practice, M* binaries tend to be smaller. I did not perpetuate any myths (healthy or not) or make any generalizations other than that, over the years, "RISC programs got shorter and CISC programs got longer". I don't think that is overzealous.


Regarding GPUs and VRAM, I stand corrected that the Intel Integrated Graphics hardware uses system memory. I was under the general impression that all Macs had some kind of dedicated graphics card with dedicated VRAM, and thought that at the low end, say 8 GiB of system memory in the Intel Macs, that even 2 GiB of VRAM moved to unified RAM would be noteworthy. So let's just say you need to check your current system's video hardware to make a better prediction of the impact on memory of moving to M*.


Regarding how VRAM is used, to the best of my knowledge, where a graphics card with VRAM is used and graphics "acceleration" is not disabled, the data in the VRAM is only mirrored in system RAM when the VRAM is full and buffers/pages need to be swapped out. In the general case (excluding some special cases such as photos and videos), drivers send graphics commands to the GPU and the GPU expands those commands into pixel buffers. For example, font definitions are uploaded to the GPU (in the form of Bézier curves) and then text is sent to the GPU as characters (1-4 bytes each) and the GPU renders the text in the font at the desired size, which is many more bytes than the text itself. Every open window is backed by such a buffer in VRAM, usually even multiple buffers (such as one for each embedded image in a web page, one for the scroll bar, one for the window frame, etc.). Window buffers are built by overlaying these buffers on top of each other, and monitor images (desktops) are built by overlaying window buffers on top of the desktop buffer. So one desktop can take up a lot more video memory than 3*pixel count.

On top of that, a single window may be rendered in multiple resolutions, e.g. one for the built in retina display, one for an external Full HD display, and one for an external 4K display, so that the windows can be dragged from display to display, or split across displays. Multiple external displays can mean multiple resolutions and multiple color profiles to render, and probably mean more open windows, too. Little to none of what is in VRAM should be duplicated in system RAM. I am 100% confident that the system never stores code in VRAM, and only uses VRAM for data for very special cases such as heavy-duty math (like Bitcoin mining) where the GPU can perform the required calculations much faster than the CPU (and in parallel).

I only have usage information for myself, and expect I am on the high end of usage, but offer my data for whatever it is worth. My Intel Mac has 8 GiB of VRAM. It nearly always is using at least half of that, and it routinely gets to 98% full, at which point I have to assume it starts swapping VRAM out to system RAM, because the computer experiences frequent "freezes" of a few moments that I cannot attribute to anything else. So when I'm looking at switching to an M* Mac, I'm mentally reserving 8+ GiB of unified RAM for graphics.
 

pipo2

macrumors newbie
Jan 24, 2023
20
8
Awfully sorry.
The thumb-ups I gave theorist9 and Sydde have nothing to do with the thread(subject), but all with the mentioning of the Grace-Hopper superchip. I had no idea a chip is named after Grace Hopper. For me, this is a somewhat valuable find for an otherwise totally unrelated context.
No doubt accidental, nevertheless thank you very much!
 

mr_roboto

macrumors 6502a
Sep 30, 2020
777
1,668
What’s your opinion on memset/memcpy instructions?
I think they're a good idea with a lot of ways to get them wrong that could cause future pain. I don't know whether Arm's implementation of them has anything wrong, though. Not a real expert on the pitfalls, I've just seen enough discussions of them over the years to know that "here be dragons".

Microcode - probably not, but micro-ops, yes. I also wonder whether microcode is used much in modern x86 designs, it’s mostly about implementing legacy stuff, right?
re: not microcode but micro-ops - I suppose you can look at things that way. Post-decode instructions are never the same as pre-decode, and you might as well call them micro-ops. An example transformation is that in any uarch which renames registers, architectural register numbers are rewritten to point at the real register currently mapped to the architectural register.

Note that the rewritten register numbers need more bits of storage, since the real register file is bigger than the architectural register file. Most of the other transformations done during decode also expand the word. So, the name "micro-op" is a little ironic: they're always much larger in bit count than the original representation of the instruction, even if the CPU is not microcoded.

But as far as I know, microcode isn't just for legacy in x86. x86 always permits the two operands of any ALU instruction to be one register and one memory reference. (Register-register is also allowed, of course.) Since the load and store units in modern x86 cores are separate from ALUs, most mem/reg ALU instructions have to be cracked into two uops, one to access the memory and the other to do the math. While it's encouraged to treat modern x86 as if it's a load/store architecture these days, those are still perfectly valid instructions and they do show up.

Microcode is so deep in the bones of x86 that in most implementations, all instructions can be patched with a microcode update, even ones that'd ordinarily decode to just one uop.

Unless I am misremembering, they should be comparable (especially if you take the cache size into account)? Around 3.5-4mm2?
M1 Firestorm core size including L1 cache and excluding AMX (which I think is fair) is about 2.3mm^2 according to Semi Analysis. A Zen 4 core + 1MB L2 is 3.84mm^2, according to AMD itself.

I do think it's fair to compare Z4 with L2 to M1 without. I have two arguments in favor of it. One is that in both cases we're including all the private cache not shared with any other core. Apple does cluster-level cache sharing at L2, AMD does it at L3. The other reason is that Apple has gigantic L1 caches, which is probably why they're able to have fewer levels of cache in their hierarchy. (M1: 128KB D + 192KB I. Zen4: 32KB D + 32KB I.) Big L1 caches have a disproportionate impact on die area - L1 is almost always implemented using far less area efficient SRAM cells than higher levels, since it has to simultaneously provide more access ports and operate at lower latency than higher levels of the cache hierarchy.

With such dramatic differences in L1 size, there's no way to make this comparison truly clean, but basically I think it's fair to include all core-local memory each design team thought was necessary.

All that said, you can kinda fudge the numbers if you like. Semi Analysis has the 12MB shared L2 cache for a M1 P cluster at 3.578mm^2, so if you want to add a 1MB SRAM penalty to M1, you could do 3.578/12 = 0.298mm^2. That number includes only the data array, no tags, but even if you double it Firestorm's still far smaller than a Zen 4 core.

I meant instructions like ENTER/LEAVE etc.
I suppose, but those are not super high level in my eyes. Just shorthand for creating and destroying stack frames.
 
  • Like
Reactions: leman

TzunamiOSX

macrumors 65816
Oct 4, 2009
1,013
411
Germany
After using a M1 Mini for a while, I can say: 16 GB M1 Memory is less than 16 GB intel memory because you lose capacity for all video “things”.

For example: On my Mac Studio, Topaz Video AI is using 48 GB as Video Memory. I think it was 10 GB on my Mini M1.
 
  • Like
Reactions: mlts22

Sydde

macrumors 68030
Aug 17, 2009
2,557
7,059
IOKWARDI
I don't know whether Arm's implementation of them has anything wrong, though.
The strict ARM specification does not include tailored instructions for those operations. They have to be implemented in software, as loops. Apple apparently built a single undocumented instruction pair for handling 1K-block memory compression, but there is no indication that they added any other string/memory block instructions, which would not be consonant with the ARM ethos.
 

name99

macrumors 68020
Jun 21, 2004
2,282
2,139
The strict ARM specification does not include tailored instructions for those operations. They have to be implemented in software, as loops. Apple apparently built a single undocumented instruction pair for handling 1K-block memory compression, but there is no indication that they added any other string/memory block instructions, which would not be consonant with the ARM ethos.
If you're going to comment on technology, I don't ask for much, but I do ask that you keep quiet if you don't know anything about the issue being discussed...
The issue in question is the CPYx and SETP instructions which ARE part of "strict ARM specifications", added as part of ARMv8.8A and 9.3A in 2021.

The reason that people are asking about these is specifically that not very much is known yet except that they exist. The best explanation, probably, is
and yeah, that's not much to go on!

If someone IS looking for info, you might be able to find it in the LLVM sources. ARM claims there is support for these instructions in LLVM15, but I can't find it! My preferred methodology would be to look at ARM patents and see if anything suggests what they have in mind, but I have more important patents to look at.

As for Apple custom instructions, the ones I can remember include (but are not limited to)
- the full AMX set (which changes every year)
- hardware page compression/decompression
- a 53b integer multiply instruction (exploits FP hardware)
- some NEON instructions to improve the FSE part of LZ-FSE
- various stuff to toggle Rosetta functionality (eg the TSO stuff)

There are *probably* also substantial Apple changes to how page tables and virtual machines are handled, allowing for a variety of page sizes. There's a collection of patents about this from a few years ago.
My guess, in terms of timing, is that this is present in M3, but will be announced as part of the Ultra/Extreme announcement, maybe at WWDC. Of course you can never know with timing – maybe it's in M3 but doesn't 100% work, was just there for initial testing? Maybe it was never meant for M3, only M4? Maybe it's for Apple internal use as part of some long-term project we will only understand in five years?
 

leman

macrumors Core
Oct 14, 2008
19,319
19,336
All you have to do is follow links – there is plenty to go on (the navburger at the upper left lets you browse the whole ISA).

It does seem like ARM provides a fair amount of information about these instructions, explicitly mentioning options A and B as something implementors have to choose. Maybe I misunderstood your previous post? It's not really clear to me what you mean by "they have to be implemented in software, as loops" in this context.
 

MrGunny94

macrumors 65816
Dec 3, 2016
1,113
651
Malaga, Spain
You can see a bottleneck when you starting hooking up your M1/M2 Pro to a pair of 4K or 5K displays. In my case I have two Huawei Mateview and my Window Server goes all the way between 1.8GB -> 3GB.

Pair that in with a good amount of memory usage and you'll go along to the 60-70% Memory Pressure.

This is my usage on a base model M2 Pro without any external displays.

1704896323590.png


Yes I have 3 browsers, but each one of them has different profiles.
 

name99

macrumors 68020
Jun 21, 2004
2,282
2,139
All you have to do is follow links – there is plenty to go on (the navburger at the upper left lets you browse the whole ISA).
Nice find. The places I looked for these instructions were much less helpful.
A big problem with ARM (and almost every other company with web-only documentation) is an endless trail of material of every possible version, with no natural way to find the one true latest version :-(
 

pshufd

macrumors G3
Oct 24, 2013
9,982
14,455
New Hampshire
Nice find. The places I looked for these instructions were much less helpful.
A big problem with ARM (and almost every other company with web-only documentation) is an endless trail of material of every possible version, with no natural way to find the one true latest version :-(

The Intel x86 architecture books were always a good one-stop place to get everything.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.