Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
In most cases, the CPU and GPU are handling different data, so there is minimal swapping between system RAM and VRAM. There are other factors at play which preclude VRAM from being treated as additional system RAM. For starters, most dedicated videocards actually are running higher spec DDR than modern CPUs support. For example, even a previous generation Radeon 6700XT is running GDDR6, while current AM4 CPUs only run DDR5 and Intel's 13th/14th gen parts can run either DDR4 or DDR5 depending on the motherboard being used. Beyond the simple generational differences, there are significant differences between DDR and GDDR, including a wider data bus with GDDR and lower power consumption compared to desktop RAM. DDR is also optimized for latency (which is why system builders often talk about memory timings, aka the CLxx numbers) instead of bandwidth (which is how GDDR is designed).
I think you misunderstood my question. I wasn't asking if GDDR could be accessed by the CPU to bolster its RAM pool. I was asking about something entirely different: It was not about the CPU being able to access GDDR VRAM; it was about whether there are cases in which the presence of the separate VRAM results in a reduction in the amount of data that needs to be stored in the CPU RAM.

I.e., most data that's in the VRAM needs to be copied to it from CPU RAM. Thus, for such data, the separate VRAM doesn't represent an effective increase in available memory, since having separate VRAM just means you need to have a duplicate copy. Here's a toy illustration of this:

Separate CPU and GPU RAM:
16 GB CPU RAM: Contains 12 GB CPU-only data, plus 4 GB that the CPU had to access to send to the GPU
4 GB GPU RAM: Contains the 4 GB copied over from the CPU

UMA:
16 GB Unified RAM: Contains 12 GB CPU-only data, plus 4 GB for the GPU

So while you have 4 GB extra RAM in the former case, that also requires that 4 GB of the data is duplicated, so there's no net gain.

My question was thus about whether there are cases in which some of the data in the GPU RAM either never was in, or could be deleted from, the CPU RAM. In those cases, having that separate GPU RAM would free up some CPU RAM.
These charts compare GDDR, DDR, and LPDDR both in relation to overall bandwidth and power efficiency. What's interesting to me is that LPDDR beats standard DDR both in bandwidth and power efficiency
Interestingly, NVIDIA uses LPDDR5x in its Grace-Hopper superchip.
 
Last edited:
  • Like
Reactions: pipo2

leman

macrumors Core
Oct 14, 2008
19,516
19,664
Anyway, the topic is whether MacOS+AS is more memory efficient, and my admittedly limited experience says it isn't.

There are different ways to define what memory-efficiency is. The most popular is “using less RAM overall”, which I personally don’t find very useful due to how memory usage is measured. Different operating systems simply behave differently.
 

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
My question was thus about whether there are cases in which some of the data in the GPU RAM either never was, or could be delted from, the CPU RAM. In those cases, having that separate GPU RAM would free up some CPU RAM.

For a scene, the GPU may do some off-screen composition, expanding shape maps and textures into output image data and compositing the elements. Also, the CPU may send text data to the GPU which is then expanded into font image data, which is usually considerably larger than the source text stream (and the font outline data does not have to reside in CPU memory after being copied over).

When you get into heavy math jobs, though, I believe the situation is different: transient data will usually not be considerably larger than source data (I could be wrong, though, or it may vary a lot).

Interestingly, NVIDIA uses LPDDR5x in its Grace-Hopper superchip.

Grace-Hopper uses a semi-UMA approach.
 
  • Like
Reactions: pipo2 and theorist9

theorist9

macrumors 68040
May 28, 2015
3,880
3,059
Grace-Hopper uses a semi-UMA approach.
Their approach is interesting. This is somewhat semantics, but I'm not sure I'd call it even semi-UMA, since I'm not sure if there is any cotemporaneous shared access to the same memory locations by the CPU and GPU, which I think is the essence of UMA.

All I can tell is that their approach allows the GPU to utilize the CPU's LPDDR5x as an additional memory pool if it runs out of VRAM, giving it access to up to 480 + 144 = 624 GB RAM, less whatever the CPU is using (but it doesn't work the other way around—the CPU can't use the GPU's HBM3/3e VRAM).

That is a neat feature, to be sure. It effectively provides the the GPU a huge amount of video memory. I'd call it "expandable VRAM".

But in order for it to qualify even as semi-UMA, when the GPU does that, both the CPU and GPU would need to be able to access those memory locations on the CPU RAM where the GPU data is temporarily stored. Then, at least for that data, you'd have shared access. But I don't know if it works that way. It may be that, when the GPU uses the CPU's RAM, that RAM becomes reserved for the GPU only, and thus there is never any joint CPU/GPU access to a single pool of data.
 
Last edited:

MacInMotion

macrumors member
Feb 24, 2008
36
4
I don't think our opinion differ that much. And I fully agree with you that these terms can be sometimes useful in a technical discussion, as long as all the interlocutors understand the nuances. But also I think that these notions are potentially dangerous in a casual non-technical discussion for wider audiences (taking @MacInMotion's post for example), because they obfuscate the reality. Instead of discussing what is actually going on (which might be interesting and educational for a hobbyist curious about these things), labels perpetuate unhealthy myths and overzealous generalizations (like "RISC is low-power, CISC is high-power" or "integrated is slow, dedicated is fast"). Labels are easy, and they tend to get repeated a lot, which makes them seem "right". And the end effect is that people stop at labels and don't bother learning the actual interesting effect hiding behind it.

First, let me point out that in we are in agreement about all of the conclusions of my post except possibly the impact on memory usage caused by the lack of the GPU having dedicated VRAM.

My main point was that several people were making posts which seemed to be saying the question of whether an M* Mac uses more or less memory than an Intel Mac was a ridiculous question, and my post was an effort to explain why it was a legitimate question, regardless of the answer.

For all of that, it seems to have generated a lot of unnecessary pushback. Please let us all chill a bit.

Regarding RISC vs CISC, as I said, in agreement with @leman, the distinctions have significantly lessened over time. To me, the primary distinction is that RISC chips operate primarily (if not necessarily exclusively) one instruction per clock cycle and do not use microcode, whereas CISC chips use microcode and multi-clock instructions a lot. The poster child for this is the Integer Divide instruction, which ARM did not have until v7, by which point I will agree any clear cut distinction between RISC vs CISC was lost.

At the same time, I want to point out that my only reference to Intel as CISC and M*/ARM as RISC was to point out what those terms meant historically and that both historically and currently, the same source code compiles to machine language code of different sizes, and historically it was by a very large amount. The only other thing I said was that in current practice, M* binaries tend to be smaller. I did not perpetuate any myths (healthy or not) or make any generalizations other than that, over the years, "RISC programs got shorter and CISC programs got longer". I don't think that is overzealous.


Regarding GPUs and VRAM, I stand corrected that the Intel Integrated Graphics hardware uses system memory. I was under the general impression that all Macs had some kind of dedicated graphics card with dedicated VRAM, and thought that at the low end, say 8 GiB of system memory in the Intel Macs, that even 2 GiB of VRAM moved to unified RAM would be noteworthy. So let's just say you need to check your current system's video hardware to make a better prediction of the impact on memory of moving to M*.


Regarding how VRAM is used, to the best of my knowledge, where a graphics card with VRAM is used and graphics "acceleration" is not disabled, the data in the VRAM is only mirrored in system RAM when the VRAM is full and buffers/pages need to be swapped out. In the general case (excluding some special cases such as photos and videos), drivers send graphics commands to the GPU and the GPU expands those commands into pixel buffers. For example, font definitions are uploaded to the GPU (in the form of Bézier curves) and then text is sent to the GPU as characters (1-4 bytes each) and the GPU renders the text in the font at the desired size, which is many more bytes than the text itself. Every open window is backed by such a buffer in VRAM, usually even multiple buffers (such as one for each embedded image in a web page, one for the scroll bar, one for the window frame, etc.). Window buffers are built by overlaying these buffers on top of each other, and monitor images (desktops) are built by overlaying window buffers on top of the desktop buffer. So one desktop can take up a lot more video memory than 3*pixel count.

On top of that, a single window may be rendered in multiple resolutions, e.g. one for the built in retina display, one for an external Full HD display, and one for an external 4K display, so that the windows can be dragged from display to display, or split across displays. Multiple external displays can mean multiple resolutions and multiple color profiles to render, and probably mean more open windows, too. Little to none of what is in VRAM should be duplicated in system RAM. I am 100% confident that the system never stores code in VRAM, and only uses VRAM for data for very special cases such as heavy-duty math (like Bitcoin mining) where the GPU can perform the required calculations much faster than the CPU (and in parallel).

I only have usage information for myself, and expect I am on the high end of usage, but offer my data for whatever it is worth. My Intel Mac has 8 GiB of VRAM. It nearly always is using at least half of that, and it routinely gets to 98% full, at which point I have to assume it starts swapping VRAM out to system RAM, because the computer experiences frequent "freezes" of a few moments that I cannot attribute to anything else. So when I'm looking at switching to an M* Mac, I'm mentally reserving 8+ GiB of unified RAM for graphics.
 

pipo2

macrumors newbie
Jan 24, 2023
24
9
Awfully sorry.
The thumb-ups I gave theorist9 and Sydde have nothing to do with the thread(subject), but all with the mentioning of the Grace-Hopper superchip. I had no idea a chip is named after Grace Hopper. For me, this is a somewhat valuable find for an otherwise totally unrelated context.
No doubt accidental, nevertheless thank you very much!
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
What’s your opinion on memset/memcpy instructions?
I think they're a good idea with a lot of ways to get them wrong that could cause future pain. I don't know whether Arm's implementation of them has anything wrong, though. Not a real expert on the pitfalls, I've just seen enough discussions of them over the years to know that "here be dragons".

Microcode - probably not, but micro-ops, yes. I also wonder whether microcode is used much in modern x86 designs, it’s mostly about implementing legacy stuff, right?
re: not microcode but micro-ops - I suppose you can look at things that way. Post-decode instructions are never the same as pre-decode, and you might as well call them micro-ops. An example transformation is that in any uarch which renames registers, architectural register numbers are rewritten to point at the real register currently mapped to the architectural register.

Note that the rewritten register numbers need more bits of storage, since the real register file is bigger than the architectural register file. Most of the other transformations done during decode also expand the word. So, the name "micro-op" is a little ironic: they're always much larger in bit count than the original representation of the instruction, even if the CPU is not microcoded.

But as far as I know, microcode isn't just for legacy in x86. x86 always permits the two operands of any ALU instruction to be one register and one memory reference. (Register-register is also allowed, of course.) Since the load and store units in modern x86 cores are separate from ALUs, most mem/reg ALU instructions have to be cracked into two uops, one to access the memory and the other to do the math. While it's encouraged to treat modern x86 as if it's a load/store architecture these days, those are still perfectly valid instructions and they do show up.

Microcode is so deep in the bones of x86 that in most implementations, all instructions can be patched with a microcode update, even ones that'd ordinarily decode to just one uop.

Unless I am misremembering, they should be comparable (especially if you take the cache size into account)? Around 3.5-4mm2?
M1 Firestorm core size including L1 cache and excluding AMX (which I think is fair) is about 2.3mm^2 according to Semi Analysis. A Zen 4 core + 1MB L2 is 3.84mm^2, according to AMD itself.

I do think it's fair to compare Z4 with L2 to M1 without. I have two arguments in favor of it. One is that in both cases we're including all the private cache not shared with any other core. Apple does cluster-level cache sharing at L2, AMD does it at L3. The other reason is that Apple has gigantic L1 caches, which is probably why they're able to have fewer levels of cache in their hierarchy. (M1: 128KB D + 192KB I. Zen4: 32KB D + 32KB I.) Big L1 caches have a disproportionate impact on die area - L1 is almost always implemented using far less area efficient SRAM cells than higher levels, since it has to simultaneously provide more access ports and operate at lower latency than higher levels of the cache hierarchy.

With such dramatic differences in L1 size, there's no way to make this comparison truly clean, but basically I think it's fair to include all core-local memory each design team thought was necessary.

All that said, you can kinda fudge the numbers if you like. Semi Analysis has the 12MB shared L2 cache for a M1 P cluster at 3.578mm^2, so if you want to add a 1MB SRAM penalty to M1, you could do 3.578/12 = 0.298mm^2. That number includes only the data array, no tags, but even if you double it Firestorm's still far smaller than a Zen 4 core.

I meant instructions like ENTER/LEAVE etc.
I suppose, but those are not super high level in my eyes. Just shorthand for creating and destroying stack frames.
 
  • Like
Reactions: leman

TzunamiOSX

macrumors 65816
Oct 4, 2009
1,057
434
Germany
After using a M1 Mini for a while, I can say: 16 GB M1 Memory is less than 16 GB intel memory because you lose capacity for all video “things”.

For example: On my Mac Studio, Topaz Video AI is using 48 GB as Video Memory. I think it was 10 GB on my Mini M1.
 
  • Like
Reactions: mlts22

Sydde

macrumors 68030
Aug 17, 2009
2,563
7,061
IOKWARDI
I don't know whether Arm's implementation of them has anything wrong, though.
The strict ARM specification does not include tailored instructions for those operations. They have to be implemented in software, as loops. Apple apparently built a single undocumented instruction pair for handling 1K-block memory compression, but there is no indication that they added any other string/memory block instructions, which would not be consonant with the ARM ethos.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
The strict ARM specification does not include tailored instructions for those operations. They have to be implemented in software, as loops. Apple apparently built a single undocumented instruction pair for handling 1K-block memory compression, but there is no indication that they added any other string/memory block instructions, which would not be consonant with the ARM ethos.
If you're going to comment on technology, I don't ask for much, but I do ask that you keep quiet if you don't know anything about the issue being discussed...
The issue in question is the CPYx and SETP instructions which ARE part of "strict ARM specifications", added as part of ARMv8.8A and 9.3A in 2021.

The reason that people are asking about these is specifically that not very much is known yet except that they exist. The best explanation, probably, is
and yeah, that's not much to go on!

If someone IS looking for info, you might be able to find it in the LLVM sources. ARM claims there is support for these instructions in LLVM15, but I can't find it! My preferred methodology would be to look at ARM patents and see if anything suggests what they have in mind, but I have more important patents to look at.

As for Apple custom instructions, the ones I can remember include (but are not limited to)
- the full AMX set (which changes every year)
- hardware page compression/decompression
- a 53b integer multiply instruction (exploits FP hardware)
- some NEON instructions to improve the FSE part of LZ-FSE
- various stuff to toggle Rosetta functionality (eg the TSO stuff)

There are *probably* also substantial Apple changes to how page tables and virtual machines are handled, allowing for a variety of page sizes. There's a collection of patents about this from a few years ago.
My guess, in terms of timing, is that this is present in M3, but will be announced as part of the Ultra/Extreme announcement, maybe at WWDC. Of course you can never know with timing – maybe it's in M3 but doesn't 100% work, was just there for initial testing? Maybe it was never meant for M3, only M4? Maybe it's for Apple internal use as part of some long-term project we will only understand in five years?
 

leman

macrumors Core
Oct 14, 2008
19,516
19,664
All you have to do is follow links – there is plenty to go on (the navburger at the upper left lets you browse the whole ISA).

It does seem like ARM provides a fair amount of information about these instructions, explicitly mentioning options A and B as something implementors have to choose. Maybe I misunderstood your previous post? It's not really clear to me what you mean by "they have to be implemented in software, as loops" in this context.
 

MrGunny94

macrumors 65816
Dec 3, 2016
1,148
675
Malaga, Spain
You can see a bottleneck when you starting hooking up your M1/M2 Pro to a pair of 4K or 5K displays. In my case I have two Huawei Mateview and my Window Server goes all the way between 1.8GB -> 3GB.

Pair that in with a good amount of memory usage and you'll go along to the 60-70% Memory Pressure.

This is my usage on a base model M2 Pro without any external displays.

1704896323590.png


Yes I have 3 browsers, but each one of them has different profiles.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
All you have to do is follow links – there is plenty to go on (the navburger at the upper left lets you browse the whole ISA).
Nice find. The places I looked for these instructions were much less helpful.
A big problem with ARM (and almost every other company with web-only documentation) is an endless trail of material of every possible version, with no natural way to find the one true latest version :-(
 

pshufd

macrumors G4
Oct 24, 2013
10,145
14,571
New Hampshire
Nice find. The places I looked for these instructions were much less helpful.
A big problem with ARM (and almost every other company with web-only documentation) is an endless trail of material of every possible version, with no natural way to find the one true latest version :-(

The Intel x86 architecture books were always a good one-stop place to get everything.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
Has this ever been confirmed?
I think if it existed (as opposed to hardware accelerated page compression, which does exist) there would be at least one somewhat relevant patent, and I have seen nothing.

There is a patent that's kinda-sorta relevant to cache compression (not the most general cache compression, but still useful, something analogous to a zero-content cache), but when I experimented on the M1 I saw no evidence of this.
Of course it could be something added later, contributing its 5% or whatever to general IPC boost at lower power; and you wouldn't see evidence for it unless you especially probed for weirdness.


OB RANT: I continue to be EXTREMELY disappointed with Geekerwan, James Aslan, and Chips&Cheese. They're happy to run their standard suite of (x86-inspired) micro-benchmarks on every new chip, but have ZERO interest in investigating any aspects of a new chip (Apple, ARM, QC, ...) that don't match the model of a generic 2018 or so CPU.
Andrei F at least had genuine curiosity, to see something strange in his results, say "that's funny", and make some effort to investigate.
 

casperes1996

macrumors 604
Jan 26, 2014
7,597
5,769
Horsens, Denmark
I think if it existed (as opposed to hardware accelerated page compression, which does exist) there would be at least one somewhat relevant patent, and I have seen nothing.
A few questions:

1) What exactly is meant by page compression? Are we talking compressing the contents of a page (if so, I would say that is exactly the same as generic memory compression, isn't it?) Or some form of compression of the page table structures? Seems like a very small win in that case.

2) For memory compression to be considered hardware accelerated, surely it's enough to just have general purpose compression hardware and then apply it to memory with the kernel knowing how to page fault, flag and decompress it, right? Does that tie back to question 1? Perhaps when you distinguish page compression and memory compression, you think of memory compression as completely opaque to software and memory not needing to be effectively paged out to compression and page faulted back in uncompressed?
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,308
A few questions:

1) What exactly is meant by page compression? Are we talking compressing the contents of a page (if so, I would say that is exactly the same as generic memory compression, isn't it?) Or some form of compression of the page table structures? Seems like a very small win in that case.

2) For memory compression to be considered hardware accelerated, surely it's enough to just have general purpose compression hardware and then apply it to memory with the kernel knowing how to page fault, flag and decompress it, right? Does that tie back to question 1? Perhaps when you distinguish page compression and memory compression, you think of memory compression as completely opaque to software and memory not needing to be effectively paged out to compression and page faulted back in uncompressed?
Page compression is what it sounds like, the compression of the contents of a page. The page is marked "invalid" in page tables, just like a page on disk, and when the page fault is handled, the page is decompressed.
Page compression was applied by NeXT to pixmaps, and then expanded by Apple to general pages. It generally provides 2 to 4x compression, with faster fault servicing than going to disk.
There are a variety of compressors available, and there is hardware for at least WKDM compression/decompression: https://patents.google.com/patent/US11500638B1 (the original Mac page compressor) and LZ https://patents.google.com/patent/US10331558B2 (a more aggressive compressor).

Memory compression works on the CACHE LINE level, not on the page level. No fault is taken when a compressed line is accessed, rather the line is simply decompressed on access. This obviously requires a simpler compressor but still gets 1.5x to 2x compression on "generic" use cases. Obviously if your memory is primarily filled with non-compressible material (eg compressed video, or LLM weights) you will not see much benefit, but "generically" there is a benefit.

I use standard terminology from the academic literature.
 
  • Like
Reactions: casperes1996

casperes1996

macrumors 604
Jan 26, 2014
7,597
5,769
Horsens, Denmark
Page compression is what it sounds like, the compression of the contents of a page. The page is marked "invalid" in page tables, just like a page on disk, and when the page fault is handled, the page is decompressed.
Page compression was applied by NeXT to pixmaps, and then expanded by Apple to general pages. It generally provides 2 to 4x compression, with faster fault servicing than going to disk.
There are a variety of compressors available, and there is hardware for at least WKDM compression/decompression: https://patents.google.com/patent/US11500638B1 (the original Mac page compressor) and LZ https://patents.google.com/patent/US10331558B2 (a more aggressive compressor).

Memory compression works on the CACHE LINE level, not on the page level. No fault is taken when a compressed line is accessed, rather the line is simply decompressed on access. This obviously requires a simpler compressor but still gets 1.5x to 2x compression on "generic" use cases. Obviously if your memory is primarily filled with non-compressible material (eg compressed video, or LLM weights) you will not see much benefit, but "generically" there is a benefit.

I use standard terminology from the academic literature.
Thank, and fair play. I just never even considered memory compression beyond page compression
 

cassmr

macrumors member
Apr 12, 2021
58
62
OB RANT: I continue to be EXTREMELY disappointed with Geekerwan, James Aslan, and Chips&Cheese. They're happy to run their standard suite of (x86-inspired) micro-benchmarks on every new chip, but have ZERO interest in investigating any aspects of a new chip (Apple, ARM, QC, ...) that don't match the model of a generic 2018 or so CPU.
Andrei F at least had genuine curiosity, to see something strange in his results, say "that's funny", and make some effort to investigate.
A bit off topic, but is there reviewer/writer/youtuber looking at chips these days you do recommend?

I've enjoyed some of Geekerwan when I found it recently, merely because there hasnt seemed as much content these days that even went to that level.
 

throAU

macrumors G3
Feb 13, 2012
9,198
7,342
Perth, Western Australia
One thing I will add

compared to previous intel machines - memory pressure seems much less noticeable in M series chips. I've had my M1 Pro well into the orange and didn't even notice - I forgot I'd left a second 4GB VM running alongside a 6 GB VM on a 16 GB machine with a whole suite of other productivity apps open.

If I hadn't seen the orange memory pressure graph I never would have known.

An Intel MacBook Pro would have been screaming its fan and chugging at that point.

So, while RAM is RAM - in my experience the M series chips handle high memory usage scenarios MUCH more gracefully.
 

NJRonbo

macrumors 68040
Original poster
Jan 10, 2007
3,233
1,224
I am amazed at how many programs I can run at once on my M2 Macbook Air with 16GB ram

On an Intel machine, it would choke

There is something in the sauce that gives one more bang for the buck when it comes to RAM
 
  • Like
Reactions: throAU
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.