Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,522
19,679
Clearly the GPU has RAM. It’s confirmed in the BIOS and the more I give it, the less RAM I have available for actual programs.

Again, that's the RAM that the system reserves for the GPU operation. This is an OS-level abstraction, not the hardware-level one. How this plays out in the end depends on the driver and the OS. The OS could dedicate some of the RAM for the GPU use and say "that's it, you are not getting any more", the OS could reserve a minimal amount but allow the driver to allocate more if needed, or there could some arrangement in between. At any rate, regardless what your system tells you, Intel iGPU has full access to the entirety of the system RAM, as clearly shown by Vulkan capabilities reports.

For example: https://vulkan.gpuinfo.org/displayreport.php?id=10970#memory shows an Intel machine with 8GB of GPU-accessible memory. This memory is DEVICE_LOCAL_BIT (which means it can be accessed by the GPU directly, without an intermediate transport layer) as well as HOST_VISIBLE_BIT + HOST_COHERENT_BIT, which means that the CPU can also access this memory directly.

Anyway, Apple on M1 also reserves some of the RAM for the GPU operation (this would probably show up as "kernel pinned memory"). After all, GPU is critical to user experience and you can't just swap the main framebuffer because a website needs more RAM. That's exactly what "pinned" memory means — RAM reserved by the driver for core system operation, which cannot be moved or swapped under any circumstances.

But overall, macOS cares very little whether the RAM has been allocated for GPU use or not. I can totally imagine that a Windows system might prevent the GPU driver from grabbing too much RAM (I simply don't know whether they do), but I was unable to find such a limitation on macOS.
 
  • Like
Reactions: jdb8167

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
Again, that's the RAM that the system reserves for the GPU operation. This is an OS-level abstraction, not the hardware-level one. How this plays out in the end depends on the driver and the OS. The OS could dedicate some of the RAM for the GPU use and say "that's it, you are not getting any more", the OS could reserve a minimal amount but allow the driver to allocate more if needed, or there could some arrangement in between. At any rate, regardless what your system tells you, Intel iGPU has full access to the entirety of the system RAM, as clearly shown by Vulkan capabilities reports.

For example: https://vulkan.gpuinfo.org/displayreport.php?id=10970#memory shows an Intel machine with 8GB of GPU-accessible memory. This memory is DEVICE_LOCAL_BIT (which means it can be accessed by the GPU directly, without an intermediate transport layer) as well as HOST_VISIBLE_BIT + HOST_COHERENT_BIT, which means that the CPU can also access this memory directly.

Anyway, Apple on M1 also reserves some of the RAM for the GPU operation (this would probably show up as "kernel pinned memory"). After all, GPU is critical to user experience and you can't just swap the main framebuffer because a website needs more RAM. That's exactly what "pinned" memory means — RAM reserved by the driver for core system operation, which cannot be moved or swapped under any circumstances.

But overall, macOS cares very little whether the RAM has been allocated for GPU use or not. I can totally imagine that a Windows system might prevent the GPU driver from grabbing too much RAM (I simply don't know whether they do), but I was unable to find such a limitation on macOS.

Not by default. Vulkan might do this by default, but not every piece of software uses it. And if I don’t, I need to explicitly state in code to use zero memory buffer.
 

leman

macrumors Core
Oct 14, 2008
19,522
19,679
I think we get it but you are ignoring the fact that the iGPU in Intel SoCs are emulating a discrete GPU at the driver level. That means that it really isn't using zero copy at runtime but acting more like a traditional dGPU. Unless you can show me where any Intel graphics driver uses zero copy then just the fact that the hardware supports a broader use of RAM really isn't very interesting.

I'm not particularly knowledgeable about this stuff so I'm prepared to be wrong but nothing I've seen so far says that the graphics drivers with Intel iGPUs allow the whole RAM space to be used for zero copy buffers. Like you said earlier, zero copy buffers are more useful for compute than for custom graphics engines which won't likely be written to specifically target Intel iGPUs.

It's not that the Intel iGPU is "emulating" a discrete GPU — it's simply that mainstream high-performance GPUs are discrete GPUs and the API is designed accordingly. Metal is the same. For example, the basic Metal API for creating a GPU-visible data buffer is this function:

Objective-C:
- (id<MTLBuffer>)newBufferWithBytes:(const void *)pointer
                             length:(NSUInteger)length
                            options:(MTLResourceOptions)options;

This function takes a pointer to the data already residing in the system memory. What it does is create a GPU buffer and copy the data into it. If you are running it on a dGPU, the buffer will be allocated in the GPU's own dedicated memory pool(1) and the data will be copied over the PCI-e bus. If you are running it on an iGPU (Intel or M1), the buffer will be allocated in the system memory and the data will be copied. You always get a copy, no matter whether you are running on an UMA system or not. Why? Because you ask the driver to manage the buffer, and it can only do it by taking ownership of the memory allocation. The only difference is that copy on an UMA system is much faster since it does not have to go though the low-bandwidth PCI-e channel.

Now, if you want to achieve real zero-copy behavior, you can use a different API on an UMA system:

Objective-C:
- (id<MTLBuffer>)newBufferWithBytesNoCopy:(void *)pointer
                                   length:(NSUInteger)length
                                  options:(MTLResourceOptions)options
                              deallocator:(void (^)(void *pointer, NSUInteger length))deallocator;

This basically let's you transfer an existing system memory allocation to the GPU driver, and if this is an UMA system, no copy is done — at all.

The bottomline is that "zero copy" can mean different things. UMA systems can be "zero copy", but the regular use of GPU APIs is certainly not "zero copy", nor do you even want it to be "zero copy" most of the time. This is the case for Intel iGPUs, and this is the case for M1. The real benefit of UMA (and yet another definition of "zero copy") is that you can access GPU-side memory allocations from the CPU with a very low latency, which allows you do combine CPU and GPU work. Add to this Apple Silicon with it's additional accelerators (like the NPU), all sharing the same RAM, and you can achieve some really interesting applications.

(1) I am massively oversimplifying this because it's far from being this straightforward, but you get the basic idea...

P.S. Other APIs, like Vulkan, give you a more explicit control over memory, but also require you to manage all the memory manually. As usual, this is even more confusing in practice. For example, Nvidia GPU drivers will tell you that they have shared GPU+CPU memory (which they clearly do not), so Vulkan is far from perfect.
 
  • Like
Reactions: jdb8167

leman

macrumors Core
Oct 14, 2008
19,522
19,679
Not by default. Vulkan might do this by default, but not every piece of software uses it. And if I don’t, I need to explicitly state in code to use zero memory buffer.

Which is exactly how it works with Metal and M1. Was not your point to show that Intel GPUs are different from Apple GPUs?
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
It's not that the Intel iGPU is "emulating" a discrete GPU — it's simply that mainstream high-performance GPUs are discrete GPUs and the API is designed accordingly. Metal is the same. For example, the basic Metal API for creating a GPU-visible data buffer is this function:

Objective-C:
- (id<MTLBuffer>)newBufferWithBytes:(const void *)pointer
                             length:(NSUInteger)length
                            options:(MTLResourceOptions)options;

This function takes a pointer to the data already residing in the system memory. What it does is create a GPU buffer and copy the data into it. If you are running it on a dGPU, the buffer will be allocated in the GPU's own dedicated memory pool(1) and the data will be copied over the PCI-e bus. If you are running it on an iGPU (Intel or M1), the buffer will be allocated in the system memory and the data will be copied. You always get a copy, no matter whether you are running on an UMA system or not. Why? Because you ask the driver to manage the buffer, and it can only do it by taking ownership of the memory allocation. The only difference is that copy on an UMA system is much faster since it does not have to go though the low-bandwidth PCI-e channel.

Now, if you want to achieve real zero-copy behavior, you can use a different API on an UMA system:

Objective-C:
- (id<MTLBuffer>)newBufferWithBytesNoCopy:(void *)pointer
                                   length:(NSUInteger)length
                                  options:(MTLResourceOptions)options
                              deallocator:(void (^)(void *pointer, NSUInteger length))deallocator;

This basically let's you transfer an existing system memory allocation to the GPU driver, and if this is an UMA system, no copy is done — at all.

The bottomline is that "zero copy" can mean different things. UMA systems can be "zero copy", but the regular use of GPU APIs is certainly not "zero copy", nor do you even want it to be "zero copy" most of the time. This is the case for Intel iGPUs, and this is the case for M1. The real benefit of UMA (and yet another definition of "zero copy") is that you can access GPU-side memory allocations from the CPU with a very low latency, which allows you do combine CPU and GPU work. Add to this Apple Silicon with it's additional accelerators (like the NPU), all sharing the same RAM, and you can achieve some really interesting applications.

(1) I am massively oversimplifying this because it's far from being this straightforward, but you get the basic idea...

P.S. Other APIs, like Vulkan, give you a more explicit control over memory, but also require you to manage all the memory manually. As usual, this is even more confusing in practice. For example, Nvidia GPU drivers will tell you that they have shared GPU+CPU memory (which they clearly do not), so Vulkan is far from perfect.
Thanks for this. I've been reading a few stack overflow (sometimes better than nothing) posts that seem to have a lot of arguments on whether DirectX and/or Vulkan allow for UMA zero copy buffers or if they are "near" zero copy which looks to be marketing speak for mostly one-copy.

So, do you know if the metal newBufferWithBytesNoCopy function works on Intel iGPUs? If so, then I guess that settles it in my mind.
 

leman

macrumors Core
Oct 14, 2008
19,522
19,679
Thanks for this. I've been reading a few stack overflow (sometimes better than nothing) posts that seem to have a lot of arguments on whether DirectX and/or Vulkan allow for UMA zero copy buffers or if they are "near" zero copy which looks to be marketing speak for mostly one-copy.

So, do you know if the metal newBufferWithBytesNoCopy function works on Intel iGPUs? If so, then I guess that settles it in my mind.

It works even on AMD GPUs… you get the behavior guarantee, not a performance guarantee. AFAIK the API does not promise zero-copy implementation at all, it’s more of an implicit contract.
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
It works even on AMD GPUs… you get the behavior guarantee, not a performance guarantee. AFAIK the API does not promise zero-copy implementation at all, it’s more of an implicit contract.
Hmm. So even with Metal, it is possible that the Intel Driver doesn't support UMA after all. Is there any way to find out?
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
Which is exactly how it works with Metal and M1. Was not your point to show that Intel GPUs are different from Apple GPUs?

Do we know if Rosetta changes the way the code works with the GPU? I never said this also didn’t apply to the M1. I have not seen in my research however any place to see the GPU reserved memory, or that it even does. Which makes me think Rosetta does something to change it.

And since this is an API call, Apple could just change the guts of the API on ARM Macs to just default to the zero buffer feature.

Apple is typically not in favor of running legacy software, but I can run a Direct X 8 game on Windows 10 still today with the latest updates.
 

Thomas Davie

macrumors 6502a
Jan 20, 2004
746
528
Panzer Battles: Normandy and Panzer Battles Kursk also work w/o issues (and under W10 x64 bottles in Crossover).

Tom
 

leman

macrumors Core
Oct 14, 2008
19,522
19,679
Hmm. So even with Metal, it is possible that the Intel Driver doesn't support UMA after all. Is there any way to find out?

UMA is a hardware feature, it's not really about the driver support. I suppose what you mean is that driver might perform a copy even if one uses newBufferWithBytesNoCopy...

One can probably test it by measuring the time needed to synrconize the CPU and GPU access to the same memory buffer, but no idea what kind of pitfalls one might find.

Do we know if Rosetta changes the way the code works with the GPU? I never said this also didn’t apply to the M1. I have not seen in my research however any place to see the GPU reserved memory, or that it even does. Which makes me think Rosetta does something to change it.

There is no evidence that Rosetta changes anything. Maybe some compatibility settings (e.g. common misuse of render passes, but that has nothing to do with memory). You don't see GPU reserved memory because Apple does not advertise it openly. But they definitely reserve some — otherwise you'd see GPU crashes and black screen in low memory situations. It's called "pinned memory" in Apple terminology.

And since this is an API call, Apple could just change the guts of the API on ARM Macs to just default to the zero buffer feature.

What do you mean by "zero buffer feature"? Most Metal APIs that deal with memory are explicitly copying. There is now way Apple can change their behavior on ARM Macs without breaking applications. Not to mention that would directly violate Apple's documentation about API behavior.

As I mentioned before, zero copy buffers aka. passing over an ownership of a system memory allocation to the GPU driver is not a universally useful feature. Most of the time a copy is either perfectly fine or even desirable (because it gives the GPU opportunity to optimize the data layout, e.g. for texture access). The core benefit of UMA instead is being able to do heterogenous processing on the same data without the need to move it around.
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
Most Metal APIs that deal with memory are explicitly copying.
Then what is the entire point of unified memory architecture if its still copying? I have seen developer documentations and reports that the M1 SOC can use any and all of the 8 or 16GB of RAM you have on your device, but its still copying things? Then what is the point? And how much memory is available for that copy? Why is this much more complicated now when I have 64GB of system RAM and 8GB of dedicated GPU memory. If my 3D workflow is using too much GPU memory, I can get a GPU with more memory. That is easier to deal with than all of this mess.

I am not seeing copying going on from the developer videos I have seen, and this video too (which shows one of the developer videos).


So this is getting way too confusing with the M1 and Apple Silicon macs now. If I have a workflow that works just fine on 8GB of RAM and 8GB AMD GPU, should I just assume I need 16GB of Unified Memory? How much with the GPU take up for all of these copies? How much will be available for my applications? Will 8GB of RAM and 8GB of GPU just completely not work for 8GB of unified memory architecture? How much will my apps use for system RAM and how much with the M1 use for GPU tasks? Do I need to bump it to 16 and just assume it will be fine as its 8 + 8?
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,522
19,679
Then what is the entire point of unified memory architecture if its still copying? I have seen developer documentations and reports that the M1 SOC can use any and all of the 8 or 16GB of RAM you have on your device, but its still copying things? Then what is the point? And how much memory is available for that copy?

I think you might be overthinking all this. UMA simply means that the CPU and GPU can access the same physical memory, as opposed to the dGPU where data has to be continuously transferred over the PCI-e bus to make changes visible to both. That's the quintessence of "zero copy". Traditional dGPUs make workflows where you want both the CPU and the GPU access the data non-viable as these background data transfers are slow and introduce a lot of latency, stalling both devices and costing you performance. UMA instead makes it trivial, giving you more options.

API is a different matter altogether. Yes, basic GPU APIs will copy data. But it doesn't mean that you ned more RAM or that your performance will suffer, it's just the steps you need to take to get the data where it has to be. As we discussed before, UMA does give you an option to avoid copying altogether but again, it is only useful in a handful of very niche applications. For games for example, the "zero-copy" API is of little interest, because games don't own memory allocations that they want to share with the GPU. They are usually streaming the data in, using driver-provided allocations. Again, the big difference is that if you stream the data to a dGPU, it has to go though the slow PCI-e bus. On an UMA system streaming data simply means doing a fast system memory copy.

If this is confusing, blame the marketing and YouTubers such as MaxTech who have been actively spreading misinformation.

Why is this much more complicated now when I have 64GB of system RAM and 8GB of dedicated GPU memory. If my 3D workflow is using too much GPU memory, I can get a GPU with more memory. That is easier to deal with than all of this mess.

It's not more complicated. UMA is actually much much simpler. If your GPU workflow needs 8GB RAM, well, that's the amount of system memory that will be reserved for it. If you need more, it will reserve more. If you need more yet, it will reserve even more and swap some other application data out. In fact, there is no need to thing about the GPU at all. Think about application data instead. The application uses the GPU, and it needs a certain amount of memory for operation. Whether the contents of this memory is processed by CPU or the GPU does not matter.

Now, dGPUs are actually much more messy. For example GPU-side memory is often mirrored in the system RAM. So your 8GB of GPU data might very well eat 8GB of the system RAM as well. You might actually only have half of this data on the GPU, rest being streamed in from the system memory on demand. There is all kinds of stuff that could be going on in the background. But you simply have no way of knowing this since all these details are hidden by the driver and the OS. There is no such complications with an UMA system, since there is only one memory pool to manage, and no need to make multiple copies of the data.

I am not seeing copying going on from the developer videos I have seen, and this video too (which shows one of the developer videos).

Sorry, but how do you want to "see copying"? Your dGPU system is copying data in the background all the time, do you see that somehow? Furthermore, your CPU is copying data all the time, because that's how software works.

So this is getting way too confusing with the M1 and Apple Silicon macs now. If I have a workflow that works just fine on 8GB of RAM and 8GB AMD GPU, should I just assume I need 16GB of Unified Memory? How much with the GPU take up for all of these copies? How much will be available for my applications? Will 8GB of RAM and 8GB of GPU just completely not work for 8GB of unified memory architecture? How much will my apps use for system RAM and how much with the M1 use for GPU tasks? Do I need to bump it to 16 and just assume it will be fine as its 8 + 8?

Again, you are overthinking it. There is no memory that "GPU takes" and memory that "application takes". It's all the same. GPU memory is the application memory on an UMA system, and the rest is up to your application.

If I have a workflow that works just fine on 8GB of RAM and 8GB AMD GPU, should I just assume I need 16GB of Unified Memory?

You'd be probably fine with 8GB of UMA to be honest. As I wrote before, a dGPU will often need to maintain copies of the data in system memory for various reasons, so your 8+8GB system gets much less effective total RAM than 16GB. The UMA system does not need to maintain any data copies, so you don't get any of that wastage.

The bottomline is that it's not UMA that's difficult to reason about, but it's actually the dGPUs. With UMA, you always have a good sense of memory usage. On M1, if you allocate 1GB GPU buffer, then that's how much memory you need (and this memory can be swapped etc. just like regular application memory). With a dGPU, it's all up to the driver and the OS. You might be getting a 1GB GPU-side allocation + 1GB system-side copy, or a 1GB GPU-side allocation and a 256MB system-side partial copy etc... you just have no way of knowing, because it's not communicated to you.
 
  • Like
Reactions: EntropyQ3

Colstan

macrumors 6502
Jul 30, 2020
330
711
Back to MrMacRight, and his videos, as featured in the original post. He recently released a video about the native M1 port of Baldur's Gate 3:


He goes into some details of how the game was ported, support for Metal and Apple's active participation in the process, how Steam is dragging down native ARM binaries, and the difficulties with low-end M1 systems. While this is still in beta, the game currently requires 10GB of unified memory with M1 machines. I think a lot of folks were overoptimistic about the "magic" of Apple Silicon's memory architecture, where you allegedly need significantly less memory than traditional computers. While Apple's accomplishment is impressive, with some tasks you simply need more RAM. BG3 is an example of that.

Regardless, this is still in beta, so performance requirements could certainly change. However, one of the Elverils devs (Larian's partner for the Mac port) posted this on Reddit:

Hey Elverils team here. Just want to stress that Patch 5 does not significantly change 8Gb situation vs patch 4. We’ve had numerous reports about Patch 4 memory stress on 8Gb devices. So frankly speaking Gpu perf on patch 5 is even better but memory pressure is bigger so yeah 8Gb devices are now in not so good shape.

But as pointed out this is one of the most important issues now for our team. We will spend time and resources to optimise the game. We have some time so please trust in us and don’t be upset too much. We’re working on the game to improve things. Cheers!
 
  • Like
Reactions: EntropyQ3

leman

macrumors Core
Oct 14, 2008
19,522
19,679
Back to MrMacRight, and his videos, as featured in the original post. He recently released a video about the native M1 port of Baldur's Gate 3:


He goes into some details of how the game was ported, support for Metal and Apple's active participation in the process, how Steam is dragging down native ARM binaries, and the difficulties with low-end M1 systems. While this is still in beta, the game currently requires 10GB of unified memory with M1 machines. I think a lot of folks were overoptimistic about the "magic" of Apple Silicon's memory architecture, where you allegedly need significantly less memory than traditional computers. While Apple's accomplishment is impressive, with some tasks you simply need more RAM. BG3 is an example of that.

Regardless, this is still in beta, so performance requirements could certainly change. However, one of the Elverils devs (Larian's partner for the Mac port) posted this on Reddit:

I thing they are still tweaking these things. Besides, it's not like you need 16GB just to run the game, its the high quality assets that make it problematic. It's not like the game will run better at high settings on a Windows PC with 8GB RAM or anything.

There is a good chance that the game will be playable on high settings with 8GB Macs before long. There are plenty of optimization opportunities, from texture compression to on-demand texture streaming.
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
Again, you are overthinking it. There is no memory that "GPU takes" and memory that "application takes". It's all the same. GPU memory is the application memory on an UMA system, and the rest is up to your application.
But you are saying a GPU process is issuing a copy in Metal. Which means there IS a duplication going on (definition of copy) that an application is taking. And this duplication is specific to the Metal/GPU, which means the GPU is in fact "taking" some of the application memory space away. Which means part of the application memory will be doubled in UMA (again definition of a copy). So is 8GB enough, or do I need 16GB since its 8 + 8?
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
I thing they are still tweaking these things. Besides, it's not like you need 16GB just to run the game, its the high quality assets that make it problematic. It's not like the game will run better at high settings on a Windows PC with 8GB RAM or anything.

There is a good chance that the game will be playable on high settings with 8GB Macs before long. There are plenty of optimization opportunities, from texture compression to on-demand texture streaming.
I agree. Things are all about optimization. Batman Arkham Knight required 12-16GB of RAM when it first launched, now it recommends 8.
 

leman

macrumors Core
Oct 14, 2008
19,522
19,679
But you are saying a GPU process is issuing a copy in Metal. Which means there IS a duplication going on (definition of copy) that an application is taking. And this duplication is specific to the Metal/GPU, which means the GPU is in fact "taking" some of the application memory space away. Which means part of the application memory will be doubled in UMA (again definition of a copy). So is 8GB enough, or do I need 16GB since its 8 + 8?

I think it's easiest to explain using an example.

Let's say I am writing a game and I have to load some textures to be used on the GPU. So what I can do is allocate some system memory, load my texture data from the disk into that memory, and then call a Metal API to copy that data into a Metal-managed texture storage.

At this point I indeed have two copies of texture data — one in the memory that I have allocated, and one in the memory that Metal has allocated. But! I don't really need to keep the two copies around, do I? After I have populated the texture data, I don't need my own local copy anymore, so I can safely release that memory. Or, I can use that memory to load the next texture from the disk and use it to create a new texture. In pseudocode:

Code:
allocate enough memory <data> to store texture data
for each tecture file
  load texture data from the file into <data>
  create a new metal texture by copying the contents of <data>
done
free memory allocation bound to <data>

As you can see, after this is done, we have loaded all the textures and have no RAM duplication whatsoever. The <data> allocation is just temporary area that you have to maintain before the texture contents reach its final destination. In a real system of course, you will probably steam multiple textures simultaneously, so you'll use a larger area that can store multiple textures — but the principle remains the same.

Now, if you are on an UMA system, you can prevent the copying altogether. You can create a Metal buffer (which will allocate driver-managed system memory), obtain the address of the buffer contents, and load the data directly into that address. Then you can create a texture using the buffer data. This will avoid the copy, but it's not necessarily a good thing. Loading textures like this will also prevent all kinds of GPU-specific optimizations. If you let Metal copy the data, that copy is owned by Metal and it can do all kinds of things without you noticing. For example, it can reorganize the data layout to make it more cache-friendly, or it can apply lossless compression to improve bandwidth.

The bottomline is that we have to distinguish "zero copy" as a fundamental hardware feature (aka. CPU and GPU can access the same physical memory without the need to copy the data behind the scenes) vs. "zero-copy" as an API feature. Most uses of the API does involve copying data, which is perfectly normal. The true value of UMA is not in preventing API-driven data copies, but in preventing behind-the-scenes copies between different memory pools. If you want to do some GPU-side processing and then access the results on the CPU, well, in an UMA system (be it Intel or M1) this is instantaneous, because the data is in the memory directly accessible by both the CPU and the GPU. On a system with the dGPU however, you have to first copy the data from the GPU memory to the system RAM where the CPU can access it, costing you time and PCI-e bandwidth. And that last bit is regardless of the API abstraction. The API might well present you with a "zero-copy" interface, but it will actually be doing PCI-e transfers in the background.
 
  • Like
Reactions: Ethosik

Ethosik

Contributor
Oct 21, 2009
8,142
7,120
I think it's easiest to explain using an example.

Let's say I am writing a game and I have to load some textures to be used on the GPU. So what I can do is allocate some system memory, load my texture data from the disk into that memory, and then call a Metal API to copy that data into a Metal-managed texture storage.

At this point I indeed have two copies of texture data — one in the memory that I have allocated, and one in the memory that Metal has allocated. But! I don't really need to keep the two copies around, do I? After I have populated the texture data, I don't need my own local copy anymore, so I can safely release that memory. Or, I can use that memory to load the next texture from the disk and use it to create a new texture. In pseudocode:

Code:
allocate enough memory <data> to store texture data
for each tecture file
  load texture data from the file into <data>
  create a new metal texture by copying the contents of <data>
done
free memory allocation bound to <data>

As you can see, after this is done, we have loaded all the textures and have no RAM duplication whatsoever. The <data> allocation is just temporary area that you have to maintain before the texture contents reach its final destination. In a real system of course, you will probably steam multiple textures simultaneously, so you'll use a larger area that can store multiple textures — but the principle remains the same.

Now, if you are on an UMA system, you can prevent the copying altogether. You can create a Metal buffer (which will allocate driver-managed system memory), obtain the address of the buffer contents, and load the data directly into that address. Then you can create a texture using the buffer data. This will avoid the copy, but it's not necessarily a good thing. Loading textures like this will also prevent all kinds of GPU-specific optimizations. If you let Metal copy the data, that copy is owned by Metal and it can do all kinds of things without you noticing. For example, it can reorganize the data layout to make it more cache-friendly, or it can apply lossless compression to improve bandwidth.

The bottomline is that we have to distinguish "zero copy" as a fundamental hardware feature (aka. CPU and GPU can access the same physical memory without the need to copy the data behind the scenes) vs. "zero-copy" as an API feature. Most uses of the API does involve copying data, which is perfectly normal. The true value of UMA is not in preventing API-driven data copies, but in preventing behind-the-scenes copies between different memory pools. If you want to do some GPU-side processing and then access the results on the CPU, well, in an UMA system (be it Intel or M1) this is instantaneous, because the data is in the memory directly accessible by both the CPU and the GPU. On a system with the dGPU however, you have to first copy the data from the GPU memory to the system RAM where the CPU can access it, costing you time and PCI-e bandwidth. And that last bit is regardless of the API abstraction. The API might well present you with a "zero-copy" interface, but it will actually be doing PCI-e transfers in the background.

Thanks, that example made it more clear. So it’s not really a copy as we can free up the memory after it “moves” it. So only a brief time it’s copied, right?
 

Ethosik

Contributor
Oct 21, 2009
8,142
7,120

Yeah with terms like copy makes me think it will be double the space for the lifetime of the given object or resource.

What is essentially happening here is the resource is undergoing a cut or move operation similar to file shares. It’s copied to the destination then deleted.
 

poked

macrumors 6502
Nov 19, 2014
267
150
Can it play Path of Exile? Can anyone do a benchmark test, if they've gotten the M1 chip compatible with it yet?
 

Thomas Davie

macrumors 6502a
Jan 20, 2004
746
528
Playing Fallout 3 on my iPad Pro, sidecar’ed from m1 MBA running Windows 10 arm64

9QE8vlOl.jpg


Just to see if I could. And I can. No lag that I can feel (local network is rock solid at 320 mbps)

Tom
 

ha1o2surfer

macrumors 6502
Sep 24, 2013
425
46
It is determined in the BIOS. And Intel DOES have the feature to use shared RAM, but it requires a programming change. Besides, Intel iGPU are horrible and NVIDIA/AMD have their own memory.
hmm I have it set to 512MB, there's only a minimum setting no maximum
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.