So true, I can't believe they are even making discrete cards of these as well for 11th Gen.
I've seen a couple of early reviews of the dedicated Xe cards, the performance is laughable to say the least.
So true, I can't believe they are even making discrete cards of these as well for 11th Gen.
I have too. I seriously wonder why they want to go through all the trouble of producing these given how poor they are. I mean, they fall somehwere between an old GT 1030 and GTX-1050.I've seen a couple of early reviews of the dedicated Xe cards, the performance is laughable to say the least.
For operations where the data has to be manipulated by both the CPU and iGPU, it is copied twice across the system bus into RAM (once per partition), then the system has to reconcile the two sets of data once passed back from RAM, which adds additional processing time. With the UMA setup the M1 uses, both the CPU and iGPU can access the full system RAM simultaneously (i.e., there is no partitioning of the RAM between CPU and GPU.) This means that data is only copied to RAM once, and since all operations can happen simultaneously, there is no overhead associated with reconciling two versions of the same data once passed back to the CPU from RAM.
x86 does NOT use unified memory for system RAM.
UMA refers to the RAM setup in the system, not CPU cache.
With the x86 platform, the system partitions the RAM into a CPU and iGPU section. For the iGPU, the system usually allocates around 2GB for GPU operations, meaning that only 6GB are available for the CPU.
One thing you have to realize is that Intel's use of the term "UMA" is misleading. For Intel's purposes, they just renamed Intel HD to UMA, but made no changes to the underlying architecture. On the other hand, Apple's approach is essentially what AMD has been trying to do for years with the development of their Infinity Fabric technology for Ryzen-series CPUs.
Intel still partitions the CPU and GPU RAM, so you have to copy data from the CPU side to GPU side and vice versa (bolded section is Apple's approach, the italicized section is Intel's):
My understanding after re-skimming the video (it’s been a few months since I last watched it) is that this doesn’t necessarily impact the amount of video memory this needs all that much, but rather the pressure placed on memory bandwidth.
I still need X MB for a texture of a given size, and X MB for the frame buffer in either design. However, TBDR reduces how often you need to reach out to (V)RAM, especially in situations where you need to make multiple passes. It *might* reduce intermediate buffers a little, but that assumes intermediate buffers are a noticeable contribution compared to the other buffers in use. My understanding was that the back buffer itself was used as the intermediate buffer, so I am a bit skeptical that there’s big gains to be had there. Draw to texture seems to be common these days, so there might be more than I expect, assuming these scenarios can be all done at the tile level, rather than a texture.
This is not true at least on macOS. Intel GPU cannot address all of the system RAM. The area of memory it can address is static and you have to reboot if you want to change size of the addressable area. Even though you manually change that value to a very large one, the Intel GPU driver on macOS has a hard-coded upper-limit of 2GB and it will not use any memory beyond that(Don't know if that's changed now).Both Intel and Apple GPUs are able to address the entirety of the system RAM.
By zero copy, we mean that no buffer copy is necessary since the physical memory is shared.
This is not true at least on macOS. Intel GPU cannot address all of the system RAM. The area of memory it can address is static and you have to reboot if you want to change size of the addressable area. Even though you manually change that value to a very large one, the Intel GPU driver on macOS has a hard-coded upper-limit of 2GB and it will not use any memory beyond that(Don't know if that's changed now).
Which means, Intel's UMA is available only for buffers, not textures, if the texture is accessed frequently by both CPU and GPU, you need to make 2 copies in each "memory space", and sync the change, which uses extra memory as a result. On iOS and tvOS where all hardware features Apple GPU, the texture and buffer are all shared between CPU and GPU. Apple's document still states that shared mode is only available for buffer on macOS. I don't know if they are doing anything special for Apple Silicon Macs, but Apple Silicon should have the capability to share texture because it is already available on iOS.
@dmcloud I am afraid your understanding of these things are a bit incomplete. I will do my best to clarify some points below. Mom importantly: yes, Intel's CPU and GPU access the same physical memory and their implementation is a fully cache-coherent UMA. Your claims about Intel's implementation are unfortunately only speculation that is directly refuted by technical documentation and API behavior. I assume that AMD's iGPUs are also full UMA, but I haven't looked into the technical documentation there, so I can't comment.
I think this is where the true source of confusion is. An UMA system can achieve zero copy data sharing between CPU and GPU. It does not mean that it is always zero copy. Our graphics and compute API work with driver-managed buffers. The usual programming model is that the data comes from somewhere and gets copied into the buffer for the GPU use. This is no different in Metal: when you create a buffer or a texture, you will usually incur a memory copy. On a system with a dGPU, this copy is mandatory anyway.
However, both Apple and Intel offer means to avoid the copy. Metal has a way to use CPU-side memory allocation as a buffer on UMA systems. Intel offers very similar functionality (@mi7chy has provided a link in their post). Furthermore, the restrictions on both platforms are very similar: the data has to be page-aligned and its size has to be aligned, which agains tells us that the technical implementation is very similar.
To sum it up: you need to do some extra work on an UMA system to avoid the copy. Using the API as normal will still require you to copy your data.
First and foremost, x86 is a processor architecture. It does not mandate GPU setup. There are different implementations of the x86 platform, with different characteristics. Some are UMA some are not.
UMA refers to everything. RAM setup is not enough, you need to synchronize GPU and CPU caches on common memory accesses or you risk getting garbage (CPU not . This is what "cache coherency" means. On both Apple and Intel current implementation, this is partly achieved by having the CPU and the GPU share the last level (SoC level cache, Intel also calls this L3 cache). The CPU and the GPU share the same memory subsystem (memory controllers etc.), and can address exactly the same RAM.
This is just driver behavior. Some memory has to be reserved for GPU-only operation. Apple also does the same on M1 machines (RAM fro the frame buffers and various GPU-internal state, plus some overhead for the window manager). It has nothing to do with the ability of the GPU to address physical RAM. Both Intel and Apple GPUs are able to address the entirety of the system RAM.
As I explain above, Intel has offered full UMA implementation for many years now. Appel didn't invent UMA. We do not know what the difference between Intel's and Apple's implementation is. I have a suspicion that in Apple's implementation virtual memory is fully under OS control (all M1 processors share the same page table), so the OS can very efficiently swap GPU memory when necessary (it is unclear whether an Intel-based system can do the same). And yes, it's very similar in spirit to Infinity Fabric. Again, Intel had this implemented with their iGPU for a couple of years now, it's just that their implementation is not particularly performant or scalable.
By the way, who is really mismarketing UMA is Nvidia. The have claimed to have "CUDA UMA" some time ago, but it's just BS.
These quotes do describe some older implementations of iGPUs (like 10 years old). Again, that's not how more recent Intel GPUs (Sandy bridge and up) work.
Just read the anandtech article on the m1..How is that supposed to support the claim you are making? What I am asking is - do you have any factual evidence, or a reference to a source presenting such factual evidence that GPU memory allocation works differently on Intel and Apple GPUs? Both use unified memory architecture with last level cache shared between CPU and GPU. M1 definitely reserves some memory for GPU use, although I am unsure how much.
Go look at any Intel laptop on the shelf at Best Buy, Office Max/Staples, etc. If you check the system settings, it will show the amount of available system RAM (i.e., the RAM NOT partitioned off for the iGPU) as right around 1.5-2GB less than the installed system RAM. That's because Intel still partitions off the iGPU RAM separately from the RAM allocated to CPU operations. Even on my 10th Gen MSI gaming laptop, it shows 16.0 GB installed RAM with 14.8 usable, and this is a system with both an iGPU and dedicated GPU.
Just read the anandtech article on the m1..
Disable iGPU in BIOSGo look at any Intel laptop on the shelf at Best Buy, Office Max/Staples, etc. If you check the system settings, it will show the amount of available system RAM (i.e., the RAM NOT partitioned off for the iGPU) as right around 1.5-2GB less than the installed system RAM. That's because Intel still partitions off the iGPU RAM separately from the RAM allocated to CPU operations. Even on my 10th Gen MSI gaming laptop, it shows 16.0 GB installed RAM with 14.8 usable, and this is a system with both an iGPU and dedicated GPU.
It only requires a few megabytes to run a display but GPUs are also used for machine learning, pixel arrays, and specialized calculations. It really depends on the software you are using.
Look around in the prefs and see if you can spot GPU settings. For example, Photoshop can make heavy use of GPUs but it can also be disabled.
I assumed it was a Rosetta app and I assumed it would be duplicating buffers and arrays and such things. I don't actually know how Apple deals with that internally though so I can see how I would be wrong.I fail to see an upside. Using the GPU could greatly improve PhotoShop performance. With unified memory in the M1, the data being accessed by the GPU would be exactly the same memory that PhotoShop would use the CPU to work on. Actual memory usage would not change significantly, if at all, so what is there to gain?
I fail to see an upside. Using the GPU could greatly improve PhotoShop performance.
I must agree with @leman that your understanding is a bit incomplete. He's not disagreeing with what you're saying above. AFAIK the newer Intel UMA chips really do partition some RAM off for the iGPU.Go look at any Intel laptop on the shelf at Best Buy, Office Max/Staples, etc. If you check the system settings, it will show the amount of available system RAM (i.e., the RAM NOT partitioned off for the iGPU) as right around 1.5-2GB less than the installed system RAM. That's because Intel still partitions off the iGPU RAM separately from the RAM allocated to CPU operations. Even on my 10th Gen MSI gaming laptop, it shows 16.0 GB installed RAM with 14.8 usable, and this is a system with both an iGPU and dedicated GPU.
Software written before that API was introduced is not able to use the functionality, so there still needs to be the notion of GPU-exclusive memory. The UMA capabilities are completely optional.
Howdy majormike,Hi,
How much ram does the M1 take up from the memory pool to run the GPU?
I want to get it for music production and I'd be happy with 16 gigs but I'm unsure as to how much memory the GPU using, especially with an external display?
You are exactly spot on. TBDR reduces the number of data accesses in fragment shaders, which means a reduction in required bandwidth. You still need the RAM to hold the data though. There are of course things like texture compression etc. and Apple's support here is industry-leading.
In regards to intermediate buffers, TBDR makes it possible to eliminate them in many common cases. Since all processing occurs in tile, you can mix multiple shader invocations (compute and fragment) that apply different processing effects to the tile data without the need to move it to the memory. You only need to store it if you are using this data for something else entirely (e.g. a sharpening pass), but even then you can only store what you need and eliminate other intermediate buffers. It also allows completely new applications, such as fast vector rendering using compute shaders. I don't know if any other API than Metal that would allow you to chain compute shaders that share on-chip memory.
The point is that the "GPU-exclusive" memory in this case is an artificial construct. The memory is still allocated in the CPU-visible area and both CPU and GPU have equal access to it. It's just that you as a user don't have the pointer to the data, so you can't write to it without using the appropriate APIs. This is the same on Apple GPUs as well — when you create a texture using the standard API, Metal will reserve the texture data in the system unified memory, but your application won't have direct access to that memory.
The bolded bit is where I get a bit fuzzy. Mostly because I’ve been out of the loop long enough that I couldn’t really say how much memory gets spent on these sort of things, and how much TBDR actually saves.
Although since you bring up vector rendering, I can’t help but think Apple has been chomping at the bit for such functionality. Is that really that new? I know there are definite wins if you can avoid rasterizing CoreGraphics draw calls on the CPU or to textures for later compositing, but for some reason I thought that was doable a while ago. But maybe I’m mis-reading your comment here.