Is the M1 GPU sharing LPDDR4 RAM going to be enough?

leman · Sep 21, 2021

deconstruct60 said:
You wouldn't. But back in several replies you put forward how cache and TBDR was going to solve the problem.
It doesn't matter can't exactly match bandwidths because need less bandwidth. The probably here is that the cache isn't going to be effective at mitigating bandwidth if you are not using it. Same issue with TBDR. Has next to no traction here because it is a display controller issue not a "tile" (or tile cache) issue.

I think you might be overestimating the bandwidth needs for driving displays. In a typical demanding graphics workload (e.g. a game), the real bandwidth problem is fetching texture data. And TBDR+large cache can effectively reduce the real bandwidth required to do this.

deconstruct60 said:
There is a significant factor where this bypass 'eats' into the savings of the other techniques. For example with some approximate numbers. 60 GB/s for the memory system and 3 GB/s for a display stream. Your position is that it is a small fraction (5% ) . True, but that small fraction diminishes your savings. For example if it is 55% that is only applied to 95% of the workload. That gives an overall effective rate of 52%. With two displays that falls to 50%. Three displays 47%. 47% isn't going to scale up as well as 55% might.

You seem to be assuming that every display is fully redrawn on every frame (it’s almost never the case) and you are dismissing framebuffer compression.

In a real world, multi-monitor workloads are desktop compositing workloads, where changes are infrequent and confined to small regions most of the time. Also, real-world GUIs consist of large homogeneous regions which are a present for the compressor. Sure, if you want to play a game on 3 4K displays simultaneously, the bandwidth hit will be noticeable, but your GPU hardly has the performance to do this anyway.

deconstruct60 said:
You are drinking lots of Cupertino kool-aid. Apple plays some games with their L2 cache numbers. Those L2 are attached to multiple core.

Exactly, which is why I am saying that Apple‘s L2 fulfills the same role as L2+L3 on more traditional designs. They seem to have around 3MB of „fast“ cache region per core, but each core can access other regions at a slight loss of performance (would be interesting to see some benchmarks btw). Sure, something like Zen3 Vermeer has a bit more effective cache per core (that’s to their humongous L3), but the amount of fast per-core cache in Apples designs is unprecedented.

theorist9 · Sep 21, 2021

leman said:
Exactly, which is why I am saying that Apple‘s L2 fulfills the same role as L2+L3 on more traditional designs. They seem to have around 3MB of „fast“ cache region per core, but each core can access other regions at a slight loss of performance (would be interesting to see some benchmarks btw). Sure, something like Zen3 Vermeer has a bit more effective cache per core (that’s to their humongous L3), but the amount of fast per-core cache in Apples designs is unprecedented.

I've read some programmers optimize their code so what's sent to the processor is in cache-size chunks ("cache-friendly" code). Does this mean code has to be rewritten at a low level to be fully optimized for AS's unique cache sizes and configuration?

Krevnik · Sep 21, 2021

theorist9 said:
I've read some programmers optimize their code so what's sent to the processor is in cache-size chunks ("cache-friendly" code). Does this mean code has to be rewritten at a low level to be fully optimized for AS's unique cache sizes and configuration?

When talking about cache friendly code, it’s code/data locality that is generally discussed, and avoiding sharing cache lines that can cause contention over the data within the cache line that either messes up instruction ordering, or creates issues when there’s multithreaded access.

These techniques are applicable to multiple different architectures, as they aren’t necessarily optimizing for a specific cache size. And many, many apps don’t chase this until they get quite complex. I’ve worked on projects that both cared about locality, and projects that don’t, and I’d wager the ones that do are outnumbered by the ones that don’t. One project I worked on that did care about locality is M1 native these days, and the locality efforts were applicable across both ARM and x64, rather than anything we did for one or the other (the project was already cross-platform for iOS and Android).

quarkysg · Sep 21, 2021

leman said:
I think you might be overestimating the bandwidth needs for driving displays. In a typical demanding graphics workload (e.g. a game), the real bandwidth problem is fetching texture data. And TBDR+large cache can effectively reduce the real bandwidth required to do this.

From what I understand on video circuitry designs, they usually have their own dedicated RAMs that is used by the video circuits to render into the respective supported video output format signals, e.g. HMDI, VGA, etc. These RAMs are usually dual ported, in that the CPU can write into the video buffer RAM via one path, and the video circuits could read from another path simultaneously. I would think the video buffer memory would be memory-mapped to the CPU's address space.

Using system RAM as video buffer RAM would be sub-optimal as any cores requiring access to RAM would have to give way to the video ciruitry since it must have the highest priority, or you will see jerky video output.

For the M1, as far as the GPU is concerned, RAM is used solely to render the final image into a temporary frame buffer, where it will be copied to the video buffer RAM (most likely using DMA.)

quarkysg · Sep 21, 2021

theorist9 said:
I've read some programmers optimize their code so what's sent to the processor is in cache-size chunks ("cache-friendly" code). Does this mean code has to be rewritten at a low level to be fully optimized for AS's unique cache sizes and configuration?

From what little I know from messing around Linux driver codes, programmer usually drop hints to the compiler to ensure that their data structures are cache aligned and they try to fit their data into cache lines to make data fetch more efficient. This will result in less fetches from cache lines into the CPU thereby improving performance, especially if you have codes running in a tight loop.

Since most codes are now written in higher level languages like C, C++, Swift, etc a lot of the optimisation techniques are now delegated to the compiler optimiser.

Hexley · Sep 21, 2021

Can you do anything about this "design flaw"? Do you have alternatives to it?

theorist9 · Sep 21, 2021

Krevnik said:
When talking about cache friendly code, it’s code/data locality that is generally discussed, and avoiding sharing cache lines that can cause contention over the data within the cache line that either messes up instruction ordering, or creates issues when there’s multithreaded access.

These techniques are applicable to multiple different architectures, as they aren’t necessarily optimizing for a specific cache size. And many, many apps don’t chase this until they get quite complex. I’ve worked on projects that both cared about locality, and projects that don’t, and I’d wager the ones that do are outnumbered by the ones that don’t. One project I worked on that did care about locality is M1 native these days, and the locality efforts were applicable across both ARM and x64, rather than anything we did for one or the other (the project was already cross-platform for iOS and Android).

I thought there were two parts to this. One is locality. But I thought the other involved sending data to cache in cache-sized chunks. I recall one vectorized database that, as part of its marketing material, claimed to have achieved processor-specific optimizations by doing that. But I don't know if it was mere marketing, or actually substantive.

Also, did you mean "I’d wager the ones that don't [care about locality] are outnumbered by the ones that do."?

Krevnik · Sep 21, 2021

theorist9 said:
I thought there were two parts to this. One is locality. But I thought the other involved sending data to cache in cache-sized chunks. I recall one vectorized database that, as part of its marketing material, claimed to have achieved processor-specific optimizations by doing that. But I don't know if it was mere marketing, or actually substantive.

That's part of locality, but a very specific niche application of it. An application can't "send data to cache", so much as it can slice and dice the data to try to ensure things fit in a cache line. But since cache lines are pretty small, there's limits to what you can do without really getting into the weeds. One of the biggest things we did in the project I worked on was to try to avoid references to objects that could be anywhere, and instead embed/append types instead to ensure a data object was continuous whenever possible, using the stack instead of the heap, etc.

The cache line size of the latest Intel Macs is 64 bytes, while the M1 doubles that to 128 bytes. So if you already have gone through the trouble of aggressively optimizing locality for a 64 byte cache line, the M1 doesn't change that much. It may even be better in many cases if you focused strictly on data locality rather than specific cache line sizes. The places where it might hurt is if the second aspect I mentioned (avoiding cache line contention) is a problem.

theorist9 said:
Also, did d you mean "I’d wager the ones that don't [care about locality] are outnumbered by the ones that do."?

Optimizations like this are not common. As I said before, this sort of optimization is really getting into the weeds. Throw in things like Electron/React Native/Flutter/etc becoming more and more popular, and there's a clear trend away from these sort of low level optimizations.

Objective-C uses the heap heavily, making some locality optimizations difficult compared to C/C++. Swift lets you define your types as reference or value types, but does any stack vs heap decisions behind the scenes, limiting what you can do there as well.

leman · Sep 21, 2021

theorist9 said:
I've read some programmers optimize their code so what's sent to the processor is in cache-size chunks ("cache-friendly" code). Does this mean code has to be rewritten at a low level to be fully optimized for AS's unique cache sizes and configuration?

As far as I can say, not at all. The basic rules are the same, and Apple Silicon uses the same 64-byte cache lines as any mainstream hardware. If anything, large caches allow the CPU to be more lenient with less optimal code.

Of course, if you are optimizing low-level code specifically for M1, you can probably squeeze out some extra performance by taking its larger caches into account, but it sounds like a lot of bother for a marginal win.

quarkysg said:
From what I understand on video circuitry designs, they usually have their own dedicated RAMs that is used by the video circuits to render into the respective supported video output format signals, e.g. HMDI, VGA, etc. These RAMs are usually dual ported, in that the CPU can write into the video buffer RAM via one path, and the video circuits could read from another path simultaneously. I would think the video buffer memory would be memory-mapped to the CPU's address space.

Using system RAM as video buffer RAM would be sub-optimal as any cores requiring access to RAM would have to give way to the video ciruitry since it must have the highest priority, or you will see jerky video output.

Thants interesting, thanks, do you have any recommendations where I can read more about it? What confused me a bit is that Intels documentation seems to indicate that their display engine operates from RAM.

quarkysg said:
For the M1, as far as the GPU is concerned, RAM is used solely to render the final image into a temporary frame buffer, where it will be copied to the video buffer RAM (most likely using DMA.)

Hm, but how would it make things better? You still need to write the framebuffer and then read it back. Doesn’t seem like it would save you any bandwidth …

mr_roboto · Sep 21, 2021

leman said:
You seem to be assuming that every display is fully redrawn on every frame (it’s almost never the case) and you are dismissing framebuffer compression.

In a real world, multi-monitor workloads are desktop compositing workloads, where changes are infrequent and confined to small regions most of the time. Also, real-world GUIs consist of large homogeneous regions which are a present for the compressor. Sure, if you want to play a game on 3 4K displays simultaneously, the bandwidth hit will be noticeable, but your GPU hardly has the performance to do this anyway.

Frame buffers are seldom stored in a compressed format, and typically external monitors get a full refresh for every frame. Partial refresh is a thing for tightly integrated internal displays, however, due to the fact that these are usually part of battery operated devices and partial refresh reduces the energy required to update the display.

leman · Sep 22, 2021

mr_roboto said:
Frame buffers are seldom stored in a compressed format, and typically external monitors get a full refresh for every frame. Partial refresh is a thing for tightly integrated internal displays, however, due to the fact that these are usually part of battery operated devices and partial refresh reduces the energy required to update the display.

Apple‘s chips implement lossless image compression for textures as a bandwidth-saving feature, it would only make sence to assume that it is also applied to framebuffer textures as well. Regarding refresh, yes, the display gets a refresh every frame, but it doesn’t mean that the frame in the RAM has to be completely rebuilt on every frame as well.

But again, I am just speculating. I would really love to see how modern desktop compositing systems do it. There are a lot of optimization opportunities. And if we know anything about Apple it’s that they love to optimize these things. I’m certain their display engines are designed to play very well with the display compositor. Naive framebuffer refresh would take too much bandwidth and energy, it’s just such an obvious optimization target.

mr_roboto · Sep 22, 2021

quarkysg said:
From what I understand on video circuitry designs, they usually have their own dedicated RAMs that is used by the video circuits to render into the respective supported video output format signals, e.g. HMDI, VGA, etc. These RAMs are usually dual ported, in that the CPU can write into the video buffer RAM via one path, and the video circuits could read from another path simultaneously. I would think the video buffer memory would be memory-mapped to the CPU's address space.

Using system RAM as video buffer RAM would be sub-optimal as any cores requiring access to RAM would have to give way to the video ciruitry since it must have the highest priority, or you will see jerky video output.

Your understanding is correct, but only for a bygone era. As far as PCs are concerned, dual ported video RAM began fading away in the late 1990s, and it's completely gone today.

I don't know exactly why it fell out of favor, but my educated guess is that it wasn't a great tradeoff to dedicate a bunch of pins on memory chips just to video refresh. Pins are a scarce resource on chips. Instead of spending a lot of resources on a second port, you can put everything into making a single ported DRAM as fast as it can possibly be, then share that bandwidth between multiple clients.

The priority problem is solved exactly as you'd expect. There always has to be an arbiter/scheduler sitting between the DRAM and the many clients requesting access to the DRAM, deciding whose requests are submitted in what order. Modern DRAM arbiters have sophisticated quality-of-service features, so it's pretty easy to set them up so that video refresh is priority #1.

altaic · Sep 22, 2021

Bygone era is apt. This era is about approaching thermodynamic limits, although we’re a long way away from running into that. Clever designs still rule.

vladi · Sep 22, 2021

leman said:
The reason why I am asking is because unified memory is pretty much the default assumption across the entire range of Apple Silicon Macs (Apple communication has been fairly clear on this so far). So every time someone suggests that they are going to use separate VRAM I have to ask why that person thinks that Apple would break their elegant design and what kind of benefit that would bring in their opinion.

Pretty simple. Rendering. If Apple is anything but serious about at least prosumer market let alone production market they need RAM and they need VRAM. But maybe they have a elegant solution for 512GB of shared memory.

quarkysg · Sep 22, 2021

mr_roboto said:
Your understanding is correct, but only for a bygone era. As far as PCs are concerned, dual ported video RAM began fading away in the late 1990s, and it's completely gone today.

I don't know exactly why it fell out of favor, but my educated guess is that it wasn't a great tradeoff to dedicate a bunch of pins on memory chips just to video refresh. Pins are a scarce resource on chips. Instead of spending a lot of resources on a second port, you can put everything into making a single ported DRAM as fast as it can possibly be, then share that bandwidth between multiple clients.

Man I'm getting old.

The design of the bygone era actually makes a lot of sense to me, but with refresh rate at even 240Hz, I suppose it'll not put too much strain on the memory sub-system of today. Tho. the bandwidth will add up with more displays supported. I suppose that's the reason why the M1 only support 2 6K 60Hz displays.

Apple may go back to this design in their SoC as the pins need not be exposed now, if pins are a concerns. They just need the digital video output pins from the SoC.

Anyway I was trying to search for the Linux video driver for the M1, hoping to find out how the video hardware is controlled, but couldn't find anything.

leman · Sep 22, 2021

vladi said:
Pretty simple. Rendering. If Apple is anything but serious about at least prosumer market let alone production market they need RAM and they need VRAM. But maybe they have a elegant solution for 512GB of shared memory.

Why do they need VRAM for rendering? I would think that just having, well, RAM would be sufficient?

Btw, Metal ray tracing is clearly designed with production GPU-accelerated rendering in mind.

diamond.g · Sep 22, 2021

leman said:
Why do they need VRAM for rendering? I would think that just having, well, RAM would be sufficient?

Btw, Metal ray tracing is clearly designed with production GPU-accelerated rendering in mind.

Isn't it just using the Compute Shaders on the GPU to do the BVH traversal and Interference checks?

leman · Sep 22, 2021

diamond.g said:
Isn't it just using the Compute Shaders on the GPU to do the BVH traversal and Interference checks?

Sure, but I was referring to features like extended acceleration structure limits (that allow you to use much larger scenes) and raytracing motion blur.

diamond.g · Sep 22, 2021

leman said:
Sure, but I was referring to features like extended acceleration structure limits (that allow you to use much larger scenes) and raytracing motion blur.

Ah, that makes sense. Though I do wonder how they plan on "hardware" accelerating those things (the latter seems ripe for motion vectors which can be used for other things *cough* ML upscaling *cough*).

mr_roboto · Sep 22, 2021

quarkysg said:
Apple may go back to this design in their SoC as the pins need not be exposed now, if pins are a concerns. They just need the digital video output pins from the SoC.

Anyway I was trying to search for the Linux video driver for the M1, hoping to find out how the video hardware is controlled, but couldn't find anything.

I don't think anybody makes dual-ported VRAM any more, so it'd be difficult to go back.

Note that despite what you may have heard, M1's DRAM is not "on die", meaning it's not part of the SoC itself, it's just a pair of off-the-shelf LPDDR4x DRAMs soldered to the SoC package. (known as Package-on-Package or PoP)

I've been following the Asahi Linux efforts and the short version is that you'll never find that out by reading their driver. The long version is in this blog post:

Progress Report: August 2021 - Asahi Linux

asahilinux.org

You should read it, it'll blow your mind how Apple did video mode setting on M1.

vladi · Sep 22, 2021

leman said:
Why do they need VRAM for rendering? I would think that just having, well, RAM would be sufficient?

Btw, Metal ray tracing is clearly designed with production GPU-accelerated rendering in mind.

Oh I meant GPU rendering. Personally I ditched CPU rendering two years ago and went with Octane for product rednering shots and enver looked back. Now we have GPU V-Ray for interior and I don't think CPU will hold anymore. We are no Hollywood 3D production render expert people and there are some VFX hubs that still hold onto CPU render farms due to physics accuraccy and because they'v invested like $250,000 into farming.

But take for example BMD Fusion, simpleton in prosumer video comping (It could get very advanced as well but BMD have not been developing it properly). Most of the tools are now GPU accelerated and they need memory for such thing but Fusion itself eats up RAM as much as you can feed to it. It' not as CPU stressing as much as it love to choke up on available RAM in order to render or give you realtime workflow. If RAM is shared I could see the problem.

leman · Sep 22, 2021

vladi said:
Oh I meant GPU rendering. Personally I ditched CPU rendering two years ago and went with Octane for product rednering shots and enver looked back. Now we have GPU V-Ray for interior and I don't think CPU will hold anymore. We are no Hollywood 3D production render expert people and there are some VFX hubs that still hold onto CPU render farms due to physics accuraccy and because they'v invested like $250,000 into farming.

But take for example BMD Fusion, simpleton in prosumer video comping (It could get very advanced as well but BMD have not been developing it properly). Most of the tools are now GPU accelerated and they need memory for such thing but Fusion itself eats up RAM as much as you can feed to it. It' not as CPU stressing as much as it love to choke up on available RAM in order to render or give you realtime workflow. If RAM is shared I could see the problem.

But isn’t Apple‘s approach with shared memory actually much better for this? Systems with a lot of GPU memory are rare and expensive. Not to mention that transferring data over to GPU memory takes a long time.

With plenty of shared memory on an Apple Silicon system you can dedicate as much to the GPU needs as you have to, and you can quickly combine CPU and GPU work for more advanced applications.

theorist9 · Sep 22, 2021

leman said:
The iPhone probably doesn’t need LPDDR5, it is already fast enough. The new Mac prosumer chips could be a different matter though, they are much more likely to need significantly improved bandwidth (especially on the rumored 32-core GPU). Given the relative scarcity of LPDDR5, it would make perfect sense for Apple to continue using cheaper, more ubiquitous LPDDR4X for the phones while reserving the harder to get faster RAM for the Macs.

P.S. Some android flagships can afford shipping LPDDR5 because their volume is not nearly close to the iPhone…

TLDR: This back-of-the-envelope calculation suggsest that, in order to put LPDDR5 RAM in all the "M1X"-equipped MBP's and Minis it's introducing in Q4, Apple would need to use about 3% (2M/70M) of global LPDDR5 production (which seems doable). That percentage will decrease in subsequent quarters, as LPDDR5 production is increasing rapidly.

From a March 7 2021 article announcing that Hynix has started mass-production of 18GB (gigabyte) LPDDR5 mobile DRAM (https://www.prnewswire.com/news-rel...ith-industrys-largest-capacity-301241338.html):

"According to market research institution Omdia, the demand of LPDDR5 DRAM accounts for about 10% of the entire mobile DRAM market currently, and the share is estimated to increase to more than 50% by 2023 as high-tech devices are adopted in a widening range of applications."

From this I estimate that 45M LPDDR5-equipped mobile devices were shipped per quarter in Q1 2021 (0.45B units laptops+tablets+phones globally/quarter x 10% = 45M LPDDR5-equipped units/quarter. And that number is rapidly increasing as mass production ramps up--let's estimate it's now 70M/quarter (it's estimated to increase to about 250M/quarter, i.e., half of all mobile units, by 2023)

Mac sales are about 4M units/quarter. No idea what percent of those will be "M1X"-equipped MBP's + Minis, but let's say it's half = 2M units/quarter. That means, to put LPDDR5's in all of them, Apple would need to use 2M/70M = 3% of global LPDDR5 production. That seems doable.

If anyone has actual numbers, as opposed to my guesses, feel free to update my calculations accordingly.

altaic · Sep 23, 2021

There are a few shallow teardowns of the iPhone 13 Pro out right now, where they don’t get all the way down to the ICs. I haven’t managed to find a proper teardown yet, but I did find out that tomorrow both iFixit and TechInsights will be doing theirs. We should know more soon!

Also FYI, in the past, TechInsights has done die analysis of the A and M SoCs, so I’m keeping my eye on them. There’s also ICmasters to keep an eye on, though I think their work is published by semianalysis (whose conclusions I don’t usually agree with).

altaic · Sep 23, 2021

It turns out that someone has taken a high res photo of the A15 and added it to the Wikipedia entry for the SoC.

Unfortunately, the chip is marked with “D9XMR” which corresponds to the same Micron D9XMR RAM that iFixit identified on the A14. So, unless the D9XMR lineup includes LPDDR5 memory (pretty sure it doesn’t), or there are multiple A15 variants, it’s looking like the A15 uses LPDDR4X 🙁

I suppose that’s not a huge surprise, and the higher end Mac SoCs could certainly use different memory controller IP, but yeah 🤷‍♂️

Is the M1 GPU sharing LPDDR4 RAM going to be enough?

macrumors Core

macrumors 601

macrumors 601

macrumors 65816

macrumors 65816

Suspended

macrumors 601

macrumors 601

macrumors Core

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors 6502a

macrumors 65816

macrumors 65816

macrumors Core

macrumors G5

macrumors Core

macrumors G5

macrumors 6502a

macrumors 65816

macrumors Core

macrumors 601

macrumors 6502a

macrumors 6502a

Our Staff