Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I see Apple Silicon as a return to the old days when each company (IBM, DEC, Data General, Perkin Elmer, CDC, Sperry-Univac) all has there own architectures. Don't laught. The new model is the company designes it and TSMC makes it. Even a hobbyest, like me can make a low-perforance CPU architecture by programming a "soft CPU" in an FPGA. And if I don't like it, I can reflash the FPGA All for under $100. The bar for entry to this is now very low so I expect many to do this.

that is probably not gong to happen. Similar reason why CDC , Sperry-Univac , DG and DEC are not prime time players anymore.

There will be differences across implementations, but will be more confined to the non CPU cores. The SoC packages will have a non uniform collections of accelerators across system vendors but probably won't see much expansion in the generic application cores ( CPU cores) . There is tons of money that goes into developer tools and training that really isn't going to be supportable by several dozen different system vendors rolling their own gratuitously different stuff.

What works and doesn't work on generic CPU cores has been optimized to death over the last 30+ years. There is no "huge breakout" that someone is going to get short of going Quantum and/or abandoning the basic Von Neumann approach is a huge way. Arm V9 , RISC-V , modern Power , modern Z aren't that bad. x86 has some 25-40 year old dubious bad design tangents coupled to it but if AMD/Intel would do a decent amount of "fail experement" garbage collection along with letting go of code generated in the 80's and 90's they'd be in the "not that bad" status too.

Also not going to get major new fragmentation in major operating systems either. ( those also make applications less portable because have to link again different application libraries. That can be maganed by adding some layers of abstraction and/or customization. ) If go back to that 60-70's era each vendor had their own OS too. Too many multiple platform ports of very viable Operating System now to that to return.

What Apple is doing here with a lightly forked CPU core and a separate OS ( well there is some standard Unix there too for a small segment of apps) is primarily driven by the scale of the iPhone and iPad. Without that scale it isn't going to be as tractable to move very far away from the shared R&D mainstream implementations. If for some reasons iPhone volume collapsed in 4-5 years then the M-series variants would be in a deep, deep , deep manure pit.
 
It is relevant. You still need a memory for both GPU and CPU. Apple is using LPDDR for both of them through unified memory but for higher end GPU still requires much faster memory such as HBM2. What you are saying is use LPDDR5 for GPU. Are there any high end GPU using LPDDR5 as a VRAM?

Do you really think LPDDR5 can replace HBM2? I think not. Even Nvidia's Grace CPU with A100 uses both LPDDR5 and HBM2 and then link them with NVlink. So far, nobody ever used LPDDR5 to replace both GDDR6 and HBM2 for high end and workstation GPU.

LPDDR5 is far from replacing HBM2 memory and there is no way to beat its bandwidth. How would you overcome its bandwidth physically?
Large SRAM cache can help offset bandwidth deficits.
 
What do you mean not yet proven?
If you gonna say M1 is a proof, then it's not enough to prove anything. Also, you are ignoring the physics and you still not able to prove that LPDDR5 can replace HBM2. So how?
 
If you gonna say M1 is a proof, then it's not enough to prove anything. Also, you are ignoring the physics and you still not able to prove that LPDDR5 can replace HBM2. So how?
I've said before Apple can go the "Infinity Cache" route (which is what AMD is using) as long as the workloads are designed to take advantage of it.
 
I've said before Apple can go the "Infinity Cache" route (which is what AMD is using) as long as the workloads are designed to take advantage of it.
lol, even those AMD GPU with infinity cache still using GDDR6. Still not enough for HBM2 level. At this point, I'm gonna ignore this thread as nobody didn't prove anything instead of breaking the physics.
 
  • Haha
Reactions: Serban55
lol, even those AMD GPU with infinity cache still using GDDR6. Still not enough for HBM2 level. At this point, I'm gonna ignore this thread as nobody didn't prove anything instead of breaking the physics.
Well it stands to reason it must work well enough since the workstation cards using RDNA2 don't have HBM (the CDNA cards do use HBM, and don't have IC either).

Again, TBDR GPUs don't care nearly as much about memory bandwidth, I've not seen anyone post anything that shows the M1 actually needs more GPU bandwidth with its current 8 GPU config. I do agree that more GPU cores will need more bandwidth (proportionate to core count) to keep up, but I am not sure Apple is going to be anywhere near needing at TB/s of it.
 
It is relevant. You still need a memory for both GPU and CPU. Apple is using LPDDR for both of them through unified memory but for higher end GPU still requires much faster memory such as HBM2. What you are saying is use LPDDR5 for GPU. Are there any high end GPU using LPDDR5 as a VRAM?

You are completely missing the forest for the trees. Two big issues.

1. both CPU and GPU will not necessarily ask for the same stuff at different times.
2. It isn't "faster" ; it is higher bandwidth.

HBM largely gets it high bandwidth by having a wide bus. Apple can implement a wide bus using LPDDR4X. They need to put more memory controllers on the die but

HBM2e bus width 1024

32 x 16 bit = 512 bits

which still haven't accounted for clock speed or parallelism for multiple data streams. Apple's semi custom LPDDR4X RAM stacks probably have a 4 x16bit ( 64bit) wide concurrent data path instead of generic LPDDR4 RAM (with multiple dies with shared bandwidth to "banks' inside of the RAM package). What apple is doing is more HBM like than generic standard LPDDR4.


And if Apple didn't add a relatively larger cache to backstop the gap then yeah they would e in bigger trouble.
[ another missing factor is that the GPU and CPU run so hot in the MBP 16" enclsoure that the GPU is downclocked quite a bit. It is primarily just getting traction because it is "wide" in compute also. ]


I am skeptical that Apple is going to cover the Build to Order (BT0) GPU HBM2 models without throwing more than one die at the problem ( more doable with 8 custom RAM packages than with just 4. They'll need a high multiple of their custom packages but 1-2 HBM ones. Would need at least a multiple greater than 2 , if not 4 . If they did 1-to-1 ratio then yeah they'd probably "loose". )


Do you really think LPDDR5 can replace HBM2? I think not. Even Nvidia's Grace CPU with A100 uses both LPDDR5 and HBM2 and then link them with NVlink.

the MBP 16" doesn't cover that space. So there is little reason to think Apple is trying to cover that.
The top 20% of performance in GPU space I don't think Apple is going to cover. 90% of their entire Mac product line doesn't cover that space. The laptops certainly do not .

For the next year or so, I think Apple will keep 3rd party GPUs sidelined so that app developers fully optimize for the Apple GPU. There are several features in Apple GPU's Metal API around some of these bandwdith issues. If folks very heavily use those that too will take some bandwidth drop pressures off the table.

But over the long term..... Apple probably does need to figure out how to either let some 3rd party GPU players back into the Mac ecosystem or they'll need to separate Unified Memory from Homogenous Memory. However, on laptops with higher Performance/Watt pressures it probably will remain essentially the same thing.

So far, nobody ever used LPDDR5 to replace both GDDR6 and HBM2 for high end and workstation GPU.

Is Apple going to make a very high end workstation with Apple Silicon ? That is an assumption.


P.S. Bandwidth problems aren't always GB problems.



514240562_575px.jpg


https://www.anandtech.com/show/16912/hot-chips-2021-live-blog-graphics-intel-amd-google-xilinx

Those curves bend with cache capacity. It isn't linear across the whole scale.
 
View attachment 1835514Speed and bandwidth are separate things. LPDDR5 is still far from being able to replace GDDR6. Not only that, 16 inch MBP used HBM2 graphic cards which is more faster than GDDR6.

There are two measures to RAM performance: bandwidth (max amount of data transmitted within a time window) and latency (time from data request to when the data be one’s available). Most of the time, when we speak about RAM performance, we speak about bandwidth. There are also other factors such as cost, power consumption, space requirements and component availability.

Yes, a single HBM2 module will have higher bandwidth (and capacity) than a single LPDDR5 module. However, you can use multiple LPDDR5 modules in parallel to reach the same bandwidth. After all, that what HBM2 is - multiple relatively slow stacked DRAM chips that communicate over a very wide data interface.

It is relevant. You still need a memory for both GPU and CPU. Apple is using LPDDR for both of them through unified memory but for higher end GPU still requires much faster memory such as HBM2. What you are saying is use LPDDR5 for GPU. Are there any high end GPU using LPDDR5 as a VRAM?

I think the question we need to be asking ourselves is how much memory bandwidth Apple GPUs really need? We are so used to the traditional „dumb“ GPU design where the performance is achieved by stacking the bandwidth to incredible levels that we often forget that alternative engineering solutions are possible.

Apple Silicon has two advantages over traditional GPUs. First is a large fast cache (AMD used similar technology in the recent GPUs, allowing them to use slower
VRAM without sacrificing performance). Second is TBDR, which dramatically reduces memory access requests, and, what’s even more important, makes them more predictable and therefore cacheable. It is entirely possible that Apple GPUs will quickly run into diminishing returns as you are increasing the bandwidth, because you are limited by the GPU and the cache performance, not the RAM bandwidth. We can already see it with M1, where a meager 60Gbps can offer performance not far off GPUs that have 200Gbps VRAM.

Basically, the idea is that by carefully matching the capability of your hardware to cache size and performance you can control how much bandwidth you really need. You don’t have to believe me on this. This from Apples GPU driver team lead:


LPDDR5 is far from replacing HBM2 memory and there is no way to beat its bandwidth. How would you overcome its bandwidth physically?

By not needing that much bandwidth. I mean, give the chip 64/128MB of fast LLC plus 200GBps (four stacks of LPDDR5) and you can probably reach the same aggregate bandwidth as 500Gbps GPU, especially taking into account the TBDR optimizations.
 
Last edited:
lol, even those AMD GPU with infinity cache still using GDDR6. Still not enough for HBM2 level. At this point, I'm gonna ignore this thread as nobody didn't prove anything instead of breaking the physics.

You have to look at physics too. A 256-bit wide GDDR6 bus and a 256=bit wide LPDDR4x bus really aren't hugely apart from each other.

If the Apple GPU objective were to crush the upcoming Ponte Vechioo , MI200 / MI300 , "next gen" A100 class cards they have a large problem. But none of those are going into even a current Intel Mac let along a Mac laptop any time soon.

If Apple thins out the large screen imac even half as much as they thinned out the 24" versoin then it really isn't "too infinity and beyond" power budget requirements.
 
By not needing that much bandwidth. I mean, give the chip 64/128MB of fast LLC plus 200GBps (four stacks of LPDDR5) and you can probably reach the same aggregate bandwidth as 500Gbps GPU, especially taking into account the TBDR optimizations.

128MB for a single 4K/1080p screen workload. but what about multiple concurrent screens. Apple can wind the scope of the solution space down so that it works. ( single screen laptops solution space versus 3 6k discrete screens space ) .

The M1 systems ( Mini and MBP 13) dropped off in screen driving capacity. Max RAM dropped off also. That cache is also going to get consumed by the CPU so it solely isn't going to benefit the GPU.

CPU L3 cache patterns (and replacement policies) are not necessary the same set as those for all GPU workloads ( compute versus raster render and display signal generation. )

There is a pretty good likelihood that Apple is going to hit a smaller solution space for the power usage , die space , and cost constraints they are going to put on the Apple GPUs. The question more so is whether it will be just big enough or too small. Exactly matching or "covers more" probably are off the table.

Where Apple has used HBM2 in Mac systems has also been in part to cut power to some extent rather than solely constant absolute constant bandwidth.
 
128MB for a single 4K/1080p screen workload. but what about multiple concurrent screens. Apple can wind the scope of the solution space down so that it works. ( single screen laptops solution space versus 3 6k discrete screens space ) .

Why would you trash your cache with the the contents of the screen? That’s probably just written out to RAM bypassing the cache. Frankly, I don’t know how this stuff works… it’s possible that the chips have some sort of dedicated buffers for video outputs and they don’t land in the RAM at all.


The M1 systems ( Mini and MBP 13) dropped off in screen driving capacity. Max RAM dropped off also. That cache is also going to get consumed by the CPU so it solely isn't going to benefit the GPU.

I still think that the screen output capability of M1 is a limitation of early implementation. I don’t know if it has anything to do with memory or caches.

CPU L3 cache patterns (and replacement policies) are not necessary the same set as those for all GPU workloads ( compute versus raster render and display signal generation. )

Sure, but that’s why the CPU clusters have their own shared L2 cache that fulfills the role of the traditional CPU L3.
 
  • Like
Reactions: BenRacicot
lol, even those AMD GPU with infinity cache still using GDDR6. Still not enough for HBM2 level. At this point, I'm gonna ignore this thread as nobody didn't prove anything instead of breaking the physics.
how you want to be proven wrong when you ignore this thread? I think its better this way, bye bye and have a nice week
 
  • Like
Reactions: JMacHack
Why would you trash your cache with the the contents of the screen? That’s probably just written out to RAM bypassing the cache. Frankly, I don’t know how this stuff works… it’s possible that the chips have some sort of dedicated buffers for video outputs and they don’t land in the RAM at all.




I still think that the screen output capability of M1 is a limitation of early implementation. I don’t know if it has anything to do with memory or caches.



Sure, but that’s why the CPU clusters have their own shared L2 cache that fulfills the role of the traditional CPU L3.
Wow, possibly the best comment. We all aren’t exactly sure how this works, it appears that LPDDR4 RAM being shared would bottleneck but we don’t know for sure.

we also seemed to have jumped from LPDDR4 to HBM2 RAM.
Is it safe to say that LPDDR4 is a bottleneck for shared memory if we have multiple screens at 4K ?
 
we also seemed to have jumped from LPDDR4 to HBM2 RAM.
Is it safe to say that LPDDR4 is a bottleneck for shared memory if we have multiple screens at 4K ?

A 5K framebuffer frame is around 60MB, you need about 3.4Gbps to resend such a framebuffer at 60fps. M1 alone has 60Gbps. No, LPDDR4 is definitely not a bottleneck here, especially when one takes into account that full-screen redraw is exceedingly rare.

I don’t believe we will see HBM2 in the baseline prosumer chip for one simple reason: battery life. HBM2 is not designed with low-power operation in mind. Yes, it’s power-efficient, but that’s in context of power-hungry GPU and server RAM. There is little doubt that Apple will continue building their own „HBM“ from low-power optimized RAM, as they have been doing for years.
 
  • Like
Reactions: BenRacicot
A 5K framebuffer frame is around 60MB, you need about 3.4Gbps to resend such a framebuffer at 60fps. M1 alone has 60Gbps. No, LPDDR4 is definitely not a bottleneck here, especially when one takes into account that full-screen redraw is exceedingly rare.

I don’t believe we will see HBM2 in the baseline prosumer chip for one simple reason: battery life. HBM2 is not designed with low-power operation in mind. Yes, it’s power-efficient, but that’s in context of power-hungry GPU and server RAM. There is little doubt that Apple will continue building their own „HBM“ from low-power optimized RAM, as they have been doing for years.
Ok so 5K@30fps = 3.4Gbps? And that would make 5K@60fps = 6.8Gbps?
Is that math 4K@30fps = 2.8Gbps? X 3 screens at 4K = 8.4Gbps. This feels like the min expectation for pro setups.

If we could get gaming compute power that would be an awesome start.
 
Ok so 5K@30fps = 3.4Gbps? And that would make 5K@60fps = 6.8Gbps?
Is that math 4K@30fps = 2.8Gbps? X 3 screens at 4K = 8.4Gbps. This feels like the min expectation for pro setups.

If we could get gaming compute power that would be an awesome start.
What @leman wrote was at 60fps.

Unless I am mistaken framebuffer memory hasn't been a problem for a couple of decades.
 
Ok so 5K@30fps = 3.4Gbps? And that would make 5K@60fps = 6.8Gbps?
Is that math 4K@30fps = 2.8Gbps? X 3 screens at 4K = 8.4Gbps. This feels like the min expectation for pro setups.

If we could get gaming compute power that would be an awesome start.

3.4GB/s is the maximal required bandwidth to send 5K@60pfs. Frankly, I doubt that you ever need that much since screen redraw is usually only partial (at least outside of gaming). And I have no clue how modern GPUs handle display output. It is entirely possible that they have a dedicated framebuffer memory pool for display connection that bypasses RAM. Maybe someone with detailed knowledge can comment.
 
A 5K framebuffer frame is around 60MB, you need about 3.4Gbps to resend such a framebuffer at 60fps. M1 alone has 60Gbps. No, LPDDR4 is definitely not a bottleneck here, especially when one takes into account that full-screen redraw is exceedingly rare.

One nit here: 60MB * 60 frames is around 3.5GBps, not Gbps. While the M1 bandwidth is also measured in GBps, it was throwing me for a loop since data rates for DisplayPort are in Gbps, and it wasn’t tracking with what I knew about 5K data rates over DisplayPort. :p

But I do wonder how the display hardware works here. The DisplayPort controller being on die means that the SoC either needs to read the framebuffer back from memory, or have some other access to the data on a streaming basis from the GPU. If it has read it back, then that’s more bandwidth potentially used if there are cache misses.
 
  • Like
Reactions: BenRacicot
One nit here: 60MB * 60 frames is around 3.5GBps, not Gbps. While the M1 bandwidth is also measured in GBps, it was throwing me for a loop since data rates for DisplayPort are in Gbps, and it wasn’t tracking with what I knew about 5K data rates over DisplayPort. :p

Yeah, sorry, I still find these abbreviations asinine. I meant GB/s everywhere.

But I do wonder how the display hardware works here. The DisplayPort controller being on die means that the SoC either needs to read the framebuffer back from memory, or have some other access to the data on a streaming basis from the GPU. If it has read it back, then that’s more bandwidth potentially used if there are cache misses.

It is surprisingly difficult to find any relevant information on this. The only bit I could find was from Intel's GPU white paper where they state that the display engine can be a major consumer of bandwidth (suggesting that the framebuffer is indeed stored in system RAM). The interesting bit is that the display engine in Intel's SoC is not part of the GPU, but a separate SoC unit (makes sense in retrospect), but really no idea how Apple treats it.

BTW, let's not forget about framebuffer compression. Apple GPUs will apply lossless compression to the images, so the required bandwidth is likely to be much lower than the theoretical max.
 
Yeah, sorry, I still find these abbreviations asinine. I meant GB/s everywhere.



It is surprisingly difficult to find any relevant information on this. The only bit I could find was from Intel's GPU white paper where they state that the display engine can be a major consumer of bandwidth (suggesting that the framebuffer is indeed stored in system RAM). The interesting bit is that the display engine in Intel's SoC is not part of the GPU, but a separate SoC unit (makes sense in retrospect), but really no idea how Apple treats it.

BTW, let's not forget about framebuffer compression. Apple GPUs will apply lossless compression to the images, so the required bandwidth is likely to be much lower than the theoretical max.
Yeah I don't think anyone uses a dedicated framebuffer anymore (separate from the pool of memory either on the GPU board or share with the system). The last time I can recall it being mentioned was with Xenos on the Xbox 360 (it was like a 10MB buffer).
 
Why would you trash your cache with the the contents of the screen? That’s probably just written out to RAM bypassing the cache. Frankly, I don’t know how this stuff works… it’s possible that the chips have some sort of dedicated buffers for video outputs and they don’t land in the RAM at all.

You wouldn't. But back in several replies you put forward how cache and TBDR was going to solve the problem.
It doesn't matter can't exactly match bandwidths because need less bandwidth. The probably here is that the cache isn't going to be effective at mitigating bandwidth if you are not using it. Same issue with TBDR. Has next to no traction here because it is a display controller issue not a "tile" (or tile cache) issue.

There is a significant factor where this bypass 'eats' into the savings of the other techniques. For example with some approximate numbers. 60 GB/s for the memory system and 3 GB/s for a display stream. Your position is that it is a small fraction (5% ) . True, but that small fraction diminishes your savings. For example if it is 55% that is only applied to 95% of the workload. That gives an overall effective rate of 52%. With two displays that falls to 50%. Three displays 47%. 47% isn't going to scale up as well as 55% might.

So the real gap between GDDR6 and HBM2 will be larger. HDR ( 10-12 bit color ) and/or high refresh ( 90-120 Hz) also expose the problem.

The geometry render function units (and video decode units ) have to "pass" the rendered screen elements off. there are many of them and much fewer of the display controllers. They need a central , shared area to do communication through. If we have banned them from using the system cache the next level up of shared area is RAM. You're trying to direct this away from RAM to some on-die large cache area... possible with a substantially larger die area consumption which won't scale up very well either.

Fancy up/down scaling ( XeSS/DLSS?FSR up and "Retina" down ..... again similar bandwidth pressures where pragmatically need to hold "old" and "new" in order to get to the altered state.

Trying to balloon squeeze this problem off to monitors isn't going to work well either. There are some monitors that have a big enough buffer to statically refresh a completely static screen. But that is an optional feature. Similarly there are some monitors that can accept DSC compressed streams. Again optional; not standard. And in order to get to a DSC compressed stream need to have the original frame. [ Even if Apple applied something like DSC to every "internal only" framebuffer and/or frambuff contributiion fragment. They could just in time decompress into external streams for non DSC monitors. that cuts down on the display diminish rate but still number of monitors can scale right back up to cover compression "wins". Probably cost increased area consumption though. ]


I still think that the screen output capability of M1 is a limitation of early implementation. I don’t know if it has anything to do with memory or caches.

Die space is probably a significant factor. Bandwidth is another as I've already illustrated. Those are actually coupled as the

"....
Change (factor)Comments
CPU 1 (mm2)2.1-
CPU 1 # cores2.0M1 has twice as many CPU 1 cores
CPU 2 (mm2)1.0Same size as CPU 2
...
DDR (mm2)2.7M1 has double the DDR interfaces

....
GPU (mm2)2.1M1 has double the GPU cores

....
System cache (mm2)0.8M1 has 25% smaller system cache

...."




The "go super wide" is a dual edge sword "solution" . It works much better on mid-size to larger dies. It doesn't work so well once get into the smaller die "half" of the spectrum. Relatively, the system cache shrank and the DDR system is closer to 3 than 2 in scaling factor. ( Not really surprising since to interface with off chip elements through the bottom bumps on package that isn't going to happen a 5nm kind of sizes. Going off die is bigger ). There isn't a good reason to shrink the system cache other running out of area budget.

Already demonstrative that it is "cheaper" to add more GPU cores than it is to add more DDR interfaces . If apply a 4x to the GPU cores there isn't really a way for a pragmatic 6x DDR expansion to keep up.

Similar issue with cache. First, the hit rates tends to go "flat" ( or at least increasing sub "slope 1" linear ). Second, their implementation density is different that logic rates .


Sure, but that’s why the CPU clusters have their own shared L2 cache that fulfills the role of the traditional CPU L3.

You are drinking lots of Cupertino kool-aid. Apple plays some games with their L2 cache numbers. Those L2 are attached to multiple core. The E cores' 4MB L2 cache sounds "L3" big up until divide by 4 ( 1MB L3 cache isn't traditional L3 cache in 2020-2021 era for laptops and desktops ). P cores' 8MB doesn't realy measure up to be huge uniform whole.


pbw-m1_575px.png



That "8+MB" L2 is collapsing in effectiveness around 3MB ( 3072KB ) and basically "done" at 4MB ( 4096) .

[ apple did something similar with the iMac Pro and Mac Pro tech specs were cobbled together some aggregate caches to present a "bigger number than those other guys". ]


There is a comment you quotes about " x TFLOPS needs y amount of bandwidth". The NPU substantive has a large amount of TFLOPs and sits largely coupled to the L3 system cache. The issue is just a GPU silo issue. The M series SoC have multiple TFLOP consumers sitting on the L3 system cache. Can do RAM en/decompression to offlload bandwdith pressure somewhat , but also need to be able to walk and chew gum at the same time.


Using GDDR6 or HMB2 dGPU implementations more headroom to be able to walk and chew gum at the same time. Can plug in 2-4 monitors and it copes with the larger workload. Apple is a cut off scale approach. Limit RAM capacity .
 
Do we have any real-world app examples showing the benefit of HBM2 over other memory types? That would require you have otherwise identical systems built with equal amounts of HBM2 and, say, LPPDR5. I don't know if those exist outside of internal company test labs.

All I've found thus far is a simulation (still, it's interesting reading):

 
Last edited:
  • Like
Reactions: BenRacicot
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.