You seem to be presenting this as if it's in contradiction to what I said, but it's actually a restatement/expansion of it. As I mentioned, HBM isn't suitable for Apple's Mac's, based on power efficiency and other reasons: "I suspect [power efficiency is] one of the reasons (but not the only reason) Apple has stayed with LPDDR".
I went on to say that while Apple could design a server chip for which HBM was suitable, that would be inconsistent with their approach thus far, which is to share the same architecture across their chips (i.e., if their server chip is a scaled-up version of their existing chips, it's less likely to have a design suitable for HBM).
What's your basis for this expectation?
Sorry for being unclear. Let me restate: I was agreeing with you that it was unsuitable, but disagreeing that the primary issue is power
efficiency, rather the primary issue is minimum stack size and of course the cost. For analysis below I'm going to focus on size and power efficiency, but keep in mind that even if the minimum size wasn't an issue, cost still would be.
From what I can gather HBM3e is as efficient if not more so than LPDDR5X/6 on a per-bandwidth basis - i.e. it's uses X more power but provides Y more bandwidth than LPDDR where Y >= X. I found a reference that backed that up, but I didn't link it because I didn't trust it and when I check its references I didn't see the quoted numbers. I'll keep looking. The best link I was able to find was this (
https://user.eng.umd.edu/~blj/papers/memsys2018-dramsim.pdf) but the problem is it is old and the primary analysis focuses on 4-core CPU system which the authors admit doesn't really stress the bandwidth of an HBM2 memory system so even if its performance issues still the best, its power to performance ratio looks pretty bad (oh and again they were comparing the power of an HBM2 system using a 1,024 bit bus to LPDDR4 using a 64bit bus, which combined with the 4-core CPU issue, makes the link not a great reference).
However, the reasoning why HBM should be at least as if not more power efficient than LPDDR is similar to why Apple's LPDDR5X R1 uses less power per bandwidth than Apple's standard LPDDR5X solution: they get double the bandwidth by running 4x the number of pins at 1/2 rate. Like with other silicon, for RAM width tends to be linear and clocks tend to not be, so lowering clocks and increasing width yields power savings. Heck it's one of the ways LPDDR is able to provide higher bandwidth at lower power than regular DDR (increased parallelism within the DRAM package). HBM is like that, but provides even more bandwidth.
So why doesn't one build an M4 Max with HBM3e if HBM is so efficient? Primarily, you can't build an HBM stack with less than a 1024bit bus (HBM3/HBM3e delivers 800GBs/1.2TB/s here). So yes that will use more power than an LPDDR solution delivering 540GB/s on a 512bit bus but that's secondary to the fact that you can't build a smaller HBM3e stack - oh and the HBM3(e) will cost a lot more. That just gets more ridiculous as you go further down the stack where again the minimum HBM3e stack performance remains 1024 bus and >1TB/s but your requirement are say 120GB/s. So I agree with your overall thesis that HBM is inappropriate for almost all of Apple's product stack, but it's not the power
efficiency per se that's problem. Beyond the cost issue (which is the biggest), it's the minimum amount of total power/bandwidth/stack size - at equal bus size HBM is likely as if not more efficient.
I don't understand how these memory technologies differ in detail, so I can only look at implementations. But based on those, I'm wondering if you're overly trivializing the performance advantage of HBM.
As the name says, and as you know, the purpose of HBM is to provide high bandwidth. For instance, NVIDIA's GH200 GPU uses HBM3e to deliver 5 TB/s bandwidth. If it were possible & practical to achieve that by taking the fastest commercially-available LPDDR and expanding its bus width, then they wouldn't bother using HBM3e; they'd instead just use your expanded version of LPDDR. And yet they don't, which seems to indicate this isn't readily achievable with LPDDR, meaning the performance advantage of HBM is real and significant.
As an aside, there's also the question of latency. I've not been able to get a clear picture on this but, based on a few mentions in Chips and Cheese, it seems HBR's latency is inferior to LPDDR's.
With regard to latency, that's my understanding as well, but like you I've had difficulty nailing down hard numbers.
WRT LPDDR and HBM total performance there is a nuance here. HBM has a very real advantage but
@leman is correct that you can build LPDDR with the same bandwidth as HBM3e, but there's a significant catch. Let's start with "can you do it?". Yes. You can see from above an LPDDR5x memory system like the M4 Max which has a 512bit bus delivers around half the bandwidth of an HBM3e stack which has a 1024bit bus (540GB/s vs 1.2TB/s) - so LPDDR5x on the same bus size would deliver very similar performance 1.08 TB/s vs 1.2 TB/s. Okay the HBM is better but not by much and LPDDR6 probably catches up if not surpasses it on the same bus size. (HBM3/HBM3e data rate per pin is about 6.4/9.8Gbps while LPDDR5/6 is about 8.5/10.6Gbps).
So let's reverse my question from earlier, why use HBM at all if it is so expensive and the bandwidth gains seemingly so minor? In addition to my suspicion that HBM is slightly more power efficient at those bandwidths it is because I've left out a really important detail - space. To fill a 1024bit bus with LPDDR5X modules requires 16 such modules - that not only takes up a lot of room, but even worse, just like the minimum HBM stack size is no use for smaller chips, the minimum LPDDR5X module with full bandwidth I believe is 6GB. (I believe 2GB modules are offered but at reduced bandwidth, single data package/rank, while the minimum Apple offers at the max bandwidth is 6GB and I
think that is indeed it). Let's assume it is 6GB. That means in order to provide 1.08TB/s of bandwidth from LPDDR5x you would have to have a minimum of 96GB of RAM from 16 modules. HBM3e can provide that level of bandwidth in a single stack of 24GB. That's the HBM advantage and why people build with it.
So you can in theory build an LPDDR system with as much bandwidth as an HBM3e system, as long as you are willing to provide that much more base RAM and pay the cost in package size. So in theory if Apple wanted to provide 5TB/s (matching Grace Hopper but Blackwell) levels of bandwidth - the system would have to have a minimum 480GB of LPDDR RAM - this isn't completely crazy as Grace Hopper in fact comes with that amount of LPDDR memory on top of the 100+GB of HBM3/e memory (Nvidia uses a smaller LPDDR bus size since it is feeding the CPU, it doesn't need that level of bandwidth, GH's CPU bandwidth is 500GB/s as opposed to 5TB/s). However, as per my arguments in the rest of the thread, I'm not sure Apple will be building an individual chip quite that size to be honest and I'm not sure how you would distribute that over interconnects or even what Apple's interconnect speed will be. Also, how $costs compare at that point to just using HBM I'm not sure and might depend on whether or not you are measuring $/GB or $/TB/s. My suspicion is that LPDDR would be much better in $/GB but worse in $/TB/s. Further, I'm not sure if there are limits to how many LPDDR modules/HBM stacks are possible in a single system or if the limit is just space and money.
*****
As an aside, this (which I assume is still in the development stage) looks intriguing:
en.wikipedia.org
HBM-PIM
In February 2021,
Samsung announced the development of HBM with processing-in-memory (PIM). This new memory brings AI computing capabilities inside the memory, to increase the large-scale processing of data. A DRAM-optimised AI engine is placed inside each memory bank to enable parallel processing and minimise data movement. Samsung claims this will deliver twice the system performance and reduce energy consumption by more than 70%, while not requiring any hardware or software changes to the rest of the system.
[49]
Aye, there is a lot of cool stuff about doing in-memory computation. That's beyond me, but people seem excited by some of the possibilities.