Would Apple's custom AI server chip use HBM or stick with LPDDR?

crazy dave · Feb 7, 2025

leman said:
I think another point to consider is that performance difference between HBM and LPDDR isn’t actually that large. After all, the former is basically stacked DDR with a very wide interface. If you’d take the fastest LPDDR5X and gave it 6000+ bits of bus, it would perform very similarly to HBM. It really be ones about costs and engineering preferences. HBM is probably more compact if you go super wide. Then again, Apples LPDDR modules already can be described as HBM lite.

Aye and even more so for Apple's R1 LPDDR packages which doubled its bandwidth - then again I have no idea what that did to the costs, the article describes a similar packaging technique for the future HBM4 spec to increase its bus width even further but also that HBM4 will be so expensive HBM3e will likely stick around for awhile after HBM4's introduction as the comparatively cost-effective solution, which is ... something. The extra, extra wide bus on HBM4 afforded by the packaging isn't the only contributing factor to that cost increase but still.

senttoschool · Feb 7, 2025

crazy dave said:
Amazing is subjective. But even if I were to accept that adjective, are they amazing enough to be actually driving revenue to justify their costs (i.e. make profits)? Did consumers mass upgrade devices to access them? No to both. What does Apple sell?

Yes, they're driving revenue for Microsoft and OpenAI. Many startups are making bank. For example, Cursor.ai is the fastest growing SaaS of all time.

https://twitter.com/x/status/1887600134269379052

crazy dave said:
Yes you are repeating what I just said:

No, I'm not. You said Apple "outsourcing", which implies that Apple's primary purpose is to sell on-device and then "outsource" certain tasks to outside of Apple. This is not it. They're making their own AI chips because they don't want to outsource.

You're conflating multiple levels of Apple Intelligence together. Again primary purpose is to sell devices using on-device AI - the server-side models are merely there for when on-device is too small and then they outsource models when necessary.

crazy dave said:
It's three levels, but you are saying Apple has to squash that down to two and effectively remove OpenAI in order to not be a failure or even more extreme that the on-device model doesn't matter if they don't have the best server side model running on their own servers making only one level important. I fundamentally disagree. There are two other levels, the actual Apple levels, for a reason.

No, I didn't say that.

crazy dave said:
Again, what is Apple's revenue model? What is OpenAI's?

Apple sells products, subscriptions, and Google Search default.

OpenAI's is to sell ChatGPT, API token usage, whole models, and whatever business model they come up with if/when they achieve AGI.

crazy dave said:
Think of it this way OpenAI or other services is the last level of Apple Intelligence.

For now. Apple got caught with their pants down so they had to add some sort of LLM integration into iOS asap.

crazy dave said:
. Apple uses it when they have to, but rapidly sinking billions upon billions of dollars into developing massive bespoke Apple LLMs to compete with the likes of Claude or ChatGPT or Gemini etc ... isn't, so far, what Apple has indicated that it wants to bring to the table.

Not true. They're using Google TPUs to train their models now. It's not clear what Apple's full strategy is.

At this point, I'm not even sure what you're arguing. What are your main points?

crazy dave · Feb 7, 2025

senttoschool said:
Yes, they're driving revenue for Microsoft and OpenAI. Many startups are making bank. For example, Cursor.ai is the fastest growing SaaS of all time.
https://twitter.com/x/status/1887600134269379052

No where near enough compared to costs. It's a bubble - you check multiple analyses done by multiple investment firms sounding the alarm on this (Jim Covello, Daron Acemoglu, David Cahn, and a number of others), the needed and projected investments outstrip revenue growth predictions by massive amounts. Now if this is anything like the dot com bubble some firms will survive and AI/LLMS are hardly going to disappear, but we're talking many billions of dollars in losses in this field as most of these ventures will fail. The infrastructure is simply too costly. However, this conversation is now at risk and spiraling far out from the original topic of Apple using HBM, which is why I didn't really want to go here. However, it's fair to say that our differing outlooks on this field are probably informing our different opinions about how crucial "Apple being left behind in AI" is right now and even what it actually means to be left behind for a company like Apple.

senttoschool said:
No, I'm not. You said Apple "outsourcing", which implies that Apple's primary purpose is to sell on-device and then "outsource" certain tasks to outside of Apple. This is not it. They're making their own AI chips because they don't want to outsource.

I said outsource as the last level of Apple Intelligence. Which is true and is what you then repeated back to me. What I wrote:

You're conflating multiple levels of Apple Intelligence together. (1) Again primary purpose is to sell devices using on-device AI - (2) the server-side models are merely there for when on-device is too small and (3) then they outsource models when necessary.

What you wrote:

(1) Apple is planning on-device inference and (2) their own server inference. (3) It's not just outsourcing the models to OpenAI.

Why are you arguing with me on this?

senttoschool said:
No, I didn't say that.

That is very much the implication of what you wrote as quoted below:

They plan to inference their own big models in the server. And if their own models suck compare to OpenAI, no one will care about Apple Intelligence. And guess what OpenAI uses? All HBM-based chips.

And even now you write:

They're making their own AI chips because they don't want to outsource.

And

For now. Apple got caught with their pants down so they had to add some sort of LLM integration into iOS asap.

In response to Apple using OpenAI (and others) as the last level implying that Apple's short term goal should be to remove that last level.

Your posts indicate you think Apple needs to run models as large and as sophisticated as ChatGPT/Claude/Ollama/etc ... on their own servers or "no one will care about Apple Intelligence". I contend that they can continue to integrate outside models into Apple Intelligence for anything their own local and server models can't handle. The goal with their AI servers is unlikely to replace ChatGPT completely, but rather to run more and better medium agent-based models to service a larger number of customers.

senttoschool said:
Apple sells products, subscriptions, and Google Search default.

Indeed and the last two service the first as part of Apple's larger ecosystem strategy. And note Apple didn't develop an "Apple Search engine" - getting paid billions by Google was a better deal than spending the resources to develop their own. Just bringing that up.

senttoschool said:
OpenAI's is to sell ChatGPT, API token usage, whole models, and whatever business model they come up with if/when they achieve AGI.

Which is a very different paradigm from Apple which places different emphasis on what they train and how and why.

senttoschool said:
For now. Apple got caught with their pants down so they had to add some sort of LLM integration into iOS asap.

Not true. They're using Google TPUs to train their models now.

Indeed the models you decried as being subpar. And it isn't just a matter of compute power, it's training data and model purpose.

senttoschool said:
It's not clear what Apple's full strategy is.

Sure, as I said I don't sit in Apple's boardroom anymore than you do. I'm only going by what they've said and what they've done so far.

senttoschool said:
At this point, I'm not even sure what you're arguing. What are your main points?

That Apple can afford to be a lot more patient than the hype machine tech journalists and AI enthusiasts have built around LLMs and given their strategy so far and the way Apple does their service business that I think they will be. That Apple's strategy will likely revolve around much smaller AI agents that do very specific tasks whose goal is to fit on-device and on Apple's servers when too large. That they don't need to compete with OpenAI and doing so can be a massive waste of resources. That if the rumors are accurate surrounding Hidra, that they will likely pursue a product direction that reflects their core competencies and issue a product that can be used for both consumers and their data centers.

You asked in your title: "Would Apple's custom AI server chip use HBM or stick with LPDDR?" Hence my original statement:

My guess, and it is a guess, would be LPDDR actually. Apple get remarkably good bandwidth out of LPDDR, server power is actually still important (although if I remember right HBM power cost is no where near as bad as GDDR per bandwidth), and LPDDR is less expensive both because Apple is already getting in bulk for their other chips and just intrinsically. LPDDR also has much better latency and Apple likes to use the memory for both CPU and GPU.

In your original post, you asked:

If it's forgone conclusion that Apple's own custom AI server chip will use HBM, then what are the odds that Apple will re-use the design for a Mac Pro version and sell an amazing local LLM machine in a growing market?

My original post as quoted above and many of my subsequent posts explains that HBM is incredibly uncommon in consumer machines due to its costs, in fact you basically don't see it anymore in the consumer sphere, and it is not necessarily the best solution when you want to serve both a CPU and GPU simultaneously and need a larger data pool than HBM can provide (cheaply). So, even if Apple were to use HBM in their data center that is incredibly unlikely to be the memory used in a consumer machine based on the same SOC. Apple doesn't sell big iron to data centers and it isn't easy to swap LPDDR and HBM in the same SOC design. So if Apple wants to reuse their design to sell to others, this makes HBM less likely in the data center as well. LPDDR at high bandwidth can serve as an amazing local LLM machine - hell there's a reason why a number of the AI tech firms are so excited for DIGITS despite it having LPDDR and likely M4 Pro levels of bandwidth. It isn't all about the trillion parameter models.

This doesn't mean Apple won't use HBM ever in any product, but I disagree about it being a fait accompli that they will do so for the chip being developed and indeed feel that the likelihood tends more towards it using LPDDR. Naturally I can be wrong and you feel I am, fair enough - using HBM is certainly plausible, it just isn't, to me, the most likely thing that Apple will do.

I'm not sure there are any more relevant points to be made at least on my end. I apologize if there is any remaining confusion for you, but I don't think I can possibly make any more permutations of the arguments above than I already have. I will of course read what you write in response.

theorist9 · Feb 7, 2025

crazy dave said:
But its bandwidth and power doesn't really make sense for most applications that Apple competes in - 8GB of HBM simply wouldn't make sense on a phone even if the power was the same or less than LPDDR - it still doesn't make sense even for Max or Ultra.

You seem to be presenting this as if it's in contradiction to what I said, but it's actually a restatement/expansion of it. As I mentioned, HBM isn't suitable for Apple's Mac's, based on power efficiency and other reasons: "I suspect [power efficiency is] one of the reasons (but not the only reason) Apple has stayed with LPDDR".

I went on to say that while Apple could design a server chip for which HBM was suitable, that would be inconsistent with their approach thus far, which is to share the same architecture across their chips (i.e., if their server chip is a scaled-up version of their existing chips, it's less likely to have a design suitable for HBM).

crazy dave said:
....if LPDDR could deliver the same bandwidth as HBM3/e (and it possibly could if you just kept adding memory busses, I'm not sure what the limit is if there is one) its power costs would likely be as high if not higher....

What's your basis for this expectation?

leman said:
I think another point to consider is that performance difference between HBM and LPDDR isn’t actually that large. After all, the former is basically stacked DDR with a very wide interface. If you’d take the fastest LPDDR5X and gave it 6000+ bits of bus, it would perform very similarly to HBM. It really be ones about costs and engineering preferences. HBM is probably more compact if you go super wide. Then again, Apples LPDDR modules already can be described as HBM lite.

I don't understand how these memory technologies differ in detail, so I can only look at implementations. But based on those, I'm wondering if you're overly trivializing the performance advantage of HBM.

As the name says, and as you know, the purpose of HBM is to provide high bandwidth. For instance, NVIDIA's GH200 GPU uses HBM3e to deliver 5 TB/s bandwidth. If it were possible & practical to achieve that by taking the fastest commercially-available LPDDR and expanding its bus width, then they wouldn't bother using HBM3e; they'd instead just use your expanded version of LPDDR. And yet they don't, which seems to indicate this isn't readily achievable with LPDDR, meaning the performance advantage of HBM is real and significant.

As an aside, there's also the question of latency. I've not been able to get a clear picture on this but, based on a few mentions in Chips and Cheese, it seems HBR's latency is inferior to LPDDR's.

*****
As an aside, this (which I assume is still in the development stage) looks intriguing:

High Bandwidth Memory - Wikipedia

en.wikipedia.org

HBM-PIM

In February 2021, Samsung announced the development of HBM with processing-in-memory (PIM). This new memory brings AI computing capabilities inside the memory, to increase the large-scale processing of data. A DRAM-optimised AI engine is placed inside each memory bank to enable parallel processing and minimise data movement. Samsung claims this will deliver twice the system performance and reduce energy consumption by more than 70%, while not requiring any hardware or software changes to the rest of the system.[49]

crazy dave · Feb 7, 2025

theorist9 said:
You seem to be presenting this as if it's in contradiction to what I said, but it's actually a restatement/expansion of it. As I mentioned, HBM isn't suitable for Apple's Mac's, based on power efficiency and other reasons: "I suspect [power efficiency is] one of the reasons (but not the only reason) Apple has stayed with LPDDR".

I went on to say that while Apple could design a server chip for which HBM was suitable, that would be inconsistent with their approach thus far, which is to share the same architecture across their chips (i.e., if their server chip is a scaled-up version of their existing chips, it's less likely to have a design suitable for HBM).

What's your basis for this expectation?

Sorry for being unclear. Let me restate: I was agreeing with you that it was unsuitable, but disagreeing that the primary issue is power efficiency, rather the primary issue is minimum stack size and of course the cost. For analysis below I'm going to focus on size and power efficiency, but keep in mind that even if the minimum size wasn't an issue, cost still would be.

From what I can gather HBM3e is as efficient if not more so than LPDDR5X/6 on a per-bandwidth basis - i.e. it's uses X more power but provides Y more bandwidth than LPDDR where Y >= X. I found a reference that backed that up, but I didn't link it because I didn't trust it and when I check its references I didn't see the quoted numbers. I'll keep looking. The best link I was able to find was this (https://user.eng.umd.edu/~blj/papers/memsys2018-dramsim.pdf) but the problem is it is old and the primary analysis focuses on 4-core CPU system which the authors admit doesn't really stress the bandwidth of an HBM2 memory system so even if its performance issues still the best, its power to performance ratio looks pretty bad (oh and again they were comparing the power of an HBM2 system using a 1,024 bit bus to LPDDR4 using a 64bit bus, which combined with the 4-core CPU issue, makes the link not a great reference).

However, the reasoning why HBM should be at least as if not more power efficient than LPDDR is similar to why Apple's LPDDR5X R1 uses less power per bandwidth than Apple's standard LPDDR5X solution: they get double the bandwidth by running 4x the number of pins at 1/2 rate. Like with other silicon, for RAM width tends to be linear and clocks tend to not be, so lowering clocks and increasing width yields power savings. Heck it's one of the ways LPDDR is able to provide higher bandwidth at lower power than regular DDR (increased parallelism within the DRAM package). HBM is like that, but provides even more bandwidth.

So why doesn't one build an M4 Max with HBM3e if HBM is so efficient? Primarily, you can't build an HBM stack with less than a 1024bit bus (HBM3/HBM3e delivers 800GBs/1.2TB/s here). So yes that will use more power than an LPDDR solution delivering 540GB/s on a 512bit bus but that's secondary to the fact that you can't build a smaller HBM3e stack - oh and the HBM3(e) will cost a lot more. That just gets more ridiculous as you go further down the stack where again the minimum HBM3e stack performance remains 1024 bus and >1TB/s but your requirement are say 120GB/s. So I agree with your overall thesis that HBM is inappropriate for almost all of Apple's product stack, but it's not the power efficiency per se that's problem. Beyond the cost issue (which is the biggest), it's the minimum amount of total power/bandwidth/stack size - at equal bus size HBM is likely as if not more efficient.

theorist9 said:
I don't understand how these memory technologies differ in detail, so I can only look at implementations. But based on those, I'm wondering if you're overly trivializing the performance advantage of HBM.

As the name says, and as you know, the purpose of HBM is to provide high bandwidth. For instance, NVIDIA's GH200 GPU uses HBM3e to deliver 5 TB/s bandwidth. If it were possible & practical to achieve that by taking the fastest commercially-available LPDDR and expanding its bus width, then they wouldn't bother using HBM3e; they'd instead just use your expanded version of LPDDR. And yet they don't, which seems to indicate this isn't readily achievable with LPDDR, meaning the performance advantage of HBM is real and significant.

As an aside, there's also the question of latency. I've not been able to get a clear picture on this but, based on a few mentions in Chips and Cheese, it seems HBR's latency is inferior to LPDDR's.

With regard to latency, that's my understanding as well, but like you I've had difficulty nailing down hard numbers.

WRT LPDDR and HBM total performance there is a nuance here. HBM has a very real advantage but @leman is correct that you can build LPDDR with the same bandwidth as HBM3e, but there's a significant catch. Let's start with "can you do it?". Yes. You can see from above an LPDDR5x memory system like the M4 Max which has a 512bit bus delivers around half the bandwidth of an HBM3e stack which has a 1024bit bus (540GB/s vs 1.2TB/s) - so LPDDR5x on the same bus size would deliver very similar performance 1.08 TB/s vs 1.2 TB/s. Okay the HBM is better but not by much and LPDDR6 probably catches up if not surpasses it on the same bus size. (HBM3/HBM3e data rate per pin is about 6.4/9.8Gbps while LPDDR5/6 is about 8.5/10.6Gbps).

So let's reverse my question from earlier, why use HBM at all if it is so expensive and the bandwidth gains seemingly so minor? In addition to my suspicion that HBM is slightly more power efficient at those bandwidths it is because I've left out a really important detail - space. To fill a 1024bit bus with LPDDR5X modules requires 16 such modules - that not only takes up a lot of room, but even worse, just like the minimum HBM stack size is no use for smaller chips, the minimum LPDDR5X module with full bandwidth I believe is 6GB. (I believe 2GB modules are offered but at reduced bandwidth, single data package/rank, while the minimum Apple offers at the max bandwidth is 6GB and I think that is indeed it). Let's assume it is 6GB. That means in order to provide 1.08TB/s of bandwidth from LPDDR5x you would have to have a minimum of 96GB of RAM from 16 modules. HBM3e can provide that level of bandwidth in a single stack of 24GB. That's the HBM advantage and why people build with it.

So you can in theory build an LPDDR system with as much bandwidth as an HBM3e system, as long as you are willing to provide that much more base RAM and pay the cost in package size. So in theory if Apple wanted to provide 5TB/s (matching Grace Hopper but Blackwell) levels of bandwidth - the system would have to have a minimum 480GB of LPDDR RAM - this isn't completely crazy as Grace Hopper in fact comes with that amount of LPDDR memory on top of the 100+GB of HBM3/e memory (Nvidia uses a smaller LPDDR bus size since it is feeding the CPU, it doesn't need that level of bandwidth, GH's CPU bandwidth is 500GB/s as opposed to 5TB/s). However, as per my arguments in the rest of the thread, I'm not sure Apple will be building an individual chip quite that size to be honest and I'm not sure how you would distribute that over interconnects or even what Apple's interconnect speed will be. Also, how $costs compare at that point to just using HBM I'm not sure and might depend on whether or not you are measuring $/GB or $/TB/s. My suspicion is that LPDDR would be much better in $/GB but worse in $/TB/s. Further, I'm not sure if there are limits to how many LPDDR modules/HBM stacks are possible in a single system or if the limit is just space and money.

theorist9 said:
*****
As an aside, this (which I assume is still in the development stage) looks intriguing:

High Bandwidth Memory - Wikipedia

en.wikipedia.org

HBM-PIM
In February 2021, Samsung announced the development of HBM with processing-in-memory (PIM). This new memory brings AI computing capabilities inside the memory, to increase the large-scale processing of data. A DRAM-optimised AI engine is placed inside each memory bank to enable parallel processing and minimise data movement. Samsung claims this will deliver twice the system performance and reduce energy consumption by more than 70%, while not requiring any hardware or software changes to the rest of the system.[49]

Aye, there is a lot of cool stuff about doing in-memory computation. That's beyond me, but people seem excited by some of the possibilities.

tenthousandthings · Feb 8, 2025

crazy dave said:
[…] You asked in your title: "Would Apple's custom AI server chip use HBM or stick with LPDDR?" […] In your original post, you asked:

“If it's forgone conclusion that Apple's own custom AI server chip will use HBM, then what are the odds that Apple will re-use the design for a Mac Pro version and sell an amazing local LLM machine in a growing market?”

My original post […] and many of my subsequent posts explains that HBM is incredibly uncommon in consumer machines due to its costs, in fact you basically don't see it anymore in the consumer sphere, […]

To recap, let’s just look at Nvidia Blackwell for a real-world, non-hypothetical, example of this, built using the same family of TSMC technologies that Apple uses.

Nvidia only uses HBM in high-end HPC/AI Blackwell products (GB200, HGX B200, HGX B100), and even that is limited to the GPUs. For all of its consumer-facing Blackwell products, it’s either LPDDR (Project DIGITS) or GDDR (GeForce RTX 50 series GPUs), never HBM.

So no, it doesn’t seem likely Apple would use HBM in Mac Studio and/or Mac Pro. All of the same constraints that apply to Nvidia’s product line also apply to Apple’s.

theorist9 · Feb 8, 2025

crazy dave said:
Sorry for being unclear. Let me restate: I was agreeing with you that it was unsuitable, but disagreeing that the primary issue is power efficiency, rather the primary issue is minimum stack size and of course the cost.

The problem is you're still disagreeing with something I never said, thus giving other readers the misimpression that's my position, when it never was.

If you carefully re-read my post, you'll see I never claimed the primary reason Apple doesn't use HBM in its Macs is power efficiency. I merely said it was one of the reasons. Here's what I actually said about power efficiency. Nowhere do I say it was primary. I was focusing on power efficiency merely because, in response to another post, I happened to be addressing the specific question of power efficiency for HBM vs. LPDDR.

theorist9 said:
Further, I suspect that's one of the reasons (but not the only reason) Apple has stayed with LPDDR...

****************

crazy dave said:
From what I can gather HBM3e is as efficient if not more so than LPDDR5X/6 on a per-bandwidth basis

I remain skeptical of the idea that, for equal bandwidth, HBM would consume the same energy as LPDDR, since isn't HBM stacked DDR, and wouldn't it instead need to be stacked LPDDR to be as energy-efficient as LPDDR?
I.e., wouldn't it need to be something like this?:

http://www.golbalspeed.com/en/sys-nd/21.html#:~:text=Mobile%20HBM%20stacks%20and%20connects,advantage%20of%20low%20power%20consumption.

Mobile HBM: Empowering the Next Generation of Mobile Device Memory

With the increasing demand for computational performance and data processing capabilities in mobile devices traditional LPDDR memory is struggling to meet the needs of increasingly complex applications To address this challenge a new memory technology Mobile High Bandwidth Memory https ...

www.openpr.com

crazy dave · Feb 8, 2025

theorist9 said:
The problem is you're still disagreeing with something I never said, thus giving other readers the misimpression that's my position, when it never was.

If you carefully re-read my post, you'll see I never claimed the primary reason Apple doesn't use HBM in its Macs is power efficiency. I merely said it was one of the reasons. Here's what I actually said about power efficiency. Nowhere do I say it was primary. I was focusing on power efficiency merely because, in response to another post, I happened to be addressing the specific question of power efficiency for HBM vs. LPDDR.

Okay fair, but I'm really disputing that efficiency is a reason at all. I'm agreeing with @throAU. HBM's issue for mobile or even desktop (where it isn't around either) isn't efficiency, it's cost + minimum size/power budget (and for CPUs maybe latency). Efficiency just isn't a factor.

theorist9 said:
****************

I remain skeptical of the idea that, for equal bandwidth, HBM would consume the same energy as LPDDR, since isn't HBM stacked DDR, and wouldn't it instead need to be stacked LPDDR to be as energy-efficient as LPDDR?
I.e., wouldn't it need to be something like this?:

http://www.golbalspeed.com/en/sys-nd/21.html#:~:text=Mobile%20HBM%20stacks%20and%20connects,advantage%20of%20low%20power%20consumption.

Mobile HBM: Empowering the Next Generation of Mobile Device Memory

With the increasing demand for computational performance and data processing capabilities in mobile devices traditional LPDDR memory is struggling to meet the needs of increasingly complex applications To address this challenge a new memory technology Mobile High Bandwidth Memory https ...

www.openpr.com

While HBM is essentially stacked DDR, it's a bit like saying LPDDR is just modified DDR (EDIT: to be more clear I am saying both HBM and LPDDR are independently derived DRAM from standard DDR). Like with LPDDR, HBM uses lower clocks and shorter wires to transmit data in bulk over smaller distances with less power. For HBM in particular the stacking + packaging allows the stack to communicate over much shorter distances using much less power. This is also mentioned in the embedded article you and I both linked to:

“HBM has not just superior bandwidth, but also power consumption because the distances are small,” said David Kanter, executive director of MLCommons, an artificial intelligence engineering consortium that, among other things, develops industry benchmarks for AI hardware. “The central weakness is that it requires advanced packaging which is currently limiting supply and also adding cost. “But HBM will almost certainly always have a place.”

For some numbers:

Wayback Machine

web.archive.org

On the low power slide they mention the per pin clock and power for the original HBM is roughly half the power per pin of the then contemporary DDR4. Unfortunately they don't have the numbers for LPDDR4. This isn't the only determinate of power efficiency (after all GDDR ain't exactly efficient), but it is a major source of it.

Further as my link in the previous post mentions one of the primary power consumptions is the bus itself - once those are equivalent, they have the same 1024bit width, you are now measuring the difference between LPDDR and HBM which are both using wide, low power designs but one achieves it with even shorter distances because of the 3D stacking while the other requires powering 16 modules with many more total RAM packages, while HBM is just powering a single, much smaller stack of memory packages for that same bus.

Your link is the first I've heard of research into a mobile HBM. My suspicion is that indeed one could construct an even lower power HBM, but one of the primary problems would be to make it such that you could get the benefits of the wide design at even smaller stack sizes and probably give more options for lower bus sizes. Constructing it out of LPDDR stacks could be a way to go (EDIT: literally stacking LPDDR modules in advanced packaging would be more similar to the Apple R1 than actual HBM despite the name). But again, cost.

theorist9 · Feb 8, 2025

crazy dave said:
For some numbers:

Wayback Machine

web.archive.org

On the low power slide they mention the per pin clock and power for the original HBM is roughly half the power per pin of the then contemporary DDR4. Unfortunately they don't have the numbers for LPDDR4. This isn't the only determinate of power efficiency (after all GDDR ain't exactly efficient), but it is a major source of it.

Further as my link in the previous post mentions one of the primary power consumptions is the bus itself - once those are equivalent, they have the same 1024bit width, you are now measuring the difference between LPDDR and HBM which are both using wide, low power designs but one achieves it with even shorter distances because of the 3D stacking while the other requires powering 16 modules with many more total RAM packages, while HBM is just powering a single, much smaller stack of memory packages for that same bus.

That was an interesting slide show. I liked all the details about HBM. But, as you said, it doesn't provide numbers for HBM vs. LPDDR efficiency, which was the original point at issue.

For instance, to minimize power consumption, LPDDR5 has Dynamic Voltage Scaling (DVS), which allows it to adjust both the core and I/O voltages based on frequency. AFAIK, DDR5 lacks this; and I thus assume HBM lacks this as well.

I've read they'll be introducing DVS with DDR6, so we may see this on future HBM designs. But our discussion is about HBM as it exists today.

crazy dave · Feb 8, 2025

theorist9 said:
That was an interesting slide show. I liked all the details about HBM. But, as you said, it doesn't provide numbers for HBM vs. LPDDR efficiency, which was the original point at issue.

For instance, to minimize power consumption, LPDDR5 has Dynamic Voltage Scaling (DVS), which allows it to adjust both the core and I/O voltages based on frequency. AFAIK, DDR5 lacks this; and I thus assume HBM lacks this as well.

I've read they'll be introducing DVS with DDR6, so we may see this on future HBM designs. But our discussion is about HBM as it exists today.

I found a link that claimed HBM2/3 had DVS, but the site was kind of sketchy - it read like AI slop. So I don't know. It probably be more accurate to say LP/G/DDR and HBM are all based on DRAM, but HBM3 doesn't corresponds to a particular DDR anymore than GDDR does and even LPDDR is independently derived. Sure as a shorthand HBM is philosophically like stacked DDR, but that's isn't accurate as to what's actually happening and you can't know what the capabilities of one is based on the other. The technology within HBM-stacked DRAM is independently derived from the technology used to package DRAM for DDR. They're all different flavors of DRAM.

In contrast, rereading the "mobile HBM" link, it actually sounds like they are quite literally stacking LPDDR modules on top of each other and is in someways similar to the extra-low power LPDDR packaging used in the Apple R1 for Vision Pro (on of the links even mentions the memory used in the AVP). But I'm not sure. If so, despite the name it really isn't anything like HBM aside from the stacked aspect of it. However, I wondered in some of the posts above whether we might see more of that R1 style of memory packaging in the future.

Here from Semi-analysis likewise claims that LPDDR cannot challenge HBM in power per bit:

The Memory Wall: Past, Present, and Future of DRAM

Winners & Losers in the 3D DRAM Revolution The world increasingly questions the death of Moore’s Law, but the tragedy is that it already died over a decade ago with 0 fanfare or headlines. The …

semianalysis.com

In accelerators, LPDDR has emerged as the best option for a “2nd tier” of memory that provides cheaper capacity at a lower (slower) level than expensive HBM. It falls short in building highest capacities and in reliability features, but beats DDR5 DIMMs as it consumes an order of magnitude less energy per bit of throughput. LPDDR5X packaging goes up to 480GB available on the Nvidia Grace processor, which is about 10x the capacity limit for GDDR configurations (which are limited by rules of circuit board layout and chip packaging required to meet signal in consumer gaming systems), and in the same range as medium DDR server configurations. Larger capacity DDR5 is possible using R-DIMMS of size above 128GB, albeit costly due to packaging complexity and the additional Registers (a kind of buffer chip) on the DIMMs.

LPDDR5X has a large advantage in power consumption vs. DDR and in cost vs. HBM, but the energy per bit cannot challenge HBM and it requires many lanes (connections to the CPU) which crowd board layouts at larger capacities. It also has a weak story on error correction (ECC) which becomes more important at larger capacities as there is greater chance of an error. To compensate, some capacity must be diverted to support extra ECC. For example, the Grace CPU has 512GB of LPDDR5x per compute tray but seems to reserve 32GB for reliability features, leaving 480GB available for use.

Then there is HBM3E (High Bandwidth Memory gen. 3, with an enhanced “E” version). It prioritizes bandwidth and power efficiency but is very expensive. The 2 defining characteristics of HBM are the much wider bus width and the vertically stacked memory die. Individual HBM die have 256 bits each of I/O, 16x more than LPDDR which has a bus width of only 16 bits per chip. Dies are vertically stacked, typically 8 or more, with I/O grouped for every 4 dies; in total the package can deliver 1024 bits of bandwidth. In HBM4 this will double to 2048 bits. To make the most of HBM it is best co-packaged next to the compute engine to reduce latency and energy per bit. To expand capacity while maintaining a short connection to compute, more dies must be added to the stack.

throAU · Feb 8, 2025

If nothing else, vertically stacking HBM will mean less distance for traces on the motherboard / SOC to link it to the processor, and power will be saved purely due to that.

theorist9 · Feb 8, 2025

crazy dave said:
I found a link that claimed HBM2/3 had DVS, but the site was kind of sketchy - it read like AI slop. So I don't know.

Here from Semi-analysis likewise claims LPDDR cannot challenge HBM in power per bit:

The Memory Wall: Past, Present, and Future of DRAM

Winners & Losers in the 3D DRAM Revolution The world increasingly questions the death of Moore’s Law, but the tragedy is that it already died over a decade ago with 0 fanfare or headlines. The …

semianalysis.com

I tried Google Scholar, and I couldn't find anything substantive (i.e., something more than an unsupported claim) either. I did find this, which mentions the efficiency of both LPDDR and HBM, but gives no comparative figures:

High-Bandwidth and Energy-Efficient Memory Interfaces for the Data-Centric Era: Recent Advances, Design Challenges, and Future Prospects

Currently, we are living in a data-centric era as the need for large amounts of data has dramatically increased due to the widespread adoption of artificial intelligence (AI) in a variety of technology domains. In the current computing architecture, the memory input and output (I/O) bandwidth is...

ieeexplore.ieee.org

theorist9 · Feb 8, 2025

throAU said:
If nothing else, vertically stacking HBM will mean less distance for traces on the motherboard / SOC to link it to the processor, and power will be saved purely due to that.

EDIT: Isn't the distance from the RAM to the CPU with well-positioned LPDDR just as small or smaller than it would be with well-positioned HBM?

throAU · Feb 8, 2025

theorist9 said:
Don't Apple's LPDDR modules also contain vertically stacked LPDDR chips (for that same reason)?

Don't know enough about how Apple configure the LPDDR to comment on that one.

crazy dave · Feb 8, 2025

theorist9 said:
I tried Google Scholar, and I couldn't find anything substantive (i.e., something more than an unsupported claim) either. I did find this, which mentions the efficiency of both LPDDR and HBM, but gives no comparative figures:

High-Bandwidth and Energy-Efficient Memory Interfaces for the Data-Centric Era: Recent Advances, Design Challenges, and Future Prospects

Currently, we are living in a data-centric era as the need for large amounts of data has dramatically increased due to the widespread adoption of artificial intelligence (AI) in a variety of technology domains. In the current computing architecture, the memory input and output (I/O) bandwidth is...

ieeexplore.ieee.org

I've also been editing my post above. Major edits below:

It probably be more accurate to say LP/G/DDR and HBM are all based on DRAM, but HBM3 doesn't corresponds to a particular DDR anymore than GDDR does and even LPDDR is independently derived. Sure as a shorthand HBM is philosophically like stacked DDR, but that's isn't accurate as to what's actually happening and you can't know what the capabilities of one is based on the other. The technology within HBM-stacked DRAM is independently derived from the technology used to package DRAM for DDR. They're all different flavors of DRAM.

In contrast, rereading the "mobile HBM" link, it actually sounds like they are quite literally stacking LPDDR modules on top of each other and is in someways similar to the extra-low power LPDDR packaging used in the Apple R1 for Vision Pro (on of the links even mentions the memory used in the AVP). But I'm not sure. If so, despite the name it really isn't anything like HBM aside from the stacked aspect of it. However, I wondered in some of the posts above whether we might see more of that R1 style of memory packaging in the future.

Plus some more quotes from the semianalysis article.

Another quote from the semianalysis article:

In short, high bandwidth and very high bandwidth density along with best energy per bit and true ECC capability makes HBM3E the clear winner, for now

Sadly the article doesn't actually state energy per bit of each. 🤷‍♂️ So according to to semianlaysis I am indeed wrong, HBM isn't just a little more efficient than LPDDR, it's a lot more efficient. But again 🤷‍♂️

throAU said:
If nothing else, vertically stacking HBM will mean less distance for traces on the motherboard / SOC to link it to the processor, and power will be saved purely due to that.

theorist9 said:
EDIT: Isn't the distance from the RAM to the CPU with well-positioned LPDDR just as small or smaller than it would be with well-positioned HBM?

I'm not sure about vertically stacked HBM ... HBM is vertically stacked already. And remember even if it is low power it still has to be cooled. There is also 3DRAM, but that's different from what we're talking about (that's changing the way DRAM works). So I'm not sure what is meant by vertically stacked HBM.

Stacking LPDDR onto the A-series processor is all well and good but you can't have high bandwidth as the density isn't big enough (as the semianalysis article states LPDDR is one of the least dense of the various DRAMs) and well ... heat. That's why it works for the phone, not so much for the M-series processors. And I think there was even a report that said Apple was going to stop doing that, move to more M-series-style packaging. Also, I think the stacking of the LPDDR memory on top of the A-series processors was primarily for space reasons, I'm not sure how much power it saved.

crazy dave · Feb 8, 2025

crazy dave said:
So let's reverse my question from earlier, why use HBM at all if it is so expensive and the bandwidth gains seemingly so minor? In addition to my suspicion that HBM is slightly more power efficient at those bandwidths it is because I've left out a really important detail - space. To fill a 1024bit bus with LPDDR5X modules requires 16 such modules - that not only takes up a lot of room, but even worse, just like the minimum HBM stack size is no use for smaller chips, the minimum LPDDR5X module with full bandwidth I believe is 6GB. (I believe 2GB modules are offered but at reduced bandwidth, single data package/rank, while the minimum Apple offers at the max bandwidth is 6GB and I think that is indeed it). Let's assume it is 6GB. That means in order to provide 1.08TB/s of bandwidth from LPDDR5x you would have to have a minimum of 96GB of RAM from 16 modules. HBM3e can provide that level of bandwidth in a single stack of 24GB. That's the HBM advantage and why people build with it.

So you can in theory build an LPDDR system with as much bandwidth as an HBM3e system, as long as you are willing to provide that much more base RAM and pay the cost in package size. So in theory if Apple wanted to provide 5TB/s (matching Grace Hopper but Blackwell) levels of bandwidth - the system would have to have a minimum 480GB of LPDDR RAM - this isn't completely crazy as Grace Hopper in fact comes with that amount of LPDDR memory on top of the 100+GB of HBM3/e memory (Nvidia uses a smaller LPDDR bus size since it is feeding the CPU, it doesn't need that level of bandwidth, GH's CPU bandwidth is 500GB/s as opposed to 5TB/s). However, as per my arguments in the rest of the thread, I'm not sure Apple will be building an individual chip quite that size to be honest and I'm not sure how you would distribute that over interconnects or even what Apple's interconnect speed will be. Also, how $costs compare at that point to just using HBM I'm not sure and might depend on whether or not you are measuring $/GB or $/TB/s. My suspicion is that LPDDR would be much better in $/GB but worse in $/TB/s. Further, I'm not sure if there are limits to how many LPDDR modules/HBM stacks are possible in a single system or if the limit is just space and money.

K3KL7L70EM-MGCU(12Gb) | DRAM | Samsung Semiconductor Global

K3KL7L70EM-MGCU(12Gb). Find technical product specifications, features and more at Samsung Semiconductor.

semiconductor.samsung.com

If I'm reading right this 1.5GB module from Samsung is at full bandwidth, but I'm pretty tired. If so, LPDDR5X would be able to offer nearly the same bandwidth at the same rate per GB as HBM3e (i.e. 24GB of RAM from 16 modules yielding 1080GB/s vs 1 24GB stack yielding a max of 1200GB/s) ... but again with substantially more modules needed - so definitely much more space, possibly more power.

EDIT: Still tired, not sure this is full bandwidth, wouldn't it need to be x64 not x32?

throAU · Feb 9, 2025

crazy dave said:
I'm not sure about vertically stacked HBM ... HBM is vertically stacked already.

Sorry to clarify, that's what I was referring to - the modules are already vertically stacked.

tenthousandthings · Feb 9, 2025

crazy dave said:
[…] And I think there was even a report that said Apple was going to stop doing that, move to more M-series-style packaging. […]

Not sure, but I think you may be referring to this leak:

iPhone 18 to Use Enhanced 2nm Chip Tech Integrating 12GB RAM

Apple's 2026 iPhones will use TSMC's next-generation 2-nanometer fabrication process in combination with a new packaging method that will...

www.macrumors.com

The move is to WMCM (Wafer-Level Multi-Chip Module), whatever that is.

Would Apple's custom AI server chip use HBM or stick with LPDDR?

macrumors 68000

macrumors 68030

macrumors 68000

macrumors 601

HBM-PIM​

macrumors 68000

HBM-PIM​

Contributor

macrumors 601

macrumors 68000

macrumors 601

macrumors 68000

macrumors G4

macrumors 601

macrumors 601

macrumors G4

macrumors 68000

macrumors 68000

macrumors G4

Contributor

Our Staff

HBM-PIM

HBM-PIM