Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

will4laptop

macrumors newbie
Original poster
Oct 10, 2022
2
1
I was interested to read in this thread by @Amethyst that at least one version of the Apple Silicon Mac Pro's development samples did not feature socketed RAM. This lead me to wonder:
  • Is there any technical reason why Apple Silicon couldn't use (socketed) DDR* SDRAM instead of (soldered) LPDDR* SDRAM?
The only performance difference I am aware of between the two is that LPDDR is more power-efficient. Obviously that makes it the better choice for almost all mobile devices. However, for a desktop tower form factor, the benefit from modularity would recommend socketed DDR. And my amateur understanding is that LPDDR and DDR of the same generation aren't substantially different in architecture so that it would mean making a computer that would use DDR would be a substantial additional development cost for Apple. (If anything, DDR would seem to be simpler!)

Of course, development samples are (by definition) not finished products, so I still retain hope that the final version might included socketed DDR (just as it will almost certainly include more than this sample's one PCIe slot) – that is, unless someone is aware of some limitation of Apple Silicon's design which would make socketed DDR infeasible.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
I was interested to read in this thread by @Amethyst that at least one version of the Apple Silicon Mac Pro's development samples did not feature socketed RAM. This lead me to wonder:
  • Is there any technical reason why Apple Silicon couldn't use (socketed) DDR* SDRAM instead of (soldered) LPDDR* SDRAM?
The only performance difference I am aware of between the two is that LPDDR is more power-efficient. Obviously that makes it the better choice for almost all mobile devices.

"more power-efficient" substantively undercuts the technical differences.

1. The data path width is different. 16 bits wide for LPDDR and two 32 bits wide for the data channels.

So somewhat like saying there is no difference between a physical x16 slot and a physical x4 slot if you have a x16 card.

The technical problem here is that Apple has unusually high number of memory controller paths out of their designs. The 'lowly' M1/M2 have four were as most mainstream desktop chips have just two. Apple can pack in more paths because the LPDDR is narrower.

Technically Apple could build a completely different controller with a different width. Apple could even conceptually kick the memory controllers completely off the main die with cores.


2. The ECC path differences are grossly different. LPDDR doesn't have a segregated ECC data path from the data. The ECC data would need to go up/down the regular data path. Again DDR implementation is wider which technically (since die edge space is limited) will squeeze down the number of memory channels can straightforwardly attach to the die.


Again a completely different controller implementation.


3. The rest of the SoC is fundamentally designed around the ability to do a relatively very high number of concurrent memory requests. Apple's whole set up compensates for 'slower' memory by going wider. The internal communications network is set up to provision this "poor man's HBM" implementation they have composed. If they switched to throwing a high number of disparate , concurrent memory requests at half the number of memory controllers are the data return latencies going to be the same? The write latencies going to be the same?

[ Over last decade or so when GPU vendors take their cores designed around GDDRx and provision them in an iGPU with a two channel memory controller for a desktop with DIMMs slots does the performance generally go up or down? ]


"But can just whip up a new memory controller" ....
Not directly a technical problem, but as they used to say at NASA back in the early 60's " No bucks , no Buck Rogers". You have to get funded to do bleeding edge high tech. The technical differences indirectly drive a higher cost.

Also a cost in 'die space'. Can make a "does everything" controller and try to amortize it out over every single SoC die sold. ( and the do everything controller would be bigger so get less per limited die edge space). But that extra die space when selling 10's of millions of dies is a substantively higher effective wafer costs if it is extremely unused among those 10's of millions of placements.


And my amateur understanding is that LPDDR and DDR of the same generation aren't substantially different in architecture so that it would mean making a computer that would use DDR would be a substantial additional development cost for Apple. (If anything, DDR would seem to be simpler!)

That is the flaw. There are substantively different in architecture. And LPDDR can be put on So-DIMMs also. DIMMs are out because can't really do the "poor man's HBM" with DIMMs. Has little to do with LP/regular DDR.

The "poor man's HBM" constraint is technically driven by the high performance iGPU characteristics . DIMMs slots do diddly squat to help that. Throwing both performance and power consumption efficiency out the window just to get modularity. ( So to a substantive degrees mostly a form over function argument. Not getting more on the functional dimensions throwing out . ) And not accounting for how the "memory only modularity" impacts on the rest of the SoC design.


Of course, development samples are (by definition) not finished products, so I still retain hope that the final version might included socketed DDR (just as it will almost certainly include more than this sample's one PCIe slot) – that is, unless someone is aware of some limitation of Apple Silicon's design which would make socketed DDR infeasible.

That is grossly deeply wishful thinking. The primary purpose of putting beta hardware out there is to test it with a variety of workloads . Substituting a completely different memory subsystem would require another round of beta testing. Intel doesn't beta test their Xeon SP server hardware by shipping out boards with mainstream laptop chips on them. It is a completely different sized chip package.

For DDR and slots to be in play Apple would need a substantially even bigger chip package in order not to backslide on memory throughput. If Apple had used LPDDR to implement a legacy "laptop" like memory bandwidth performance then that would be a different story. The technical challenge for regular DDR is do a better job at building a "poor man's HBM" solution. And it doesn't ( at least for DDR5 generation. ).


The technically easier path for Apple to add DDR5 DIMMs in for a 'add-on' to the foundation would be as some RAM-SSD. Either make it so just the file system uses it ( file system already compresses some RAM data, so already a 'swap to something closer and faster than disk' mechanism in place. ) . If memory map files in then larger files more readily loaded into RAM-SDD. And implemented off the main die(s) so as not to cut into the primary edge space budget. The bandwidth and latency would just have to be mostly much better than "disk" (SSD), which wouldn't be that hard to do.


P.S. for PCI-e slots. That is much easier to hide on a prototype board. The iMac Pro 'threw away" over x16 PCI-e v3 lanes from the CPU package. They were present on the package, but the internal logic board just didn't use them. Apple has done that on several systems ( MBP 13" models that had x16 PCI-e lane provisioning for a dGPU and Apple used none. Several lanes left unused even after provisioned Thunderbolt controllers).

Frankly, few if any useful descriptions of that slot every got offered up. If that was x4 PCI-e v3/4 slot then there really wasn't anything substantial provisioned there anyway. Counting up the slot numbers is incomplete. If Apple put six x4 PCI-e v3 slots in a new Mac Pro I highly doubt most of the folks complaining about one slot would be happy. (e.g., take the gross excess TB controllers in the Quad SoC available and pragmatically make a built in TB PCI-e expansion box. )
 
Last edited:
  • Like
Reactions: will4laptop

jscipione

macrumors 6502
Mar 27, 2017
429
243
“Unified architecture” is the technical reason that RAM cannot be socketed on a theoretical Apple Silicon Mac Pro like all other Apple Silicon Macs.
 
  • Like
Reactions: enc0re

Kimmo

macrumors 6502
Jul 30, 2011
266
318
My thought to keep all the benefits of on-SOC RAM, AND the expandability of traditional RAM is to just treat the super-fast on-SOC RAM as a cache level. Only extreme memory intensive cases will ever need to fetch from slower, socketed, user expandable RAM.
With a handle like DaveEcc he's got to be right, right?

At least I hope so.

Welcome, Dave.
 

ArkSingularity

macrumors 6502a
Mar 5, 2022
928
1,130
“Unified architecture” is the technical reason that RAM cannot be socketed on a theoretical Apple Silicon Mac Pro like all other Apple Silicon Macs.
There isn't necessarily any reason it couldn't be socketed just because it's unified. The Unified Architecture refers to using one pool of memory for both the GPU and the CPU (along with anything else on the SOC, such as the ANE). While it happens to be very convenient to build it directly next to the SOC die, there is no technical reason (other than perhaps lots of logistical issues with bandwidth concerns) that using a unified architecture would prevent them from socketing it.
 
  • Like
Reactions: will4laptop

mattspace

macrumors 68040
Jun 5, 2013
3,344
2,975
Australia
There isn't necessarily any reason it couldn't be socketed just because it's unified. The Unified Architecture refers to using one pool of memory for both the GPU and the CPU (along with anything else on the SOC, such as the ANE). While it happens to be very convenient to build it directly next to the SOC die, there is no technical reason (other than perhaps lots of logistical issues with bandwidth concerns) that using a unified architecture would prevent them from socketing it.

Call it cynical, or call it realistic, but *if* they allow a socketed RAM, it won't be available to the GPU if there's no user-upgradable socketed GPU options.

You gotta buy the GPU you want to use for the entire life of the machine, when you buy the machine. That's my pessimistic bet for the machine's paradigm.
 

ArkSingularity

macrumors 6502a
Mar 5, 2022
928
1,130
Call it cynical, or call it realistic, but *if* they allow a socketed RAM, it won't be available to the GPU if there's no user-upgradable socketed GPU options.

You gotta buy the GPU you want to use for the entire life of the machine, when you buy the machine. That's my pessimistic bet for the machine's paradigm.
My guess (as someone who doesn't actually physically design the hardware for these things) is that it's probably a memory bandwidth issue. They could probably use socketed RAM if they wanted to, but they would probably have to design either their own socket or their own memory interface (or at least a different setup than they are using now) in order to do so. (I could be wrong about this, I know that servers can provide massive memory bandwidth but they do so with several DIMMS. Not sure that would work so well in a Macbook.)

The fact that it's socketed itself isn't necessarily as much of an issue, there is no reason you couldn't make a socket and just put the memory chips on a socket rather than soldering them directly next to the SOC. (I was wrong, see my edit). The distance would be slightly further away, but the physical distance is actually much less of an issue than it might seem from a latency standpoint (assuming a signal is electrically transmitted at half the speed of light, it would travel about 1.84 inches every CPU clock cycle. A cold, uncached M1 RAM access requires about 300-350 CPU cycles, so the vast majority of the latency here is coming from other factors.)

My guess is that there are just some logistical factors that make it impractical. Those factors would probably be much less of an issue on a full-sized desktop, where they could design more server-grade setups for these things if they wanted to.

Edit: May have found another reason. As it turns out, LPDDR has to be soldered, and doesn't come in a modular form at all.
 
Last edited:

LeoI07

macrumors member
Jul 8, 2021
56
45
“Unified architecture” is the technical reason that RAM cannot be socketed on a theoretical Apple Silicon Mac Pro like all other Apple Silicon Macs.
To be honest, I don't understand why Apple couldn't just build DIMM slot support into the M2/M3 Ultra and Extreme chips. If the Max ends up being a quarter cut out of an Extreme, and the Ultra is a half cut out of an Extreme, the Ultra and Extreme could include additional parts of the die for DIMM slot support that would be cut out of the Max. Here's an example I made in MS Paint of what the possible structure could look like:
mxmaxultraextremediagram.png

Where the Max is everything within the red lines, the Ultra is everything within the green lines, and the Extreme is everything within the blue lines.
 

will4laptop

macrumors newbie
Original poster
Oct 10, 2022
2
1
"more power-efficient" substantively undercuts the technical differences.

1. The data path width is different. 16 bits wide for LPDDR and two 32 bits wide for the data channels.

So somewhat like saying there is no difference between a physical x16 slot and a physical x4 slot if you have a x16 card.

The technical problem here is that Apple has unusually high number of memory controller paths out of their designs. The 'lowly' M1/M2 have four were as most mainstream desktop chips have just two. Apple can pack in more paths because the LPDDR is narrower.

Technically Apple could build a completely different controller with a different width. Apple could even conceptually kick the memory controllers completely off the main die with cores.
@deconstruct60 thank you very much for the in-depth response!

I wanted to try and better understand the first part of your response, especially in context of the following excerpt from this guide published by Synopsys:
As such applications tend to have fewer memory devices on each channel and shorter interconnects, the LPDDR DRAMs can run faster than standard DRAMs, providing a higher performance. LPDDR DRAMs in low-power states help achieve the highest power efficiency and extend battery life. LPDDR DRAM channels are typically 16 or 32-bit wide, in contrast to the standard DDR DRAM channels which are 64 bits wide.
And from another guide by Synopsys:
Since SoCs for such applications tend to have fewer memory devices on each channel and shorter interconnects, the LPDDR DRAMs can run faster than the standard DDR DRAMs, (for example, LPDDR4/4X DRAMs run at up to 4267 Mbps and standard DDR4 DRAMs run up to 3200 Mbps), thereby providing higher performance.

So I assembled this comparison table from this page on DDR5 and this page on LPDDR5 (since I couldn't find a single table directly comparing the two):
DDR5LPDDR5
Device size8 Gb to 64 Gb2 Gb to 32 Gb
SpeedUp to 6,400 MbpsUp to 6,400 Mbps
Channel width64 bits wide [Standard]
-or-
80 bits wide [ECC]
16 bits wide
-or-
32 bits wide
Voltage (I/O)1.1V0.5V or 0.3V

In the end of it, I think I am finally following the logic here:
  1. LPDDR achieves similar speeds as DDR but on a narrower channel width
  2. Therefore, if you have a fixed channel width, you will get the best performance out of LPDDR (by striping, à la RAID 0)
So, for an example using the DDR5 vs LPDDR5 stats above, if you had one 64 bit wide controller, you can either have:
DDR5LPDDR5
Controller width (hypothetical example)64 bits64 bits
Number of modules per controller12
-or-
4
Aggregate speed (theoretical max)6,400 Mbps12,800 Mbps
-or-
25,600 Mbps

Is that essentially the logic?

(If so, then the next question in my mind is: why can the DDR variant of every generation not achieve similar performance per channel width unit as its LPDDR counterpart? Is it that the physics of a socketed design that make it inherently more difficult to achieve equivalent performance as a soldered design? Since the DDR modules don't have the power and heat constraints of their LPDDR counterparts, I would have though that any disadvantages of a socketed design, such as voltage loss across the socket, could have been compensated for.)
 
Last edited:
  • Like
Reactions: ArkSingularity

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
@deconstruct60 thank you very much for the in-depth response!

...

In the end of it, I think I am finally following the logic here:
  1. LPDDR achieves similar speeds as DDR but on a narrower channel width
  2. Therefore, if you have a fixed channel width, you will get the best performance out of LPDDR (by striping, à la RAID 0)

It really isn't "RAID 0" versus non RAID. You are not going to find a high end SoC with DDR5 controllers numbering in just one. There are going to be multiple memory controllers. It is far more closer to a "much wider stripe" versus a "wide stripe" issue. They are all stripping.

Also on that DDR5 page you seemed have skipped that outline that that 64 is really two 32 (or two 40 for ECC ). That is new for DDR5. They are coupled/bundled so can't split them apart or skip it. DDR6 will/should decouple those two. So both are moving in similar direction (more thinner channels ) , but at different paces and with different error resiliency requirements and options.

So, for an example using the DDR5 vs LPDDR5 stats above, if you had one 64 bit wide controller, you can either have:
DDR5LPDDR5
Controller width (hypothetical example)64 bits64 bits
Number of modules per controller12
-or-
4
Aggregate speed (theoretical max)6,400 Mbps12,800 Mbps
-or-
25,600 Mbps

Is that essentially the logic?

Apple doesn't have multiple modules per controller. They have multiple controllers per physical RAM modules. (stacking memory dies higher and accessing concurrently ). That row is off in the table label above. DDR5 is not limited to one module ( DIMM/Package ) sitting on a controller channel either. Putting more than on module on a single memory channel doesn't get you double speed. It gets more capacity on that channel.
[ I think you are trying to map 'banks' into 'modules' in your table and I don't think that is correct. Banks have impact on burst length/size . The burst size doesn't shift the overall bandwidth. More so how big of a chunk trickles back from a single request. ]

And if end goal is to count op aggregate bandwidth then channels will be better than "controller". One hand with five fingers can press 5 piano keys. One hand with three fingers can press fewer keys at the same time.

But overall the notion of having more , narrower channel controllers than the 'norm' for DDR5 is the point. ( 1 * 64 = 4 * 16 ) Wider and slower gets to a higher aggregate bandwidth (and ability to field a more diverse set of concurrent requests). HBM does the roughly the same with more expensive packaging (HBM diversity isn't as high). And also not really care much for end-to-end ECC.


Finally, you "aggregate speed" is somewhat hookey-pokey. You are comparing 1 wire of a 64 bus to 2 (or 4) wires of an aggregate 64-bit bus. Yeah they are different, but mainly because it is a different number of wires. That isn't really aggregate bandwidth. More natural aggregate bandwidth unit of measure is MB/s not Mb/s. Bytes requires counting more than just one wire of a bus; which gets into the correct mindset.

The aggregate bandwidth is calculated by the total number of wires across the bus(es). If both are 64 bits wide and both have 6,400 Mb/s individual wire speed, then both are the same overall aggregate. Apple is using 4 RAM die stacks to match the DDR5 DIMMs. That is going to buy more in space and power (distance) savings than it is going to be higher total bandwidth. If they throw a double the number of RAM controllers at the problem then can pull ahead.

DDR5 sets the minimal width wider. However, the upper limit for both DDR5/LPDDR5 is pragmatically just how much die edge space want to throw at memory controllers as opposed to something else. Apple doubling up on number of controllers/channels isn't a free lunch. LPDDR5 scales 'better' because only taking on 16-bit chunks at a time as scale up. DDR5 makes you take 80 bit wide chunks at a time ( if going to implement the complete standard) That is lots more die edge space disappear at the cost of something else for each increment.



(If so, then the next question in my mind is: why can the DDR variant of every generation not achieve similar performance per channel width unit as its LPDDR counterpart?

"Channel width unit" being something other than bits? Your chart has 6,400 Mb/s and 6,400 Mb/s .


Different channel distance constraints. Super fast 'clock' over a longer wire isn't 'free'. Decent chance DDR6 will get a tighter distance metrics than DDR5. If can only stack memory dies so high (to hit footprint and volume constraints ) and the capacity is limited to fab tech ... then overall capacity is different.

If you get the wrong answer to a read/write request faster does that really help? Or get substantially fewer answers ( have to page to persistent storage) is that going to help go faster once run out of 'runway' ?

The myopic notion that 99.999% of the issue is the bit rate on a single wire is enough to cover what different areas that LPDDR and DDR address.

Is it that the physics of a socketed design that make it inherently more difficult to achieve equivalent performance as a soldered design? Since the DDR modules don't have the power and heat constraints of their LPDDR counterparts, I would have though that any disadvantages of a socketed design, such as voltage loss across the socket, could have been compensated for.)

DDR5 also pushes some data integrity tasks onto the DIMMs. The 'controller' on the DIMM is also more complicated than on 'less smart' LPDDR5 packages. Some ECC duties are offloaded. And yes that is because have more power and space on the DIMM. It is also bigger and wider.


Width and length of the wires and complexity of stuff on either end of the 'channel wire' is bigger contribution to the different implementation demands of a DDR5 versus LPDDR5 controller.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.