Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,678
Instead of integrating LPDDR5 on SoC. I think they can shoot for 16-channels of DDR5-6400 instead on Mac Pro. They can get the bandwidth and capacity (theoretically up to 4TB, with ~600GB/s bandwidth)

I am not sure how this solution is better, except that it provides some (awkward, since all the slots have to be filled) upgradeability potential. iI's also expensive and unwieldy. Also, 600GB/s is not enough at all to offer competitive performance in this segment. Apple demonstrated that they are willing to offer 400GB/s in a lightweight workstation laptop, something that has so far been completely unprecedented! A pro desktop workstation would offer much more.
 

altaic

Suspended
Jan 26, 2004
712
484
HBM2 still has an advantage in bandwidth per area. If you want to go, say, 1.6TB/s bandwidth (4x M1 Max), you’d need 16x RAM modules with the current implementation which can get messy. HBM3 can probably do it with a much smaller package. But then again you lose RAM capacity, so there are drawbacks too.
Do you know what the highest density die (or maybe flip-chip is a better term) area of LPDDR5 and HBM2 vs the M1 Max is? I’ve been looking into some multi-die packaging patents recently, so I’m curious.

Edit: IIRC, the M1 Max is 420 sq mm. I’m wondering what the area of the silicon in state of the art RAM packages are.
 

Kpjoslee

macrumors 6502
Sep 11, 2007
417
269
I am not sure how this solution is better, except that it provides some (awkward, since all the slots have to be filled) upgradeability potential. iI's also expensive and unwieldy. Also, 600GB/s is not enough at all to offer competitive performance in this segment. Apple demonstrated that they are willing to offer 400GB/s in a lightweight workstation laptop, something that has so far been completely unprecedented! A pro desktop workstation would offer much more.

Or Apple can pull some kind of proprietary 8 x 256-bit LPDDR5 modules effectively offering 1.6TB/s of bandwidth and 2TB of max capacity.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,678
Do you know what the highest density die (or maybe flip-chip is a better term) area of LPDDR5 and HBM2 vs the M1 Max is? I’ve been looking into some multi-die packaging patents recently, so I’m curious.

Sorry, no idea unfortunately. I think the LPDDR5 modules Apple uses currently (up to 16GB per module) are the largest at the market. HBM3 has just been announced with 24GB per stack but much higher bandwidth. With these chips Apple would need only 2x modules to reach 1.6TB/s bandwidth, but max capacity is 48GB.

Or Apple can pull some kind of proprietary 8 x 256-bit LPDDR5 modules effectively offering 1.6TB/s of bandwidth and 2TB of max capacity.

A huge pluggable module with thousands of pins? Sounds like a disaster ?
 

Kpjoslee

macrumors 6502
Sep 11, 2007
417
269
A huge pluggable module with thousands of pins? Sounds like a disaster ?
Yep. Wouldn't be technically feasible. I would prefer to see 16-channel DDR5 instead lol. Maybe Apple can implement huge infinity cache like solution for GPUs on Jade 2~4x die like AMD to compensate for bandwidth.
 

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
Or Apple can pull some kind of proprietary 8 x 256-bit LPDDR5 modules effectively offering 1.6TB/s of bandwidth and 2TB of max capacity.
It'll have to be 256 + 32 bits modules to cater to ECC. High RAM capacity workstation need to guard against RAM corruptions since probability of flipped bits is higher with more RAM.
 

altaic

Suspended
Jan 26, 2004
712
484
I think the LPDDR5 modules Apple uses currently (up to 16GB per module) are the largest at the market. HBM3 has just been announced with 24GB per stack but much higher bandwidth.
They definitely are, but I’m wondering about the dies, sans fan out for interconnects— what sort of gains could be had if the SoC and RAM (and other modules) we’re more tightly connected. Just ruminating.
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,665
OBX
That so many missed. Next step, a ray tracer (please). I saw once a video (can't find it) where a 10W chip performed better at ray tracing than a monster NVIDIA card because the former was dedicated to the task. I believe that was 5-7 years ago...
BVH Traversal and Intersection Testing both need to be accelerated. It is why Nvidia (and IMG Tech proposed solution) are faster at RT than AMD. At least for real time ray tracing. If it doesn't have to be realtime, then it doesn't matter as much, right?
 

Pressure

macrumors 603
May 30, 2006
5,182
1,545
Denmark
I am not sure how this solution is better, except that it provides some (awkward, since all the slots have to be filled) upgradeability potential. iI's also expensive and unwieldy. Also, 600GB/s is not enough at all to offer competitive performance in this segment. Apple demonstrated that they are willing to offer 400GB/s in a lightweight workstation laptop, something that has so far been completely unprecedented! A pro desktop workstation would offer much more.
We know the performance cluster cannot use more than 204GB/s anyway, so there could be a cheaper and more straightforward solution for the Mac Pro replacement.

LPDDR5X at 8533Mbps and up to 64GB per stack.

A base 18-core model (2 high efficiency and 16 performance cores) would have plenty of bandwidth (512-bit memory channel and 546GB/s memory bandwidth) leftover for the GPU as well.

As SoC size increases so does pin count which should leave room for extra memory channels to increase bandwidth.

A 26-core model (2 high efficiency and 24 performance cores) flanked by 6 stacks of memory would give a 768-bit memory channel and 819GB/s memory bandwidth.
 

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
HBM2 still has an advantage in bandwidth per area. If you want to go, say, 1.6TB/s bandwidth (4x M1 Max), you’d need 16x RAM modules with the current implementation which can get messy. HBM3 can probably do it with a much smaller package. But then again you lose RAM capacity, so there are drawbacks too.
What about using DDR5 in the desktops? Based on this it looks like DDR5 has twice the bandwidth per DIMM vs. LPDDR5 (except in "bursts"?), but I'm not sure if I'm translating the info. properly:

"LPDDR5 also features dual-16-bit channels, as well as a burst length of up to 32 (mostly 16)....DDR5, on the other hand, features two 32-bit channels per DIMM...Both DDR5 and LPDDR5 will support speeds up to 6400 Mbps as per the JEDEC standard, although we won’t see them with the first wave of modules. LPDDR5 also increases the density up to 32Gb per channel with operating voltages as low as 1.05/0.9V for VDD and 0.5/0.35V for I/O....Lastly, LPDDR5 also has a more flexible bank structure. While DDR5 packs 32 banks, its mobile variant can vary from 4 to 16 banks, with up to four bank groups (although 1-2 is the norm)."

 
Last edited:

leman

macrumors Core
Original poster
Oct 14, 2008
19,521
19,678
What about using DDR5 in the desktops? Based on this it looks like DDR5 has twice the bandwidth per DIMM vs. LPDDR5 (except in "bursts"?), but I'm not sure if I'm translating the info. properly:

"LPDDR5 also features dual-16-bit channels, as well as a burst length of up to 32 (mostly 16)....DDR5, on the other hand, features two 32-bit channels per DIMM...Both DDR5 and LPDDR5 will support speeds up to 6400 Mbps as per the JEDEC standard, although we won’t see them with the first wave of modules. LPDDR5 also increases the density up to 32Gb per channel with operating voltages as low as 1.05/0.9V for VDD and 0.5/0.35V for I/O....Lastly, LPDDR5 also has a more flexible bank structure. While DDR5 packs 32 banks, its mobile variant can vary from 4 to 16 banks, with up to four bank groups (although 1-2 is the norm)."


I think the confusion is because LPDDR5 does not have DIMMs… so implementations have more flexibility how to arrange the chips (although I have no idea whether 32bit LPDDR5 is a thing). But for the same bus width LPDDR5 and DDR5 will have the same bandwidth. E.g. Apple uses 256 and 512 bit RAM width in M1 Pro and Max respectively, which is equivalent to quad or orca-channel DDR5 (assuming „traditional„ 64-bit channels)
 
  • Like
Reactions: theorist9

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
I think the confusion is because LPDDR5 does not have DIMMs… so implementations have more flexibility how to arrange the chips (although I have no idea whether 32bit LPDDR5 is a thing). But for the same bus width LPDDR5 and DDR5 will have the same bandwidth. E.g. Apple uses 256 and 512 bit RAM width in M1 Pro and Max respectively, which is equivalent to quad or orca-channel DDR5 (assuming „traditional„ 64-bit channels)
Then what advantage does DDR5 give over LPDDR5 that would cause a PC maker to choose the less efficient DDR5 for its desktops? Is it simply that DDR5 is less expensive and/or more available?
 

Kpjoslee

macrumors 6502
Sep 11, 2007
417
269
Thanks. Based on what you wrote, I looked into this, and it appears what's going on is that LPDDR5 doesn't need to be soldered, but its low voltage makes soldering the preferred implementation.


Using DDR5 on Mac Pro would be a blessing since there should be much less costlier options than getting memory upgrades from Apple. While using special LPDDR5 modules probably means you can only upgrade the RAM from Apple.
 
  • Like
Reactions: theorist9

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
TLDR: It looks like, when they say the 400 GB/s bandwidth of the M1 Max's unified memory is unusually high for a laptop, they're comparing it to typical CPU RAM bandwidth, not typical GPU RAM bandwidth. I.e., more specifically, the Max's unified memory is giving the CPU an unusually high RAM bandwidth, and giving the GPU a typical RAM bandwidth (compared to other laptops in its class).

****
I've read the tradeoff between DDR and GDDR RAM is that the former provides low latency (needed for CPUs) and the latter provides high bandwidth (needed for GPUs). So it seems Apple decided they couldn't accept the high latency of GDDR RAM for unified memory, and instead used DDR RAM and increased the bandwidth to the point it was comparable to what is available from mobile-workstation-class GDDR.

For instance, a comparably-equipped (64 GB RAM, 4 TB SSD, 120 Hz screen) 17" Dell 7760 workstation laptop, with an 8-core Xeon W11955M and A4000 mobile, costs about the same (within 10%) as a 16" M1 Max ($4900 for the MBP and $5440 for the Dell, based on current pricing on Dell's website*). And the A4000 mobile offers 384 GB/s GPU RAM bandwidth (see link below).**

So it sounds like the Max's 400 GB/s isn't unprecedented when it comes to GPU bandwidth (compared to other laptops in its class), but might instead be unprecedented compared to their CPU bandwidth (?).


*At the time I configured it, a couple of days ago, they were offering an automatic discount of 35% off; they always offer heavy discounts, but I don't know if this particular discount is typical or not. Interestingly, I tried configuring it, left, and came back, and the 2nd time it offered me a higher discount--so maybe its tracks you and offers a lower discount if you don't bite the first time :).

**You don't even need to go to a workstation-class laptop to get that GPU RAM bandwidth. For instance, an RTX3080 mobile also offers 384 GB/s memory bandwidth. Also, for comparison, the A4000 and RTX3080 desktop GPUs offer twice that: 768 GB/s
 
Last edited:

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
Much has been made about the M1 Max offering an unusually high (400 MB/s) memory bandwidth for a laptop. But what does this mean, i.e., what's the basis of comparison? I ask because the M1 has unified memory, so when they compare it to other comparably-priced workstation-class laptops, what bandwidth are they using as the basis for that comparison? I.e., are they comparing it to the CPU RAM bandwidth, the GPU RAM bandwidth, or the sum of the two? Without understanding more, I don't see how any of those three comparisons would be meaningful.

More specifically, I've read the tradeoff between DDR and GDDR RAM is that the former provides low latency (needed for CPU's) and the latter provides high bandwidth (needed for GPUs). So it seems Apple decided they couldn't accept the high latency of GDDR RAM for unified memory, and instead used DDR RAM and increased the bandwidth to the point it was comparable to what is available from workstation-class GDDR.

For instance, a comparably-equipped (64 GB RAM, 4 TB SSD, 120 Hz screen) 17" Dell 7760 workstation laptop, with an 8-core Xeon and A4000, costs about the same (within 10%) as a 16" M1 Max ($4900 for the MBP and $5440 for the Dell, based on current pricing on Dell's website). And an A4000 mobile offers 384 GB/s GPU RAM bandwidth (see link below).

So it sounds like the Max's 400 MB/s isn't unprecedented when it comes to GPU bandwidth (compared to other laptops in its class), but might instead be unprecedented compared to their CPU bandwidth (?).


This is correct: It's mostly CPU bandwidth that is unprecedented for such a device. (Though in practice it caps out at 240GB/s for the P-core cluster) What's interesting about the GPU bandwidth is indeed that they achieved it with lpDDR5 rather than GDDR - the main reason GPUs don't do this is cost. There are others, but that's the biggest.
 

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
This is correct: It's mostly CPU bandwidth that is unprecedented for such a device. (Though in practice it caps out at 240GB/s for the P-core cluster) What's interesting about the GPU bandwidth is indeed that they achieved it with lpDDR5 rather than GDDR - the main reason GPUs don't do this is cost. There are others, but that's the biggest.
So does that mean, for the P-core cluster, it's 120 for the Pro and 240 for the Max, or 200 for the Pro and 240 for the Max?

And, either way, has anyone uncovered any CPU-bound tasks that benefit from this high bandwidth, i.e., CPU-bound tasks where the Max is significantly faster than the Pro? If you do a lot of those, that could justify the added $200 to get the Max with the 24-core GPU, even if you don't need the added GPU power.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
So does that mean, for the P-core cluster, it's 120 for the Pro and 240 for the Max, or 200 for the Pro and 240 for the Max?

And, either way, has anyone uncovered any CPU-bound tasks that benefit from this high bandwidth, i.e., CPU-bound tasks where the Max is significantly faster than the Pro? If you do a lot of those, that could justify the added $200 to get the Max with the 24-core GPU, even if you don't need the added GPU power.

It's 200 for the Pro and 240 for the Max, so basically the same wrt to CPU so not a huge change. You could probably run tests like SPEC 503.bwaves, 519.lbm, 549.fotonik3d and 554.roms on the two and see some differences but otherwise they'd be pretty close to identical.
 
  • Like
Reactions: theorist9

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
It's 200 for the Pro and 240 for the Max, so basically the same wrt to CPU so not a huge change. You could probably run tests like SPEC 503.bwaves, 519.lbm, 549.fotonik3d and 554.roms on the two and see some differences but otherwise they'd be pretty close to identical.
BTW, I came across this (https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2), which seems to say the ~240 figure for the Max is for all cores, and with the P-cores it's ~220.

For a single core, it says the Pro/Max can do ~100, by comparsion to ~60 for the M1 (https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested) (there are other differences, including cache size and latency). I'm thus curious if anyone has uncovered a single-core CPU-bound task that would be faster on the Pro/Max than the M1. Or would this also be something that could only been seen in synthetic benchmarks?

If the latter, then I'm guessing the only place you'd see a real-world benefit of the Pro/Max's large memory bandwidth compared to high-end AMD/Intel processors (which are 100 GB/s multicore?) (https://www.intel.com/content/www/u...0056722/processors/intel-core-processors.html) would be for multicore tasks, since that's what would be needed to get the CPU to saturate the memory bandwidth. [Of course, there's separately a benefit for unified memory.]
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
BTW, I came across this (https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2), which says the ~240 figure for the Max is for all cores, and with the P-cores it's ~220.

For a single core, it says the Pro/Max can do ~100, by comparsion to the ~60 for the M1 (https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested) (there are other differences, including cache size and latency). I'm thus curious if anyone has uncovered a single-core CPU-bound task that would be faster on the Pro/Max than the M1. Or would this also be something that could only been seen in synthetic benchmarks?

That’s where I pulled my numbers from :) - you’re right I should’ve said (and meant to say) that the CPU can pull 240GBs, not the P-cores alone.

Well the SPEC tests I mentioned above as being memory-pressure intensive are all based on real programs and algorithms. You can look up their names. Presumably that will be reflected in their workloads. But even in them you only generally get small improvements over the M1 in single-core tasks. So nothing to write home about.

They’re listed in SPEC 2017 fp:


As you can see generally the same or similar single core results at most 13% better in 549.fotonik3d.

It should be noted that in one of the integer subtests (548.exchange2_r) the regular M1 outperformed the Max by a substantial amount in ST.
 
  • Like
Reactions: theorist9

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
BTW, I came across this (https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2), which seems to say the ~240 figure for the Max is for all cores, and with the P-cores it's ~220.

For a single core, it says the Pro/Max can do ~100, by comparsion to ~60 for the M1 (https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested) (there are other differences, including cache size and latency). I'm thus curious if anyone has uncovered a single-core CPU-bound task that would be faster on the Pro/Max than the M1. Or would this also be something that could only been seen in synthetic benchmarks?

If the latter, then I'm guessing the only place you'd see a real-world benefit of the Pro/Max's large memory bandwidth compared to high-end AMD/Intel processors (which are 100 GB/s multicore?) (https://www.intel.com/content/www/u...0056722/processors/intel-core-processors.html) would be for multicore tasks, since that's what would be needed to get the CPU to saturate the memory bandwidth. [Of course, there's separately a benefit for unified memory.]
From the die shots shown for the M1 Pro & Max, it appears that the memory controllers are split between the 2 P/E CPU clusters. For M1 Max, it looks like the 400GB/s bandwidth are split between the two CPU clusters. So each clusters could potentially get 200GB/s of bandwidth independently, as long as both cluster's SLC does not need to sync it's cache.

I would think in theory, with proper process scheduling (e.g. threads for the same process are kept within the same CPU clusters), all 400GB/s of memory bandwidth can be consumed by the CPU alone.
 

theorist9

macrumors 68040
May 28, 2015
3,881
3,060
That’s where I pulled my numbers from :) - you’re right I should’ve said (and meant to say) that the CPU can pull 240GBs, not the P-cores alone.

Well the SPEC tests I mentioned above as being memory-pressure intensive are all based on real programs and algorithms. You can look up their names. Presumably that will be reflected in their workloads. But even in them you only generally get small improvements over the M1 in single-core tasks. So nothing to write home about.

They’re listed in SPEC 2017 fp:


As you can see generally the same or similar single core results at most 13% better in 549.fotonik3d.

It should be noted that in one of the integer subtests (548.exchange2_r) the regular M1 outperformed the Max by a substantial amount in ST.
They noted that integer subtest "523.xalancbmk is showcasing a large performance improvement [in the Max over the M1], however I don’t think this is due to changes on the chip, but rather a change in Apple’s memory allocator in macOS 12. Unfortunately, we no longer have an M1 device available to us, so these are still older figures from earlier in the year on macOS 11."

So perhaps the opposite effect seen with 548.exchange2_r is due to the change in memory allocator as well.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
They noted that integer subtest "523.xalancbmk is showcasing a large performance improvement [in the Max over the M1], however I don’t think this is due to changes on the chip, but rather a change in Apple’s memory allocator in macOS 12. Unfortunately, we no longer have an M1 device available to us, so these are still older figures from earlier in the year on macOS 11."

So perhaps the opposite effect seen with 548.exchange2_r is due to the change in memory allocator as well.

Possibly, I don’t know enough about either test.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.