Why is traditional CPU memory bandwidth so slow?

chumps · Oct 21, 2021

After watching the latest keynote I looked up the memory bandwidth figures for x86 chips and they seem to be in the 50-90 GB/s range (the latter for Alder Lake w/ DDR5). GPU's hit 400 GB/s+. What makes the memory bandwidth so slow? The distance of ram slots from the CPU?

chabig · Oct 21, 2021

I'm not an expert, but the obvious answer is the width of the channel. Because Apple's memory is in the SOC package, they can run as many data lines as they need, while traditional CPU packaging would need to run all of those additional data lines off the chip through external pins.

parseckadet · Oct 21, 2021

Yes distance, sort of. The RAM is not integrated with the CPU in a traditional PC design, it's on a separate portion of the motherboard. That means data transfer speeds are dictated by mother board's logic controller and it's clock speed. This is why you will see specs on CPUs detailing the various amounts of cache they have (L1, L2, L3). This cache is local memory that the CPU tries to fill with data that it anticipates needing in the near term. These caches are typically faster than the motherboard's logic controller (operating at or relatively close to the CPU's clock speed), but they have sizes in the KB or MB range. So any time the CPU changes tasks and needs something that's not in cache it has to load it from RAM, which is slower.

The same issues are true for the VRAM on a GPU. That's integrated into the GPU (these days anyway), but anything that makes the trip between the GPU and the CPU or the RAM has to go over the mother board, and will thus be limited. Often times calculations on the GPU stay on the GPU, so this isn't always a concern.

Chabig is also right to point out differences in bandwidth as well.

ETA: All of this starts to get into why the Apple Silicon chips aren't actually that magical. The biggest limitations and challenges in computing has always been various bottle-necks. We all saw what happened when SSDs started to replace spinning discs for storage and the massive advantages that had. The bottle-neck between the CPU and RAM was the next obvious one to take on. Apple is in a unique position to do so and that's exactly what they've done. Intel may very well try to do the exact same thing. They're behind the 8-ball right now, but they've been in that position before and have usually come back swinging.

fakestrawberryflavor · Oct 21, 2021

Typically, CPU RAM is lower bandwidth but also much lower latency. GPU RAM is super wide, higher bandwidth, but also much higher latency. Something about CPU cores needing lower latency to keep the pipeline fed in a serial format, where GPUs are so massively parallel that the latency penalty isn’t noticed. I’m not an expert, just my rudimentary understanding.

mi7chy · Oct 21, 2021

In addition to other advantages already mentioned, GDDR also has synchronous advantage vs asynchronous of DDR. That's why AMD APUs in Sony PS5 and Xbox Series X/S consoles perform than their desktop counterparts with DDR and dGPUs exclusively use GDDR except for low end budget ones that use DDR.

chumps · Oct 21, 2021

Thanks for the replies! Very illuminating.

fakestrawberryflavor said:
Typically, CPU RAM is lower bandwidth but also much lower latency. GPU RAM is super wide, higher bandwidth, but also much higher latency. Something about CPU cores needing lower latency to keep the pipeline fed in a serial format, where GPUs are so massively parallel that the latency penalty isn’t noticed. I’m not an expert, just my rudimentary understanding.

The M1 has a latency of 100ns and traditional PC's have somewhere around 10-20ns right? So it does seem like what Andrei from Anandtech said that these chips are more like GPUs w/ CPUs bolted on?

Also something I don't understand is why the M1's or GPUs have higher latency. The memory is right there on the package in the case of the M1 but pretty nearby in GPUs. Shouldn't latency be better in these scenarios?

chumps · Oct 21, 2021

Also correct me if I'm wrong but given the various cache levels there would be nothing to stop Apple from supporting traditional DDR memory via ram slots right? So L1 > L2 > L3 > on chip memory > ram slots > ssd etc. Not to say that I'll think they'll go that route, just wondering.

m1maverick · Oct 21, 2021

chumps said:
Also correct me if I'm wrong but given the various cache levels there would be nothing to stop Apple from supporting traditional DDR memory via ram slots right? So L1 > L2 > L3 > on chip memory > ram slots > ssd etc. Not to say that I'll think they'll go that route, just wondering.

I wouldn't be surprised they'll go this route for a Mac Pro equivalent system. Either that or the processor will be heeeeuuuuggggeeee!

Krevnik · Oct 21, 2021

chumps said:
Also correct me if I'm wrong but given the various cache levels there would be nothing to stop Apple from supporting traditional DDR memory via ram slots right? So L1 > L2 > L3 > on chip memory > ram slots > ssd etc. Not to say that I'll think they'll go that route, just wondering.

The LPDDR that Apple uses is analogous to what you would find in RAM chips on DIMMs. But it's generally either/or. If you are going to support DIMMs, just use the DIMMs. You don't gain a whole lot by doing a two-tier RAM system to my knowledge.

chumps said:
The M1 has a latency of 100ns and traditional PC's have somewhere around 10-20ns right? So it does seem like what Andrei from Anandtech said that these chips are more like GPUs w/ CPUs bolted on?

Also something I don't understand is why the M1's or GPUs have higher latency. The memory is right there on the package in the case of the M1 but pretty nearby in GPUs. Shouldn't latency be better in these scenarios?

The comment was more talking about how the GPU effectively dominates the M1 Pro/Max dies. The memory controllers are what they are to feed that GPU complex, as the CPU doesn't really need it. This is also a very wide (256-bit or 512-bit) RAM bus which is more like a GPU than CPU.

Unfortunately, I'm not well versed enough as to why the latency is this high.

leman · Oct 21, 2021

Two basic considerations. First, CPUs don’t need massive bandwidth to function properly (they are specialized in complex logic and data reuse, not parallel data crunching). Second, cost - 128bit RAM interfaces have established themselves as a reasonable compromise in modern consumer computing. The CPU development kind of grew on these two principles. While modern CPU cores can in principle utilize more memory bandwidth, the usual approach has been to increase the cache, since it’s cheaper.

Apple is shaking up the scene a bit here. Already around a year ago I was saying that Apple Silicon memory subsystem is more akin to a GPU than to a CPU. This year it became quite obvious. They are essentially building their CPU into a large CPU, while also using a massive amount of cache. It’s expensive, but will unlock new potential for the CPU.

m1maverick · Oct 21, 2021

leman said:
Apple is shaking up the scene a bit here. Already around a year ago I was saying that Apple Silicon memory subsystem is more akin to a GPU than to a CPU. This year it became quite obvious. They are essentially building their CPU into a large CPU, while also using a massive amount of cache. It’s expensive, but will unlock new potential for the CPU.

I would say it's more like moving the memory controller on chip as was done several years ago. The more you put on chip the tighter the integration and the higher the performance. The question becomes: How much can you put on chip?

sentential · Oct 21, 2021

chumps said:
After watching the latest keynote I looked up the memory bandwidth figures for x86 chips and they seem to be in the 50-90 GB/s range (the latter for Alder Lake w/ DDR5). GPU's hit 400 GB/s+. What makes the memory bandwidth so slow? The distance of ram slots from the CPU?

Lots of good replys already but the short of it is that they don’t have to. X86 CPUs are straight number crunchers other than the Xeon-SP platforms which have the same 512bit busses. Why? Because they handle large datasets that need the memory bandwidth just like GPUs do.

Plus narrower busses reduce cost both on the CPU die and memory as they have to be paired together and can often be made up with raw speed hence higher power costs.

So simply it’s a design choice based on use cases of what Intel and AMd think their end users will be.

Could Intel/AMD make a Xeon-SP/EPYC laptop in the same power envelop as Apple, absolutely. They already do in the server space now.

Would it cost an astronomical amount of money similar to the M1 Max, also yes. They just don’t think there’s a market for it and I’m inclined to agree.

The M1 Max/Pro is the worst of both worlds. Too large to scale clock frequencies, too expensive to have broad appeal, cannot be serviced whatsoever by the end user or by their IT support.

The real answer would have been a knock off of Infinity Fabric from AMD which is truly revolutionary hence them beating Intel with a cudgel for the last couple of years.

With a chiplet system they could have done nothing more than fuse 4x M1 chiplets and have the same power targets, similar GPU target of 28 but quadruple the cores at 32

throAU · Oct 21, 2021

chumps said:
After watching the latest keynote I looked up the memory bandwidth figures for x86 chips and they seem to be in the 50-90 GB/s range (the latter for Alder Lake w/ DDR5). GPU's hit 400 GB/s+. What makes the memory bandwidth so slow? The distance of ram slots from the CPU?

Physical distance and physical space.

We're at the stage now where having to make trips out to memory slots on the motherboard is a significant electrical signalling problem due to physical distance.

This is why the M1 memory is on the CPU/GPU package. The connections between CPU/GPU and the memory are much much shorter and can be clocked higher without signalling issues. Just like your ADSL/VDSL internet has lower performance the further you are away from the telco, same with memory slots being far from your processor.

Also bus width. Apple gave M1 Max a 512 bit memory bus. That's huge. Only a few GPUs have had that size in the past, and they've been ultra high end expensive cards. Vega and Fury had bigger buses - but guess what? That was with on-package HBM (i.e., memory chips on the processor package - not seperate sockets with chips in them) kinda like what apple is doing with their SOC - due to packaging reasons.

PC CPU bus size is 64 bit from memory (or something like 2x 64 bit memory channels?) - and making a very wide bus like the M1 Max has (except on a motherboard, for memory sockets) is expensive and difficult - and uses a large amount of board space for running the physical traces. A traditional PC CPU/board already has a heap of traces for things like PCIe slots, etc.

Its essentially a packaging problem vs. physics. It isn't just to lock users down that Apple have gone to this packaging format - there are laws of physics related issues they have attended to in the process.

falainber · Oct 21, 2021

I think there is also an issue of scalability. x86 CPUs are used for a variety of applecations including servers and workstations with terabytes of RAM. It would be technically impossible or prohibitively expensive to have that much of unified RAM. Apple started with phone SOCs with very small RAM. Now they managed to increase RAM size to 64GB which is impressive. But it's not enough for Mac Pro. It's interesting what they are going to do for desktop chips.

throAU · Oct 21, 2021

falainber said:
It's interesting what they are going to do for desktop chips.

I would wager they will have multiple sockets/packages with locally connected memory (just like the laptops, the RAM will be on the same package as CPU/GPU). Upgrading will be a case of plugging in modules that contain RAM/CPU/GPU in the same package.

Think 4x M1 Max packages in 4 sockets on a board. It might not be exactly the same as an M1 Max (may be bigger, will target higher power consumption and clock speed), but it will certainly be modular and have the capacity to link multiple packages together.

I could be wrong here, but i very much suspect that just like AMD has done with Ryzen, systems in general are going to become more modular to get scalability. You can't just keep making the chips bigger due to cost. But you can design them to be very fast in small clusters and then link the clusters together.

There's been talk of moving compute into the memory modules for some time due to the concerns mentioned in my previous post. Apple have just done it in reverse; moved memory into the processing package.

mr_roboto · Oct 21, 2021

chumps said:
The M1 has a latency of 100ns and traditional PC's have somewhere around 10-20ns right? So it does seem like what Andrei from Anandtech said that these chips are more like GPUs w/ CPUs bolted on?

Also something I don't understand is why the M1's or GPUs have higher latency. The memory is right there on the package in the case of the M1 but pretty nearby in GPUs. Shouldn't latency be better in these scenarios?

I'm not sure where you're getting this idea that traditional PCs have main memory latency of 10 to 20ns. DRAM random access latency isn't anywhere near that fast, that kind of figure can only come from a cache.

Here's a couple test results from Anandtech showing that M1 has very similar main memory latency to a Ryzen 9 CPU which also uses LPDDR4X-4266. Look at the far right side of each memory latency graph: what they're showing on the X axis is the effect of doing progressively larger tests to probe out each level of cache.

The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test

www.anandtech.com

AMD Ryzen 9 5980HS Cezanne Review: Ryzen 5000 Mobile Tested

www.anandtech.com

quarkysg · Oct 21, 2021

falainber said:
It would be technically impossible or prohibitively expensive to have that much of unified RAM.

Not really.

There's nothing stopping Apple from design the AS Mac Pros with DDR5 DIMM sockets and go crazy with the RAM capacity. And I think Apple will likely go this route. They just have to give up the latency and power advantage of on-package LPDDR5 SDRAM.

Existing Intel Mac Pro already have 384-bits of data bus of DDR4 DIMMs. I would think Apple will go to 768-bits next for AS Mac Pro, and likely further increase the SoC's SLC cache. They have to keep all the high performance CPU and GPU cores fed with data. This should mitigate the higher access latency required by DIMMed SDRAMs. And those DIMMs likely have ECC memory. I don't think LPDDR5 memory modules have ECC at the moment.

falainber · Oct 21, 2021

quarkysg said:
Not really.

There's nothing stopping Apple from design the AS Mac Pros with DDR5 DIMM sockets and go crazy with the RAM capacity. And I think Apple will likely go this route. They just have to give up the latency and power advantage of on-package LPDDR5 SDRAM.

Existing Intel Mac Pro already have 384-bits of data bus of DDR4 DIMMs. I would think Apple will go to 768-bits next for AS Mac Pro, and likely further increase the SoC's SLC cache. They have to keep all the high performance CPU and GPU cores fed with data. This should mitigate the higher access latency required by DIMMed SDRAMs. And those DIMMs likely have ECC memory. I don't think LPDDR5 memory modules have ECC at the moment.

But if they go with DMMs that would mean that they use the same memory interface as all other companies with the same bandwidth as those.

quarkysg · Oct 21, 2021

falainber said:
But if they go with DMMs that would mean that they use the same memory interface as all other companies with the same bandwidth as those.

Well, nothing wrong with that as far as I'm concerned

mr_roboto · Oct 21, 2021

sentential said:
The M1 Max/Pro is the worst of both worlds. Too large to scale clock frequencies, too expensive to have broad appeal, cannot be serviced whatsoever by the end user or by their IT support.

The real answer would have been a knock off of Infinity Fabric from AMD which is truly revolutionary hence them beating Intel with a cudgel for the last couple of years.

With a chiplet system they could have done nothing more than fuse 4x M1 chiplets and have the same power targets, similar GPU target of 28 but quadruple the cores at 32

You do not have any idea what you're talking about.

"Too large to scale clock frequencies" - nonsense. They aren't building a single giant circuit the size of the entire die. They're designing much smaller blocks, with local clock trees, such as the GPU core, the two types of CPU core, or the NPU core. Then they're replicating these structures and interconnecting them. If they felt they wanted to increase frequency of any of those small blocks, under that system architecture, there's nothing getting in the way.

"too expensive to have broad appeal" - who cares? Apple built these to appeal to people who need a ton of compute power in a notebook. Also, I bet they end up selling a lot more of the base 14" model (and the base upgraded to 32GB RAM) than you might think.

"cannot be serviced" - for some reason, my employer and many others are perfectly willing to buy Mac laptops and give them to end users despite the fact that their hardware is more difficult to service.

With a chiplet system and something akin to Infinity Fabric they could have increased power for the same nominal compute throughput, at the cost of increasing latencies and thus hurting performance. As an armchair chip designer, you seem to be blissfully unaware that offchip communications always cost more power than onchip. The fact that Apple stuck with a single monolithic die for laptop computers shows that they're committed to minimizing power and maximizing performance for that power.

827538 · Oct 21, 2021

fakestrawberryflavor said:
Typically, CPU RAM is lower bandwidth but also much lower latency. GPU RAM is super wide, higher bandwidth, but also much higher latency. Something about CPU cores needing lower latency to keep the pipeline fed in a serial format, where GPUs are so massively parallel that the latency penalty isn’t noticed. I’m not an expert, just my rudimentary understanding.

Quite a good description.

OP, also look at the absolutely massive L1/L2/L3 caches Apple is including. You'd be surprised at how much die space is taken up with cache. The name of the game is as much latency as bandwidth. You can have all the bandwidth in the world but if the cores (typically CPU cores) are waiting for data then everything stops, hence the push for very large caches to feed those powerful cores.

It's also not got a whole lot to do with physical distance but time taken to fetch data from RAM > cache > core.

827538 · Oct 21, 2021

falainber said:
I think there is also an issue of scalability. x86 CPUs are used for a variety of applecations including servers and workstations with terabytes of RAM. It would be technically impossible or prohibitively expensive to have that much of unified RAM. Apple started with phone SOCs with very small RAM. Now they managed to increase RAM size to 64GB which is impressive. But it's not enough for Mac Pro. It's interesting what they are going to do for desktop chips.

That's not really correct. Is it more work to include RAM on an SoC than as separate sticks. Sure. Is it prohibitively expensive or hard, not really, you just have to design for it. There's quite a lot of ways to skin this particular cat and they can pack a lot of RAM in there if they choose. I don't think they will do it but it's entirely possible. The Mac Pro is a much more interesting question as the sort of have to go for a chiplet design in my opinion given the M1 Max is already probably close to the limit of acceptable yields given its size (this has nothing to do with the DRAM), if they were to include say 32 HP cores + all the other bits then they really are left with going down the chiplet route...
Or just surprising absolutely everyone and making a truly mammoth slab of silicon goodness for us to all awe and gock at.

chumps · Oct 21, 2021

mr_roboto said:
I'm not sure where you're getting this idea that traditional PCs have main memory latency of 10 to 20ns. DRAM random access latency isn't anywhere near that fast, that kind of figure can only come from a cache.

Here's a couple test results from Anandtech showing that M1 has very similar main memory latency to a Ryzen 9 CPU which also uses LPDDR4X-4266. Look at the far right side of each memory latency graph: what they're showing on the X axis is the effect of doing progressively larger tests to probe out each level of cache.

The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test

www.anandtech.com

AMD Ryzen 9 5980HS Cezanne Review: Ryzen 5000 Mobile Tested

www.anandtech.com

Probably because of CAS latency and memory timing numbers thrown around. It makes sense that actual roundtrip numbers would be a low slower.

throAU · Oct 21, 2021

quarkysg said:
There's nothing stopping Apple from design the AS Mac Pros with DDR5 DIMM sockets and go crazy with the RAM capacity. And I think Apple will likely go this route. They just have to give up the latency and power advantage of on-package LPDDR5 SDRAM.

Bandwidth and latency.

As per my post above, its likely better for them to put out self contained modules that contain local resources to optimise for latency and throughput.

Memory traces to memory slots cost performance. Apple are not interested in using generic components if it means trading off their performance advantage is my bet.

leman · Oct 21, 2021

quarkysg said:
There's nothing stopping Apple from design the AS Mac Pros with DDR5 DIMM sockets and go crazy with the RAM capacity. And I think Apple will likely go this route. They just have to give up the latency and power advantage of on-package LPDDR5 SDRAM.

It's one thing to integrate wide channel memory on the SoC package, and an entirely different one to route all the wires though the mainboard, not to mention expensive. Besides, at some point it becomes unpractical. Let's say Apple wants to hit over 1TB memory bandwidth on their Mac Pro system, you'd need 16 or more DDR5 sockets, all of which have to be filled in order to get the full bandwidth. Only need 32GB of RAM in your system? Tough luck, 2GB DDR5 modules don't exist — you'd have to pay for 16 8GB ones, and forget about mixing and matching brands...

What makes much more sense is using HBM or some other fast compact memory interface (since even LPDDR5 will became a space issue if you want to 1024bits or more) and then add 6- or 8-channel socketed RAM for total capacity.

Why is traditional CPU memory bandwidth so slow?

Cancelled

macrumors G4

macrumors 68000

macrumors 6502

Suspended

Cancelled

Cancelled

macrumors 65816

macrumors 601

macrumors Core

macrumors 65816

macrumors regular

macrumors G4

macrumors 68040

macrumors G4

macrumors 6502a

macrumors 65816

macrumors 68040

macrumors 65816

macrumors 6502a

Cancelled

Cancelled

Cancelled

macrumors G4

macrumors Core

Our Staff