Rumored pro chip

quarkysg · Jun 20, 2021

EntropyQ3 said:
When it comes to RAM, apart from the power and speed costs of socketed DIMMs, there is also the aspect of granularity. DIMMs present a 64-bit wide (total) path to DRAM, typical consumer PCs including Apples offering using two sets of DIMMs in parallel, for a (in total) 128-bit wide interface. More professionally oriented products use more channels, the widest workstation systems today being AMD:s Threadripper Pro line that can operate 8 DIMMs in parallel for 512-bits worth of DDR4-speed data, which offers a bandwidth of 200GB/s, nominal.
Going wider than that, though possible, is impractical even for dedicated workstations. It’s also worth noting the toral number of DRAM dies it allows.

The 2019 Mac Pros already sport 6 channels of DDR4 ECC DIMMs, for a total of 384 + ECC data lines. So going to 8 channels DDR5 is entirely plausible. Some server boards even have 12 DIMM channels.

EntropyQ3 said:
Going from 8 to 128 GPU units is a factor of 16. No socketed memory system will offer 1100GB/s in a Mac Pro.

IMHO, Apple need not shoot for such a high bandwidth. Traditional GPUs are still bottle-necked by the PCIe bus bandwidth. Even tho the GPU’s VRAM can supply TB/s bandwidth, the data still need to be send from main memory via PCIe bus. This will benefit algorithms that need to iterate multiple (maybe thousands) times the data set in VRAM before the computation is done. If data is sent to GPU to be transformed and read back by CPU, the PCIe bottleneck will negate all the VRAM bandwidth benefit. The programmer would have to synchronise the data handling and GPU processing to hide this bottleneck. This problem is not present for UMA. Apple would need a humongous cache tho., if the addressable RAM grows.

My thinking is that Apple would have socketed DIMMs for Apple Silicon Mac Pros, and maybe even for the 27” iMac successor. Would be very expensive stocking inventory they cannot sell for a product that have wild RAM requirements from their customers. iMacs would likely only have 4 channels DDR5 DIMMs, for about 200GB/s bandwidth.

PsykX · Jun 20, 2021

If it's an M1 "unleashed" in the form of a 10-core with 8 performance cores, we should expect it to score 13,000 in Geekbench 5, give or take.

Which means it would be better than a few iMac Pros and better than the first Mac Pro.

I can't wait.

EntropyQ3 · Jun 20, 2021

quarkysg said:
The 2019 Mac Pros already sport 6 channels of DDR4 ECC DIMMs, for a total of 384 + ECC data lines. So going to 8 channels DDR5 is entirely plausible. Some server boards even have 12 DIMM channels.

Going wider than eight is certainly possible, but as you imply yourself, isn’t found outside servers. Selling CPU instances isn’t really the application we envision Apples upcoming models to be made for though, the workstation market seems a better fit, and is the only one they are currently adressing.

quarkysg said:
IMHO, Apple need not shoot for such a high bandwidth. Traditional GPUs are still bottle-necked by the PCIe bus bandwidth. Even tho the GPU’s VRAM can supply TB/s bandwidth, the data still need to be send from main memory via PCIe bus. This will benefit algorithms that need to iterate multiple (maybe thousands) times the data set in VRAM before the computation is done. If data is sent to GPU to be transformed and read back by CPU, the PCIe bottleneck will negate all the VRAM bandwidth benefit. The programmer would have to synchronise the data handling and GPU processing to hide this bottleneck. This problem is not present for UMA. Apple would need a humongous cache tho., if the addressable RAM grows.

I’m not sure if I disagree or simply do not follow you.
The primary use case for GPU:s is to produce 3D graphics. In that area, the PCIe bottleneck demonstrably doesn’t matter a lot, it has been shown over and over that even the fastest graphics cards aren’t constrained by PCIe bandwidth unless you drop speed and width down to for-experimental-purposes levels. (Or PCI3 four lanes wide via Thunderbolt). Programmers aren’t complete idiots, they see to that their code works well within the constraints of the target hardware platforms.
The same goes for GPGPU - it works as long as your problem set fits within local memory, and you iterate over it there, not over PCIe. If you’re lucky, your problem may allow you to divide it into parts that you can compute sequentially, making the size of the local memory pool less of an issue. (Given the relative speed of present day solid state storage and PCIe, you might as well stream those chunks from SSD as from RAM.)

Again, you don’t do bulk work over PCIe. It would be a disaster in terms of performance.
Not having a PCIe bandwidth barrier potentially opens up new ways of doing things, as well as allowing the GPU to access a larger pool of memory. That is good, but current code is written to take that barrier into account, you can’t assume that this code will run faster on a lower bandwidth system just because nothing needs to cross PCIe.

quarkysg said:
My thinking is that Apple would have socketed DIMMs for Apple Silicon Mac Pros, and maybe even for the 27” iMac successor. Would be very expensive stocking inventory they cannot sell for a product that have wild RAM requirements from their customers. iMacs would likely only have 4 channels DDR5 DIMMs, for about 200GB/s bandwidth.

Well, we’ll see about that. For what it’s worth, I hope you are right when it comes to the iMacs.

quarkysg · Jun 20, 2021

EntropyQ3 said:
I’m not sure if I disagree or simply do not follow you.
The primary use case for GPU:s is to produce 3D graphics. In that area, the PCIe bottleneck demonstrably doesn’t matter a lot, it has been shown over and over that even the fastest graphics cards aren’t constrained by PCIe bandwidth unless you drop speed and width down to for-experimental-purposes levels. (Or PCI3 four lanes wide via Thunderbolt). Programmers aren’t complete idiots, they see to that their code works well within the constraints of the target hardware platforms.
The same goes for GPGPU - it works as long as your problem set fits within local memory, and you iterate over it there, not over PCIe. If you’re lucky, your problem may allow you to divide it into parts that you can compute sequentially, making the size of the local memory pool less of an issue. (Given the relative speed of present day solid state storage and PCIe, you might as well stream those chunks from SSD as from RAM.)

Again, you don’t do bulk work over PCIe. It would be a disaster in terms of performance.
Not having a PCIe bandwidth barrier potentially opens up new ways of doing things, as well as allowing the GPU to access a larger pool of memory. That is good, but current code is written to take that barrier into account, you can’t assume that this code will run faster on a lower bandwidth system just because nothing needs to cross PCIe.

Imagine this scenario for GPGPU compute. Say you have to process 1GB of data. For the traditional modular architecture design for GPUs with dedicated VRAMs. You would need to:

1. Fetch the data from disk to main memory;
2. Feed this dataset to the GPU via PCIe with max 32GB/s (and that’s ignoring the PCIe protocol overhead);
3. Get the GPU to crunch thru that;
4. Retrieve the crunched dataset back to main memory;
5. Use the result.

For Apple Silicon UMA, 2. and 4. is not required, and the GPU can get at the dataset at the full memory bus bandwidth, and in the case of the M1, this is clocked at more than 60GB/s by Anandtech, actual, with all overheads accounted for. With wider bus and memory speed like DDR5, bandwidth will increase.

Of course if the GPU tasks can be started while the copy is still going on, there’s potential to hide this bottleneck, but it will increase code complexities. I believe most of the x86 optimisations are heavily focused in this area. If the tasks at hand need to iterate the GPU’s VRAM big dataset many times, then the high bandwidth GPUs with dedicated VRAMs will win.

As for graphics workflows, Apple has implemented TBDR and GPU Tile Memory, making their implementations much more efficient, in terms of GPU utilization and power usage. So in my view, the raw TFLOPS advertised are probably misleading when comparing between the two architecture (i.e. the dGPU IMR model vs Apple’s TBDR UMA architecture)

nothingtoseehere · Jun 20, 2021

leman said:
Upgradeability in a business environment is rarely meaningful:

[…]
- Equipment is cheap, labor is expensive

My impression from all this is that upgradability is mostly a thing of a home enthusiast PC builder, who likes tinkering with the components, upgrade every time a new gaming GPU comes out and has a limited budget.

I fully agree for corporate environments. Being in charge myself, we want solutions out of the box at least concerning hardware, there is enough software configuration to do. As you write, labor is the most expensive part.

But I wonder if there is a third group not already covered with businesses on the one hand, home enthusiasts on the other hand. I think of creators, developers and that kind of people, working alone or with colleagues in very small and nonformal offices. They are there own IT department, they may want to keep the machine they are used to as long as possible, and they do not mind tinkering a little bit with hardware as well as with software.

I also wonder if this is the right thread to discuss that question 🤔

AVonGauss · Jun 20, 2021

Deleted.

diamond.g · Jun 20, 2021

quarkysg said:
Imagine this scenario for GPGPU compute. Say you have to process 1GB of data. For the traditional modular architecture design for GPUs with dedicated VRAMs. You would need to:

1. Fetch the data from disk to main memory;
2. Feed this dataset to the GPU via PCIe with max 32GB/s (and that’s ignoring the PCIe protocol overhead);
3. Get the GPU to crunch thru that;
4. Retrieve the crunched dataset back to main memory;
5. Use the result.

For Apple Silicon UMA, 2. and 4. is not required, and the GPU can get at the dataset at the full memory bus bandwidth, and in the case of the M1, this is clocked at more than 60GB/s by Anandtech, actual, with all overheads accounted for. With wider bus and memory speed like DDR5, bandwidth will increase.

Of course if the GPU tasks can be started while the copy is still going on, there’s potential to hide this bottleneck, but it will increase code complexities. I believe most of the x86 optimisations are heavily focused in this area. If the tasks at hand need to iterate the GPU’s VRAM big dataset many times, then the high bandwidth GPUs with dedicated VRAMs will win.

As for graphics workflows, Apple has implemented TBDR and GPU Tile Memory, making their implementations much more efficient, in terms of GPU utilization and power usage. So in my view, the raw TFLOPS advertised are probably misleading when comparing between the two architecture (i.e. the dGPU IMR model vs Apple’s TBDR UMA architecture)

Just for clarification CPUs can feed/read data directly to and from GPU ram without having to copy to system memory first. I know it still isn’t as fast, but it does save a step.

deconstruct60 · Jun 20, 2021

quarkysg said:
My thinking is that Apple would have socketed DIMMs for Apple Silicon Mac Pros, and maybe even for the 27” iMac successor. Would be very expensive stocking inventory they cannot sell for a product that have wild RAM requirements from their customers. iMacs would likely only have 4 channels DDR5 DIMMs, for about 200GB/s bandwidth.

Apple doesn't "have to" do DIMM slots.

There is no inventory problem if don't offer as many configurations. The Retina iMac 21.5" had 3 CPUs choices , 3 GPU choices. They didn't go full permutations , but there was about 8 different ways to combine them. For the iMac 24" that dropped down to just two.

The Mac Pro probably wouldn't drop down to just two , but it doesn't have to the exact same number of combinations. If this is a "half sized" Mac Pro then the permutations on GPUs won't be the same because there isn't the same number of slots. Same permutation lowering factors if there is much lower cap on max RAM capacity. ( peak at 128 or 256GB and start at 32GB ( or 64 ) )

Apple could off a set of standard configurations that they kept inventory. (
14 CPU 48 GPU 32GB RAM ( 16 x 2 GB )
20 CPU 64 GPU 64GB RAM ( 16 x 4 GB )
40 CPU 128 GPU 128GB RAM ( 32 x 4 GB )

folks who want more of one thing just get more of some others

14 CPU 48 GPU 64GB RAM ( 16 x 4 GB )
14 CPU 48 GPU 128 RAM ( 16 x 8 GB )
20 CPU 64 GPU 32GB RAM ( 16 x 2GB )
20 CPU 64 GPU 128GB RAM ( 16 x 8GB )
40 CPU 128 GPU 64 GB ( 32 x 2GB )
40 CPU 128 GPU 256GB RAM ( 32 x 8GB )

Thouse would just be build to order just like the XDR display where you order it and they make it. There is no 2-4 days delivery. There is no inventory of finished SoC if just make them "just in time". ( Apple could put a cores deactivated models of the 40 baseline also which would add more permutations without changing much ). The inventory overhead of 2, 4, 8 GB memory packages isn't that much more different than holding DIMMs in terms of cost. Apple would need a fixed set of TSMC 2.5 (or 3D) packing starts reserved ( but they could drive a prediction from that from Mac Pro 2019 sales ). That number of options is substantially different from the rest of the Mac line up on M-series.

The "delayed substantail amount of "just in time" models doesn't have to be permanent. They could adjust with some data if something on that list was far more popular then the standard configurations. Likewise if the 20 and 40 baseline SoC are chiplets it is relatively easy to just tweak that allocation of that same building block to different package groupings. ( if 40's don't sell as well as they though and 20's selll 40% then toss the chiplets into more 20's ) . Basically comes down to measuring what people buy and adjusting to what they make out of the same set of building blocks.

This is also probably priced high enough that it is not some very high run rate product. More so in the thousands per month zone than 100K's per month zone.

If there are 1-2 slots the Afterburner card, other M.2 SSD drive cards , etc as ways to flush out the permutations to double digit number of configuration options to choose from. If they eventually open the door for 3rd party GPU then more options.

Apple could use a Xeon W 3300 refresh of the Mac Pro 2019 full size chassis to "kick the can" out into 2023-2024 where they can cover more ground with a future "half size" model than they would with the initial iteration. Keeping the full sized refreshed would also tap down on the volume they would need for the half sized one. ( even more likely in the thousands per month zone).

thenewperson · Jun 20, 2021

Bug-Creator said:
I'm 99% sure that the 2-fan iMac were supposed to get an M1x which for some reason didn't make the cut. Would also have turned the 2 non TB USB-C into TB.

Leaning this way as well (mostly because it's what I wish happened with the iMacs).

leman · Jun 20, 2021

quarkysg said:
The 2019 Mac Pros already sport 6 channels of DDR4 ECC DIMMs, for a total of 384 + ECC data lines. So going to 8 channels DDR5 is entirely plausible. Some server boards even have 12 DIMM channels.

8 channels of DDR5 will “only” give you around 400-500GB/s bandwidth, definitely not enough to compete with high-end GPUs Mac Pro currently uses, UMA or no UMA. And besides, you still have to populate all 8 slots. With $100 per module, that’s $800 for a starting memory config with subpar performance. Not really worth it.

quarkysg said:
IMHO, Apple need not shoot for such a high bandwidth. Traditional GPUs are still bottle-necked by the PCIe bus bandwidth. Even tho the GPU’s VRAM can supply TB/s bandwidth, the data still need to be send from main memory via PCIe bus. This will benefit algorithms that need to iterate multiple (maybe thousands) times the data set in VRAM before the computation is done. If data is sent to GPU to be transformed and read back by CPU, the PCIe bottleneck will negate all the VRAM bandwidth benefit. The programmer would have to synchronise the data handling and GPU processing to hide this bottleneck. This problem is not present for UMA. Apple would need a humongous cache tho., if the addressable RAM grows.

It’s not that simple. UMA is amazing for certain workloads (like video processing), but not all GPU uses require constant data streaming between the CPU and the GPU. Apple can definitely get away with having less bandwidth, but their bandwidth-saving techniques won’t be able to overcome a difference of 500-600 G B/s…

theorist9 · Jun 20, 2021

leman said:
I am not sure how accurate this is. I have spent around a decade in a function as a manager for IT resources of a mid-size research department and I can assure you that concerns over modularity of upgradeability never even entered the equation. For the regular work machines, you buy what you need every couple of years — they are cheap. Even our supercomputer department never upgraded anything: they had a financial plan and wold just buy new hardware every five years or so. For them, it was more important that the systems were extensible, e.g. that they could buy a new server blade or a new storage unit to extend the existing system.

Upgradeability in a business environment is rarely meaningful:

- It puts a lot of pressure on the usually strained resources of IT departments that really have better things to do instead of tinkering with components
- You are putting yourself at a financial risk by investing into hardware outside of it's warranty or service window
- By the time you need to upgrade, computers likely became much faster anyway, so replacing the system is often the better choice
- Upgrading rarely save any meaningful amount of money anyway (if the cost of the new GPU is 60% of the computer cost, you are not saving anything, those few $$$$ are just peanuts)
- Should your demands change in an unpredictable manner, upgradeability does not help. It is never the case that you go "oh, I have misjudged how much RAM I need, I should have bought 64GB instead of 16GB". If you find yourself in such a situation, it's not just more RAM that you need. You likely need a bigger system overall.
- Equipment is cheap, labor is expensive

My impression from all this is that upgradability is mostly a thing of a home enthusiast PC builder, who likes tinkering with the components, upgrade every time a new gaming GPU comes out and has a limited budget.

1) I understand that was your experience in your business, but that was just one company. Recall that, in desiging the latest Mac Pro, Apple brought on a "Pro Workflow Team" (https://techcrunch.com/2018/04/05/apples-2019-imac-pro-will-be-shaped-by-workflows/ ) so that their pro users could tell them just what their needs were. And a key input from those pros is that they needed the machine to be modular, so that it could be upgraded as their needs changed. From Tom Boger, senior director of Mac Hardware Product Marketing:

".... modular was inherently ... a real need for our customers and that’s the direction we’re going."

And from Apple's white paper on the Mac Pro (https://www.apple.com/mac-pro/pdf/Mac_Pro_White_Paper_Feb_2020.pdf):

"The Mac Pro is engineered to provide unprecedented levels of access and capability. Every aspect of the hardware is designed to be flexible and accommodate change. The graphics, storage, and memory modules are easily expandable and configurable." [emphasis mine]

And it's not just Apple. The workstations produced by HP, Dell, Boxx, etc. are all modular and upgradeable. So essentially what you're arguing here is that what Apple, HP, Dell, and Boxx think many of their pro customers want, and what many of their pro customers are actually saying that want, isn't what they really want!

2) The pro community is quite diverse, so again, just because that was your experience doesn't mean it's applicable generally. Here's a quote from John Ternus, Apple's VP of Hardware Engineering (from techcruch article linked above):

“We said in the meeting last year that the pro community isn’t one thing. It’s very diverse. There’s many different types of pros and obviously they go really deep into the hardware and software and are pushing everything to its limit."

Many of the customers for the higher-end Macs are smaller specialty shops that do post production, video rendering, etc., and don't have IT departments or leasing arrangements, etc. For them it's simpler and more cost-effective to upgrade their existing machines than to buy new ones. [Buying a new machine means swapping everything out, and also making arrangements to sell the old machine, both of which are much more inconvenient than just, say, upgrading the RAM.] And AppleCare now offers warranties extendable beyond 3 years.

3) You wrote "Should your demands change in an unpredictable manner, upgradeability does not help. It is never the case that you go "oh, I have misjudged how much RAM I need, I should have bought 64GB instead of 16GB."

On the contrary, when I was doing my Ph.D. I specced out a G5 tower that was fine for my computational needs for the first two years (both local computation, and development work for programs I would then send to the university's clusters). But then my needs expanded (research, after all, can take you in unexpected directions), and I recall having to increase both the RAM and HD size. It was good I could do that, because my PI didn't have the budget to buy me a new computer.

polyphenol · Jun 20, 2021

One2Grift said:
Intel has been, in the past, an outstanding performer.

As with the well-known phrase "No-one ever got sacked of buying IBM" and that company's FUD policies, Intel managed to push their products by pushing the message that anything else was questionable.

deconstruct60 · Jun 20, 2021

EntropyQ3 said:
. Going from 8 to 128 GPU units is a factor of 16. No socketed memory system will offer 1100GB/s in a Mac Pro. And this is about the Mac Pro, no portables or anorectic iMacs are likely at the power levels these solutions will probably end up, which also means that we don’t need to constrain ourselves to mobile packaging.

The M1 has two memory 3D stacks attached (with 4 memory channels to each chip). Going from 8 GPU cores to 128 cores also means the whole combined chip area gets bigger. For example, if Apple went with a much bigger 32GPU chiplet than the M1 that had eight memory 3D then there would an increase of 4 for the GPU core count and a increase of 4 for the memory stacks ( 4 * 8 = 32 , 4 * 2 = 8 ).

What Apple would likely be throwing away is anything like the x64 PCI-e v4 lanes on a Xeon W-3300/3200 or AMD's 128 PCI-e lane count on Threadripper and EPYC. On the M1 about half of the whole circumfernce of the outer part of the whole die is just memory controllers ( eight). That is about the same where upper end Intel and AMD server packages go on a far smaller die. Apple doesn't particulary care about off chip PCI-e bandwith on the M1. If they carry that over to whatever they do for the first iteration (or two ) for the Mac Pro it is possible to double , triple what the "socketed CPU " packages do by making a tradeoff they don't do. ( using those off die connection point for even more point-to-point connection to RAM dies inside of close packed 3D memory modules).

This is also enabled by the SoC being a substantially lower TDP. Bring the memory into close proximity to something far cooler has a better chance of working. Likewise if the point-to-point the memory controllers will consume less power also.

It is a matter of going "all in" on lowering the chip die to memory power consumption. If Apple goes for the biggest 2.5D package that TSMC offers this is in the practical limits now. The footprint of this huge package might be a bit bigger than what Intel/AMD large sockets are , but the also would be "recovering" the DIMM space on the logic board. But if looking at the non-DIMM portions of the SoC package it would probably be in the same zone as the current larger sockets ( leveraging be a bit ahead on fab process used . ) . The key would be scaling the memory bandwidth with each chiplet.

If Apple is out to avoid ECC RAM for now, they probably don't care about the shrink in maximum RAM capacity.
So bandwidth to the chip package isn't the issue. The larger trad-off "penalty" cost is more likely in the area of capacity than bandwidth. There is some cost where sharing the expanded bandwidth with a large multiple of P cores but if Apple isn't looking for an absolute total "walk and chew gum at the same time" SoC that could work.

Similarly, if they are sharing the 20 core with the iMac larger screen model, there is a mixed track record of Apple providing ECC over most of that range also.

EntropyQ3 said:
So, picking options from the tried and true we can:
1. Use a wide "normal" RAM system. This gives access to really large amounts of RAM in a straightforward manner. Unfortunately it means feeding the system with data through an (in this context) itty bitty straw. This can to some extent be alleviated by beefing up the cache hierarchy. But only to some extent.

Not need to have a "itty bitter straw" if throw away bandwidth assigned to other duties.

EntropyQ3 said:
2. Use industry standard HBM. This can, now that Hynix has recently announced their first HBM3 modules, provide 665GB/s per stack. Going with four stacks of HBM, such as in the old AMD Fiji graphics cards, could provide in excess of 2TB/s. Which would be nice, and appropriate. The compromise of course being total memory capacity, which might initially be constrained to 128GB or so. This could conceivably be extended with regular DIMMed RAM, but as noted earlier, that would require some work on the software side of things to maintain the illusion of a homogenous RAM system while maintaining the performance benefits.

Don't need HBM if use their own customer stacked LPDDR4 ( or LPDDR5 ) memory stacks. But like HBM there will be a capacity cap put on main RAM size.

EntropyQ3 said:
Personally, I’m strongly in favour of speed vs. extendability. In computational chemistry, there are a fair number of codes that are ”GPU accellerated", but that actually scales with GPU memory bandwidth, not TFLOPs. (Even published as such which I wouldn’t let pass if I were doing the reviewing).

Which is why Apple might throw away lanes assigned to PCI-e v3 (or v4) for more lanes assigned to memory for their iGPU.

Mainstream, a future AMD HPC focused CPU will couple to future computational focused GPU with Infinity Fabric. Similarly , PCI-e v5 gets into the zone where really not all that "slow". [ At that point faster than the UPI links used to couple 2-4 socket Xeons together a generation ago. The sky didn't collapse on those in package-to-package data movement. ] . PCI-e v5 and v6 are not particularly aimed at affordabe, generic , mainstream cards anymore. ( layer CXL on top either of them and have standards based memory to memory coherence. ). I don't think Apple is going to follow that path at all, but there are folks making different trade-offs at substantially different power consumption trade-off.

EntropyQ3 said:
.... Comparatively, I just don’t see much of a case for prioritising TB+ RAM areas over speed.

"speed" can be as much fewer stalls and bubbles as much as it can be cranking up clocks and power consumption.
The was the major message Apple expound on in WWDC 2020. It is doubtful they are going to abandon that for the Mac Pro. But that path isn't "free". Part of the trade-off is some flexibility , capacity , and breath of 3rd party options.

P.S. Wouldn't be the uber, ultimate gamer GPU ( with GPU cores split over 4 chiplets ) aimed primary at goosing frame rates higher on larger monitors. But as a balanced computational GPU this would probably would work find. ( so that kind of "speed" also. )

deconstruct60 · Jun 20, 2021

leman said:
8 channels of DDR5 will “only” give you around 400-500GB/s bandwidth, definitely not enough to compete with high-end GPUs Mac Pro currently uses, UMA or no UMA. And besides, you still have to populate all 8 slots. With $100 per module, that’s $800 for a starting memory config with subpar performance. Not really worth it.

The Intel/AMD chips also have x64+ bandwidth of PCI-e v4 to work with also. That is approximately 128GB/s ( 4 * 32GB/s ) of longer distance bandwidth. Crank the distance down it wouldn't be hard to crank the bandwidth up. There is a trade-off though in throwing away almost all of the off-die bandwidth for uses other than memory.

Apple uses eight channels on the M1. Why would some larger SoC with chip(s) with much larger circumference be stuck with the same eight? if the circumference of the die doubles then they can double the channles. Quadruples then can do the same thing.

That looks nothing like what the generic DIMM market is aimed at.

The other presumption here is that they are actually trying to replace the highest or high end GPU options. Decent chance that isn't plan long term. Especially if uses are primarlly dropping in the top 2nd GPU almost solely to do GPGPU work. If Apple goes off on a Captain Ahab quest to subsume all dGPUs cards and loose the ability to get 32GB/s off of a second internal drive they could get into the "rob Peter to Pay Paul" zone for processing large data sets.

leman said:
It’s not that simple. UMA is amazing for certain workloads (like video processing), but not all GPU uses require constant data streaming between the CPU and the GPU. Apple can definitely get away with having less bandwidth, but their bandwidth-saving techniques won’t be able to overcome a difference of 500-600 G B/s…

Unified doesn't necessarily mean homogenous memory. It is cheaper to just do homogenious memory.
If Apple chucks ECC they can chuck having to implement support for that in the controller ( and no ... DDR doesn't get rid of ECC. on RAM module ECC is different that system ECC. ). Apple also get rid of inventory configurations with unified memory.
Customer A need 10GB VRAM
Customer B needs 30GB VRAM
Customer C needs 40GB VRAM

hand them all 64GB of Unified RAM and most can all allocate into a single SKU. Fewer SKU and lower inventory complexity ... Apple is happy.

diamond.g · Jun 20, 2021

deconstruct60 said:
The Intel/AMD chips also have x64+ bandwidth of PCI-e v4 to work with also. That is approximately 128GB/s ( 4 * 32GB/s ) of longer distance bandwidth. Crank the distance down it wouldn't be hard to crank the bandwidth up. There is a trade-off though in throwing away almost all of the off-die bandwidth for uses other than memory.

Apple uses eight channels on the M1. Why would some larger SoC with chip(s) with much larger circumference be stuck with the same eight? if the circumference of the die doubles then they can double the channles. Quadruples then can do the same thing.

That looks nothing like what the generic DIMM market is aimed at.

The other presumption here is that they are actually trying to replace the highest or high end GPU options. Decent chance that isn't plan long term. Especially if uses are primarlly dropping in the top 2nd GPU almost solely to do GPGPU work. If Apple goes off on a Captain Ahab quest to subsume all dGPUs cards and loose the ability to get 32GB/s off of a second internal drive they could get into the "rob Peter to Pay Paul" zone for processing large data sets.

Unified doesn't necessarily mean homogenous memory. It is cheaper to just do homogenious memory.
If Apple chucks ECC they can chuck having to implement support for that in the controller ( and no ... DDR doesn't get rid of ECC. on RAM module ECC is different that system ECC. ). Apple also get rid of inventory configurations with unified memory.
Customer A need 10GB VRAM
Customer B needs 30GB VRAM
Customer C needs 40GB VRAM

hand them all 64GB of Unified RAM and most can all allocate into a single SKU. Fewer SKU and lower inventory complexity ... Apple is happy.

Doesn’t DDR5 require ECC?

Bug-Creator · Jun 20, 2021

deconstruct60 said:
Why would some larger SoC with chip(s) with much larger circumference be stuck with the same eight? if the circumference of the die doubles then they can double the channles. Quadruples then can do the same thing.

The would still need to route inside the CPU and route more cores (of all kind) getting access to it.
Or the go the chiplet route which creates more/different problems.

Don't forget the M1 is (together with A12z) the biggest SoC Apple has ever done (counting cores) so I wouldn't be surprised if they hit a wall somewhere and what comes next might have to use creative solutions and might not even scale up as good in performance as core count suggests.

cmaier · Jun 20, 2021

Bug-Creator said:
The would still need to route inside the CPU and route more cores (of all kind) getting access to it.
Or the go the chiplet route which creates more/different problems.

Don't forget the M1 is (together with A12z) the biggest SoC Apple has ever done (counting cores) so I wouldn't be surprised if they hit a wall somewhere and what comes next might have to use creative solutions and might not even scale up as good in performance as core count suggests.

I don’t think they’re anywhere near the wall. If an algorithm can handle multiple cores, it’s quite easy to scale to 32 or 64 nowadays given modern on-chip bus architectures.

polyphenol · Jun 20, 2021

deconstruct60 said:
Apple also get rid of inventory configurations with unified memory.
Customer A need 10GB VRAM
Customer B needs 30GB VRAM
Customer C needs 40GB VRAM

hand them all 64GB of Unified RAM and most can all allocate into a single SKU. Fewer SKU and lower inventory complexity ... Apple is happy.

Ah! The mainframe days - when an engineer would flick a switch and the mainframe would increase processor power or see more memory! And the invoicing department would ramp up the bill massively. Despite there being no real hardware change.

leman · Jun 20, 2021

deconstruct60 said:
Apple uses eight channels on the M1. Why would some larger SoC with chip(s) with much larger circumference be stuck with the same eight? if the circumference of the die doubles then they can double the channles. Quadruples then can do the same thing.

It’s eight channels, but each channel is 16bit, as opposed to DDR’s 64-bit channels. M1 eighth channels are equivalent to dual-channel DDR RAM.

diamond.g said:
Doesn’t DDR5 require ECC?

No, DDR5 simple has in-die EXC to protect from data corruption with higher capacities. It’s not the same as usual ECC RAM where error correction also applies to data bus and memory controller as well.

Bug-Creator said:
Don't forget the M1 is (together with A12z) the biggest SoC Apple has ever done (counting cores) so I wouldn't be surprised if they hit a wall somewhere and what comes next might have to use creative solutions and might not even scale up as good in performance as core count suggests.

It’s still a rather small SoC by industry standards. There is no good reason to assume that Apple would have any issues here.

Jorbanead · Jun 20, 2021

theorist9 said:
1) I understand that was your experience in your business, but that was just one company. Recall that, in desiging the latest Mac Pro, Apple brought on a "Pro Workflow Team" so that their pro users could tell them just what their needs were. And a key input from those pros is that they needed the machine to be modular, so that it could be upgraded as their needs changed. From Apple's white paper on the Mac Pro (https://www.apple.com/mac-pro/pdf/Mac_Pro_White_Paper_Feb_2020.pdf):

"The Mac Pro is engineered to provide unprecedented levels of access and capability. Every aspect of the hardware is designed to be flexible and accommodate change. The graphics, storage, and memory modules are easily expandable and configurable." [emphasis mine]

It’s always been my understanding and assumption that the Mac Pro going forward will still be modular. For both the reasons you stated, but also because I don’t see Apple spending years in R&D for a completely new modular Mac Pro design that gets discontinued two/three years late

My guess is that by the time they decided they needed a new Mac Pro design, Apple was already in the middle of planning the transition away from Intel. So the reason why the current design took so long was due to it needing to support the current Intel roadmap as well as their own. I wouldn’t be surprised if the MPX module system was specifically designed with Apple silicon in mind.

All that being said, I don’t see any other Mac being modular in this way. If you are a pro that needs modularity, your only option will be Mac Pro.

bobcomer · Jun 20, 2021

polyphenol said:
Ah! The mainframe days - when an engineer would flick a switch and the mainframe would increase processor power or see more memory! And the invoicing department would ramp up the bill massively. Despite there being no real hardware change.

It's still that way with the IBM midrange power processor machines.

quarkysg · Jun 20, 2021

diamond.g said:
Just for clarification CPUs can feed/read data directly to and from GPU ram without having to copy to system memory first. I know it still isn’t as fast, but it does save a step.

Usually DMA will be utilised to fetch data from device to memory and vice-versa. The idea is that the CPU need not be wasting cycles shuttling data, but yeah, the CPU can also send it straigth to the GPU. It's better to use DMA to send data from memory to GPU tho, as CPU need not be tied up sending data.

For example, even if the dataset for the GPU is generated on the fly by the CPU via some algorithm and needed to be sent to the GPU for computation, it's better to generate the data in chunks in memory and DMA it to the GPU, to hide the data shuttling bottleneck, of course depending on how fast the CPU algo. is generating the data.

leman said:
8 channels of DDR5 will “only” give you around 400-500GB/s bandwidth, definitely not enough to compete with high-end GPUs Mac Pro currently uses, UMA or no UMA. And besides, you still have to populate all 8 slots. With $100 per module, that’s $800 for a starting memory config with subpar performance. Not really worth it.

It’s not that simple. UMA is amazing for certain workloads (like video processing), but not all GPU uses require constant data streaming between the CPU and the GPU. Apple can definitely get away with having less bandwidth, but their bandwidth-saving techniques won’t be able to overcome a difference of 500-600 G B/s…

I'm thinking that with the M1 vs GTX 1050/1060, which seems roughly equivalent in terms of performance, the M1 is trailing the GTX in terms of GPU to dataset bandwidth by 2:1 but the M1 is still matching the performance delivered.

In the case of game rendering for example, with high end GPU cards, they have gobs of high speed VRAMs as they need to pre-fill it with textures and have it rendered, as sending the textures to the VRAMs on-demand will kill performance due to the PCIe bottleneck. I think that's why you need to pair high end GPU cards with a beefy CPU as other than processing game logic, I would think a fair chunk of the CPU time will be spent determining when to shuttle data to the GPU VRAMs, and sending gobs of data to VRAMs before it's needed for the next scene for example.

I agree that not all real world usage will favour Apple's UMA/TBDR compared to the traditional dGPU/IMR, so it's a win-some/lose-some scenario I guess. At the end of the day, it's about how fast actual data is generated and sent to the GPU ALUs for processing. Will be interesting to see how Apple approaches this problem.

diamond.g · Jun 20, 2021

quarkysg said:
In the case of game rendering for example, with high end GPU cards, they have gobs of high speed VRAMs as they need to pre-fill it with textures and have it rendered, as sending the textures to the VRAMs on-demand will kill performance due to the PCIe bottleneck. I think that's why you need to pair high end GPU cards with a beefy CPU as other than processing game logic, I would think a fair chunk of the CPU time will be spent determining when to shuttle data to the GPU VRAMs, and sending gobs of data to VRAMs before it's needed for the next scene for example.

I agree that not all real world usage will favour Apple's UMA/TBDR compared to the traditional dGPU/IMR, so it's a win-some/lose-some scenario I guess. At the end of the day, it's about how fast actual data is generated and sent to the GPU ALUs for processing. Will be interesting to see how Apple approaches this problem.

Eventually DirectStorage (and SFS) will make it not matter as much (for games at least). For Pro applications I am not sure when Apple will have it’s equivalent ready. Apple has had pretty fast storage speeds, but stuff still has loading screens.

quarkysg · Jun 20, 2021

diamond.g said:
Eventually DirectStorage (and SFS) will make it not matter as much (for games at least). For Pro applications I am not sure when Apple will have it’s equivalent ready. Apple has had pretty fast storage speeds, but stuff still has loading screens.

Looks like DirectStorage is a device to device transfer mechanism, skipping the device to memory part, but still bottlenecked by PCIe when it comes to filling the GPU VRAMs. The CPU in this case is essentially blind to the data sent to the GPU then. This is beneficial to the textures loads I guess, as the CPU basically do not process textures and no system memory is used to store the textures.

But considering that disk I/O is an order or more magnitude slower than memory and even PCIe bandwidth, I think this API basically makes programmer's jobs easier. I'm sure programmers are using pre-fetching optimising techniques prior to DirectStorage.

In Apple's case, once the data is loaded into memory, it'll be in the SoC cache, and the GPU can suck it up as quicky as it can from the cache. Other processing cores in the SoC can also work on the same dataset as well. The drawback here is that if the algorirthm's dataset cache locality is no good, it's back to the raw bandwidth of the memory to cache of the SoC, and the GPU has to fight with the other processing cores for data. So for Apple's Pro and Workstation offerings, bandwidth has to go up. Question is, by how much compared to the M1 for them to think that it is enough.

theorist9 · Jun 20, 2021

How does one calculate total memory bandwidth? I found this table from https://www.rambus.com/blogs/get-ready-for-ddr5-dimm-chipsets/#3

So does this mean your total memory bandwidth with standard DDR5 would be?:

[Updated based on post from @leman.]

total memory bandwidth
=transfer Rate (Gbps/bit) x 32 bits/channel for data x 2 channels/DIMM x number of DIMMs ÷ 8 bits/byte

Thus with 1 DIMM:
DDR5-4800 = 4800 MT/s = 4.8 Gbps:
4.8 Gbps x 32 bits/channel for data x 2 channels/DIMM ÷ 8 bits/byte = 38.4 GBps/DIMM

DDR5-6400 = 6400 MT/s = 6.4 Gbps:
6.4 Gbps x 32 bits/channel for data x 2 channels/DIMM ÷ 8 bits/byte = 51.2 GBps/DIMM

And with 8 DIMMs:
DDR5-4800: 38.4 GBps/DIMM x 8 DIMMs = 307 Gbps
DDR5-6400: 51.2 GBps/DIMM x 8 DIMMs = 410 Gbps

And what would be the corresponding calculation for Apple's current M1, and for the expected next-gen AS?

Rumored pro chip

macrumors 65816

macrumors 68040

macrumors 6502a

macrumors 65816

macrumors 6502

macrumors 6502

macrumors G5

macrumors G5

macrumors 65816

macrumors Core

macrumors 601

macrumors 68020

macrumors G5

macrumors G5

macrumors G5

macrumors 68000

Suspended

macrumors 68020

macrumors Core

macrumors 65816

macrumors 601

macrumors 65816

macrumors G5

macrumors 65816

macrumors 601

Our Staff