Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Andropov

macrumors 6502a
May 3, 2012
746
990
Spain
I think the surprise is that the CPU complex is capable of 200GB/s. While they weren’t able to get the GPU complex pushing more than 90GB/s. I would have expected that to be reversed.
I was very surprised that the M1 Pro's 200GB/s could be saturated with the CPU alone. About the GPU bandwidth, maybe it can go much higher than 90GB/s (I would expect it could), the review says that he couldn't find a workload capable of going much higher than those 90GB/s, not that the interconnect itself is limited.
 

Krevnik

macrumors 601
Sep 8, 2003
4,101
1,312
I was very surprised that the M1 Pro's 200GB/s could be saturated with the CPU alone. About the GPU bandwidth, maybe it can go much higher than 90GB/s (I would expect it could), the review says that he couldn't find a workload capable of going much higher than those 90GB/s, not that the interconnect itself is limited.

Maybe I was a bit unclear, but I did try to emphasize that the reviewer couldn’t do it, not that the chip itself was limited. Makes me wonder why the bandwidth numbers were that low though. That seems to represent some actual efficiencies in play that are rather useful in this sort of SoC.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Would it be possible if they pack them so you have one chip on either side of the board? Basically “3D stacking” them so you have four chips only taking up the 2D area of two.

Those two different approached. To have one SoC package can't really put them on the "other side" (of the package ) since that is where the pin outs are. Not going to be practical to stack one on top of another because again the 'bottom" of these dies have a relatively large number of connections that need to be routed out. ( Part of the problem here is how to route out to 8 or 16 RAM stacks. Apple ha some custom RAM packages here that have two stack of dies. to keep the footprint down but I suspect aren't going solely vertical. )

There is a limit to how much wattage can stack underneath a logic chip. Can relatively easily put a chip on the bottom that doesn't product much cast off heat that does communication between dies or perhaps caching and communications. But putting a higher end GPU core or CPU core down there would be a problem. The A-series stacks RAM dies on top of the package, but the A_X and M-series have consistently not done that. Once have a substantive thermal dissipation problem having more surface area helps.

Two different sides of the primary logic board ( one SoC package on either side ) would not be all that dense. Nor much of a net benefit since now would have substantive heat sinks and cooling issues on both sides. It would help with the awkwardness of the RAM package placements, but would be a "ballon squeeze" into another set of problems.
The other issue here is that the inter-die communication addition they would need to make to the Max ( Jade ) die would pragmatically need to be at the outer edge. Off-chip I/O is not max transistor dense. Putting relatively low dense stuff in the middle of the chip. Once at the edge it is just easier and more efficient to just orient the two inter-die communication headers next to one another.


I don’t know much about this, just asking a probably stupid question here

It could be done if dogmatically pursued it at all costs. Pretty likely would not end up with a maximum Performance/Watt. The farther apart ( e.g., either side of relatively "thick" logic board" ) the higher penalty in wattage consumed. If pack a large heat source under another then have cooling problem and will pragmatically have to control clocks (e.g., a performance hit).

In a reasonable iMac (iMac Pro) or Mac Pro not sure what the space saving on a large package is suppose to "buy".
 
  • Like
Reactions: GubbyMan

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
The current Max won't work for dense packed 4 tile set up ( even if it did have interconnect which it doesn't appear to have at all). The placement of the RAM connections is wrong.
How do you know that? Did you see any die shots?


There were die shots in Apple's event.
Screen%20Shot%202021-10-18%20at%2012.20.45%20PM.jpg


https://live.arstechnica.com/apple-october-18-unleashed-event/

Might as well since just going to get x-rayed and released anyway by 3rd parties in a couple of days/weeks anyway.

All need to do is look at the outer edges. Memory and inter-die interconnect isn't going to be in the middle. The Max bottom basically has elements just like upper right. Since duplicated and the video en/decode are duplicated .... very high probability that it is.

The interdie would be at least as big as the memory ones. ( go look at UPI on Intel or Infinity Fabric on AMD dies ).
The top is thunderbolt and some other stuff. Plus there would be little to no rational reason to put the interconnecton on the Jade-Chop ( M1 Pro) die. Pro -> Max is just adding stuff to the "bottom" of the die. ( covered in Anantechs article, but relatively obvious here. The top "half" of Max (Jade) exactly matches the Pro (Jade-Chop). That is why Jade-Chop is a 'chop'. Not a literal chop but some very minor closing off of some internal networks and just stopping. with a smaller die mask and die product. )

The 4-die packaging problem is that the Memory I/O is all up and down the right and left sides. If put four of these Max dies in a 'square' pattern the memory on "inner" side" of the square will be much farther away since the dies are close. If try to spread the dies out so that can put memory in the middle than the comm overhead ( in power ) is much higher and have uneven latencies. ( both are bad if trying to maximize Perf/Watt. ).

[ I suspect Apple needs the NPU cores to do part of the ProRes encodings. That would be why close placement and doubling would be needed for second video encoding. NPU cores probably aren't maximum fit for the work but conveniently they are still an offload from the general CPU P cores. Perhaps a contributing reason why don't see encode in Afterburner also. ]

macOS caps out at 64 threads .

It’s a kernel implementation detail that can easily be changed. Or one can also go Windows way and use CPU affinity groups.

trying to do a '64 int' to '128 int' might be conceptually easy, the data structure ramifications ( caching and footprint) , scaling , and real time constraints ramifications aren't really easy. Very significant work went into moving past the 64 thread limit went into Linux , various Unix, and Windows moves into that zone. It is a substantive amount of work.

It isn't a deep technical problem. It is a strategic objective problem. Apple didn't have a technical problem coasting on the MP 2013 for 6 years. It was constraints of their own construction that were the primary driver. There is no upside for iOS/iPad/etc for more than 64 threads. There is not to no upside for vast majority of Mac line up either. Forking the kernel for some small single digit Mac market share probably won't happen.

MacOS got APFS in part because it was cheaper to have one file system shared over iPhones and Macs. Basically same primary driver here.

Apple could fork the kernel for a single digit share product. Apple could charge current market prices for memory and NAND storage. Neither one is likely going to happen any more than Apple going back to allowing the cloning option for macs.


( Recend rumor puts iMac 27" on M1 Pro and Max also. If so , that is an even smaller base upon which to layer a forked kernel. )

Die-Sizes.jpg


these came out later than my post, but quite illustrative of why Apple would be keen to get on to TSMC N3. The Max is in the 432mm2 zone. Let's say a tweak that allowed for multiple chip solutions was approximately 460mm2. Four times that would be 1,840mm2 . that's doable as a multichip package using TSMC 2.5/3D tech, but pretty close to the current 2021 limits and probably on the expensive side. If Apple got that down to 340-360mm2 that would be 1360-1440mm2 which probably lowers overall SoC package costs. ( not that Apple would reduce the price to end users, but would give them high margins and a relatively super low volume SoC. Also could shrink a Jade-2 into a monolithic die size they could live with and be more higher margin. )

Longer term Apple probably will go with few exotic 3D packaging for their SoCs if can go that path for the Mac solutions. The vast majority of their product line up doesn't need it ( now and even less so when get to N3 , N2 , etc. ).
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
I was very surprised that the M1 Pro's 200GB/s could be saturated with the CPU alone. About the GPU bandwidth, maybe it can go much higher than 90GB/s (I would expect it could), the review says that he couldn't find a workload capable of going much higher than those 90GB/s, not that the interconnect itself is limited.

Is anandtech counting GPU core bandwidth or GPU bandwidth? Likely the former not the latter . ( e.g. not including the Display controllers ). Typical overall GPU bandwidth includes the cores , A/V en/decode, and display driving.

In short, the measuring "tool" is at issue. Not going to measure what it doesn't cover.

Also Thunderbolt implementations for v1 , 2 , and 3 had bandwidth hard caps on encoding PCI-e data in part to make sure that "real time" video data was guaranteed data bandwidth. for v1 the PCI-e and video data were enccoded and segregated onto different channels. That is just easier to do since don't have to implement any hardware encoded flow regulation. Version '1.5' and 2 had slightly better transistor budgets and provided slightly better balance with incrementally better hardware flow control. Same with v3 and v4 .

That is partially what is probably happening here also. Apple has pragmatic flow regulators built in so that one subsystem doesn't diminish the real time response latencies of another subsystem by temporarily hogging all the bandwidth.

If Anandtech is peepholing into a sub-sub section then this isn't a very surprising result. For the iPhones , iPads , and low end Mac (with only a built in screen is 95% percentile use case ) this sub-sub system peephole optimization is all primarily need.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,679
trying to do a '64 int' to '128 int' might be conceptually easy, the data structure ramifications ( caching and footprint) , scaling , and real time constraints ramifications aren't really easy. Very significant work went into moving past the 64 thread limit went into Linux , various Unix, and Windows moves into that zone. It is a substantive amount of work.

If I understand it correctly, Windows solved it by introducing „Processor groups“. The core mask is still 64 bit, but there are multiple such masks and you have to use their NUMA API if you want to take advantage of more than 64 cores in your application. Apple could do something similar. They already have this kind of API in place for Metal.

Is anandtech counting GPU core bandwidth or GPU bandwidth? Likely the former not the latter . ( e.g. not including the Display controllers ). Typical overall GPU bandwidth includes the cores , A/V en/decode, and display driving.

Display engine etc. are not part of the GPU on Apple Silicon. It’s a separate coprocessor running it’s own firmware.
 

Homy

macrumors 68030
Jan 14, 2006
2,510
2,461
Sweden
About the memory bandwidth Apple says "up to" 200-400 GB/s. Does it mean you need 32 GB RAM with Pro (dual channel) and 64 GB with Max (quad channel) to have max bandwidth? Does M1 Pro 16 GB have max 100 GB/s and M1 Max 32 GB max 200 GB/s?
 

Krevnik

macrumors 601
Sep 8, 2003
4,101
1,312
About the memory bandwidth Apple says "up to" 200-400 GB/s. Does it mean you need 32 GB RAM with Pro (dual channel) and 64 GB with Max (quad channel) to have max bandwidth? Does M1 Pro 16 GB have max 100 GB/s and M1 Max 32 GB max 200 GB/s?

M1 Pro always has the same max bandwidth, 16 or 32GB, same with the M1 Max. The number of channels for the die are always the same for the Pro. It's the density of the RAM chips that differs: 2 8GB packages vs 2 16GB packages for the Pro, 4 8GB packages vs 4 16GB packages for the Max.

The "up to" is marketing speak for the fact that not all components on the SoC can saturate it. The CPU cores by themselves can only hit around 200GB/sec before they just can't move data any faster for example.
 
  • Like
Reactions: jdb8167

Homy

macrumors 68030
Jan 14, 2006
2,510
2,461
Sweden
M1 Pro always has the same max bandwidth, 16 or 32GB, same with the M1 Max. The number of channels for the die are always the same for the Pro. It's the density of the RAM chips that differs: 2 8GB packages vs 2 16GB packages for the Pro, 4 8GB packages vs 4 16GB packages for the Max.

The "up to" is marketing speak for the fact that not all components on the SoC can saturate it. The CPU cores by themselves can only hit around 200GB/sec before they just can't move data any faster for example.

I hope so. Then I'll be fine with 16. I've read some comments on Youtube that you have to have 32 and 64 GB to have max bandwidth.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,229
I hope so. Then I'll be fine with 16. I've read some comments on Youtube that you have to have 32 and 64 GB to have max bandwidth.

The M1 Pro has max bandwidth at base configuration, but just for clarity that’s the 200GB/s for the Pro. What those you tuber comments are probably referencing is that the biggest bandwidth you can get is with the M1 Max which is 400GB/s and it comes with a minimum of 32 GB. It should be noted though that the CPU can actually only use about 240GB/s so not much more than the Pro delivers anyway if CPU bandwidth is what you are really interested in.
 
  • Like
Reactions: Homy

Tagbert

macrumors 603
Jun 22, 2011
6,259
7,285
Seattle
The M1 Pro has max bandwidth at base configuration, but just for clarity that’s the 200GB/s for the Pro. What those you tuber comments are probably referencing is that the biggest bandwidth you can get is with the M1 Max which is 400GB/s and it comes with a minimum of 32 GB. It should be noted though that the CPU can actually only use about 240GB/s so not much more than the Pro delivers anyway if CPU bandwidth is what you are really interested in.
The primary difference between the Pro and Max is GPU cores. As you can see some of the other details don't make that difference in practice. If you need a butt-load of GPU then go with the Max otherwise the Pro may suite your needs. We don't all have to proclaim our Maxness all the time.
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
I hope so. Then I'll be fine with 16. I've read some comments on Youtube that you have to have 32 and 64 GB to have max bandwidth.
You do need to have 32GB or 64GB to have max bandwidth because you need an M1 Max to get 400 GB/s and the Max does not have a 16 GB option.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
I hope so. Then I'll be fine with 16. I've read some comments on Youtube that you have to have 32 and 64 GB to have max bandwidth.

It is a secondary effect. You cannot have a M1 Max with less than 32GB of RAM. You cannot have a RAM capacity setting of 64GB RAM without a M1 Max.

So in BTO configuration page. if you pick an M1 Max your memory is automatically boosted to 32GB RAM. ( the 16GB is greyed out; cannot select). If leave it on a M1 Pro and go to the memory section and try to select 64GB RAM it is greyed out ( have to go back and select a Max. ... which minimally boosts to 32GB ).

There is a corner case where leave it on M1 Pro and select 32GB RAM. That just gets more capacity, not bandwidth.

To get to a Max that has the maximum bandwidth, you have to have either 32 or 64GB of memory. It is the 'floor' put under the Max capacity that is the central issue. ( essentially have to put something on all of the memory channels of the Max ( and there are twice as many in the Max versus the Pro). Second, Apple isn't mucking around with wimpy, custom RAM packages that only have 4GB (or less in them). [ these packages start at 8GB versus the M1's that start at 4GB. Very good chance there are two 4GB stacks in them to go along with the wider data path. ]

The M1 Max has both an upgrade CPU die and upgrade RAM packages coupled together. ( happens to make Apple more money too. )

P.S. a contributing reason why there are no DIMM slots where a user could leave them all empty except one (or two) as that would cripple the system bandwidth.
 
  • Like
Reactions: Homy

Homy

macrumors 68030
Jan 14, 2006
2,510
2,461
Sweden
You do need to have 32GB or 64GB to have max bandwidth because you need an M1 Max to get 400 GB/s and the Max does not have a 16 GB option.

Yes but people said you have to have 64 GB with Max and 32 GB with Pro to get max bandwidth and you can't get that if Max has 32 and Pro has 16.
 

jdb8167

macrumors 601
Nov 17, 2008
4,859
4,599
Yes but people said you have to have 64 GB with Max and 32 GB with Pro to get max bandwidth and you can't get that if Max has 32 and Pro has 16.
If you have an M1 Max then you get the 400GB/s bandwidth with either the 32 GB or the 64 GB. The bandwidth is from the twice wider bus and not the size of the RAM chips. On the other side, even if you have 32 GB on the M1 Pro, you only get 200 GB/s bandwidth because the width of the Pro memory bus is 1/2 of the M1 Max.
 

fakestrawberryflavor

macrumors 6502
May 24, 2021
423
569
This guy hypothesizes that the 2 E cores in M1Pro/Max outperform the 4 E Cores in M1. I wonder if they borrow tech or are the E cores from A15 instead of A14, or something different altogether. Cool analysis tho either way
 
  • Like
Reactions: Tagbert

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
This guy hypothesizes that the 2 E cores in M1Pro/Max outperform the 4 E Cores in M1. I wonder if they borrow tech or are the E cores from A15 instead of A14, or something different altogether. Cool analysis tho either way

The E cores in the Pro/Max are less throttled than the on the M1. It is a trade off. In many cases the E cores on the Pro/Max are consuming more power. The plain M1 keeps them power consumption throttled a bit to hit lower battery consumption profiles.

First, the pragmatic memory bandwidth on the plain M1 is lower.

"...The two E-cores in the system clock at up to 2064MHz, and as opposed to the M1, there’s only two of them this time around, however, Apple still gives them their full 4MB of L2 cache, same as on the M1 and A-derivative chips. ..."


So for the plain M1, the E cores are more like "crabs in a barrel" where 4 cores fighting over 4MB of L2 at a lower base clock (when all E cores are 'lit up' and active). The Pro/Max defacto allow their smaller number of E cores get a larger share of the pie ( e.g., splitting 2MB each is better than 1MB each. )

The M1 is on LPDDR4 and Pro/Max are on LPDDR5. So the memory accesses on a cache miss are faster too. ( This should even out when the M2 comes along for the entry model. )

Additionally, there is more base memory bandwidth to go around. The two P clusters are somewhat more solated from the E core's access.

"... It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s. ..."

On the Pro/Max Apple can "spread out" the memory bandwidth sharing farther apart from the CPU , GPU , NPU , etc clusters from each other. ( a hand tuned assembly benchmark that leverages the Accelerate framework has a pretty good chance of applying some substantive amount of memory pressure to load/unload data. )

[ The A15 also takes a bit of the memory throttle off of the E core cluster without substantive microarch changes to the basic E core logic. ]

Second, the Pro/Max are allowed in some contexts to soak up relatively higher power. If look at the "on battery" scores much of that is likely "faster RAM access and better caching" . The "twice as many" P and GPU cores are consuming lots more power so why keep the E cores throttled? Power is being spent anyway and a relatively small amount more isn't going to make a big difference.

On off-battery, that better memory bandwidth is more so used for "race to sleep then when on battery power with fewer cores. [ And if get "too much" low priority work normally targeted at E cores then can just lit up an idle P core to offload to for even faster "race to sleep". ]


That said... these two evaluations ( ElectricLight and Anantech ) seem to be a bit at odds at how the P core loads are allocated. Anandtech has P cores going onto separate P core clusters to increase the memory bandwidith throughput while Electric light has the schedule clustering P core allocation on a single cluster for as long as possible. There is a decent change the P/E scheduler is more multiple dimensional than either of these two indpendent test are measuring ( that it is adapting to the "bias skew' these specific benchmark apps are attempting to measure. )
 

thunng8

macrumors 65816
Feb 8, 2006
1,032
417
Yes but people said you have to have 64 GB with Max and 32 GB with Pro to get max bandwidth and you can't get that if Max has 32 and Pro has 16.
That is wrong as explained many times already.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.