M3 Chip Generation - Discussion Megathread

cbum · Jul 19, 2023

Come on guys.
Remember, it's not how big it is, but how you use it... ;-)

Adult80HD · Jul 20, 2023

koyoot said:
Looking at benchmarks, that are optimized for M series, M2 Ultra is actually around 20% faster than GA103 based laptop 4090.

Which makes that GPU in the range of 4080, desktop, in optimized software.

This actually matches testing I have done with a PC with a 4080 and my Studio Ultras (M1 and M2). In some non-3D applications the M2 Ultra actually outperforms the 4080 (e.g., Adobe Lightroom and Photoshop).

harold.ji · Jul 20, 2023

Well, I just hope that M3 can come sooner rather than later. October 2023 should be the perfect timing.

deconstruct60 · Jul 20, 2023

Basic75 said:
Back on topic, I'm wondering whether the M3 generation will increase the size of the P-core clusters. The M1/2 Pro/Max/Ultra all group the P-cores into clusters of 4 that share an L2 cache. And each P-core can only access the L2 cache belonging to its cluster. Will Apple make a jump like AMD did when going from Zen 2's 4-core to Zen 3's 8-core clusters?

There probably is not any good reason to do that at all. More cores to a 100% shared L2 cache has a decent chance of increasing the latencies. At L2 , as opposed to L2, already taking more latency hit. And it was shared L3 caches that AMD did; not L2. AMD's L2 are all on a single core; let alone some number > 2 .

There is a trade off that Apple is making here by having 4. Going to 8 seem quite likely to put them out into an area of diminishing returns.

The other major problem is that N3 (whether N3E or N3B ) there is a disconnect between how much the SRAM cache is shrinking and how much the logic (cores compute logic ) is shrinking. The cores are going to get smaller, but the case won't. It 100% chance of not getting any smaller if you make it twice as big. 8 cores sharing twice as much Cache. N3E that is kind of horrible ( since no smaller than N5/N5P cache. ). The cores of the 'far' side of that cache block are going to get farther away from the cache memory in the opposite corner. That isn't going to help adverse latency either ( the speed of light isn't getting any faster.)

The unshared L3 was NUMA problematical for AMD because there were 'common enough' , atomic data structures that kind of 'bounced' between split L3 cache depending upon how much which block of cores pulled on them for fast access. For stuff that is going to be shared among a very large number of cores ... Apple has the system level cache. Stuff placed there won't 'bounce around' depending upon latest number of pulls on the data. Apple's L2 cache really where that kind of stuff is suppose to stay.

I think Apple's L2 is more like cores 1 ,2 , 3 all grab row Xi from a matrix ( so it all lands in L2 ) and cores 1 ,2 ,3 grab colume Yi , Yi+1 , Yi+2 from matrix and that all crunch the entries together. Also so that hand-offs to the shared cache to the AMX co-processor that takes all of that away from the generic CPU cores for certain matrix problems, but the results will get picked up by cores in the cluster from that shared cache.
[ in contexts where there is one AMX co-processor for every Arm core there are already 8 processors on the L2 already. Throwing four more Arm+co-processors onto the L2 just makes things even MORE crowded with potential cache thrashers. ]

The other major disconnection from reality here is that P core clusters instances are far more often deployed in A-series chips . Do you really need 8 P cores in a phone chip? Not really. Decoupling the basic cluster building blocks from the A-series helps the economics here how???? It doesn't.

Some of the leaks of the M3 indicate that Apple is more likely going to apply OCD symmetry to having two 4-core E core clusters being paired up with two 4 core P core clusters.

To get the problem that AMD had with their split L3 , Apple's L3/System cache would need to get 'too far apart' on the mesh between two P-core clusters that were ping-pong some data ownership. So far they are keeping it to just two dies. If they went to three ( or more dies) with more large areas of 'other stuff' between then might get into an issue. Two P-Core clusters sitting side-by side on a mesh probably isn't a problem. That bigger 'joint' L3 cache on current AMD dies has nearby 'blocks' on a mesh also. Not to mention the AMD has 3D cache were the some L3 is off on another die. The bandwidth and/or latency to the other blocks that form the cache is the major focal point. Not that you have to pile everything into a huge single logical pile. The root cause problem AMD has was putting a NUMA issue where it didn't really belong. Apple isn't lazily letting NUMA issues fester in their design to save time/money to get it out the door.

Eric_Z · Aug 5, 2023

deconstruct60 said:
There probably is not any good reason to do that at all. More cores to a 100% shared L2 cache has a decent chance of increasing the latencies. At L2 , as opposed to L2, already taking more latency hit. And it was shared L3 caches that AMD did; not L2. AMD's L2 are all on a single core; let alone some number > 2 .

There is a trade off that Apple is making here by having 4. Going to 8 seem quite likely to put them out into an area of diminishing returns.

The other major problem is that N3 (whether N3E or N3B ) there is a disconnect between how much the SRAM cache is shrinking and how much the logic (cores compute logic ) is shrinking. The cores are going to get smaller, but the case won't. It 100% chance of not getting any smaller if you make it twice as big. 8 cores sharing twice as much Cache. N3E that is kind of horrible ( since no smaller than N5/N5P cache. ). The cores of the 'far' side of that cache block are going to get farther away from the cache memory in the opposite corner. That isn't going to help adverse latency either ( the speed of light isn't getting any faster.)

The unshared L3 was NUMA problematical for AMD because there were 'common enough' , atomic data structures that kind of 'bounced' between split L3 cache depending upon how much which block of cores pulled on them for fast access. For stuff that is going to be shared among a very large number of cores ... Apple has the system level cache. Stuff placed there won't 'bounce around' depending upon latest number of pulls on the data. Apple's L2 cache really where that kind of stuff is suppose to stay.

I think Apple's L2 is more like cores 1 ,2 , 3 all grab row Xi from a matrix ( so it all lands in L2 ) and cores 1 ,2 ,3 grab colume Yi , Yi+1 , Yi+2 from matrix and that all crunch the entries together. Also so that hand-offs to the shared cache to the AMX co-processor that takes all of that away from the generic CPU cores for certain matrix problems, but the results will get picked up by cores in the cluster from that shared cache.
[ in contexts where there is one AMX co-processor for every Arm core there are already 8 processors on the L2 already. Throwing four more Arm+co-processors onto the L2 just makes things even MORE crowded with potential cache thrashers. ]

The other major disconnection from reality here is that P core clusters instances are far more often deployed in A-series chips . Do you really need 8 P cores in a phone chip? Not really. Decoupling the basic cluster building blocks from the A-series helps the economics here how???? It doesn't.

Some of the leaks of the M3 indicate that Apple is more likely going to apply OCD symmetry to having two 4-core E core clusters being paired up with two 4 core P core clusters.

To get the problem that AMD had with their split L3 , Apple's L3/System cache would need to get 'too far apart' on the mesh between two P-core clusters that were ping-pong some data ownership. So far they are keeping it to just two dies. If they went to three ( or more dies) with more large areas of 'other stuff' between then might get into an issue. Two P-Core clusters sitting side-by side on a mesh probably isn't a problem. That bigger 'joint' L3 cache on current AMD dies has nearby 'blocks' on a mesh also. Not to mention the AMD has 3D cache were the some L3 is off on another die. The bandwidth and/or latency to the other blocks that form the cache is the major focal point. Not that you have to pile everything into a huge single logical pile. The root cause problem AMD has was putting a NUMA issue where it didn't really belong. Apple isn't lazily letting NUMA issues fester in their design to save time/money to get it out the door.

Speaking of very large shared L2's, IBM does that for their server chips. But they also use it as a virtual L3 and L4 cache, and they are apparently getting pretty nice speedups from it. Maybe there's ifs and buts for consumer chips like the M series, but you do can make it work for you.

fakestrawberryflavor · Aug 5, 2023

Eric_Z said:
Speaking of very large shared L2's, IBM does that for their server chips. But they also use it as a virtual L3 and L4 cache, and they are apparently getting pretty nice speedups from it. Maybe there's ifs and buts for consumer chips like the M series, but you do can make it work for you.

How does virtual L3 and L4 work exactly? Not even sure how that could actually help.

deconstruct60 · Aug 7, 2023

Eric_Z said:
Speaking of very large shared L2's, IBM does that for their server chips. But they also use it as a virtual L3 and L4 cache, and they are apparently getting pretty nice speedups from it. Maybe there's ifs and buts for consumer chips like the M series, but you do can make it work for you.

Vastly different workloads.

" ...
The concept is that the L2 cache isn’t just an L2 cache. On the face of it, each L2 cache is indeed a private cache for each core, and 32 MB is stonkingly huge. But when it comes time for a cache line to be evicted from L2, either purposefully by the processor or due to needing to make room, rather than simply disappearing it tries to find space somewhere else on the chip. If it finds a space in a different core’s L2, it sits there, and gets tagged as an L3 cache line.

What IBM has implemented here is the concept of shared virtual caches that exist inside private physical caches. That means the L2 cache and the L3 cache become the same physical thing, and that the cache can contain a mix of L2 and L3 cache lines as needed from all the different cores depending on the workload. This becomes important for cloud services (yes, IBM offers IBM Z in its cloud) where tenants do not need a full CPU, or for workloads that don’t scale exactly across cores.
...."

Did IBM Just Preview The Future of Caches?

www.anandtech.com

If running 6-7 virtual machines there is a decent chance that one of those VM is mostly idle. What have is 'dead space' of L2 cache not being utlized heavily. Note also that is also other CPU cores' L2 , so all very homogenous in memory access workloads even when very active.

[ Much of the same issue of why CXL RAM will work in most shops with heavy virutal images running where RAM gets 'trapped'/'captured' in VM images that really are not using it , but is typically over allocated because memory is handed out in big-block single subsystems with virtual cores . ]

If actually run all CPU cores very hard the L3 cache effectively disappears. There won't be 'under usage' where can stuff 'extra' L2 spillage from other cores.

In contrast, Apple's L3 cache far more so serves as a way to very quickly transfer data between different types of cores ( e.g., CPU -> NPU , CPU -> GPU , GPU -> CPU , video decode -> GPU . etc. ) . There is no copious huge under utilizations of the L3 here at all.

Additionally, for the L4 it is just piling virtual on virtual.

" ...
For IBM Telum, we have two chips in a package, four packages in a unit, four units in a system, for a total of 32 chips and 256 cores. Rather than having that external L4 cache chip, IBM is going a stage further and enabling that each private L2 cache can also house the equivalent of a virtual L4.

This means that if a cache line is evicted from the virtual L3 on one chip, it will go find another chip in the system to live on, and be marked as a virtual L4 cache line. ..."

This really only gets viable because the RAM memory from relatively very distance chips is so latency 'far away' that going through gyrations like this keeps the L4 access inside the faster die to die interconnect without going off package to RAM.

The vast majority of Apple's dies are SINGLE die. So super high latency RAM access is a 'problem' they can't throw gobs of complexity to get around. They just don't have it. Furthermore, they don't even tolerate that because GPU cores can't even handle that well at all. Again the workloads are vastly different. This isn't being done for graphics compute workloads at all. Even 2 die graphics focused GPUs are barely plausible, let alone 10's dies.

edit
P.S. this Telum chip is really only used for IBM's z/Mainframes and the LinuxOne ( which is really still mainframe hardware running Linux on z VMs ). I'm not sure IBM is going to toss this virtual L3/L4 at the Power product line. Power 10 doesn't have it. Power 11 I'm a bit doubtful is going to go this route ( if IBM is still interested. )
[ I suspect that Arm Neoverse and Begramo/Sapphire Rapids x86 and HBM version of Zen4/5 and Intel Max CPU are going to eat into Power server market. ]

In the HPC on bulky data space this double layers of virtualiation cache isn't going to work well at all. No-to-few virtual machines and throw tons of hectic OOO and vector workload at every core at the same time. Err no.

leman · Aug 7, 2023

deconstruct60 said:
Vastly different workloads.

" ...
The concept is that the L2 cache isn’t just an L2 cache. On the face of it, each L2 cache is indeed a private cache for each core, and 32 MB is stonkingly huge. But when it comes time for a cache line to be evicted from L2, either purposefully by the processor or due to needing to make room, rather than simply disappearing it tries to find space somewhere else on the chip. If it finds a space in a different core’s L2, it sits there, and gets tagged as an L3 cache line.

What IBM has implemented here is the concept of shared virtual caches that exist inside private physical caches. That means the L2 cache and the L3 cache become the same physical thing, and that the cache can contain a mix of L2 and L3 cache lines as needed from all the different cores depending on the workload. This becomes important for cloud services (yes, IBM offers IBM Z in its cloud) where tenants do not need a full CPU, or for workloads that don’t scale exactly across cores.
...."

Did IBM Just Preview The Future of Caches?

www.anandtech.com

If running 6-7 virtual machines there is a decent chance that one of those VM is mostly idle. What have is 'dead space' of L2 cache not being utlized heavily. Note also that is also other CPU cores' L2 , so all very homogenous in memory access workloads even when very active.

[ Much of the same issue of why CXL RAM will work in most shops with heavy virutal images running where RAM gets 'trapped'/'captured' in VM images that really are not using it , but is typically over allocated because memory is handed out in big-block single subsystems with virtual cores . ]

If actually run all CPU cores very hard the L3 cache effectively disappears. There won't be 'under usage' where can stuff 'extra' L2 spillage from other cores.

In contrast, Apple's L3 cache far more so serves as a way to very quickly transfer data between different types of cores ( e.g., CPU -> NPU , CPU -> GPU , GPU -> CPU , video decode -> GPU . etc. ) . There is no copious huge under utilizations of the L3 here at all.

Additionally, for the L4 it is just piling virtual on virtual.

" ...
For IBM Telum, we have two chips in a package, four packages in a unit, four units in a system, for a total of 32 chips and 256 cores. Rather than having that external L4 cache chip, IBM is going a stage further and enabling that each private L2 cache can also house the equivalent of a virtual L4.

This means that if a cache line is evicted from the virtual L3 on one chip, it will go find another chip in the system to live on, and be marked as a virtual L4 cache line. ..."

This really only gets viable because the RAM memory from relatively very distance chips is so latency 'far away' that going through gyrations like this keeps the L4 access inside the faster die to die interconnect without going off package to RAM.

The vast majority of Apple's dies are SINGLE die. So super high latency RAM access is a 'problem' they can't throw gobs of complexity to get around. They just don't have it. Furthermore, they don't even tolerate that because GPU cores can't even handle that well at all. Again the workloads are vastly different. This isn't being done for graphics compute workloads at all. Even 2 die graphics focused GPUs are barely plausible, let alone 10's dies.

I thought Apples CPU caches work quite similarly? The traditional L2 and L3 are combined into one large shared L2 per cluster. What you describe as L3 is a SoC level cache, not a traditional L3.

deconstruct60 · Aug 7, 2023

leman said:
I thought Apples CPU caches work quite similarly? The traditional L2 and L3 are combined into one large shared L2 per cluster. What you describe as L3 is a SoC level cache, not a traditional L3.

The System level cache is the L3 cache ( it is the 3rd level. Apple's label they throw at it doesn't make not the 3rd. ). And Apple is chunking the L3 in front of the indivdiual memory controllers ( pragmatically also on the same single die . UltraFusion corner case aside where they throw 10,000 hyper-mirco pads at eliminating the die difference that IBM does NOT do. )

As for "L2 and L3 combined". If there is no private subset for each P or E core in the combined pooled then it really isn't the simular. Apple blended the characteristics of both ( L2 as in 2nd level and shared between cores (no totaly private ) classic L3 characteristic , but this is substantively different. 'Sharing' isn't 'virtual'. Apple more so has a shared L2 as opposed to a virtual one. ( I think Apple is more focused on 'fast message passing' (.e.g, off to AMX bloxk) then on finding unused , overallocated L2 'private blocks'. )

leman · Aug 7, 2023

deconstruct60 said:
The System level cache is the L3 cache ( it is the 3rd level. Apple's label they throw at it doesn't make not the 3rd. ).

I think there is an important difference in that the SLC cache is associated to the memory controller and not the CPU. It’s more of a coherency protocol/energy saving feature. And it has the same bandwidth as the DRAM.

Of course, in the end of the day this probably a translation problem. If you go strictly by the nominal nomenclature (by cache level) then SLC is obviously L4. But cache levels also have certain functional connotation to them and from that perspective I’d describe Apples L2 as a hybrid of traditional L2/L3.

deconstruct60 said:
As for "L2 and L3 combined". If there is no private subset for each P or E core in the combined pooled then it really isn't the simular.

There is some asymmetry. Each core is assigned a “fast” L2 portion and it costs extra to access the portions associated with other cores. Maynard’s document has more info, I dint remember the details. It’s not exactly like IBMs implementation but the L2 is not uniformly shared either.

deconstruct60 · Aug 7, 2023

leman said:
I think there is an important difference in that the SLC cache is associated to the memory controller and not the CPU. It’s more of a coherency protocol/energy saving feature. And it has the same bandwidth as the DRAM.

Not sure by what mean it is associated with the 'CPU'. Not 'cores'. Or Memory controller has its own "communicatin agent" on the internal bus? Memory controllers are all in modern CPUs ( package if chiplet , and die if not chiplets ).

The bandwidth being the same for the SLC 'fronting' the memory controller is because they use the same internal bus system agent ( can only handle 4 addresses requests at a time). But bandwidth isn't the major factor when talking about caches. Latency is the bigger issue. If the SLC latency is the same as the DRAM . .. what is it doing?
When stuff is delivered matters as much , if not more , to how much. How many loads of a register can the core actually do on a clock? more bandwidth than that is going where? Not to that core.

If trying to say that one bank of the L2 has a minor diference latency than the other banks. If the cache is physically big enough and the cores spread out into different corners around the cache... that wouldn't be surprising. But it really wouldn't add another "cache level' if the gap is relatively quite small ( relative to the next level up).

leman said:
Of course, in the end of the day this probably a translation problem. If you go strictly by the nominal nomenclature (by cache level) then SLC is obviously L4. But cache levels also have certain functional connotation to them and from that perspective I’d describe Apples L2 as a hybrid of traditional L2/L3.

I still don't see where you are getting L3 from. If there is some command that moves a "remote" L2 cache line to a "fast"/"faster" L2 cache then there is a path to calling that remote location different cache level. If the request goes out and the L2 cache line never moves from where it was initially placed ( even if it takes very slightly longer) then those are those the same cache level.

The documents outline that the L2 tracks multiple cores sharing the same L2 data. So the nominal mode appears to be that the data doesn't really 'move' or is duplicated. if not duplicating or moving it , then it is same level.

leman said:
There is some asymmetry. Each core is assigned a “fast” L2 portion and it costs extra to access the portions associated with other cores. Maynard’s document has more info, I dint remember the details. It’s not exactly like IBMs implementation but the L2 is not uniformly shared either.

Errr this one??

M1 Explainer 070.pdf

drive.google.com

There are four units but that is mainly driving by both the size of the cache ( the bigger it is more more chunks pragmatically going to need to compose it) and by Apple chasing lower power ( shut down L2 cache (and cores) if happen to not need it at the moment).

page 270

"...
All this is historically interesting, but raises a meta-issue. When cores share an L2, either they all run at
the same frequency, or the cost to access the L2 is substantially higher because of buffering across
clock domains. So what’s the right tradeoff? Later below we’ll see a very complicated (but clever!)
Apple patent for optimal OS scheduling given the constraint of multiple cores having to run at the same
frequency. But more generally, as we move to from the simple world of one performance and one
efficiency cluster, is four the optimal number of CPUs sharing an L2 (and thus sharing frequency)?
There are clearly advantages to a large L2, and a shared L2; but there are also costs. Is a better tradeoff
perhaps three cores sharing an L2, and the future low-end being something like two performance
clusters (ie six performance cores)? Or two cores sharing a substantially smaller L2, and all the two-core
clusters (including the efficiency two-core clusters) sharing a large CPU L3 (distinct from the SLC),
operating on its ...
"

page 270-271 ( L2 cache is snoop for all 4 processors. )

"...
What Apple do, however, is neither exclusive nor inclusive. To implement this, while retaining the
advantages of snoop filtering by a higher level cache, the tags of lower level caches are duplicated in
the higher level caches. As far as I can tell this happens in both the L2 and SLC, and the idea seems to
be something like:
- core to core snooping will mostly occur within a core complex, and can be handled well by having the
L2 for that complex snoop for all 4 processors of the complex
- snooping between different types of IP (like NPU snooping changes in a GPU cache) will be filtered at
the SLC level
- left uncertain is whether snoops between the Performance Core Complex and the Efficiency Core
Complex are treated as a special case or handled by the SLC.
..."

page 227 ( L2 split into 4 bigger , addressible chunks )
" ...

Third within a Core Complex, the L2 is apparently split the same way, into four independently address-
able units, each of which is capable of providing close to 32B/clock cycle, so we do see total L2 aggre-
gate bandwidth increase linearly with the number of cores. My guess is the drop in bandwidth by ~25%

has to do with writing the line into the L1D, something like
- for four cycles I can read 32B/cycle from L1D (16 B from each half of the L1D)
- during those same four cycles each of four quarters of a 128B line is transferred from L2
- during the fifth cycle I can’t access the L1D because the fully-accumulated line is transferred from a
line buffer into the cache. ..."

This is a very substantially different from what IBM is doing. What IBM is doing is more so taking cache evictuion victums and trying to stuff them elsewhere in the L2 cache than they would normally belong. I highly doubt that is resulting in energy efficient overhead at all. ( the complexity is probably at least twice as high as why Apple is doing. ). In some sense, IBM is looking as hard as possible to never turn any of the L2 cache off ever (to save power ). They are going way , way , way out of their way to find 'extra' stuff to push into L2 cache so it is always maximally as full as possible. Apple is NOT doing that.

[ P.S. IBM is saving power in a very large macro sense in that if can push the workload of 4 large refrigerator sized servers into just one refigerator sized server than the net datacenter power consumed goes down. But at the single system level.... not really doing that with this virtual L3/L4 thing. Could get some savings out of other components used in the server though. But z series ... it is server consolidation focus. ]

scottrichardson · Aug 7, 2023

Apple Testing M3 Max Chip With 16-Core CPU and 40-Core GPU

12 performance, 4 efficiency. 40 GPU cores.

deconstruct60 · Aug 7, 2023

scottrichardson said:
Apple Testing M3 Max Chip With 16-Core CPU and 40-Core GPU

12 performance, 4 efficiency. 40 GPU cores.

Suggestive of an approach that takes perfromance increase as a higher priority than battery savings. (at least on heavy workloads. Mostly idle just watching AppleTV+ video ... probably better. )

ArkSingularity · Aug 7, 2023

deconstruct60 said:
Suggestive of an approach that takes perfromance increase as a higher priority than battery savings. (at least on heavy workloads. Mostly idle just watching AppleTV+ video ... probably better. )

I'm curious what they'll be doing with the Pro configuration if this is their Max configuration. I don't assume many people buying the Max would really care about maximizing battery life, so they could probably afford to throw as many cores as they want at them.

(Not that battery life and energy efficiency don't matter, we all remember the thermal dumpster disaster of the i9 MacBooks. But I don't think Apple will be repeating that anytime soon on Apple Silicon. 😂)

senttoschool · Aug 7, 2023

scottrichardson said:
Apple Testing M3 Max Chip With 16-Core CPU and 40-Core GPU

12 performance, 4 efficiency. 40 GPU cores.

Pleasantly surprised.

I thought Apple would do the tik-tok model of:

M1: New cores
M2: Increase core count and refresh cores
M3: New cores
M4: Increase core count and refresh cores
M5: New cores

etc..

16 cores is going to be quite the monster. I'm also surprised that they're increasing CPU cores rather than more GPU cores - given that the demand for faster GPUs is higher than faster CPUs. Perhaps the 40-core GPU is the lower end and that the Max could have more for top binned chips.

Boil · Aug 8, 2023

M3 Max:

16-core CPU (12P/4E)
48-core GPU (w/hardware ray-tracing)
16-core Neural Engine

M3 Ultra:

32-core CPU (24P/8E)
96-core GPU (w/hardware ray-tracing)
32-core Neural Engine

M3 Extreme:

64-core CPU (48P/16E)
192-core GPU (w/hardware ray-tracing)
64-core Neural Engine

leman · Aug 8, 2023

senttoschool said:
I'm also surprised that they're increasing CPU cores rather than more GPU cores - given that the demand for faster GPUs is higher than faster CPUs. Perhaps the 40-core GPU is the lower end and that the Max could have more for top binned chips.

Depends on the changes to GPU cores themselves. They could be doubling the execution partitions per core for all we know (of course, highly unlikely, but they would need to redesign the GPU at some point anyway).

leman · Aug 8, 2023

ArkSingularity said:
I'm curious what they'll be doing with the Pro configuration if this is their Max configuration. I don't assume many people buying the Max would really care about maximizing battery life, so they could probably afford to throw as many cores as they want at them.

I don’t think the battery life will be affected much by the number of CPU cores. More cores give you better performance doing compute-intense work, but battery was never a focus for this type if work anyway. On mixed/mostly idle workloads the battery will still be good.

thenewperson · Aug 8, 2023

Really hope we get 8P/4E/20G for M3 Pro and 12P/4E/40G for Max so there's a meaningful reason to go with Max on the CPU side. Maybe they can revisit the dual neural engine (with both enabled this time) there too?

Mr Screech · Aug 8, 2023

Not that interested in 10% more power or cores.
Instead I hope the newer platforms will support,
Thunderbolt 5.
Increase the amount of displays. I don't need 6k, just give the same bandwidth to run 8 1080p monitors per port if I'd wanted to.

Full Steam support for all (multiplayer)games, would be nice as well.

MRMSFC · Aug 8, 2023

deconstruct60 said:
Suggestive of an approach that takes perfromance increase as a higher priority than battery savings. (at least on heavy workloads. Mostly idle just watching AppleTV+ video ... probably better. )

Well, since their competitors are cranking up power consumption, comparatively speaking they’ll always have the efficiency advantage.

Still, unless they make software significantly heavier, I think the amount of e cores is probably sufficient to keep good battery life.

deconstruct60 · Aug 8, 2023

Mr Screech said:
Not that interested in 10% more power or cores.
Instead I hope the newer platforms will support,
Thunderbolt 5.

Intel isn't even supporting Thunderbolt 5 with their upcoming generation of chips. The chance that Apple will do it before Intel does theirs is diminishing small. Intel pragmatically drives Thunderbolt certifications ; not Apple. And Apple doesn't even do TBv4 on the low end of hte Mac line up. They aren't 'fully committed' to TBv4 ... why would they be hyper early for TBv5? ( Apple is a strangling trail edge of adopter of DisplayPort 2.1 also. )

Thunderbolt 5 is pragmatically very tightly coupled to USB 4v2 . If USB 4v2 moves glacially slow to pragmatic deployment , then TBv5 does also. In general, USB doesn't move fast ( e.g., USB 2 2000 -> USB 3 2008 -> USB 4 2019 ). USB-IF is a cumbersomely large committee of fractions with substantively different interests. It doesn't move fast by design.

Mr Screech said:
Increase the amount of displays.

Don't hold your breath. Apple so far have been more interested in building relatively large display controllers that are more focused on saving power ( higher Pref/watt) than in number of displays supported ( or in chained aspects of MST DisplayPort protocols. )

deconstruct60 · Aug 8, 2023

Boil said:
M3 Max:

16-core CPU (12P/4E)

48-core GPU (w/hardware ray-tracing)

16-core Neural Engine

M3 Ultra:

32-core CPU (24P/8E)

96-core GPU (w/hardware ray-tracing)

32-core Neural Engine

M3 Extreme:

64-core CPU (48P/16E)

192-core GPU (w/hardware ray-tracing)

64-core Neural Engine

Pushing the Ultra to 32 cores ( 24P) is far more likely indicative that there is no Extreme at all. Especially, if the path to Ultra is just 'slapping' two largish Max's together.

The Extreme thing is just tilting at windmills.

Not sure where the 48 GPU cores is coming from for the Max when the telemetry points to 40. Likely 'bigger' more powerful 'GPU cores' ( GPU variant of cores doesn't necessary point to single compute computation units. It is a bundle of units. The bundle could be bigger with TSMC N3. Not necessarily more GPU cores ( which likey come with more cache (which isn't much smaller for N3 ).

The recent updates about the M3 state that it is 4 P / 4 E / 10 GPU . That means the building block for GPU clusters is still 10 'GPU cores'. Likely scaling on the number of RAM packages (and memory controllers coupled to those ). Unlikely going to get 50 GPU cores with two held back for yield management. M2 Max had 38. (managing two for help with yield management). N3 probably gets rid of that technique ( for higher end user costs passed along.).

If the M1/M2 Extreme was too expensive to make at reasonable volumes ... TSMC N3 (no matter what the flavor) isn't going to make that any more viable. In fact, likely worse.

The Ultra with some internal bus improvements so Apple could perhaps provision two x16 PCI-e v4 (or v5) lanes out in the Mac Pro package configuration would be enough with the P core bump to increment the Studio and Mac Pro forward. The GPU will be better. ( not a 4090-killer , but substantively better. ) also which will help both also.

deconstruct60 · Aug 8, 2023

leman said:
Depends on the changes to GPU cores themselves. They could be doubling the execution partitions per core for all we know (of course, highly unlikely, but they would need to redesign the GPU at some point anyway).

If Apple takes the performance ( clock bump) instead of the power savings then not sure really need a major change at the baseline design. ( some cache changes and internal bus network upgrades, and given up on density slightly to account for higher clocks , but basic structure of the cores not so much (e.g., double coupled vector units)... besides adding the often clamored for ray-tracing augments. )

deconstruct60 · Aug 8, 2023

ArkSingularity said:
I'm curious what they'll be doing with the Pro configuration if this is their Max configuration. I don't assume many people buying the Max would really care about maximizing battery life, so they could probably afford to throw as many cores as they want at them.

So far a major 'feature' of the Pro is that it is a low cost design augment of the Max. The intersection of the design work is a 96+% overlap. They do two tape-outs from one design. One tape-out trims off the 'extra GPU cores' for the 'Pro'.

I don't see where cranking up the costs for the Pro's R&D costs is going to offset with some kind of volume 'win'. Apple could always bin down 2-4 cores of that 3rd P-core cluster in the Pro models for folks hyper about battery life. Apple's zone power gating tends to be very good so the even if not binned off , it can be turned off until very extended duress of 'too much work to do' when not plugged in, by the power management agent. ( e.g., Intel/AMD laptop packages running differently plugged in versus on battery. Only this would less extreme for moderate and/or 1-4 threaded workloads )

The Max is still likely being design optimized for the MBP 14" and 16" chassis so not likely getting deep into the "don't care about battery life crowd" at all.

Conceptually, apple could put that 3rd P-core cluster in the same "extra stuff" section that is tossed away in the Pro variant of the tape out ( down below the extra GPU clusters and extra memory controllers ). Somewhat like pulling a P-cluster back across the UltraFusion connector to the first die . But the P-cluser would be separated from the other CPU clusters ( so moving and snooping L2 is more annoying., ).

The Mini Pro probably isn't helped much competitive wise by chopping off the 3rd P-cluster either. (and Pro versions of the MBP 14/16") . 2024 Intel/AMD desktop stuff are not going to slouch on MT performance. ( nor their higher end laptop SoCs. )

All of the above said. TSMC N3 should allow Pro/Max with more cores to run in the same power consumption range as the M2 generation versions (based on N5) if Apple juggles the improvements right. What I was more so saying that battery life would be about the same at performance loads. Folks are throwing expectations that N3 is going to bring miraculous battery life improvements. M1/M2 Pro/Max are already good and competitive differentiating on battery consumption. Apple doesn't really need a major leap here. They can spin the 'media consumption' battery metric to dupe the folks just look at their tech spec page without thinking too hard. That will goose sales up while appearing 'magical'.

ArkSingularity said:
(Not that battery life and energy efficiency don't matter, we all remember the thermal dumpster disaster of the i9 MacBooks. But I don't think Apple will be repeating that anytime soon on Apple Silicon. 😂)

Again I really wasn't trying ot suggest Apple was going to backslide back into the Intel era of running hot and shorter on battery life. There is lots of hype that N3 is going mean that Apple can stuff M3 Pro/Max into places that the plain M2 due to the miracle of N3 power savings. Or the Ultra into a Max like chassis due to same thing.

It would have been easier for Apple to add another E core cluster instead of a P core one. It would have followed their general OCD quest for symmetry. If they are skipping that there is probably a competitive reason why.

M3 Chip Generation - Discussion Megathread

macrumors member

macrumors 6502a

macrumors member

macrumors G5

macrumors regular

macrumors 6502

macrumors G5

macrumors Core

macrumors G5

macrumors Core

macrumors G5

macrumors 6502a

macrumors G5

macrumors 6502a

macrumors 68030

macrumors 68040

macrumors Core

macrumors Core

macrumors 65816

macrumors 6502

macrumors 6502

macrumors G5

macrumors G5

macrumors G5

macrumors G5

Our Staff