M3 Chip Generation - Discussion Megathread

JPack · Sep 26, 2023

A17 Pro offers:

+10% single core
+6% multi-core

Pretty much the same results we've seeing from most reviewers for the past two weeks.

AdamBuker · Sep 26, 2023

name99 said:
With all the unenlightening back and forth about whether A17 really is faster, no it isn't, no it is but only at the cost of power, well but that's only (realistically) in short bursts, blah blah; an obvious data source is being omitted!

What do the numbers look like for an A17 running either JetStream 2.1 or Speedometer 2.0?
Browser benchmarks have the great advantages that
(a) they are easy to run, by anyone
(b) they correspond to a workload we KNOW Apple really cares about for the phone (as opposed to both GB and SPEC which each in their own way don't really match how people use phones)
(c) they will stress code that's written and compiled by Apple to be a best fit for the A17. If there is functionality in the A17 that increases IPC, but requires special, or at least different code (either generated by the compiler, or otherwise exploited, eg in the JIT) we will see it there in a way we won't in GB6 (not yet recompiled for A17) or SPEC (who knows how that's been compiled by the sites that claim to be running it on iPhone?)

So how about it, people? If you own an iPhone 15 Pro/Max (or for that matter, something with an A16 or A15 AND you are running iOS latest, so the Safari is up to date!) go to
https://browserbench.org/JetStream/ and/or

Speedometer 2.0
and/or

MotionMark
and give us your numbers (trying to be as honest as possible; eg don't do this while you are also doing something else in the background!)

I don't have any A15/A16 based devices to test this out on yet, but my M1 mini scores 347. If you're curious, there is a thread on the PowerPC sub-forum where users tortured their old mac's with the speedometer browser test. My late 2005 dual core 2.0 Ghz G5 got a score of 5.61.

I do wonder if some of the features that we'll see in the M3 line are incorporated into or borrowed from the R1 chip, or if Apple might decide to bump the Vision Pro to an M3 chip when it's released as the speculated feature set would seem to be more in line with what such a device needs. I also wonder if some of the recent patents on interconnects could be referring not to an M3 Extreme, but perhaps the interconnect between the M2 and the R1. It's fun to speculate, but this is all way outside my scope of knowledge. Yet, if I understand everything I've been reading correctly, what might be a modest improvement on the phone could translate to meaningful, perhaps substantial performance improvements across the board for macs in general, and even more so for the Studio and Mac Pro.

Boil · Sep 26, 2023

Latest patent findings show a possible four-way SoC package...?

Chuckeee · Sep 26, 2023

Boil said:
Latest patent findings show a possible four-way SoC package...?

In general patent filings can be a mixed bag. Several companies have used patent filings to limit competitors or to incentivize employees to innovate without specific intentions of developing those ideas into products. Just because there is a patent it does not mean the product is actively being developed.

Boil · Sep 26, 2023

Hence the word "possible"...

senttoschool · Sep 26, 2023

name99 said:
https://twitter.com/x/status/1706472603752575103

Big core area reduced by 20%, small core area reduced by 20%. GPU single-core area increased by 5%, and GPU overall area increased by 20%.

A few months ago, I predicted that by the year 2030 or so, the SoC would just be a giant neural engine with a CPU attached to it.

I made this claim because I believe that the future main "brain" of a SoC will be the neural engine and not the CPU. I think CPUs will take a backseat to NPUs, and probably GPUs as well.

I think we're going to start to see a disproportionate number of transistors devoted to the neural engine starting in 2025. That's the year Apple designs will have had time to respond to the LLM/AI boom that started in late 2022.

Chuckeee · Sep 26, 2023

senttoschool said:
I think we're going to start to see a disproportionate number of transistors devoted to the neural engine starting in 2025. That's the year Apple designs will have had time to respond to the LLM/AI boom that started in late 2022.

Won’t AI be illegal by 2025?

AgentMcGeek · Sep 26, 2023

So we’re talking 10-15% improvements for -20% real estate CPU wise? That’s very impressive.

What’s a bit of a disappointment is that we don’t seem to get much perf-to-space improvement on the GPU side. I guess the new RT features are taxing.

deconstruct60 · Sep 26, 2023

name99 said:
Big core area reduced by 20%, small core area reduced by 20%. GPU single-core area increased by 5%, and GPU overall area increased by 20%.

View attachment 2281675

Looks like deeper focus and resource allocation to the GPU and NPU (and their interaction ) than on chasing the ultimate single threaded P code drag racing score. I think a lot of this "3nm has problems" in this thread is missing the forest for a tree. ( GPU and NPU cores can swap lots of data pretty easy/fast through the SLC. ). The NPU aggregate performance increase for the die size occupied seems to have lots of focus/optimization aimed at it.

And the uncore ( not P or E core) is a bigger fraction of the 'pie'/die.

deconstruct60 · Sep 27, 2023

Boil said:
Latest patent findings show a possible four-way SoC package...?

That might make for a good for something that wanted to take on a subset of what MI300 covers. However, for '4090' killer ... it looks to have gobs of NUMA issues that won't provision a unified , single GPU out of that. You are separating the four "not-so-good chiplets" by about as much space as possible and still have them on the same package ( the networking rectangle in the middle is filled with just multiple layers of 'wires' . Basically 10x farther apart than in a dual UltraFusion set up. That is going to surface issues that show up as quirks as the application software level. )

And affordable? ... very probably not.

Get rid of that huge empty void in the middle and run multiple layers so that had "express lanes" from top and bottom similar to Vertical blue lines here:

with multiple layers. ( top bottom express one layer). the bridge between chiplets another would not leave large 'hole' in the middle.

The patent is more about the layers of networking than the 'iron cross' of dies.

altaic · Sep 27, 2023

deconstruct60 said:
That might make for a good for something that wanted to take on a subset of what MI300 covers. However, for '4090' killer ... it looks to have gobs of NUMA issues that won't provision a unified , single GPU out of that. You are separating the four "not-so-good chiplets" by about as much space as possible and still have them on the same package ( the networking rectangle in the middle is filled with just multiple layers of 'wires' . Basically 10x farther apart than in a dual UltraFusion set up. That is going to surface issues that show up as quirks as the application software level. )

And affordable? ... very probably not.

Get rid of that huge empty void in the middle and run multiple layers so that had "express lanes" from top and bottom similar to Vertical blue lines here:

with multiple layers. ( top bottom express one layer). the bridge between chiplets another would not leave large 'hole' in the middle.

The paten is more about the layers of networking than the 'iron cross' of dies.

Why not make the central interconnect die a central SLC? UltraFusion seems to be all about cache pseudo-coherency. Furthermore, fast DRAM could be attached to the back side of that die for a massive amount of slow cache (but much faster and more efficient than going through a controller and transimpedance amplifiers to go off package to LPDDR).

leman · Sep 27, 2023

altaic said:
Why not make the central interconnect die a central SLC? UltraFusion seems to be all about cache pseudo-coherency. Furthermore, fast DRAM could be attached to the back side of that die for a massive amount of slow cache (but much faster and more efficient than going through a controller and transimpedance amplifiers to go off package to LPDDR).

Probably not that simple seeing that Apple's SLC blocks are associated with individual memory controllers. And it's not cache pseudo-coherency either (at least if we go by previous Ultras). The device will request the data from the appropriate memory controller/SLC block even if it means going to the other die. In other words, SLC caches on different dies are not mirrored and the interconnect simply makes one big network out of everything.

I am not sure I share @deconstruct60's pessimism about NUMA issues. Yes, there will be latency costs, but Apple designs already have very high latency. Scheduling work for GPU is another issue of course, but Apple has some methods to approach this.

altaic · Sep 27, 2023

leman said:
Probably not that simple seeing that Apple's SLC blocks are associated with individual memory controllers. And it's not cache pseudo-coherency either (at least if we go by previous Ultras). The device will request the data from the appropriate memory controller/SLC block even if it means going to the other die. In other words, SLC caches on different dies are not mirrored and the interconnect simply makes one big network out of everything.

I am not sure I share @deconstruct60's pessimism about NUMA issues. Yes, there will be latency costs, but Apple designs already have very high latency. Scheduling work for GPU is another issue of course, but Apple has some methods to approach this.

By pseudo-coherency, I meant that multiple SLC behave as a single cache, which is exactly what UltraFusion was designed for (as you described). I guess I have to mention the caveat that the M1 UltraFusion didn’t meet those design criteria very well, hence NUMA concerns. Perhaps in the incarnation I’m attempting to describe, one should call the per-die SLC a DLC (die level cache), and the central die SLC simply the system level cache, with very fast DRAM backing (and with appropriate memory controllers). I generally don’t want to get caught up in splitting hairs about definitions; just trying to convey an architectural idea.

leman · Sep 27, 2023

altaic said:
By pseudo-coherency, I meant that multiple SLC behave as a single cache, which is exactly what UltraFusion was designed for (as you described). I guess I have to mention the caveat that the M1 UltraFusion didn’t meet those design criteria very well, hence NUMA concerns. Perhaps in the incarnation I’m attempting to describe, one should call the per-die SLC a DLC (die level cache), and the central die SLC simply the system level cache, with very fast DRAM backing (and with appropriate memory controllers). I generally don’t want to get caught up in splitting hairs about definitions; just trying to convey an architectural idea.

Putting additional cache/DRAM controllers on the interconnect die certainly sounds like an interesting idea and a way to improve both performance and memory capacity for an Extreme class chip.

Regarding M1 Ultra... I think the problem was scheduling GPU work, especially when you had smaller work packages. The scheduler processor probably didn't do a very good job and a lot of time was wasted waiting for dependent tasks and moving the data around. It's not just M1 Ultra either, all M1 designs showed poor scaling in Blender for example. M2 Pro/Max/Ultra do much better, and we can hope that the new GPU takes additional steps to address this issue.

leman · Sep 27, 2023

Fitted frequency/power curve for sampled A17 Pro data. Looks promising for M3!

More info and methodology here: https://forums.macrumors.com/threads/power-curve-and-efficiency-of-apple-n3.2404894/

PgR7 · Sep 27, 2023

I tested a bit with an Android phone and I get better efficiency scores than this A17 Pro, Apple only go for performance instead of efficiency, is only when you activate the low power mode that it beats my phone, Im thinking iOS 17 is starting to be a heavy OS IMO, thats why I hear that some XR/XS, 11 are getting toasty too with iOS 17

deconstruct60 · Sep 27, 2023

altaic said:
Why not make the central interconnect die a central SLC? UltraFusion seems to be all about cache pseudo-coherency.

UltraFusion is all about making the internal network between to dies transparent. There is really nothing much other than transmitting the cache coherency data over the network to do with UltraFusion. UltraFusion's job is so that do NOT have to do anything substantially different to handle coherency issues. It is the same shipping data among the various die subsystem units as it is on die.

Instead of it being an approximately 800mm^2 single die it is two approxmiately 400mm^2 dies. That's it. Primarily, It is more cost effective to make.

If the interconnection die is NOT 3D then it is no where near as transparent. It introduces side effects because physics is physics ; longer distances through more material bring issues. Note also that instead of taking one large arear and slicing it half ( with not net increase in 2D footprint) UltraFusion does not introduce substantively more distance between the dies. If you create a 2D central die that is at least as big, if not bigger, than the dies you are connecting then the 2D footprint here is going way up.

Move the 'SLC' farther away from the memory doesn't make much sense. Basically increasing the round trip time there also. Part of the design trade off here is by bringing the memory packages much closer you are trading off how many chiplets you can bring into extremely close proximity to one another of that type.

The AMD desktop/server approach is to make all CPU cores have uniformly slower access. So things stay uniform. ( The memory all hang off the single I/O die and a relatively much more limited set of memory lanes (lower concurrent bandwidth). AMD does a 180 direction turn on MI200/MI300/7900 . the 'core' count is orders of magnitude higher and the channel breadth to handle concurrent requests is much higher.

altaic said:
Furthermore, fast DRAM could be attached to the back side of that die for a massive amount of slow cache

But it is plainly NOT in the diagrams. The problem with attaching all of the DRAM to a single die is that it is a single die. You are not going to get as many memory channels because the edge space on a single die around the same size as the ones you are connect isn't going to have a bigger circumference than the aggregate of the others. Going off-die to RAM sucks up a disportioncate larger amount of die space. The more you try to do in one place the faster you run out of area.

This is way Apple is trying to spread out the memory channels to more dies. The get more concurrency. That comes in very handing of trying to feed hundreds and hundreds of cores in a timely fashion. This is mainly a GPU with some CPU cores and uncore stuff 'bolted on' ; not primarily a CPU with other stuff 'bolted on'.

altaic said:
(but much faster and more efficient than going through a controller and transimpedance amplifiers to go off package to LPDDR).

Efficient on what dimension? Sure you can lower the number of memory channels and efficienly use less power... but are you efficient meeting your concurrent requests to different memory locations QoS requirements. Probably not. Which is why higher performance GPUs in generally don't look that way.

deconstruct60 · Sep 27, 2023

leman said:
I am not sure I share @deconstruct60's pessimism about NUMA issues. Yes, there will be latency costs, but Apple designs already have very high latency. Scheduling work for GPU is another issue of course, but Apple has some methods to approach this.

If the design already has have high latency that they had to put in mitigations to get around those ... cranking it even HIGHER probably makes the mitigations less effective. Piling even more Rube Goldberg mitigations on top of those mitigations has a decent chance of going sideways also.

All that has to happen for this to start to go sideways is that those mitigations start to leak out into a requirement for the user level software to make adjustments for them. At that point have software mods that only work on super expensive hardware that relatively very few people buy. That is more a 'problem' than a 'solution'. Sure some niche software may do it , but impact versus more generally useful NV 4000/5000 series or AMD 7000/8000 series is likely quite muted. ( that is why Nvidia and AMD keep the extra exotic stuff on the data center focused GPUs which serve different markets at different scales. )

A future Mac Pro will likely still start off with an "Ultra like" SoC that does not have these quirks. If you split software optimization inside the product line , ... how much of that other stuff is going to get sold? "developers can code to it like other Macs/iPad Pros" is likely going to be a high level requirement. ( e.g., one reason not seeing 3rd party GPUs in the new era. ) . Some "Extreme" SoC that is a' too weird of a contraption' has a good chance of being out of compliance.

I didn't mean to imply that this latency here would inherently make any software workarounds impractical , but the likelihood that the quirks here get exposed to the userland level is pretty good. That is enough to be a pragmatic showstopper.

name99 · Sep 27, 2023

Chuckeee said:
In general patent filings can be a mixed bag. Several companies have used patent filings to limit competitors or to incentivize employees to innovate without specific intentions of developing those ideas into products. Just because there is a patent it does not mean the product is actively being developed.

Maybe OTHER companies do that. I have seen very little evidence of this in Apple's technical patents. Most of the technical patents I've looked at
- are worthwhile (clever, non-obvious idea)
- often known to be implemented (certainly for patents older than a few years)
- part of a large set of patents that all fit together. I honestly don't think Apple has a fantasy division whose entire job is to create dense sets of interlocking patents that all seem plausible, all work together, and all refer to nothing whatsoever!

In particular this most recent scaling patent is hardly isolated. There have been a few other such scaling patents, including ones that describe how to handle connecting memory (a big issue – once you start to use more than one side of a die for the chip-to-chip communication, where do the memory connections go)? Likewise there are scaling patents from last year in a very different space, namely coherency protocols; likewise for distributing work across scaled up GPUs.

The real issue here is not the technology of scaling to more chiplets, it's the business model. Obviously nVidia have shown that there's a market for DGX's, and Intel has always been able to sell at least a few 4x or 8x chip Xeon Unobtaniums; but that's also never been an APPLE market. So what are the circumstances that have led Apple to believe that they can now sell in the $10K..50K market and that it's worth doing?

(My personal expectation is that this makes economic sense only under the assumption that these large designs also go into Apple data centers, replacing Intel chips...)

name99 · Sep 27, 2023

AgentMcGeek said:
So we’re talking 10-15% improvements for -20% real estate CPU wise? That’s very impressive.

What’s a bit of a disappointment is that we don’t seem to get much perf-to-space improvement on the GPU side. I guess the new RT features are taxing.

As always, how do you know that we don't get a "perf-to-space improvement on the GPU side"? You don't see the performance boost of what you don't test...
As GPUs get more general (see what nVidia is doing) part of what is going to happen on Apple designs is following innVidia's footsteps, eg
- making the lanes of the GPU capable of executing independently (as in Volta),
- providing more powerful synchronization primitives, and
- allowing Scratchpad storage to be shared across cores (as in Hopper).
- Hell maybe even some FP64 support.

These don't excite the games crowd, but they are an essential part of the evolution of GPGPU.

name99 · Sep 27, 2023

deconstruct60 said:
That might make for a good for something that wanted to take on a subset of what MI300 covers. However, for '4090' killer ... it looks to have gobs of NUMA issues that won't provision a unified , single GPU out of that.

You do realize that the 4090 is ALREADY a NUMA design, right?

Starting with, I think, Ampere, nVidia's large "single chip" GPUs are essentially two independent GPUs with the two L2's coherent relative to each other. This is probably also how the Apple Ultra designs work. So there's really no big deal in either Apple (or nVidia) scaling this up further.
In both cases, the important point is to have intelligence in the design to distribute work optimally, which includes many different per-core performance monitors constantly reporting data up to HQ, and both Apple and nV already have that in place; it's just a question of annual on-going refinement.

name99 · Sep 27, 2023

deconstruct60 said:
UltraFusion is all about making the internal network between to dies transparent. There is really nothing much other than transmitting the cache coherency data over the network to do with UltraFusion. UltraFusion's job is so that do NOT have to do anything substantially different to handle coherency issues. It is the same shipping data among the various die subsystem units as it is on die.

Instead of it being an approximately 800mm^2 single die it is two approxmiately 400mm^2 dies. That's it. Primarily, It is more cost effective to make.

If the interconnection die is NOT 3D then it is no where near as transparent. It introduces side effects because physics is physics ; longer distances through more material bring issues. Note also that instead of taking one large arear and slicing it half ( with not net increase in 2D footprint) UltraFusion does not introduce substantively more distance between the dies. If you create a 2D central die that is at least as big, if not bigger, than the dies you are connecting then the 2D footprint here is going way up.

Move the 'SLC' farther away from the memory doesn't make much sense. Basically increasing the round trip time there also. Part of the design trade off here is by bringing the memory packages much closer you are trading off how many chiplets you can bring into extremely close proximity to one another of that type.

The AMD desktop/server approach is to make all CPU cores have uniformly slower access. So things stay uniform. ( The memory all hang off the single I/O die and a relatively much more limited set of memory lanes (lower concurrent bandwidth). AMD does a 180 direction turn on MI200/MI300/7900 . the 'core' count is orders of magnitude higher and the channel breadth to handle concurrent requests is much higher.

But it is plainly NOT in the diagrams. The problem with attaching all of the DRAM to a single die is that it is a single die. You are not going to get as many memory channels because the edge space on a single die around the same size as the ones you are connect isn't going to have a bigger circumference than the aggregate of the others. Going off-die to RAM sucks up a disportioncate larger amount of die space. The more you try to do in one place the faster you run out of area.

This is way Apple is trying to spread out the memory channels to more dies. The get more concurrency. That comes in very handing of trying to feed hundreds and hundreds of cores in a timely fashion. This is mainly a GPU with some CPU cores and uncore stuff 'bolted on' ; not primarily a CPU with other stuff 'bolted on'.

You don't know how to interpret these patents because you're reading one and assuming that represents the whole universe.
There's a whole different collection of patents that discuss where to put the RAM, how to handle coherency, how to distribute GPU work, how to handle interrupts, how to share power and clocks, etc, etc, etc.
But the point of a patent is to patent the ONE idea (or closely related set of ideas) described in the patent, not to show how the entire system works. Which means everything except the one relevant idea is ignored or presented in the simplest terms relevant to getting across the idea of THIS PARTICULAR PATENT.

name99 · Sep 27, 2023

PgR7 said:
I tested a bit with an Android phone and I get better efficiency scores than this A17 Pro, Apple only go for performance instead of efficiency, is only when you activate the low power mode that it beats my phone, Im thinking iOS 17 is starting to be a heavy OS IMO, thats why I hear that some XR/XS, 11 are getting toasty too with iOS 17

If you can't even explain what you tested and what your "efficiency" score refers to, no-one is going to take you seriously...
Maybe spend some time first reading the room and reading the thread? This is a thread for people who care about engineering, not for d**k-measuring.

Confused-User · Sep 27, 2023

deconstruct60 said:
If the design already has have high latency that they had to put in mitigations to get around those ... cranking it even HIGHER probably makes the mitigations less effective. Piling even more Rube Goldberg mitigations on top of those mitigations has a decent chance of going sideways also.

I totally buy your arguments that making userland be aware of NUMA and other low-level design issues would be a nonstarter (or at least a huge fail if it happened). But I don't think you're right about this. The mitigations may be more effective, not less, at least up to a (significant) point.

You can see with typical PCs that RAM speeds can vary massively, and yet performance for almost all CPU tasks (ie, those that are not tightly memory-bandwidth bound) varies only marginally. A 20% change in RAM bandwidth can produce just a few percentage points of different in performance. Memory latency can be even less important in many tasks. IT all comes down to how well the cache hides these factors.

When I first saw what latency was like in the M1 I was disappointed. But then I realized... Apple had managed to achieve their remarkable performance with that latency. They very mindfully made choices about where they'd expend their efforts. It may be that they think that really strong mitigations for latency are a worthwhile investment specifically because they know they're going to keep running into latency as a key performance factor, and that they'll keep paying off.

And yes... GPUs are different than CPUs, and this argument may not hold as much water for them. I don't know.

name99 · Sep 27, 2023

Boil said:
From the drawings, more like M3 Extreme...!

64-core CPU (48P/16E)

160-core GPU (w/hardware ray-tracing)

64-core Neural Engine

960GB ECC LPDDR5X RAM (Sixteen 64GB chips)

2TB/s UMA bandwidth

IF we take these numbers at face value (and that's always tricky. The numbers in patents are supposed to be "suggestive", not design specs) this implies the basically plausible

- M3 is likely 10 GPU cores, Pro+ is 20, Max+ is 40.
Given the A17, one might have expected these numbers to be more like 12/24/48? But there's always an area budget, and maybe this year the area is devoted to scaling logic more than just ramping up GPU core count?

- 16 core NPU. It's not widely remarked upon how (very unlike the GPU or CPU) the NPU core count remains the same from A to Max. Kinda strange that, and not sure what it means. Maybe that, for now anyway, Apple see this as mainly about facilitating language UI, and there's only one person talking whether you have an iPhone or a Mac Studio?

- Pro and Max get 4+4+4P and 4E cores? That's one way to do it. Another way is
M gets 6P+4E cores, Pro and Max get 6+6P and 4E.
I honestly don't know which. A base cluster of 6P cores, in a way makes sense, allowing Apple to track Intel (eg high-end MTL mobile will be 6P+8E), so that Macbooks don't look sad in comparison.
Scaling a cluster like this is an interesting exercise because Apple shares a lot of stuff (as far as we know L2, L2 TLB+pagewalkers, page (de)compression engine, and AMX, and BANDWIDTH) across the cluster. If we go to a 6 cluster, it may make sense to rejigger this in some ways, eg
+ two AMX engines, each shared by three CPUs and
+ double the cluster bandwidth

Or they might go with the slightly simpler giving M3 4+4P cores to match MTL? Uses more area (maybe not a problem for N3B) but gives more oomph. And perhaps iPad gets the version with just one working 4P cluster?
On engineering grounds, I suspect rebalancing to a cluster of 6 split into two 3-wide sub-clusters is the optimal solution. But on time-to-market grounds...

M3 Chip Generation - Discussion Megathread

macrumors G5

macrumors regular

macrumors 68040

macrumors 68040

macrumors 68040

macrumors 68030

macrumors 68040

macrumors 6502

macrumors G5

macrumors G5

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors Core

macrumors Core

Cancelled

macrumors G5

macrumors G5

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 6502a

macrumors 68030

Our Staff