M4+ Chip Generation - Speculation Megathread [MERGED]

Boil · Jan 2, 2025

Apple moving to Tiling (chiplets) for ASi could definitely improve performance in regards to their GPU, just throw more GPU chiplets into the mix...

32-core CPU (24P/8E)
512-core GPU
128-core Neural Engine
960GB ECC LPDDR5X RAM
2.16TB/s UMA bandwidth

tenthousandthings · Jan 3, 2025

tenthousandthings said:
What is “AT” ?

thenewperson said:
I’m guessing AnandTech forums.

Yes, must be. I don’t think Ars has a “cpu forum” — AnandTech is now just an archive, though, as a news site it shut down at the end of August last year. Are those forums even moderated now?

I’m kind of half-expecting Apple to bring Anand Shimpi into a more visible role, I thought he did an outstanding job explaining the attitudes and priorities of the Apple silicon design team in an interview with Andru Edwards back in February 2023.

He’s been at Apple a full decade now (they hired him away from himself in 2014), he seems like the natural choice to introduce whatever it is they are doing with the Ultra in 2025.

tenthousandthings · Jan 3, 2025

tenthousandthings said:
I’m kind of half-expecting Apple to bring Anand Shimpi into a more visible role, I thought he did an outstanding job explaining the attitudes and priorities of the Apple silicon design team in an interview with Andru Edwards back in February 2023.

Here’s my old transcript of his comments in that interview. It is edited for clarity, removing interjections like "I think" and "kind of" and "sort of," but I've retained his rhetorical ", right? ..." and “So …” habits, as they capture the feel of the dialogue. It's a good overview of what Apple is trying to do.

February 2023 is after the launch of the M2 Pro/Max, but before the launch of the M2 Ultra in June. Similar to our current position, after the launch of M4 Pro/Max, but before Ultra.

General questions
“The silicon team doesn’t operate in a vacuum, right? When these products are being envisioned and designed, folks on the architecture team, the design team, they’re there, they’re aware of where we’re going, what’s important, both from a workload perspective, as well as the things that are most important to enable in all of these designs.

I think part of what you’re seeing now is this now decade-plus long, maniacal obsession with power-efficient performance and energy efficiency. If you look at the roots of Apple Silicon, it all started with the iPhone and the iPad. There we’re fitting into a very, very constrained environment, and so we had to build these building blocks, whether it’s our CPU or GPU, media engine, neural engine, to fit in something that’s way, way smaller from a thermal standpoint and a power delivery standpoint than something like a 16-inch MacBook Pro. So the fundamental building blocks are just way more efficient than what you’re typically used to seeing in a product like this.

The other thing you’re noticing is, for a lot of tasks that maybe used to be high-powered use cases, on Apple Silicon they actually don’t consume that much power. If you look, compared to what you might find in a competing PC product, depending on the workload, we might be a factor of two or a factor of four times lower power. That allows us to deliver a lot of workloads, that might have been high-power use cases in a different product, in something that actually is a very quiet and cool and long-lasting.

The other thing that you’re noticing is that the single-thread performance, the snappiness of your machine, it’s really the same high-performance core regardless of if you’re talking about a MacBook Air or a 14-inch Pro or 16-inch Pro or the new Mac mini, and so all of these machines can accommodate one of those cores running full tilt, again we’ve turned a lot of those usages and usage cases into low-power workloads.

You can’t get around physics, though, right? So if you light up all the cores, all the GPUs, the 14-inch system just has less thermal capacity than the 16, right? So depending on your workload, that might drive you to a bigger machine, but really the chips are across the board incredibly efficient.”

Battery life question
“You can look at how chip design works at Apple. You have to remember we’re not a merchant silicon vendor, at the end of the day we ship product. So the story for the chip team actually starts at the product, right? There is a vision that the design team, that the system team has that they want to enable, and the job of the chip is to enable those features and enable that product to deliver the best performance within the constraints, within the thermal envelope of that chassis, that is humanly possible. So if you look at what we did going from the M1 family to the M2 Pro and M2 Max, at any given power point, we’re able to deliver more performance.

On the CPU we added two more efficiency cores, two more of our e-cores. That allowed us, or was part of what allowed us, to deliver more multi-thread performance, again, at every single power point where the M1 and M2 curves overlap we were able to deliver more performance at any given power point. The dynamic range of operations [is] a little bit longer, a little bit wider, so we do have a slight increase in terms of peak power, but in terms of efficiency, across the range, it is a step forward versus the M1 family, and that directly translates into battery life.

The same thing is true for the GPU, it’s counterintuitive, but a big GPU running a modest frequency and voltage, is actually a very efficient way to fill up a box. So that’s been our philosophy dating back to iPhone and iPad, and it continues on the Mac as well.

But really the thing that we see, that the iPhone and the iPad have enjoyed over the years, is this idea that every generation gets the latest of our IPs, the latest CPU IP, the latest GPU, media engine, neural engine, and so on and so forth, and so now the Mac gets to be on that cadence too.

If you look at how we’ve evolved things on the phone and iPad, those IPs tend to get more efficient over time. There is this relationship, if the fundamental chassis doesn’t change, any additional performance you draw, you deliver has to be done more efficiently, and so this is the first time the MacBook Pro gets to enjoy that and be on that same sort of cycle.

On the silicon side, the team doesn’t pull any punches, right? The goal across all the IPs is, one, make sure you can enable the vision of the product, that there’s a new feature, a new capability that we have to bring to the table in order for the product to have everything that we envisioned, that’s clearly something that you can’t pull back on.

And then secondly, it’s do the best you can, right? Get as much down in terms of performance and capability as you can in every single generation. The other thing is, Apple’s not a chip company. At the end of the day, we’re a product company. So we want to deliver, whether it’s features, performance, efficiency. If we’re not able to deliver something compelling, we won’t engage, right? We won’t build the chip. So each generation we’re motivated as much as possible to deliver the best that we can.”

Neural engine question
“There are really two things you need to think about, right? The first is the tradeoff between a general purpose compute engine and something a little more specialized. So, look at our CPU and GPU, these are big general purpose compute engines. They each have their strengths in terms of the types of applications you’d want to send to the CPU versus the GPU, whereas the neural engine is more focused in terms of the types of operations that it is optimized for. But if you have a workload that’s supported by the neural engine, then you get the most efficient, highest density place on the chip to execute that workload. So that’s the first part of it.

The second part of it is, well, what kind of workload are we talking about? Our investment in the neural engine dates back years ago, right? The first time we had a neural engine on an Apple Silicon chip was A11 Bionic, right? So that was five-ish years ago on the iPhone. Really, it was the result of us realizing that there were these emergent machine learning models, where, that we wanted to start executing on device, and we brought this technology to the iPhone, and over the years we’ve been increasing its capabilities and its performance. Then, when we made the transition of the Mac to Apple Silicon, it got that IP just like it got the other IPs that we brought, things like the media engine, our CPU, GPU, Secure Enclave, and so on and so forth.

So when you’re going to execute these machine learning models, performing these inference-driven models, if the operations that you’re executing are supported by the neural engine, if they fit nicely on that engine, it’s the most efficient way to execute them. The reality is, the entire chip is optimized for machine learning, right? So a lot of models you will see executed on the CPU, the GPU, and the neural engine, and we have frameworks in place that kind of make that possible. The goal is always to execute it in the highest performance, most efficient place possible on the chip.”

Process node question
“You’re referring to the transistor. These are the building blocks all of our chips are built out of. The simplest way to think of them is like a little switch, and we integrate tons of these things into our designs. So if you’re looking at M2 Pro and M2 Max, you’re talking about tens of billions of these, and if you think about large collections of them, that’s how we build the CPU, the GPU, the neural engine, all the media blocks, every part of the chip is built out of these transistors. Moving to a new transistor technology is one of the ways in which we deliver more features, more performance, more efficiency, better battery life.

So you can imagine, if the transistors get smaller, you can cram more of them into a given area, that’s how you might add things like additional cores, which is the thing you get in M2 Pro and M2 Max—you get more CPU cores, more GPU cores, and so on and so forth. If the transistors themselves use less power, or they’re faster, that’s another method by which you might deliver, for instance, better battery life, better efficiency. Now, I mentioned this is one tool in the toolbox. What you choose to build with them, the underlying architecture, microarchitecture and design of the chip also contribute in terms of delivering that performance, those features, and that power efficiency.

If you look at the M2 Pro and M2 Max family, you’re looking at a second-generation 5 nanometer process. As we talked about earlier, the chip got more efficient. At every single operating point, the chip was able to deliver more performance at the same amount of power.”

Media engine question
“Going back to the point about transistors, taking that IP and integrating it on the latest kind of highly-integrated SOC and the latest transistor technology, that lets you run it at a very high speed and you get to extract a lot of performance out of it. The other thing is, and this is one of the things that is fairly unique about Apple Silicon, we built these highly-integrated SOCs, right?

So if you think about the traditional system architecture, in a desktop or a notebook, you have a CPU from one vendor, a GPU from another vendor, each with their own sort of DRAM, you might have accelerators kind of built into each one of those chips, you might have add-in cards as additional accelerators. But with Apple Silicon in the Mac, it’s all a single chip, all backed by a unified memory system, you get a tremendous amount of memory bandwidth as well as DRAM capacity, which is unusual, right?

In a machine like this a CPU is used to having large capacity, low bandwidth DRAM, and a GPU might have very low capacity, high bandwidth DRAM, but now the CPU gets access to GPU-like memory bandwidth, while the GPU gets access to CPU-like capacity, and that really enables things that you couldn’t have done before. Really, if you are trying to build a notebook, these are the types of chips that you want to build it out of. And the media engine comes along for the ride, right? The technology that we’ve refined over the years, building for iPhone and iPad, these are machines where the camera is a key part of that experience, and being able to bring some of that technology to the Mac was honestly pretty exciting. And it really enabled just a revolution in terms of the video editing and video workflows.

The addition of ProRes as a hardware accelerated encode and decode engine as a part of the media engine, that’s one of the things you can almost trace back directly to working with the Pro Workflows team, right? This is a codec that it makes sense to accelerate to integrate into hardware for our customers that we're expecting to buy these machines. It was something that the team was able to integrate, and for those workflows, there’s nothing like it in the industry, on the market.”

X86refugee · Jan 3, 2025

name99 said:
Everyone has NOT switched to tiles!

name99 said:
AMD has (as I described above) for optionality reasons.
Intel did for marketing reasons and, this being stupid, it's one more thing killing them.
Meanwhile:

nV and Apple both use chiplets as an idea, but make the chiplets as large as practical before gluing them together, not the crazy Ponte Vecchio level disaggregation Intel is championing.
QC, Mediatek, Ampere, AMZ, Google (anyone I left out) are not using them.

Chiplets are a technology, and like most technologies, have their place. My complaint is that you seem to have absolutely drunk the Intel koolaid that chiplets are the optimal design point for everything going forward, even something as small as an M1 class sort of chip (ie something targeting a tablet or low-end laptop).
It may be (unlike Intel I don't claim to predict the future ten years from now) that in ten years chiplets will in fact be the optimal design point for everything from a phone upward. But I doubt it, and that is certainly NOT the case today.

And I'm sorry, I can't take anyone who talks about the "M2/M3" lag seriously. This, like the claims about Apple loss of talent, is the sort of copium nonsense you read about on x86-support-group sites, not in genuine technical discussion.

You need to relax before you go off calling people stupid.

Intel laughed at AMD years ago when Pat G. first came on with numerous jabs like AMD 'glueing chips together'.... And AMD has been smoking Intel on a cost-basis and with faster and faster iterations that Tiling allows, and where is Pat now?

I'm probably one of the last folks anyone would call an Intel fanboy.
So your impressions seem to be off at least there, and from that I have similar doubts as to some of your other expertise.

Industry-wide, tile/chiplet/stack is the absolute, forgone conclusion.
We're reaching the current limits of litho, and outside some ground breaking new discoveries Industry has no better idea on how to continue to advance using current CMOS technology.
Speed advances petered out years ago.
Density advances have slowed to a crawl.
Defects that before would have been a minor inconvenience that might affect a handful of xters each now because of such Node scaling can affect tens-hundreds of xters.
So now redundancy means duplicates of larger and larger portions of critical systems, which affects total transistor budget and all sort of time and vailidation work.

Cutting Clock Costs On The Bleeding Edge Of Process Nodes

The clock distribution network consumes a sizable portion of the physical design and verification budget.

semiengineering.com

This cost used to be a minority of R&D, and is now rising expeditiously.
Apple using ARM ISA is probably a great help as it doesn't have half the baggage to carry, however that doesn't mean Apple by virtue of its 800# Gorilla-size is somehow immune to any of these industry-wide realities.

And my original comment was exactly the opposite of what you're claiming, that any Chiplet-like technology does not have an inherent performance improvement as suggested.

As for Apple M series issues, all I know is what I've read and heard from folks in the industry and litho/fab Macheads. My admittedly limited understanding of the issue is that with that top tier talent loss, the M2 clock was goosed a bit, and for M3 the existing team yielded some improvements, but goosed the clocks quite a bit more.
This matches with a quick google: https://medium.com/@kellyshephard/m3-vs-m2-vs-m1-macbook-air-which-one-is-right-for-you-8e7c532e1de4
M1-M2, Apple increased xter density by 25%, stayed on same Node, and was that massive improvement?
M2-M3, Apple increased xter density again by 20%, and switched from N5 to N3, with significant clockspeed increase, and saw significant improvement. Is that surprising? No.
Its exactly what AMD/Intel have historically done on their Tick gen's.

I'm absolutely sure IPC and other improvements took place in M3, however when you increase xter density by close to 50%, jump to a newer Node, and increase clock speeds, its rather loopy to simply explain it all as innate Apple R&D IPC improvement.
I'll stick with the general industry feeling that Apple has mostly maintained its quite nice improvements more based on one-shot Node jumps, clocking increases over some wizardry by its R&D guru's.
I'm on Apple now because I'm willing to pay the Apple-tax for the improvements it has over x86 and the less dreadful OS experience, however I'm not simple enough to believe Apple itself is somehow driving these fantastic advances in ARM vs x86.
I think the proof will be in the pudding in M5 or M6 latest.
ARM as a simpler ISA has less baggage and what not to delete for gains, and I think the M1 was simply a masterpiece in showing up and beating x86. To get there though, I think they've already reaped all the low-hanging fruit a la x86 has done for years.
As Nodes slow down and last longer, AMD/Intel will eventially get to or very close to node-partity. This means Apple no matter its size is not going to have massive Node advantages as AMD has similarly had over Intel now for years.
Additionally, clock speeds and advances will still slow more, for everyone.
Apple will hit the same hard wall of physics as everyone else, and we might see 10% improvement become the norm gen-on-gen for Apple too.

X86refugee · Jan 3, 2025

tenthousandthings said:
Yes, must be. I don’t think Ars has a “cpu forum” — AnandTech is now just an archive,

Sorry, yes AT is the gen shorthand for Anandtech.
Their CPU forum is one of the most educated and filled with a lot of folks with foundry experience surviving on the internet.
Still a lot of x86 and Nuvia fanboy-ism, but there are a couple of Apple threads that survive quite nicely.

komuh · Jan 3, 2025

Boil said:
Apple moving to Tiling (chiplets) for ASi could definitely improve performance in regards to their GPU, just throw more GPU chiplets into the mix...

32-core CPU (24P/8E)

512-core GPU

128-core Neural Engine

960GB ECC LPDDR5X RAM

2.16TB/s UMA bandwidth

512+ core GPU is already here just not the cores you want to see ofc. but still. I love this pipe dream posts while max we'll see is like 1/4 of that.

leman · Jan 3, 2025

X86refugee said:
As for Apple M series issues, all I know is what I've read and heard from folks in the industry and litho/fab Macheads. My admittedly limited understanding of the issue is that with that top tier talent loss, the M2 clock was goosed a bit, and for M3 the existing team yielded some improvements, but goosed the clocks quite a bit more.
This matches with a quick google: https://medium.com/@kellyshephard/m3-vs-m2-vs-m1-macbook-air-which-one-is-right-for-you-8e7c532e1de4
M1-M2, Apple increased xter density by 25%, stayed on same Node, and was that massive improvement?
M2-M3, Apple increased xter density again by 20%, and switched from N5 to N3, with significant clockspeed increase, and saw significant improvement. Is that surprising? No.
Its exactly what AMD/Intel have historically done on their Tick gen's.

I'm absolutely sure IPC and other improvements took place in M3, however when you increase xter density by close to 50%, jump to a newer Node, and increase clock speeds, its rather loopy to simply explain it all as innate Apple R&D IPC improvement.

It seems to me like you are equating improvements with IPC improvements. Given Apple's leadership in the IPC and the extremely low power consumption of their CPUs, wouldn't it make sense for them to look for performance gains elsewhere? The bottom line is that the performance is steadily increasing and every year brings a new series of tweaks and optimizations to the chips. I don't see how narrative of their development or innovation speed slowing down holds to scrutiny.

X86refugee · Jan 3, 2025

Thats a fair point.
I think Apple has done close to a miracle with the M series and at the same time has used Node/Process improvements its payed top dollar for from TSMC to keep up some remarkable advances.

Actual 'improvement' in systems can be looked at in different ways, sometimes actual Architectural improvement by Apple/Intel/AMD Design engineers, other times Process improvements intra-Node that allow more speed as Intel did with all their +++ derivatives, and sometimes just jumping onto a newer Node that give 15-20% more speed and/or transistor density at or not at exactly ISO.
I can certainly be wrong, however in my view far too many people are assigning Apples continued success as primarily due to some wizardy in Cupertino. I think the reality is that there is certainly some of that, but a lot of the remaining is based on Apple paying for the lion's share of future node development at TSMC and getting exclusive, often primary use of that advance.
Those advances seem to generally be about 15% IIRC when looked at from an industry standard ISO.
AMD was doing this all through the 2020's to Intel until Intel finally jumped on N3B and start being fairly competitive.

I think Apple is mining the heck out of ARM to get some of these increases, however its also a much simpler ISA than x86. As it is simpler, I don't think its going to be able to be mined for years and years as Intel and AMD have.
With Nodes set to continue to slow, process improvements will likewise slow, whether ARM or x86. Apple still has headway before hitting the 5-6Ghz redline x86 has, so it can continue to gas it for M5/M6. And I'll eat my hat if Apple somehow finds a way to not see power increases just as everyone else in Foundry world has.

And, I wouldn't have bothered dumping x86, even nice Linux Mint if I didn't think Apple weren't going to be dominant going forward.

mr_roboto · Jan 3, 2025

X86refugee said:
I'll stick with the general industry feeling that Apple has mostly maintained its quite nice improvements more based on one-shot Node jumps, clocking increases over some wizardry by its R&D guru's.

What "general industry feeling" would that be? Random idiots pretending to be experts on internet forums are not the same thing as industry knowledge. If you don't have the background to tell for yourself whether an idea is genuinely from an industry insider, maybe you shouldn't uncritically accept it as received wisdom.

While you should be equally skeptical of me, I'll unpack a few of the wrongheaded ideas you've repeated. You can decide for yourself whether what I'm saying makes more sense... hopefully taking some time to think things through a little more carefully.

The first bad idea is that clock speed must be a trivial knob to turn, needing no real expertise. And from the PC overclocker perspective, sure, it's easy to see how such ideas look natural. They've been doing it for ages, after all. Goose the voltage, clock number go up.

But Apple's design team is under different constraints than overclocker armchair quarterbacks, or even Intel's and AMD's designers. Apple believes in making extremely efficient devices, so their chip designers aren't allowed to just go for clock speed at any price. The review you linked to 'prove' your point showed that power efficiency went up from M1 to M2 to M3, not just performance. Guess what cannot happen if you just boost clocks on the same design and call it a day?

The second very wrongheaded idea is that Apple hasn't been changing anything other than clock frequency. The core is getting wider (M3 has eight integer units, M1 and M2 have six). Dispatch width is increasing - 8 in M1/2, 9 in M3, 10 in M4. Performance has been going up by ratios not predicted by clock frequency alone.

Finally, it's crazy that you seemingly believe only improvements in IPC demonstrate design team talent. This is just so far away from reality, it's hard to know where to begin. It's an incredibly simpleminded and reductive view of the field.

crazy dave · Jan 3, 2025

X86refugee said:
Thats a fair point.
I think Apple has done close to a miracle with the M series and at the same time has used Node/Process improvements its payed top dollar for from TSMC to keep up some remarkable advances.

Actual 'improvement' in systems can be looked at in different ways, sometimes actual Architectural improvement by Apple/Intel/AMD Design engineers, other times Process improvements intra-Node that allow more speed as Intel did with all their +++ derivatives, and sometimes just jumping onto a newer Node that give 15-20% more speed and/or transistor density at or not at exactly ISO.
I can certainly be wrong, however in my view far too many people are assigning Apples continued success as primarily due to some wizardy in Cupertino. I think the reality is that there is certainly some of that, but a lot of the remaining is based on Apple paying for the lion's share of future node development at TSMC and getting exclusive, often primary use of that advance.
Those advances seem to generally be about 15% IIRC when looked at from an industry standard ISO.
AMD was doing this all through the 2020's to Intel until Intel finally jumped on N3B and start being fairly competitive.

I think Apple is mining the heck out of ARM to get some of these increases, however its also a much simpler ISA than x86. As it is simpler, I don't think its going to be able to be mined for years and years as Intel and AMD have.
With Nodes set to continue to slow, process improvements will likewise slow, whether ARM or x86. Apple still has headway before hitting the 5-6Ghz redline x86 has, so it can continue to gas it for M5/M6. And I'll eat my hat if Apple somehow finds a way to not see power increases just as everyone else in Foundry world has.

And, I wouldn't have bothered dumping x86, even nice Linux Mint if I didn't think Apple weren't going to be dominant going forward.

Obviously the foundry node and ARM architecture are important - no one would disagree (well maybe Qualcomm's lawyers for the latter part but that's a separate issue). Heck even Apple's choice of 16Kb page sizes for its OS is helpful because it helps enable larger L1 caches while maintaining a lower associativity and simpler design. And no Apple's CPU designs aren't magic any more than TSMC's fabs are or ARM's instruction set is. Heck, Qualcomm and ARM themselves are getting better and better - the X925 and Oryon-L may be behind the A18Pro but are still damn good and closer than they've ever been to an Apple core - certainly much better than AMD/Intel have managed. I even have Qualcomm's original Orion core down below in comparison to the M2 which is probably their best comparison given the node (N4 vs N5P).

I would also contend though that this entire discussion has largely ignored Apple's efficiency cores which have their own trajectory in terms of performance, clocks, and power and are in many ways even more impressive than their high-performance cousins (and Qualcomm/ARM's little cores are also finally catching up here too).

BUT overall I think you are conflating a few things together. Let's try to tease things apart. A new foundry node allows a chipmaker to run a design at faster speeds at lower power. A new transistor density can also help a chipmaker to make a new design - though that's not technically necessary (e.g. Zen 2 and Zen 3 were on the same node and Zen 3 was probably AMD's most impressive chip improvement partially as a result). Further merely moving to a new node doesn't magically equilibrate things. ARM designs for Android for years would lag Apple by mere months on the new process nodes, but even then never came close to matching their performance or efficiency in ST. But for a more modern, PC-oriented comparison, we can compare Zen 5 to M2 (Zen 5 is on marginally better node than M2 - N4P vs N5P) and Lunar Lake to M3 (same node, N3B).

(All subsequent plots are thumbnails which can be expanded by clicking on them)

Here are Cinebench R24 results (data courtesy of NotebookCheck with some differences - I subtract idle power out to try to get power under load only - closest wall power measures get to package power):

Intel being on the same node as Apple neither allowed it to catch up to Apple in performance nor in power consumption (a factor of the die area x clock speed). Don't get me wrong, Lunar Lake is a massive improvement over Meteor Lake, Meteor Lake was so awful it wouldn't fit on this chart. However, with Intel prioritizing ST efficiency at every step of the way, sure they did better than the (much bigger) AMD chip, but didn't even catch Qualcomm's slightly worse version of the M2 on N4 never mind Apple on N3B. Similarly AMD's Zen 5 HX 370 is on a slightly better node (N4P) than the M2 Pro (N5P) and while it manages only slightly worse ST performance, it has vastly worse power consumption with its higher clocks to achieve that. The end result of AMD and Intel "catching up" with Apple in terms of node is 2-3x worse efficiency at lower performance. Now unlike in R23 (which was technically native but not very well optimized for ARM), Apple does particularly well in the R24 benchmark. So let's look at Geekbench 6. And since the assertion is that Apple's development has "slowed down" let's compare Apple's progress from M1 (Nov 2020) to M4 (May 2024) versus AMD with Zen 3 (Nov 2020) to Zen 5 (Jul/Aug 2024). (As an aside, with the simultaneous release of the M1 and Zen 3, November of 2020 must've been a really bad month for morale in Intel's offices.) I'll use AMD's desktop chips to give them the best performance matchup and just look at clocks since Zen 5 desktop on N4X uses quite a bit of power (also measuring wall power on GB is harder given the nature of the benchmark). Yes, some people go overboard with praising Apple, but your posts are hewing the other way too much. It isn't just access to the latest fabs, Apple really does have an excellent design that they have continued to iterate on which I will further demonstrate below.

Now here we have a problem. We COULD just take the top line GB 6 score and divide by clock speed or do the same with SPEC or like AMD pick an assortment of various benchmarks to create a similar composite. But to be particularly blunt, that's nonsensical. Yes, I know that marketing departments do that all the time but "IPC" as weighted geometric of a bunch of incredibly disparate sub-benchmarks is meaningless. Also it's supposed to be "instructions per clock" but ARM and x86 have different instructions which means IPC isn't particularly comparable between them. Since what people really mean is performance-per-clock and the term IPC has been so polluted from its original definition, I prefer to use the term ICP (ISO-clock performance). Also, when really trying to discuss performance to break out each sub-test separately. So let's do that!

Now some caveats - Geekbench even in the best of circumstances can show huge variance in results. The best way to handle this I've seen is to use violin plots generated from bootstrapping Geekbench data. Unfortunately that requires more time than I was willing to put in, the end result in the plots would be extremely busy because of how many points of comparison I'm doing, the clock speeds of the CPU even in GB may not stay at max boost during the whole test, and AMD especially could be biased because the database will include overclockers. So I tried to choose representative samples from the M1, M4, 7950X, and 9950X to compare that best represent data from GB 6's average database and/or reviewers. There's little I can do about the clock speed issue though and even GB's clock records in the JSON file are slightly suspect as it isn't clear when or how they are recorded. So for those charts that plot ICP rather than raw performance, I simply use max clock speed.

First up, ISO-clock performance ratio of the M4/M1, Zen 5/Zen 3, and M1/Zen 5. Here we see that AMD has achieved marginally better ICP growth than Apple ... until you look at the grey bars and realize that with the exception of 3 benchmarks, HTML5, Object Detection (more on this in a bit), and Background Blur, the M1 still has better ISO-clock performance than the Zen 5. This is really what @leman was trying to get at. Apple's ICP advantage is so *massive* that their 4 year old chip on N5 has better ICP than AMD's latest N4X chip. That said, M1 is obviously not performance competitive with Zen 5 because it's clock speeds are too low. So that Apple have made architectural decisions that have improved ICP in some key areas while also allowing them to dramatically increase clocks by nearly 40%! - all while keeping ST power (mostly) under control - is very understandable (and yes TSMC absolutely deserves credit here as well, but as we discussed above it isn't just the fab).

Here's a chart which showcases this more dramatically with the bar chart replaced with an XY scatterplot:

Screenshot 2025-01-03 at 10.54.38 PM.png

Since Object Detection is such an outlier for both AMD and Apple due to the introduction of AVX-512/VNNI (in Zen 4, but Zen 5 made improvements especially for desktops with full AVX-512) and SME (technically Apple had AMX cores previously but with the introduction of SME it could officially use them in the instruction set as opposed to the Apple-only Accelerate framework), I made a smaller blowup chart excluding it. The X-axis is the ICP ratio of M1/Zen 5 and the y-axis are the iso-clock ratios of Zen 5/Zen 3 (green) and M4/M1 (blue). What we can see from this chart is that those areas where AMD has gotten better than, or at least closest to the M1, are also the areas that Apple has worked hardest to improve - in some cases these were the weakest, least performant aspects of the M1 chip. For Object Detection well ... the M1 had no SME and yet it took the introduction of AVX-512 vectors more than doubling AMD's performance to beat the SME-less M1 by 13% (the M1 had more than double the ICP advantage over the Zen 3 here). Meanwhile with SME in the M4, Apple just pulls away again. Apple is prioritizing where it puts its gains to where it needs it the most - again this may make a weighted geomean ICP average not look so good, but the details reveal a far richer story. Apple is also on the record that they have their own set of user data and benchmarks that they look to target, this may not always correlate with SPEC/GB 6. Lastly on this point, I talk about this more below but just because ICP doesn't change doesn't mean that there weren't architectural changes that improved performance.

Here's another way to look at the data. Instead of ICP ratios, you can look at ICP deltas instead - the difference in points per clock Apple achieved. Because Apple's M1 ICP is already so high a middling ratio of improvement to the M4 might look less impressive ... until you realize that a smaller percentage of a bigger number can be very competitive with a larger percentage improvement of a much smaller number:

Screenshot 2025-01-03 at 11.08.34 PM.png

Here we see in units of clock normalized performance, Apple's gains are competitive with AMD's.

Finally let's put this all together and look at raw, non-clock normalized ST performance improvements of Apple's baseline core that goes into passively cooled tablets compared to AMD's top of the line, fastest clocked desktop processor (clocks charted as well at the end):

Screenshot 2025-01-03 at 11.18.26 PM.png

Blue in M1/Zen3 and green is M4/Zen 5 so chips released at the same time. Anything above 1 Apple's chip outperforms AMD's, anything below AMD's outperforms Apple's. Final bars are just clock speed ratios so as expected AMD's chips are clocked much faster than Apple's BUT the M4 is closer in clocks to the 9950X than the M1 was to the 7950X. Apple wins more often than it loses with nothing magical except that it's always doing so with lower clocks even if increases in clocks have been a contributor to those performance improvements (which again still necessitated architectural changes). So yes, Zen 5 has slightly closed the ICP gap with M4 compared to Zen 3 and M1, but that's because they had so far to go while Apple prioritized designs to increase clocks, one of its major weaknesses (although very high clocks like AMD/Intel are a weakness as well no question). The key takeaway here is that Apple has ensured in each and every test bar one (Text Processing, Clang and a couple of others are close), the M4/Zen 5 ratio (be it above or below 1) has improved on the M1/Zen 3 ratio. The green bar is always above the blue bar. And they've done it without busting their power budget. You don't get that from just fabs or even minor architectural improvements.

Some additional misconceptions:

X86refugee said:
I think Apple is mining the heck out of ARM to get some of these increases, however it's also a much simpler ISA than x86. As it is simpler, I don't think it's going to be able to be mined for years and years as Intel and AMD have.

I'll be honest, I'll give it a shot but this is so wrong I'm not really where to begin with this. ARMv8 has a much simpler structure than x86-64 (i.e. all instructions have the same length) which allows it to be decoded easier (a major reason why Apple and other ARM cores can go so wide more easily) and there is less historical baggage to the ARM v8/9 architecture (no 32bit or lower instruction set around). But the instruction set itself is no less "complex" and ARMv8/9 can attain similar if not greater levels of code density than x86. ARMv8 may be RISC, but it's pragmatic. It doesn't hew to classical RISC paradigms when performance would suffer. Further, the instruction set v8/9 is ever evolving, just as x86 is as well (although Intel pared back just how ambitious they would evolve it recently, it is changing).

====

As @mr_roboto alluded to in his post above, you often can't just increase clock speed and maintain the same IPC/ICP - think about interactions with the memory hierarchy for one. Cache accesses or worse cache misses and thus memory accesses are not necessarily tied to clock speed which means simply increasing clocks with no other changes to the architecture running any meaningful program will eventually lower the performance per clock as the CPU will simply be waiting additional cycles for data. We know Apple made small changes to the performance core of the M2 and much larger changes to the architecture of the performance cores of the M3 and M4, but they also massively increased clocks. Much more so than Intel and even AMD. This huge increase in clock speed has likely eaten much of the ICP gains that would've happened (see Horizon Detection where for the particular M1 and M4 chosen the M4 has very slightly lower ICP than the M1). On top of that we have my caveat above that simply using "max clocks" as many do, including myself above, may underestimate true ICP of any particular chip.

====

Lastly, throughout all of your posts there seems to be this notion that the M1 was something magical compared to what came before and a feat that Apple has never been able to duplicate since. Don't get me wrong, Firestorm was a great core (still is). But when placed in its context, M1 and Firestorm was simply the first time Apple could showcase its massive efficiency advantage while also competing in performance with AMD and Intel. It was a natural progression in the evolution of Apple's design which Apple has continued to iterate on. The M1 may have come as a massive shock to the PC world who had largely ignored what Apple was doing over the last decade, but it was much less so to those who had payed attention. Since you're on at AT forums, you must've read Andrei's preview of the M1 where he laid this out (and he was much more explicit in other comments he made to this effect). I hope I've also done a good job of illustrating that while Apple has seemingly slowed down in some respects, this has been vastly overstated and the M1 was not some outlier either in antecedents or descendents.

================

Now there is some possibility that Apple will hit a wall with its designs. Eventually going wider and wider may yield diminishing returns as most code will simply lack the instruction level parallelism to justify it. Apple may also suffer a design crisis of some other kind. I can't know the future. But again I hope that I've demonstrated that, so far, the "great slowdown" due to "brain drain" is more myth than reality - nor can Apple be accused of resting on its laurels.

Kotsos81 · Jan 4, 2025

crazy dave said:
Obviously the foundry node and ARM architecture are important - no one would disagree (well maybe Qualcomm's lawyers for the latter part but that's a separate issue). Heck even Apple's choice of 16Kb page sizes for its OS is helpful because it helps enable larger L1 caches while maintaining a lower associativity and simpler design. And no Apple's CPU designs aren't magic any more than TSMC's fabs are or ARM's instruction set is. Heck, Qualcomm and ARM themselves are getting better and better - the X925 and Oryon-L may be behind the A18Pro but are still damn good and closer than they've ever been to an Apple core - certainly much better than AMD/Intel have managed. I even have Qualcomm's original Orion core down below in comparison to the M2 which is probably their best comparison given the node (N4 vs N5P).

I would also contend though that this entire discussion has largely ignored Apple's efficiency cores which have their own trajectory in terms of performance, clocks, and power and are in many ways even more impressive than their high-performance cousins (and Qualcomm/ARM's little cores are also finally catching up here too).

BUT overall I think you are conflating a few things together. Let's try to tease things apart. A new foundry node allows a chipmaker to run a design at faster speeds at lower power. A new transistor density can also help a chipmaker to make a new design - that's not technically necessary (Zen 2 and Zen 3 were on the same node and Zen 3 was probably AMD's most impressive chip improvement partially as a result). Further merely moving to a new node doesn't magically equilibrate things. ARM designs for Android for years would lag Apple by mere months on the new process nodes, but even then never came close to matching their performance or efficiency in ST. But for a more modern, PC-oriented comparison, we can compare Zen 4 to M2 (Zen 4 is on marginally better node than M2 - N4P vs N5P) and Lunar Lake to M3 (same node, N3B).

(All subsequent plots are thumbnails which can be expanded by clicking on them)

Here are Cinebench R24 results (data courtesy of NotebookCheck with some differences - I subtract idle power out to try to get power under load only - closest wall power measures get to package power):

View attachment 2468862

Intel being on the same node as Apple neither allowed it to catch up to Apple in performance nor in power consumption (a factor of the die area x clock speed). Don't get me wrong, Lunar Lake is a massive improvement over Meteor Lake, Metro Lake was so awful it wouldn't fit on this chart. However, with Intel prioritizing ST efficiency at every step of the way, sure they did better than the (much bigger) AMD chip, but didn't even catch Qualcomm's slightly worse version of the M2 on N4 never mind Apple on N3B. Similarly AMD's Zen 5 HX 370 is on a slightly better node (N4P) than the M2 Pro (N5P) and while it manages only slightly worse ST performance, it has vastly worse power consumption with its higher clocks to achieve that. The end result of AMD and Intel "catching up" with Apple in terms of node is 2-3x worse efficiency at lower performance. Now unlike in R23 (which was technically native but not very well optimized for ARM), Apple does particularly well in the R24 benchmark. So let's look at Geekbench 6. And since the assertion is that Apple's development has "slowed down" let's compare Apple's progress from M1 (Nov 2020) to M4 (May 2024) versus AMD with Zen 3 (Nov 2020) to Zen 5 (Jul/Aug 2024). (As an aside, with the simultaneous release of the M1 and Zen 3, November of 2020 must've been a really bad month for morale in Intel's offices.) I'll use AMD's desktop chips to give them the best performance matchup and just look at clocks since Zen 5 desktop on N4X uses quite a bit of power (also measuring wall power on GB is harder given the nature of the benchmark). Yes, some people go overboard with praising Apple, but your posts are hewing the other way too much. It isn't just access to the latest fabs, Apple really does have an excellent design that they have continued to iterate on which I will further demonstrate below.

Now here we have a problem. We COULD just take the top line GB 6 score and divide by clock speed or do the same with SPEC or like AMD pick an assortment of various benchmarks to create a similar composite. But to be particularly blunt, that's nonsensical. Yes, I know that marketing departments do that all the time but "IPC" as weighted geometric of a bunch of incredibly disparate sub-benchmarks is meaningless. Also it's supposed to be "instructions per clock" but ARM and x86 have different instructions which means IPC isn't particularly comparable between them. Since what people really mean is performance-per-clock and the term IPC has been so polluted from its original definition, I prefer to use the term ICP (ISO-clock performance). Also, when really trying to discuss performance to break out each sub-test separately. So let's do that!

Now some caveats - Geekbench even in the best of circumstances can show huge variance in results. The best way to handle this I've seen is to use violin plots generated from bootstrapping Geekbench data. Unfortunately that requires more time than I was willing to put in, the end result in the plots would be extremely busy because of how many points of comparison I'm doing, the clock speeds of the CPU even in GB may not stay at max boost during the whole test, and AMD especially could be biased because the database will include overclockers. So I tried to choose representative samples from the M1, M4, 7950X, and 9950X to compare that best represent data from GB 6's average database and/or reviewers. There's little I can do about the clock speed issue though and even GB's clock records in the JSON file are slightly suspect as it isn't clear when or how they are recorded. So for those charts that plot ICP rather than raw performance, I simply use max clock speed.

View attachment 2468875

First up, ISO-clock performance ratio of the M4/M1, Zen 5/Zen 3, and M1/Zen 5. Here we see that AMD has achieved marginally better ICP growth than Apple ... until you look at the grey bars and realize that with the exception of 3 benchmarks, HTML5, Object Detection (more on this in a bit), and Background Blur, the M1 still has better ISO-clock performance than the Zen 5. This is really what @leman was trying to get at. Apple's ICP advantage is so *massive* that their 4 year old chip on N5 has better ICP than AMD's latest N4X chip. That said, M1 is obviously not performance competitive with Zen 5 because it's clock speeds are too low. So that Apple have made architectural decisions that have improved ICP in some key areas while also allowing them to dramatically increase clocks by nearly 40%! - all while keeping ST power (mostly) under control (and yes TSMC absolutely deserves credit here as well, but as we discussed above it isn't just the fab).

Here's a chart which showcases this more dramatically with the bar chart replaced with an XY scatterplot:

View attachment 2468888

Since Object Detection is such an outlier for both AMD and Apple due to the introduction of AVX-512/VNNI (in Zen 4, but Zen 5 made improvements especially for desktops with full AVX-512) and SME (technically Apple had AMX cores previously but with the introduction of SME it could officially use them in the instruction set as opposed to the Apple-only Accelerate framework), I made a smaller blowup chart excluding it. The X-axis is the ICP ratio of M1/Zen 5 and the y-axis are the iso-clock ratios of Zen 5/Zen 3 (green) and M4/M1 (blue). What we can see from this chart is that those areas where AMD has gotten better than, or at least closest to the M1, are also the areas that Apple has worked hardest to improve - in some cases these were the weakest, least performant aspects of the M1 chip. For Object Detection well ... the M1 had no SME and yet it took the introduction of AVX-512 vectors more than doubling AMD's performance to beat the SME-less M1 by 13% (the M1 had more than double the ICP advantage over the Zen 3 here). Meanwhile with SME in the M4, Apple just pulls away again. Apple is prioritizing where it puts its gains to where it needs it the most - again this may make a weighted geomean ICP average not look so good, but the details reveal a far richer story. Apple is also on the record that they have their own set of user data and benchmarks that they look to target, this may not always correlate with SPEC/GB 6. Lastly on this point, I talk about this more below but just because ICP doesn't change doesn't mean that there weren't architectural changes that improved performance.

Here's another way to look at the data. Instead of ICP ratios, you can look at ICP deltas instead - the difference in points per clock Apple achieved. Because Apple's M1 ICP is already so high a middling ratio of improvement to the M4 might look less impressive ... until you realize that a smaller percentage of a bigger number can be very competitive with a larger percentage improvement of a much smaller number:

View attachment 2468893

Here we see in units of clock normalized performance, Apple's gains are competitive with AMD's.

Finally let's put this all together and look at raw, non-clock normalized ST performance improvements of Apple's baseline core that goes into passively cooled tablets compared to AMD's top of the line, fastest clocked desktop processor (clocks charted as well at the end):

View attachment 2468894

Blue in M1/Zen3 and green is M4/Zen 5 so chips released at the same time. Anything above 1 Apple's chip outperforms AMD's, anything below AMD's outperforms Apple's. Final bars are just clock speed ratios so as expected AMD's chips are clocked much faster than Apple's BUT the M4 is closer in clocks to the 9950X than the M1 was to the 7950X. Apple wins more often than it loses with nothing magical except that it's always doing so with lower clocks even if increases in clocks have been a contributor to those performance improvements (which again still necessitated architectural changes). So yes, Zen 5 has slightly closed the ICP gap with M4 compared to Zen 3 and M1, but that's because they had so far to go while Apple prioritized designs to increase clocks, one of its major weaknesses (although very high clocks like AMD/Intel are a weakness as well no question). The key takeaway here is that Apple has ensured in each and every test bar one (Text Processing, Clang and a couple of others are close), the M4/Zen 5 ratio (be it above or below 1) has improved on the M1/Zen 3 ratio. The green bar is always above the blue bar. And they've done it without busting their power budget. You don't get that from just fabs or even minor architectural improvements.

Some additional misconceptions:

I'll be honest, I'll give it a shot but this is so wrong I'm not really where to begin with this. ARMv8 has a much simpler structure than x86-64 (i.e. all instructions have the same length) which allows it to be decoded easier (a major reason why Apple and other ARM cores can go so wide more easily) and there is less historical baggage to the ARM v8/9 architecture (no 32bit or lower instruction set around). But the instruction set itself is no less "complex" and ARMv8/9 can attain similar if not greater levels of code density than x86. ARMv8 may be RISC, but it's pragmatic. It doesn't hew to classical RISC paradigms when performance would suffer. Further, the instruction set v8/9 is ever evolving, just as x86 is as well (although Intel paired back just how ambitious they would evolve it recently, it is changing).

====

As @mr_roboto alluded to in his post above, you often can't just increase clock speed and maintain the same IPC/ICP - think about interactions with the memory hierarchy for one. Cache accesses or worse cache misses and thus memory accesses are not necessarily tied to clock speed which means simply increasing clocks with no other changes to the architecture running any meaningful program will eventually lower the performance per clock as the CPU will simply be waiting additional cycles for data. We know Apple made small changes to the performance core of the M2 and much larger changes to the architecture of the performance cores of the M3 and M4, but they also massively increased clocks. Much more so than Intel and even AMD. This huge increase in clock speed has likely eaten much of the ICP gains that would've happened (see Horizon Detection where for the particular M1 and M4 chosen the M4 has very slightly lower ICP than the M1). On top of that we have my caveat above that simply using "max clocks" as many do, including myself above, may underestimate true ICP of any particular chip.

====

Lastly, throughout all of your posts there seems to be this notion that the M1 was something magical compared to what came before - a feat that Apple has never been able to duplicate. Don't get me wrong, Firestorm was a great core. But when placed in its context, M1 and Firestorm was simply the first time Apple could showcase its massive efficiency advantage while also competing in performance with AMD and Intel. It was a natural progression in the evolution of Apple's design which Apple has continued to iterate on. This may have come as a massive shock to the PC world who had largely ignored what Apple was doing over the last decade, but was much less to those who had payed attention. Since you're on at AT forums, you must've read Andrei's preview of the M1 where he laid this out (and he was much more explicit in other comments he made to this effect). I hope I've also done a good job of illustrating that while Apple has seemingly slowed down in some respects, this has been vastly overstated and the M1 was not some outlier either in antecedents or descendents.

================

Now there is some possibility that Apple will hit a wall with its designs. Eventually going wider and wider may yield diminishing returns as most code will simply lack the instruction level parallelism to justify it. Apple may also suffer a design crisis of some other kind. I can't know the future. But again I hope that I've demonstrated that, so far, the "great slowdown" due to "brain drain" is more myth than reality - nor can Apple be accused of resting on its laurels.

Amazing post, very informative.

I would like also to add to the discussion, which focuses mainly on CPU performance and efficiency, how impressive has been the GPU progress in the M-series. Have you seen any iGPU from AMD, Intel, or Qualcomm to provide this kind of performance at this power levels as, for example, M4 Max? I mean, if memory serves me, even M3 Pro GPU beats 890M in TFLOPS (which of course is only part of the GPU performance story, but it represents a good indicator of raw processing power I assume).

leman · Jan 4, 2025

Kotsos81 said:
I mean, if memory serves me, even M3 Pro GPU beats 890M in TFLOPS (which of course is only part of the GPU performance story, but it represents a good indicator of raw processing power I assume).

Isn't 890M more of a competitor to base M-series rather than Mx Pro? It has only 1280 shading units, which is similar to 1280 found in M3 and M4. AMD clocks its GPU almost twice as high though.

Kotsos81 · Jan 4, 2025

leman said:
Isn't 890M more of a competitor to base M-series rather than Mx Pro? It has only 1280 shading units, which is similar to 1280 found in M3 and M4. AMD clocks its GPU almost twice as high though.

Agree, but as far as I know, this is the best iGPU AMD currently offers, thus this is a direct apples-to-apples 😀 comparison of Mx GPU (which is part of SoC, thus an iGPU) with the relevant SotA, as opposed to a comparison with a discrete GPU.

Of course, another way to view Mx GPU progress is how much Apple has closed the gap with, let's say, 3080, while using a fraction of power. For me, this level of performance is impressive.

leman · Jan 4, 2025

Kotsos81 said:
Agree, but as far as I know, this is the best iGPU AMD currently offers, thus this is a direct apples-to-apples 😀 comparison of Mx GPU (which is part of SoC, thus an iGPU) with the relevant SotA, as opposed to a comparison with a discrete GPU.

Of course, another way to view Mx GPU progress is how much Apple has closed the gap with, let's say, 3080, while using a fraction of power. For me, this level of performance is impressive.

I think it would be great for us consumers if Appel could deliver usable 5-6 TFLOPS in the base M-series Macs.

Kotsos81 · Jan 4, 2025

leman said:
I think it would be great for us consumers if Appel could deliver usable 5-6 TFLOPS in the base M-series Macs.

M4 GPU 10C is almost there, I guess from M5 onwards we will see >= 5 TFLOPS at the base Mx models.

crazy dave · Jan 4, 2025

crazy dave said:
Yes, I know that marketing departments do that all the time but "IPC" as weighted geometric of a bunch of incredibly disparate sub-benchmarks is meaningless. Also it's supposed to be "instructions per clock" but ARM and x86 have different instructions which means IPC isn't particularly comparable between them. Since what people really mean is performance-per-clock and the term IPC has been so polluted from its original definition, I prefer to use the term ICP (ISO-clock performance).

Hmmmm ... probably early morning late night tired thoughts, but you know since Iso-clock performance is really already taken as a terminology and it doesn't quite aptly describe what I'm going for here since it means same-clock performance, probably a more direct analog to IPC would simply be PPC (Performance per Clock, yes I know PPC is already taken too) or Clock-Normalized Performance (CNP). But the traditional IPC acronym (instructions per clock) really is also the wrong term to describe what people usually compare, especially when comparing across architectures with obviously different instruction sets. So maybe in the future I'll use PPC or CPN.

Kotsos81 said:
Amazing post, very informative.

I would like also to add to the discussion, which focuses mainly on CPU performance and efficiency, how impressive has been the GPU progress in the M-series. Have you seen any iGPU from AMD, Intel, or Qualcomm to provide this kind of performance at this power levels as, for example, M4 Max? I mean, if memory serves me, even M3 Pro GPU beats 890M in TFLOPS (which of course is only part of the GPU performance story, but it represents a good indicator of raw processing power I assume).

Thanks so much much! Indeed Apple's GPUs are impressive. I think for iGPUs to compare with the Max we'll have to wait and see what next year brings. Strix Halo is going to have a beefy iGPU (though I think slightly less so than the Max, but for many games that won't matter since, you know the vast majority are native Windows and so forth). The Qualcomm Elite V2 is going to rumored to split the product line in two, one base M-like and one more Max/Pro-like with a much improved GPU over what shipped in V1. And Nvidia is rumored to be releasing its own ARM SOC for consumer PCs later in the year but no word on the size as far as I know. They have a line of discrete mobile GPUs, the Max-Q line, that are structured more similarly to Apple's M-series GPUs. That design integrated directly into an SOC, could be quite good.

What will also be interesting is that when Apple announced the Pro and Max line of chips for the M1, pundits fell over themselves to say that only Apple could afford to build an SOC this big and make the economics work - that for AMD, Intel, etc ... this strategy would never be profitable to them because they can only sell the SOC not the whole device. Since your typical chip maker has to make a profit selling the SOC and the device maker has to make a profit on the device, the end cost simply wouldn't work for the average non-Mac PC consumer who care a lot less than Mac users about battery life and quiet operations. Since multiple chip makers are going for it anyway, we'll see if that ends up being true. I'm sure they've all got some business plan that claim it'll work, but plans don't always come true. But then again ... pundit predictions about what can never be have a somewhat worse track record.

Confused-User · Jan 6, 2025

ChrisA said:
So in only 10 years we will have negative future sizes. Macs will produce more power than they consume.

Seriously. Each new process requires an new and MUCH more expensive production line and at some point the cost of replacing the line is no longer worth it. Over the last few years Apple has been getting a free ride, they just wait while chips get smaller. Apple does not need to reinvent much. They are basically running the 1970s vintage BSD Unix on a smaller and cheaper computer. The overall idea has been unchanged for 50+ years. Maybe someday AI will push us away from that model. But on the other hand, the same OS has survived and is older than many people reading this.

Feature size reduction can only continue for so long. In reality "zero" is a hard limit and atoms have finite size.

So much wrong here...

There is quite a lot of runway known to still be ahead of us. Yes, it's getting more expensive. But no, it's not going to end in the next few years. When it does - and you're right that ultimately silicon will be done - there will be further advances from using other materials. Someday we will reach an ultimate limit of our materials science and physics ("computronium", etc.) but we're a long ways away from that. And who knows what will be possible at that point?

BTW, to claim that Apple's been getting a free ride is ludicrous. Without their investment, we might actually be getting close to done (though that argument is relatively weak - someone else would be making that money - the concentration into the hand of one interested party is significant, and probably helping push us along faster than we'd otherwise be going).

Saying that today's OSes are "basically" 1970 Unix is ridiculous. Sure, some of the concepts are the same. They were some very strong ideas that have carried us quite far. The implementations are in most ways completely different. Memory allocators and management, filesystems (layered!), the network stack (didn't exist in the 70s), thread management... I'm barely scratching the surface. It's all different. Plus, Apple has all that Mach stuff.

And then there's the "does not need to reinvent much". What arrant nonsense. They yanked the entire industry out of Intel's grubby paws and pointed it in a new direction, with an architecture focused on energy efficiency that *also* delivers unmatched performance.

What is the point of your confused message, anyway? Leaving aside all the errors, are you seriously trying to explain that there is a limit to progress? That concept is ancient and uninteresting. It is probably true, more or less, but that's like excitedly explaining that the heat death of the universe makes all this progress moot. Sure, take a long enough view, and that's true. So what?

crazy dave · Jan 6, 2025

crazy dave said:
Hmmmm ... probably early morning late night tired thoughts, but you know since Iso-clock performance is really already taken as a terminology and it doesn't quite aptly describe what I'm going for here since it means same-clock performance, probably a more direct analog to IPC would simply be PPC (Performance per Clock, yes I know PPC is already taken too) or Clock-Normalized Performance (CNP). But the traditional IPC acronym (instructions per clock) really is also the wrong term to describe what people usually compare, especially when comparing across architectures with obviously different instruction sets. So maybe in the future I'll use PPC or CPN.

Thanks so much much! Indeed Apple's GPUs are impressive. I think for iGPUs to compare with the Max we'll have to wait and see what next year brings. Strix Halo is going to have a beefy iGPU (though I think slightly less so than the Max, but for many games that won't matter since, you know the vast majority are native Windows and so forth). The Qualcomm Elite V2 is going to rumored to split the product line in two, one base M-like and one more Max/Pro-like with a much improved GPU over what shipped in V1. And Nvidia is rumored to be releasing its own ARM SOC for consumer PCs later in the year but no word on the size as far as I know. They have a line of discrete mobile GPUs, the Max-Q line, that are structured more similarly to Apple's M-series GPUs. That design integrated directly into an SOC, could be quite good.

What will also be interesting is that when Apple announced the Pro and Max line of chips for the M1, pundits fell over themselves to say that only Apple could afford to build an SOC this big and make the economics work - that for AMD, Intel, etc ... this strategy would never be profitable to them because they can only sell the SOC not the whole device. Since your typical chip maker has to make a profit selling the SOC and the device maker has to make a profit on the device, the end cost simply wouldn't work for the average non-Mac PC consumer who care a lot less than Mac users about battery life and quiet operations. Since multiple chip makers are going for it anyway, we'll see if that ends up being true. I'm sure they've all got some business plan that claim it'll work, but plans don't always come true. But then again ... pundit predictions about what can never be have a somewhat worse track record.

And here comes the Strix Halo chips:

AMD’s beastly ‘Strix Halo’ Ryzen AI Max+ debuts with radical new memory tech to feed RDNA 3.5 graphics and Zen 5 CPU cores

Disruption is a daily thing.

www.tomshardware.com

They are indeed comparing it to M4 Pro GPU rather than Maxes and the most impressive results is compared to the cut down M4 Pro GPU and VRAY is a bit of on outlier. I wonder how optimized it is for Macs relative to Blender/Redshift (Cinebench). Both VRAY and Corona are made by the same developer I believe.

EDIT: I'm not sure if those are all supposed to be GPU tests or if some of they are mixing CPU and GPU tests together in the same chart. Confusing. Edit2: actually for the Mac comparisons they might all be CPU NOT GPU.

VRAY's default is CPU (and according to the link below doesn't even work on the Mac's GPU), a lot of people still run Blender Classroom on the CPU. nT is usually the way to denote the multithreaded CPU results for Cinebench. And Corona is purely CPU from what I can tell.

V-Ray For Mac

Do you need to know about using V-Ray and SketchUp on a Mac? Read our guide to find everything you need.

elmtec-sketchup.co.uk

And the requirements for V-Ray of SSE2 compatibility definitely make it sound non-native or at least non-optimized.

M4pro · Jan 6, 2025

crazy dave said:
And here comes the Strix Halo chips:

AMD’s beastly ‘Strix Halo’ Ryzen AI Max+ debuts with radical new memory tech to feed RDNA 3.5 graphics and Zen 5 CPU cores

Disruption is a daily thing.

www.tomshardware.com

They are indeed comparing it to M4 Pro GPU rather than Maxes and the most impressive results is compared to the cut down M4 Pro GPU and VRAY is a bit of on outlier. I wonder how optimized it is for Macs relative to Blender/Redshift (Cinebench). Both VRAY and Corona are made by the same developer I believe.

EDIT: I'm not sure if those are all supposed to be GPU tests or if some of they are mixing CPU and GPU tests together in the same chart. Confusing.

V-Ray and Corona for AS are CPU renderers.

crazy dave · Jan 6, 2025

M4pro said:
V-Ray and Corona for AS are CPU renderers.

Yeah I figured that out. In fact they are ALL CPU benchmarks for the Mac as far as I can tell. I had naively thought that with the new integrated GPU that's what they would've shown off even when comparing against the Mac. Do you know how well optimized V-Ray and Corona are for AS? The link I found in my edit to the post above talk about Macs, including AS Macs, needing SSE compatibility for V-Ray which makes me think it isn't very AS optimized and may not even be native?

======

Fascinating that the GPU is on the IO die. Again I wonder what process that is using if true since on the Desktop anyway the IO die was on an older N6 process. In fact I don't know if any of the process nodes for any of the dies in the Strix Halo or Fire Range processors has yet been confirmed.

EDIT: the Strix Halo SOC die (with I/O, NPU, and GPU appears to be 5nm - unsure which specific 5nm node):

AMD Debuts Ryzen AI Max Series "Strix Halo" SoC: up to 16 "Zen 5" cores, Massive iGPU

AMD at the 2025 International CES debuted the Ryzen AI Max 300 series of mobile processors. These chips are designed to go up against the Apple M4 Pro, or the chip that powers the Apple MacBook Pro. The idea behind it is to provide leadership CPU and graphics performance from a single package...

www.techpowerup.com

M4pro · Jan 6, 2025

crazy dave said:
Yeah I figured that out. In fact they are ALL CPU benchmarks for the Mac as far as I can tell. I had naively thought that with the new integrated GPU that's what they would've shown off even when comparing against the Mac. Do you know how well optimized V-Ray and Corona are for AS? The link I found in my edit to the post above talk about Macs, including AS Macs, needing SSE compatibility for V-Ray which makes me think it isn't very AS optimized and may not even be native?

======

Fascinating that the GPU is on the IO die. Again I wonder what process that is using if true since on the Desktop anyway the IO die was on an older N6 process. In fact I don't know if any of the process nodes for any of the dies in the Strix Halo or Fire Range processors has yet been confirmed.

EDIT: the Strix Halo SOC die (with I/O, NPU, and GPU appears to be 5nm - unsure which specific 5nm node):

AMD Debuts Ryzen AI Max Series "Strix Halo" SoC: up to 16 "Zen 5" cores, Massive iGPU

AMD at the 2025 International CES debuted the Ryzen AI Max 300 series of mobile processors. These chips are designed to go up against the Apple M4 Pro, or the chip that powers the Apple MacBook Pro. The idea behind it is to provide leadership CPU and graphics performance from a single package...

www.techpowerup.com

Maybe someone has time to look into whether they used the current versions of the V-Ray and Corona renderer Benchmark apps which are here:

V-Ray 6 Benchmark updated

Give your hardware a workout and see how it responds with looping tests, RTX and CUDA comparisons, a new benchmark scene, and support for Apple silicon.

www.chaos.com

Corona Benchmark | Chaos

Test and compare the performance of your CPU using this free benchmark application, which is built upon Chaos Corona 10.

www.chaos.com

(The full renderer apps include support for Distributed / Network rendering under MacOS out to NVidia GPUs in Windows and Linux workstations, so maybe that’s why… SSE stuff)

crazy dave · Jan 6, 2025

M4pro said:
Maybe someone has time to look into whether they used the current versions of the V-Ray and Corona renderer Benchmark apps which are here:

V-Ray 6 Benchmark updated

Give your hardware a workout and see how it responds with looping tests, RTX and CUDA comparisons, a new benchmark scene, and support for Apple silicon.

www.chaos.com

Corona Benchmark | Chaos

Test and compare the performance of your CPU using this free benchmark application, which is built upon Chaos Corona 10.

www.chaos.com

(The full renderer apps include support for Distributed / Network rendering under MacOS out to NVidia GPUs in Windows and Linux workstations, so maybe that’s why… SSE stuff)

Based on what I could fin online, I think Corona and V-Ray are AS native. Not sure how well optimized they are for it though.

System Requirements - V-Ray Standalone - Global Site

docs.chaos.com

https://support.chaos.com/hc/en-us/articles/4957654652305-Is-Corona-working-on-Apple-M-series-chips#:~:text=Starting%20from%20Corona%208%2C%20the,Was%20this%20article%20helpful?

mi7chy · Jan 6, 2025

Looks like Mac Studio has competition from Nvidia Project Digits with Grace 20-core ARM CPU, Blackwell GPU, 128GB LPDDR5X unified memory, NVLink/ConnectX fabric, 4TB storage, running Linux, etc. for about $3000.

https://www.nvidia.com/en-us/project-digits/
https://nvidianews.nvidia.com/news/...ry-desk-and-at-every-ai-developers-fingertips

novagamer · Jan 6, 2025

mi7chy said:
Looks like Mac Studio has competition from Nvidia Project Digits with Grace 20-core ARM CPU, Blackwell GPU, 128GB LPDDR5X unified memory, ConnectX fabric, 4TB storage, running Linux, etc. for $3000.

https://www.nvidia.com/en-us/project-digits/
https://nvidianews.nvidia.com/news/...ry-desk-and-at-every-ai-developers-fingertips

Good, the competition will only benefit Mac users. If Apple really did move their best CPU team to work on servers for a while as has been rumored they'd better get hiring some more people as replacements.

"In addition, using NVIDIA ConnectX® networking, two Project DIGITS AI supercomputers can be linked to run up to 405-billion-parameter models." $6,000 for that is not too bad at all.

crazy dave · Jan 6, 2025

novagamer said:
Good, the competition will only benefit Mac users. If Apple really did move their best CPU team to work on servers for a while as has been rumored they'd better get hiring some more people as replacements.

"In addition, using NVIDIA ConnectX® networking, two Project DIGITS AI supercomputers can be linked to run up to 405-billion-parameter models." $6,000 for that is not too bad at all.

Not quite sure how big exactly the GPU is from the "1 Petaflop of FP4 compute". I know they're marketing it for AI purposes, but I'd be very curious about its standard FP32 TFLOPs with that much RAM.

Hopefully more to come from Nvidia this year ... the CPU on this thing won't be terribly competitive, but the GPU could be ... and it could point to more exciting things in the long rumored standard desktop/laptop ARM SOC Nvidia is said to be shipping later this year.*

(As for the Apple server chip stuff, my strong suspicion - perhaps wrong of course - is that there will continue to be significant overlap between the chips Apple uses internally for its servers and those it sells as Ultras to its customers)

*Edit: CPU may be better than I thought. Originally I had read it would be Neoverse V2 cores like the big Blackwell superchip. However, apparently it will be 10 X925 and 10 A725. Those should be much, much better!

Arm-powered NVIDIA DGX Spark Puts High-Performance AI in the Hands of Millions of Developers

DGX Spark featuring Arm-powered CPU cores makes it possible for every AI developer to have a high performance AI system on their desk.

newsroom.arm.com

I don't know if they will be competitive with the M4 Max, certainly not in ST unless very upclocked and it isn't clear to me that they can be. According to Anandtech max clock speed may be 3.8Ghz:

Arm Unveils 2024 CPU Core Designs, Cortex X925, A725 and A520: Arm v9.2 Redefined For 3nm

www.anandtech.com

The 3 nm process also enables Arm to push out higher clock speeds on the Cortex X925 core, up to 3.8 GHz, to be exact.

That may not be a hard limit and even so, it's much more exciting than I had originally thought! The ST performance will be okay and with 20 cores it should have decent multicore performance, especially since the focus should be on the GPU. Still though, Nvidia and everyone else making the Digits announcement is being a little too coy with the GPU's specifications for my liking. Hopefully, they're good.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 68040

Contributor

Contributor

macrumors newbie

macrumors newbie

Suspended

macrumors Core

macrumors newbie

macrumors 6502a

macrumors 68000

macrumors member

macrumors Core

macrumors member

macrumors Core

macrumors member

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors regular

macrumors 68000

macrumors regular

macrumors 68000

Suspended

macrumors 6502

macrumors 68000

Our Staff