Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I think you (and that guy on twitter someone else mentioned) don't seem to understand that CPU features have to be explicitly coded for. If you ship a binary containing zero SME instructions, and you never call a system API which uses SME on your behalf, you get nothing out of SME. And on the other side of things, if you ship a binary which uses SME instructions and users attempt to run it on a processor which lacks them, they'll experience your program crashing because it tried to execute an illegal instruction.


Come on, can you just look at the change log?

>
  • Introduce support for Arm Scalable Matrix Extensions (SME) instructions.Geekbench 6.3 includes SME implementations of the matrix multiplication kernels used by the Geekbench 6 machine learning workloads.
 
The "Next-generation ML accelerators" is listed with the cores, so it looks like it belongs to the core cluster to me just like AMX, or it is referring to AMX. What do you think?

I think at this time we can be certain this means new AMX units with SME/streaming mode SVE support. Better late than never :)


This Twitter user seems to saying the same thing as speculated by @Gnattu

That it’s really the SME that’s helping here. Granted this person is comparing M3 Max with fans to fanless M4 and adjusting for frequency to say about 3% IPC increase.

That sounds like a conservative take. They are comparing a preliminary M4 benchmark in a 5mm thick iPad to a 1% top score of an actively cooled M3 Max. Of course such comparison would penalize the M4. If we look at the typical high scores of the M3 Air instead we consistently see 10-20% improvements in pretty much every subtest (Object Detection notwithstanding). The IPC improvement is there, and it's quite sizable.

It really looks like the M4 CPU is the "real deal" while M3 CPU is more of an alpha version.
 
Uh, unpacking the weights IS free!

There are at least two ways in which weights are "packed" (omit zero weights, and quantization). They are described here:

Later versions of the ANE include a few more ways to compress weights (mostly relevant to vision, but maybe also applicable to audio) based on symmetries in the weights (eg mirror symmetry, or rotational symmetry).

Could people please try to stick to some sort of confirmed reality when making claims in these threads.
Unpacking quantised weights IS NOT FREE you can check it using coreml-quant and compare it to native weights. Ofc. i don't say it is main bottleneck as it can be as low as x-10% performance drop depending on method of compression and used coprocessor.

But yea try to stick to confirmed reality sending patents which don't use most popular quantisation methods like true loser xD
 
  • Like
Reactions: altaic
I think at this time we can be certain this means new AMX units with SME/streaming mode SVE support. Better late than never :)



That sounds like a conservative take. They are comparing a preliminary M4 benchmark in a 5mm thick iPad to a 1% top score of an actively cooled M3 Max. Of course such comparison would penalize the M4. If we look at the typical high scores of the M3 Air instead we consistently see 10-20% improvements in pretty much every subtest (Object Detection notwithstanding). The IPC improvement is there, and it's quite sizable.

It really looks like the M4 CPU is the "real deal" while M3 CPU is more of an alpha version.

Agreed that comparison isn’t totally fair. It does seem like most of the ST improvement comes from:

- SME
- faster clocks (N3E process)

So while CPU and likely GPU arch haven’t changed much M3->M4, it’s an impressive bump when you factor in added E cores and supposed ANE improvements.

Also makes sense when you consider timeline:

- M3 co-developed with A17 Pro’s CPU and GPU + A16’s ANE all on N3B.
- M4 developed on N3E with essentially M3/A17’s CPU and GPU but backend design allowed for faster clocks and area shrink to squeeze in 2 more E cores. Use A17’s ANE. Add new display controller for new iPad Pro.
 
Agreed that comparison isn’t totally fair. It does seem like most of the ST improvement comes from:

- SME
- faster clocks (N3E process)

So while CPU and likely GPU arch haven’t changed much M3->M4, it’s an impressive bump when you factor in added E cores and supposed ANE improvements.

The clock increase is only around 7%. I do not think it sufficiently explains the improvements we are seeing. HTML5, PDF Rendering and other tests see 10-20% improvements and those do not benefit from SME. I have a suspicion that the M3 cores had a problem that prevented them from properly utilizing the wider integer backend, and that M4 fixes that.
 
By the way, what I find most hilarious is that Apple's passively cooled 5mm thin tablet has a faster CPU core than any desktop or server CPU currently shipping, including Intel's latest enthusiast class processors that draw 50-60 watts in single-core operation. That is an insane jump forward for Apple and a real slap in the face for the competitors.

P.S. And what is even more funny is that Apple now has user-accessible 512-bit vector support in the iPad (and soon presumably on the iPhone), while Intel is removing them from their consumer CPUs because they are too expensive...
 
Last edited:
By the way, what I find most hilarious is that Apple's passively cooled 5mm thin tablet has a faster CPU core than any desktop or server CPU currently shipping, including Intel's latest enthusiast class processors that draw 50-60 watts in single-core operation. That is an insane jump forward for Apple and a real slap in the face for the competitors.
I couldn’t believe my eyes when I saw that 253 Watt max multi-core spec for Intel’s flagship chip. And that will be reigned in to 188 Watts soon too.

It just seems bizarre to me that a chip sold to consumers, even enthusiast consumers, has to be reigned in severely to get down to 188 Watts.
 
  • Wow
Reactions: Chuckeee
By the way, what I find most hilarious is that Apple's passively cooled 5mm thin tablet has a faster CPU core than any desktop or server CPU currently shipping, including Intel's latest enthusiast class processors that draw 50-60 watts in single-core operation. That is an insane jump forward for Apple and a real slap in the face for the competitors.

P.S. And what is even more funny is that Apple now has user-accessible 512-bit vector support in the iPad (and soon presumably on the iPhone), while Intel is removing them from their consumer CPUs because they are too expensive...

Now one has to wonder whether Qualcomm's self-professed Apple Silicon "killer" will stack up to M4, or are they all talk no action?

I couldn’t believe my eyes when I saw that 253 Watt max multi-core spec for Intel’s flagship chip. And that will be reigned in to 188 Watts soon too.

It just seems bizarre to me that a chip sold to consumers, even enthusiast consumers, has to be reigned in severely to get down to 188 Watts.

I can't decide what's worse - Intel having to reign in their own wattages, or releasing products that effectively have no wattage limits at all depending on how the motherboard manufacturers configured their boards.
 
Now one has to wonder whether Qualcomm's self-professed Apple Silicon "killer" will stack up to M4, or are they all talk no action?
Seriously? We just did this a couple weeks ago. QC's new chip doesn't even stack up to the M3. It loses badly in SC, and depending on the power envelope, loses to either the M3 or the M3 Pro in MC. The only way it can win is if they price it at or lower than M3 - if it comes in a few hundred $ less than the M3 Air, it will certainly find a warm welcome. Even priced the same as the Air, if it comes in a similar package, it could do well in the Windows world (though it will lose in perf to the M3 Air in all ways, in that configuration).

I don't expect this, though. Previous Windows ARM products were terribly overpriced.
 
Wow the M3 air was completely shafted. Now it's another 1+ years for it to see the M4
Almost no chance of that.

My bet is Studio and maybe Pro M4 desktops at WWDC. Pro laptops at WWDC or July (yes July would be atypical). Air and Mini July or October/November.

Of course they could decide to be aggressive about getting off N3B, in which case you could actually see the *everything* at WWDC. I imagine it depends a lot on the ramp for N3E.
 
  • Like
Reactions: altaic and leman
Almost no chance of that.

My bet is Studio and maybe Pro M4 desktops at WWDC. Pro laptops at WWDC or July (yes July would be atypical). Air and Mini July or October/November.

Of course they could decide to be aggressive about getting off N3B, in which case you could actually see the *everything* at WWDC. I imagine it depends a lot on the ramp for N3E.
It would be unprecedented for Apple to release new M4 MacBook Airs at WWDC. I got my first day order M3 MBA on March 11, 2024. That was less than 60 days ago. That would be less than 100 days at WWDC. I'm nearly certain that that's not going to happen. I do believe that it is possible that we will see mini, studio, and mac pro at WWDC though.
 
It would be unprecedented for Apple to release new M4 MacBook Airs at WWDC.
I agree, and I would bet against it. But not a lot! The M4 shipping in the new iPad is unprecedented (or as nearly as the "unprecedented" you mentioned) in at least two different ways.

My guess is we'll see it July or August, but any time through November is plausible.
 
Now one has to wonder whether Qualcomm's self-professed Apple Silicon "killer" will stack up to M4, or are they all talk no action?

I think Intel is about to have a bad, no good, terrible month ahead of them. Microsoft, Qualcomm and Asus have an event planned May 20th to unveil what is likely the new Snapdragon X laptops. Dell has one on tap. Nvidia announces earnings May 22. I'm guessing Apple shows M4 Ultras in the Studio and Mac Pro on June 10th with enough neural processing to shut the press up for a while.

M4 isn't of much interest to anyone watching Qualcomm right now other than as an indication of what Qualcomm might eventually be able to do as they further refine their product. Qualcomm compares to Apple Silicon for headlines, not because they're competing with each other.
 
  • Like
Reactions: altaic
I agree, and I would bet against it. But not a lot! The M4 shipping in the new iPad is unprecedented (or as nearly as the "unprecedented" you mentioned) in at least two different ways.

My guess is we'll see it July or August, but any time through November is plausible.
Or they skip it. Apple likely has a commitment to TSMC to consume a certain number of wafers. They might just keep them in the Air for the rest of the year, maybe update the iPad Air to M3 sometime in the future, and go to M5 when it's here.

It sounds like M4 was needed to drive the iPad tandem display, and maybe too late to get that requirement into M3. We may never see a laptop based on M4 Vanilla. M4 Pro might add NE and GPU to further press the AI point.
 

Come on, can you just look at the change log?

>
  • Introduce support for Arm Scalable Matrix Extensions (SME) instructions.Geekbench 6.3 includes SME implementations of the matrix multiplication kernels used by the Geekbench 6 machine learning workloads.
Fair enough, but I think it's reasonably clear I spent quite a bit of time looking at various sources, I just didn't realize that GB6 had a recent update.

Based on various social media postings it appears that Xcode quietly gained SME support sometime within the last year. I still wonder how Primate Labs tested - did Apple disclose to them and give them early access to M4 hardware?
 
I think Intel is about to have a bad, no good, terrible month ahead of them. Microsoft, Qualcomm and Asus have an event planned May 20th to unveil what is likely the new Snapdragon X laptops. Dell has one on tap. Nvidia announces earnings May 22. I'm guessing Apple shows M4 Ultras in the Studio and Mac Pro on June 10th with enough neural processing to shut the press up for a while.
...and AMD is likely to announce Zen 5 shipping at Computex.
M4 isn't of much interest to anyone watching Qualcomm right now other than as an indication of what Qualcomm might eventually be able to do as they further refine their product. Qualcomm compares to Apple Silicon for headlines, not because they're competing with each other.
You know, that's true right now. But...

There was a brief period of time, quite a few years ago, when MacBook Pros with bootcamp were the best Windows laptops you could buy, for a certain significant segment of the market. They didn't make a huge dent because bootcamp, and because it takes a few years for sea changes to take hold, and this state of affairs didn't last nearly long enough.

If WoA is reasonable, and Apple were to issue a new ARM Bootcamp, they could take a significant part of that market. And unlike the last time they had a "best" product, they're unlikely to be pushed out of position for years to come. Nobody has a chip that looks competitive with QC's SXE, and the SXE is a dog compared to the M3, nevermind the M4.

It will be absolutely hilarious, though not very surprising, if 6-12 months from now Apple sells the best Windows laptop on the market.

Or they skip it. Apple likely has a commitment to TSMC to consume a certain number of wafers. They might just keep them in the Air for the rest of the year, maybe update the iPad Air to M3 sometime in the future, and go to M5 when it's here.
I think that's extraordinarily unlikely. If I were Apple I'd have been damn sure to negotiate that my commitment would be portable from N3B to N3E. And I doubt TSMC would care. They use the same machines and raw materials to produce both. I suspect that both Apple and TSMC would prefer to move everything to N3E as soon as possible. It improves everyone's logistics a lot.
It sounds like M4 was needed to drive the iPad tandem display, and maybe too late to get that requirement into M3. We may never see a laptop based on M4 Vanilla. M4 Pro might add NE and GPU to further press the AI point.
Yes about the tandem display. As for the rest, I bet you a virtual dollar you're wrong. :)

Based on various social media postings it appears that Xcode quietly gained SME support sometime within the last year. I still wonder how Primate Labs tested - did Apple disclose to them and give them early access to M4 hardware?
I was going to say that that seems unlikely - if PL had the iPad, they'd know what to call it instead of iPad 16,x. But... perhaps they got an early M4 Mac. Or else, perhaps Apple offered patches? Though that seems unlikely, just because I imagine PL being leery of accepting code from a vendor they test.

I'm very interested to know the answer to this, but I doubt we'll ever know.
 
  • Like
Reactions: Analog Kid
Based on various social media postings it appears that Xcode quietly gained SME support sometime within the last year. I still wonder how Primate Labs tested - did Apple disclose to them and give them early access to M4 hardware?

There must have been some communication between Primate Labs and Apple. There is no other SME-capable hardware on the market, not even the recently announced ARM server chips ship with it, so they must have known. Another giveaway is that GB6 uses Apple-specific feature strings to test for SME - these feature strings are not currently present in any shipping macOS. I have difficulty believing that John Poole is a psychic that can guess the exact features Apple would ship and strings they would use to report the features.

As to testing, I don’t think it’s too complicated. Correctness of the algorithms can be verified using reference software implementation, and to check that it runs correctly they could have shared the code with Apple. No need to access M4 at all.
 
Last edited:
  • Like
Reactions: altaic and jdb8167
I was going to say that that seems unlikely - if PL had the iPad, they'd know what to call it instead of iPad 16,x. But... perhaps they got an early M4 Mac.
Doesn't even have to be anything Apple ever ships to consumers, Apple could have offered Primate Labs a VM running on one of the silicon bringup / validation boards Apple keeps in their labs:

 
  • Like
Reactions: Confused-User
Here's a fun table I made:

Rather than continually deal with IPC averages, I thought it might be fun to look at each individual test and see how they changed since the M1. I chose a GB 6.3 result from each processor (so run-to-run variation comes into play I'm not trying to get exact numbers from averages) and compare the change with clock speed. Given my results I can say that IPC increases depends strongly on the workload. For instance since the M1, IPC for GB's HTML 5 and Background Blur tests have increased roughly 40%, IPC for PDF Render and Photo Library and Object Remover and Ray Tracer have increased ~20%, IPC for Clang, HDR, Photo filter by 11-15%. IPC Text Processing, Asset Compression, Structure from motion, are about 7%, File Compression and Navigation are about 3-4%, while Horizon Detection is completely flat. Object Detection prior to SME went up by 18% between the M1 and M2 but was flat between the M2 and M3. Obviously unknown what it would've done in the M4 without SME. Now if you want you could create "an average" of those by taking the geometric mean for FP and INT tests and weighted arithmetic mean over the two but that would conceal everything that's interesting and why I don't like averages. This shows in fact Apple that is iterating quite strongly in areas they care about for the CPU's performance and they are leaving to clock speed those that they don't. The average is brought down by the latter. I would argue rather than the criticism that Apple "studies for the test" it would appear that they have their own design priorities for what's most important to improve for their users and those are different from GB's.

1715363376073.png



I can't attach the full spreadsheet to check it for errors, but I think it's right. It only took me a bit to do this so anyone could do the same (and please do point out any errors beyond run-to-run stuff if you get very different results).

Also I should point out that clocks have increased by 38% since the M1. Often times big clock speed increases like we've been getting necessitate microarchitecture changes just to keep up as otherwise IPC falls as clocks rise (especially if you increase them by nearly 40%). Thus, part of this is that Apple has been so aggressive with clocks, particularly M2->M3 was a huge gain in clocks (I know the M2 Max could clock higher, but this is base M2) and there was even a pretty sizable jump in M3->M4. These have "eaten" IPC gains if you will. That they've also increased clocks by this much without spiking power to ungodly amounts is a testament to both their and TSMC's engineering efforts as well.

@name99 @mr_roboto @leman
 
Last edited:
There must have been some communication between Primate Labs and Apple. There is no other SME-capable hardware on the market, not even the recently announced ARM server chips ship with it, so they must have known. Another giveaway is that GB6 uses Apple-specific feature strings to test for SME - these feature strings are not currently present in any shipping macOS. I have difficulty believing that John Poole is a psychic that can guess the exact features Apple would ship and strings they would use to report the features.
Not necessarily. SME has apparently been part of XCode for the past year (at least since October or so). It's been part of LLVM for longer, but was rolled into XCode at around that time.
So the ability to generate an Apple binary with SME did not require any special permissions.

As to testing, I don’t think it’s too complicated. Correctness of the algorithms can be verified using reference software implementation, and to check that it runs correctly they could have shared the code with Apple. No need to access M4 at all.
I'm guessing Primate tested their code on a simulator?
 
Here's a fun table I made:

Rather than continually deal with IPC averages, I thought it might be fun to look at each individual test and see how they changed since the M1. I chose a GB 6.3 result from each processor (so run-to-run variation comes into play I'm not trying to get exact numbers from averages) and compare the change with clock speed. Given my results I can say that IPC increases depends strongly on the workload. For instance since the M1, IPC for GB's HTML 5 and Background Blur tests have increased roughly 40%, IPC for PDF Render and Photo Library and Object Remover and Ray Tracer have increased ~20%, IPC for Clang, HDR, Photo filter by 11-15%. IPC Text Processing, Asset Compression, Structure from motion, are about 7%, File Compression and Navigation are about 3-4%, while Horizon Detection is completely flat. Object Detection prior to SME went up by 18% between the M1 and M2 but was flat between the M2 and M3. Obviously unknown what it would've done in the M4 without SME. Now if you want you could create "an average" of those by taking the geometric mean for FP and INT tests and weighted arithmetic mean over the two but that would conceal everything that's interesting and why I don't like averages. This shows in fact Apple that is iterating quite strongly in areas they care about for the CPU's performance and they are leaving to clock speed those that they don't. The average is brought down by the latter. I would argue rather than the criticism that Apple "studies for the test" it would appear that they have their own design priorities for what's most important to improve for their users and those are different from GB's.

1715363376073.png



I can't attach the full spreadsheet to check it for errors, but I think it's right. It only took me a bit to do this so anyone could do the same (and please do point out any errors beyond run-to-run stuff if you get very different results).

Also I should point out that clocks have increased by 38% since the M1. Often times big clock speed increases like we've been getting necessitate microarchitecture changes just to keep up as otherwise IPC falls as clocks rise (especially if you increase them by nearly 40%). Thus, part of this is that Apple has been so aggressive with clocks, particularly M2->M3 was a huge gain in clocks (I know the M2 Max could clock higher, but this is base M2) and there was even a pretty sizable jump in M3->M4. These have "eaten" IPC gains if you will. That they've also increased clocks by this much without spiking power to ungodly amounts is a testament to both their and TSMC's engineering efforts as well.

@name99 @mr_roboto @leman
This is a very interesting table!

I'm not surprised that the IPC seems to be roughly on-par between the M1 and the M2 (we already knew that the majority of the performance gains came from clock speed increases), but I'm still very curious about the obvious outlier of the HTML5 scores. That's a 20% bump for a chip that otherwise has very similar IPC to its predecessor (and this isn't just an isolated test result, various other HTML5 benchmarks have shown similar gains, including Antutu's HTML5 tests and Speedometer 2.1).

Similar gains could be seen in object detection too, though I know less about that one in terms of other benchmarks that could verify the increases (not that we have any reason to doubt Geekbench, but I think it's good to verify with other benchmarks as well just to make sure it isn't a difference in how something was compiled). Does anyone know what technical changes Apple might have made within the chip to contribute to higher IPC that was seen on these benchmarks?

Love the data that you've shared here, definitely going to bookmark this post and probably refer to it.
 
  • Like
Reactions: crazy dave
This is a very interesting table!

I'm not surprised that the IPC seems to be roughly on-par between the M1 and the M2 (we already knew that the majority of the performance gains came from clock speed increases), but I'm still very curious about the obvious outlier of the HTML5 scores. That's a 20% bump for a chip that otherwise has very similar IPC to its predecessor (and this isn't just an isolated test result, various other HTML5 benchmarks have shown similar gains, including Antutu's HTML5 tests and Speedometer 2.1).

Similar gains could be seen in object detection too, though I know less about that one in terms of other benchmarks that could verify the increases (not that we have any reason to doubt Geekbench, but I think it's good to verify with other benchmarks as well just to make sure it isn't a difference in how something was compiled). Does anyone know what technical changes Apple might have made within the chip to contribute to higher IPC that was seen on these benchmarks?

Love the data that you've shared here, definitely going to bookmark this post and probably refer to it.
Absolutely just to reiterate though the exact numbers are subject to run to run variation and looking at the data on GB's website I may have picked a particularly good M4 result. :) But's it's more about the overall relationships.

As to why those tests in particular showed such good scaling between the M1 and M2, I'm not sure - possibly cache changes. You can take a look at the test characteristics here:


With the caveat that this was the test on the reference Intel chip and an Apple chip may not only change the absolute numbers but their relationships between each other, these tests have low branch penalty misses, decent IPC, and middling L3 cache misses, and maybe a bit higher than average L1/L2 cache misses. Changes in the caches/fabric might explain difference (in other words those working set sizes benefited from whatever fabric/cache changes Apple made given the other characteristics of those tests), especially since we think M2 P-cores were quite similar to M1. Of course there's also the ray tracing scores from M2 and M3, definitely another outlier of improvement for M3 - that could be down to core changes since we know the core change quite substantially. Most of tests just kept up with the huge upswing in clocks but that one benefited a lot (also might've picked a low M3! or high M2! Dunno). Truthfully these are all just guesses.
 
  • Like
Reactions: ArkSingularity
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.