M4+ Chip Generation - Speculation Megathread [MERGED]

MrGunny94 · May 10, 2024

name99 said:
Apple are engaged in on-going work (that moves a little more each year) to split the OS up into more and more pieces that can run independently on separate cores. Obviously this is a goal that every OS vendor strives for in the age of multi-core; Apple's nothing special in this respect, just the techniques they will use will be optimal for the structure of Darwin.
There have been academic OSs in the past (like Barrelfish, from MS) that have pushed this idea, but moving a large commercial OS in this direction is obviously harder!

I've mentioned before that part of how Apple run faster is to run experiments in parallel. IMHO the M3 6E cluster was such an experiment – put it in a chip where it can't cause any harm, and see just how well it can get used (both by the OS and by lightweight threads in apps). Presumably the experiment was a big success, enough so that we see it as the new norm (and perhaps also justifying moving to 6E cores for M4 Max?)

Open questions then include
- does 6E make sense for an iPhone? I guess we'll see soon! Maybe it does?

- does going up to 8 E-cores now make sense? (There are two issues here. The presence of 8 E cores, is there enough work for them? And whether it's still feasible to have them all sharing a single set of L2 capacities like the L2 itself, the L2 TLB and page walkers, and AMX/SME. If those resources start to be overloaded, maybe better to dial it back to 4E+4E for the M4 Pro and slowly over the new few years work our way back to 6E+6E in four years or so?)

- does a dedicated OS-only E cluster make sense? The idea here is that we devote an E cluster (maybe only two E-cores, maybe no AMX/SME needed, and small L2) to running the most security critical elements of the OS and NOTHING ELSE. The idea is that if we have these cores isolated to this extent malicious apps won't be able to [or at least will have to work even harder to find some scheme] either modify the OS or eavesdrop on what it's doing. This will also allow us to make the other cores more aggressive in terms of things like variable timing and speculation without having to worry about this endless stream of micro-architectural security issues (Spectre, GoFetch and the rest of them). If you want to do crypto or anything involving passwords, call into the OS which will shunt the work to a security core, and given this fact, who CARES that an app can, with immense effort, sometimes read a few bytes from the memory range of some other app?

I think we'll be seeing more and more the approach you mentioned on the last point.

Especially with ARM coming to Windows and Intel now even planning some crazy hybrid core strategy with ARM in the future, it's possible that more and more the OS only E Cluster will be present and the P cores will be just used for Heavy Duty applications.

Can't wait to see the Mac Lineup with M4 though

Xiao_Xi · May 12, 2024

What set of SME instructions could Apple have implemented? SME? SME2?
Can iPadOS access that information or can only macOS?

Populus · May 12, 2024

Xiao_Xi said:
What set of SME instructions could Apple have implemented? SME? SME2?
Can iPadOS access that information or can only macOS?

That’s a pretty good question. IIRC, SME is part of ARM V.9.2 right? And SME2 comes with AMR 9.4

smalm · May 12, 2024

SME = 9.3
SME2 = 9.4

leman · May 12, 2024

Populus said:
That’s a pretty good question. IIRC, SME is part of ARM V.9.2 right? And SME2 comes with AMR 9.4

These are still optional extensions. Apple can implement v8 with SME subset on top. For instance, I don’t see them adopting SVE in the near future.

senttoschool · May 12, 2024

iPad16,4 - Geekbench

Benchmark results for an iPad16,4 with an ARM processor.

browser.geekbench.com

This score is insane. Almost 4000/15000.

Dulcimer · May 12, 2024

What do we think Apple will do with core counts for M4 Max? Trend for M4 (and M3 Pro) was adding more E-cores. M3 Max saw 4 more P-cores and 2 more GPU cores (highest configuration).

So it’s clear they want more distinction between Pro and Max lines. Do we have reason to believe they’ll continue this trend?

senttoschool · May 12, 2024

Dulcimer said:
What do we think Apple will do with core counts for M4 Max? Trend for M4 (and M3 Pro) was adding more E-cores. M3 Max saw 4 more P-cores and 2 more GPU cores (highest configuration).

So it’s clear they want more distinction between Pro and Max lines. Do we have reason to believe they’ll continue this trend?

I'm not expecting more GPU cores because N3E has reduced density vs N3B. I do expect a few more E cores though because it's a cheap way (in terms of area) to improve MT performance. I don't expect more P cores.

quarkysg · May 12, 2024

senttoschool said:
iPad16,4 - Geekbench

Benchmark results for an iPad16,4 with an ARM processor.

browser.geekbench.com

This score is insane. Almost 4000/15000.

Benched from a passively cooled device!

streetfunk · May 12, 2024

i would expect,
M4 Max: 6 E-cores / 12 P-cores
M4 miniPro: 6 E-cores / 6 P-cores
The question is, will there be a miniPro with 8 P-cores.

I would expect them to continue to have a clear distinction between miniPro and Max.

senttoschool · May 12, 2024

quarkysg said:
Benched from a passively cooled device!

I'm thinking 4200/27000 for M4 Max. Up from 3200/21000 today.

I was not expecting such a monstrous increase in ST.

Boil · May 12, 2024

senttoschool said:
I'm thinking 4200/27000 for M4 Max. Up from 3200/21000 today.

I was not expecting such a monstrous increase in ST.

M4 Extreme
4200/69000
Noice...!
;^p

Xiao_Xi · May 12, 2024

leman said:
Apple can implement v8 with SME subset on top. For instance, I don’t see them adopting SVE in the near future.

Is that possible?

SME builds on the Scalable Vector Extensions (SVE and SVE2), adding new capabilities to efficiently process matrices. Key features include:

Matrix tile storage

Load, store, insert, and extract tile vectors, including on-the-fly transposition

Outer product of SVE vectors

Streaming SVE mode

A new operating mode is added, Streaming SVE Mode. When in Streaming SVE Mode, the new SME storage and instructions are available, as well as significant subset of the existing SVE2 instructions. When not in Streaming SVE mode, behavior is unchanged from SVE2. Applications can switch between operating modes depending on what is needed.

Scalable Matrix Extension for the Armv9-A Architecture

In this blog, read the details for Scalable Matrix Extension (SME). This is a new extension from the latest Arm Vision day announcement for Armv9-A.

community.arm.com

Pressure · May 12, 2024

It's a shame the "tech" influencers currently doing their "in-depth reviews" only have the ability to run Geekbench 6 when we could also get some Geekbench ML results at least.

leman · May 12, 2024

Xiao_Xi said:
Is that possible?

Scalable Matrix Extension for the Armv9-A Architecture

In this blog, read the details for Scalable Matrix Extension (SME). This is a new extension from the latest Arm Vision day announcement for Armv9-A.

community.arm.com

Of course it is. One can implement SME + streaming mode SVE without implementing the regular SVE extension. This is explicitly supported by the ARM spec.

senttoschool · May 13, 2024

Some interesting discoveries from Anandtech forums:

1. Geekbench6 does use AMX x86 instruction sets: https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

2. Intel CPUs that have AMX do look to be working in GB6: https://browser.geekbench.com/v6/cpu/compare/5330257?baseline=4656765. See Object Remover scores.

Edit: Fixed first link

leman · May 13, 2024

senttoschool said:
Some interesting discoveries from Anandtech forums:

1. Geekbench6 does use AMX x86 instruction sets: https://browser.geekbench.com/v6/cpu/compare/5330257?baseline=4656765

2. Intel CPUs that have AMX do look be working in GB6: https://browser.geekbench.com/v6/cpu/compare/5330257?baseline=4656765. See Object Remover scores.

Wait, so matrix operations on an iPad are almost as fast as on the latest and greatest Intel server CPU with AVX512 and AMX? Wow…

crazy dave · May 13, 2024

senttoschool said:
Some interesting discoveries from Anandtech forums:

1. Geekbench6 does use AMX x86 instruction sets: https://browser.geekbench.com/v6/cpu/compare/5330257?baseline=4656765

2. Intel CPUs that have AMX do look be working in GB6: https://browser.geekbench.com/v6/cpu/compare/5330257?baseline=4656765. See Object Remover scores.

leman said:
Wait, so matrix operations on an iPad are almost as fast as on the latest and greatest Intel server CPU with AVX512 and AMX? Wow…

I'm a little tired so this is probably obvious, but how does this show it? Is it that the Intel Core i7-12700T has the AMX but the Intel Xeon w5-3435X doesn't? @senttoschool can you link to the Anandtech forum page where they found this?

I just naturally assumed given the scores that GB didn't use AMX, but ... wow ... if it does ...

Primate Labs needs to make it more clear which extensions it uses for every GB test.

EDIT: They actually do for some them, maybe all of them, how the hell did I miss that:

https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

Photo Library and Object Detection use AMX, they don't mention which they use for Object Remover, maybe they don't use them? Any so it's possible Photo Library's small bump is also SME related? It's just didn't have as pronounced an effect?

AgentMcGeek · May 13, 2024

Do we have any GPU benchmark already?

thenewperson · May 13, 2024

crazy dave said:
I'm a little tired so this is probably obvious, but how does this show it? Is it that the Intel Core i7-12700T has the AMX but the Intel Xeon w5-3435X doesn't? @senttoschool can you link to the Anandtech forum page where they found this?

I just naturally assumed given the scores that GB didn't use AMX, but ... wow ... if it does ...

Primate Labs needs to make it more clear which extensions it uses for every GB test.

EDIT: They actually do for some them, maybe all of them, how the hell did I miss that:

https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

Photo Library and Object Detection use AMX, they don't mention which they use for Object Remover, maybe they don't use them? Any so it's possible Photo Library's small bump is also SME related? It's just didn't have as pronounced an effect?

According to someone on anandtech forums Intel reserves AMX for their server CPUs so that’s where you’ll notice the difference. Link: https://forums.anandtech.com/posts/41209202/

senttoschool · May 13, 2024

crazy dave said:
EDIT: They actually do for some them, maybe all of them, how the hell did I miss that:

https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf
Photo Library and Object Detection use AMX, they don't mention which they use for Object Remover, maybe they don't use them? Any so it's possible Photo Library's small bump is also SME related? It's just didn't have as pronounced an effect?

Yea my first link was meant to point to https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

That's fixed in my post now.

crazy dave · May 13, 2024

AgentMcGeek said:
Do we have any GPU benchmark already?

Yeah it's about a 13-15% improvement - probably a mix of clocks/bandwidth, we'll know when they get out into the wild and we see what the GPU clocks are.

iPad16,6 - Geekbench

Benchmark results for an iPad16,6 with an ARM processor.

browser.geekbench.com

senttoschool said:
Yea my first link was meant to point to https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

That's fixed in my post now.

I've been using that link for *days* and just skipped over the instructions or at least if I noticed them, the import didn't register. Rather embarrassing.

leman · May 13, 2024

senttoschool said:
Yea my first link was meant to point to https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

Just as a quick comment — that document seems to be outdated since it does not mention SVE and SME instruction set support for ARM in the latest versions of GB6.

mr_roboto · May 13, 2024

quarkysg said:
Benched from a passively cooled device!

I'd caution against reading too much into that, thanks to the design of Geekbench. Quoting from the same GB document @crazy dave linked:

Geekbench inserts a pause (or gap) between each workload to minimize the effect thermal
issues have on workload performance. Without this gap, workloads that appear later in the
benchmark would have lower scores than workloads that appear earlier in the benchmark.
The default gap is 2 seconds for both single-core and multi-core workloads.

It doesn't say so here but each workload runs for even less time than the 2s gap. GB CPU is deliberately designed to avoid any thermal clockspeed limits. John Poole has given a rationale for this which goes beyond the paragraph above and makes sense given what he wanted GB to be, but it's something you do need to be aware of when interpreting results.

Here's two examples to demonstrate this in action: 2020 M1 MBA (fanless) vs 2020 13" M1 MBP (fan), and 3rd gen iPad Pro 11 (fanless M1) vs the same 13" M1 MBP. Chose these because all three have the exact same SoC.

MacBook Air (Late 2020) vs MacBook Pro (13-inch Late 2020) - Geekbench

browser.geekbench.com

iPad Pro (11-inch 3rd generation) vs MacBook Pro (13-inch Late 2020) - Geekbench

browser.geekbench.com

If higher tier versions of M4 like Pro/Max/Ultra are in the works, we'll probably see slightly higher clock speeds, but if Apple follows their prior patterns we won't be seeing enormous increases in GB CPU from that - it's usually a fairly small bump.

crazy dave · May 13, 2024

leman said:
Just as a quick comment — that document seems to be outdated since it does not mention SVE and SME instruction set support for ARM in the latest versions of GB6.

As I wrote in the other place though, if GB6 supports AMX and AVX-512 for x86, SME support for ARM can hardly be considered "cheating". Hell, even AVX2 is almost ... ehk ... nevermind ...

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 65816

macrumors 68000

macrumors 604

macrumors newbie

macrumors Core

macrumors 68030

macrumors 6502a

macrumors 68030

macrumors 65816

macrumors regular

macrumors 68030

macrumors 68040

macrumors 68000

macrumors 603

macrumors Core

macrumors 68030

macrumors Core

macrumors 68000

macrumors 6502

macrumors 65816

macrumors 68030

macrumors 68000

macrumors Core

macrumors 6502a

macrumors 68000

Our Staff