M4+ Chip Generation - Speculation Megathread [MERGED]

Pressure · May 16, 2024

So the M4 supports ~~both SVE and~~ SME as tested here.

We see that a single core reaches 111 FP32 GFLOPS when running Neon fmla instructions, 31 FP32 GLFLOPS when running streaming SVE fmla vector instructions, and over 2000 FP32 GFLOPS for outer-product fmopa instructions.

crazy dave · May 16, 2024

Pressure said:
So the M4 supports both SVE and SME as tested here.

Streaming SVE (and SME is a superset of streaming SVE) is supported on the 512b accelerator. Regular SVE2 on the 128b vector units is not supported.

leman · May 16, 2024

Pressure said:
So the M4 supports both SVE and SME as tested here.

It does not support SVE.

Also, their testing methodology is likely flawed. I’m sure they can get higher vector processing throughput with multiple accumulators.

Boil · May 16, 2024

Confused-User said:
If they don't ship MBPs at the same time as desktops, then perhaps you're right, and they'll use N4P for them. I still hope we'll get an avalanche next month, but I don't have any particular expectation that we will.

Cue Britney Spears "Oops!...I Did It Again"

Confused-User · May 16, 2024

Boil said:
Cue Britney Spears "Oops!...I Did It Again"

Arrrrrrgh. Thanks. :-(

name99 · May 16, 2024

SBeardsl said:
[I assume you are talking about N3P (not N4P) here. ]

I can't see Apple pushing that into any iPhone this year, they have the brand new N3E for that which would allow them to save the bump to N3P for next year's iPhone. The iPhone has a target date each year, it has to ship in that window due to the huge logistical/commercial weight it carries. I can't see them holding it up even two months past its traditional release.

What I do think is that we get desktops sometime this summer (come on WWDC!) with M4s on N3E (same family as the iPad Pro) then the MacBooks updated with "M5" whenever N3P is ready to ship with. If that is Oct/Nov it's right on time, about a year since the last updates. If its Dec/Jan, it's still fine, only a couple months delay.

Since N3P uses the same layout rules as N3E Apple could just use the M4 layouts and claim the incremental (+5% performance or power improvement) or it could roll in some improved cores to the package (more likely). As others have suggested the GPU cores are in line for updates, maybe the next round of low hanging fruit for 3D rendering, perhaps tweaks to improve their ability to contribute to computation. Add something to the media cores (AV1 encode?), maybe bump NE cores count up, and you have something worthy of a release.

More likely, I'd expect, is M4 stays as is, and M4 Pro and Max (and Ultra, has to be an Ultra, surely?!?) built on N3P.

Which, if correct, MIGHT mean
- an M4 iMac soon (WWDC?)
- MBPs (some of which will want an M4 Pro/Max) and the rest still a few months
- mini with M4? Maybe also a delay, given that Apple probably wants to release it at the same time as the version with M4 Pro.

As you point out, the 6 month delay may be worth it if you get a boosted GPU (eg the A18 Pro GPU) as part of the M4 P/M/U package...

But it's also true that we tend to project a consistency onto these products that may not be real.
We've had surprises before (no-one expected the 3-core A8X, no-one expected the M4, no-one expected M3 Ultra to be skipped).
So maybe the path will actually be JUST an M4 (no Pro/Max), no M4 upgrades, and a full M5 suite, more or less in synch with A18 Pro, on N3P, in say early December (or at least M5 early December, to catch gift season, and M5 Pro/Max a month or two later)?

name99 · May 16, 2024

DaniTheFox said:
I always thought certain tests were done in a fridge. But this is quite extreme.

Honestly I'd be curious to see them do a fridge and freezer run, since those are so much less extreme.
My guess is probably fridge, definitely freezer, can sustain 4.41GHz, without the liquid N2 craziness.

name99 · May 16, 2024

leman said:
It does not support SVE.

Also, their testing methodology is likely flawed. I’m sure they can get higher vector processing throughput with multiple accumulators.

Yeah, the SME number looks correct, the SSVE number looks wrong. My comments here:

RWT Forums - Real World Tech

content overridden

www.realworldtech.com

Once they have the basic SSVE flaw sorted out, they need to try the SSVE 2x and 4x instructions, which should allow doubling and quadrupling their vector throughput, to max out at something like the SME number.
(Of course, realistically for many cases, at that point you are limited by load/store bandwidth, not FMAC bandwidth, unless you are doing something like generating random numbers.)

leman · May 16, 2024

name99 said:
Yeah, the SME number looks correct, the SSVE number looks wrong. My comments here:

RWT Forums - Real World Tech

content overridden

www.realworldtech.com

Once they have the basic SSVE flaw sorted out, they need to try the SSVE 2x and 4x instructions, which should allow doubling and quadrupling their vector throughput, to max out at something like the SME number.
(Of course, realistically for many cases, at that point you are limited by load/store bandwidth, not FMAC bandwidth, unless you are doing something like generating random numbers.)

In their test code they accumulate to Z registers but I think they should be accumulating to ZA tile slices. That would mirror the AMX instruction semantics.

tenthousandthings · May 17, 2024

name99 said:
More likely, I'd expect, is M4 stays as is, and M4 Pro and Max (and Ultra, has to be an Ultra, surely?!?) built on N3P.

Which, if correct, MIGHT mean
- an M4 iMac soon (WWDC?)
- MBPs (some of which will want an M4 Pro/Max) and the rest still a few months
- mini with M4? Maybe also a delay, given that Apple probably wants to release it at the same time as the version with M4 Pro.

As you point out, the 6 month delay may be worth it if you get a boosted GPU (eg the A18 Pro GPU) as part of the M4 P/M/U package...

But it's also true that we tend to project a consistency onto these products that may not be real.
We've had surprises before (no-one expected the 3-core A8X, no-one expected the M4, no-one expected M3 Ultra to be skipped).
So maybe the path will actually be JUST an M4 (no Pro/Max), no M4 upgrades, and a full M5 suite, more or less in synch with A18 Pro, on N3P, in say early December (or at least M5 early December, to catch gift season, and M5 Pro/Max a month or two later)?

I was leaning toward M4 Pro/Max being on N3P around October/November. There is a precedent for changing the process node midstream, the A10 (iPhone 7) on TSMC 16nm in September 2016 and the A10X (iPad Pro 2) on TSMC 10nm in June 2017.

But I like your M5 idea better. It would mean the Studio and Pro get M3 Ultra at WWDC 2024, then M5 Ultra at WWDC 2025.

name99 · May 17, 2024

leman said:
In their test code they accumulate to Z registers but I think they should be accumulating to ZA tile slices. That would mirror the AMX instruction semantics.

Problem with that theory is that the ARM spec does not allow for accumulating to Z registers, only to ZA (which makes sense in terms of the HW).

But which then makes you wonder...
- Is the code they are providing NOT the actual code they're executing? So the code they're executing is in fact accumulating to only one register. Which would match Corsix' M1 numbers.
OR
- Does the assembler they are using allow accumulation to a Z register by "helpfully" behind the scenes adding extra data movement instructions (ie the instructions are more like ACLE, less like pure assembly)?

name99 · May 17, 2024

crazy dave said:
Looking up MacroScalar I came across this written about 13 years ago. It was a good read for multiple reasons, not just the technical description of how MacroScalar might work but also from a historical perspective on Apple's entrance into microarchitecture design:

What Is Macroscalar?

A couple months ago, a story made the nerd-press rounds about Apple’s trademark application and several patents for something called a “macroscalar” processor architecture. I’ve taken a stab at decoding the publicly available information about macroscalar architectures to give a coherent picture...

www.cs.cornell.edu

The nutshell version of something like MacroScalar is that rather than work with fixed sized registers and SIMD operations (even SVE-style) you write your code as standard scalar loops, but some of those loops are captured by the compiler and annotated as being amenable to SIMD'ization.
Everything after that happens in the CPU; at some point (possibly when the instructions are loaded to L2, possibly when they're loaded to L1, possibly even at Decode time) certain patterns of instructions are on-the-fly converted to a SIMD version appropriate for this hardware.

The primary win of this is that you get everything SVE gives you, without the various pain points; AND you don't burn up your 32b ISA space the way SVE has done.
The primary loss is that you give up perhaps even more control than SVE (eg for things like data rearrangement).

The idea admittedly sounds some combination of cool and insane. But others have suggested the same sort of idea, for example here's a version for POWER: https://libre-soc.org/openpower/sv/
Another way to look at this is that it's an implementation of "real" vector computing Cray-style ["hardware for loop"], rather than SIMD, and the implementation as SIMD or otherwise is a low-level detail unimportant to the developer.

Yet another way to look at this is to assume (who knows?) that Apple considered THE important part of MacroScalar to be not the "hardware for loop" part but the offloading to separate hardware part, and that's already present with AMX/SME so Mission Achieved, move on...

As I keep saying, the primary win in all these schemes is you get a better compiler target, and that's worth a lot. Hell, if Apple just adopts SVE128, and says they will NEVER change the width, so we NEON plus predication, no need to test for load/store boundary conditions, and no loop tails, that's worth a few percent on "most" code, including non-FP code, and getting a few percent on current CPUs any other way is not easy...

leman · May 17, 2024

name99 said:
Problem with that theory is that the ARM spec does not allow for accumulating to Z registers, only to ZA (which makes sense in terms of the HW).

But which then makes you wonder...
- Is the code they are providing NOT the actual code they're executing? So the code they're executing is in fact accumulating to only one register. Which would match Corsix' M1 numbers.
OR
- Does the assembler they are using allow accumulation to a Z register by "helpfully" behind the scenes adding extra data movement instructions (ie the instructions are more like ACLE, less like pure assembly)?

I was under impression that accumulating to Z registers is supported under streaming SVE. The ARM manual is really difficult to read. But usually it says if the instruction is unsupported in streaming mode. No such warning is given for FMLA

LaterWolf · May 18, 2024

May you now please make this M4 pro+ or m5

LaterWolf · May 18, 2024

NikkoTuason said:
Base stats:

40 high-performance cores

20 high-efficiency cores

100-core GPU

80-core Neural Engine

8GB Unified memory

W h a o w

Populus · May 18, 2024

I’ll just say that this M4 seems like quite a jump. I’m not sure what to expect from M5 but it will probably be an incremental upgrade, maybe using N3P to have some performance or efficiency gains.

If what’s been said here is true and N3P is heavily oriented towards efficiency, we might see the M5 aimed at the next Mac laptops like they said… but this is pure speculation on my part.

For me, the path should be quite clear: if next WWDC really entices me to get the new iPad Pro (with a better desktop UI experience, and/or allowing Mac apps to be installed), then I will go with it, hopefully as my main computer device. If not, if iPadOS is more of the same software wise, I’ll finally get an M4 or M4 Pro Mac mini when it’s released.

The only advantage I could see to waiting for the M5 generation would be the adoption of 12GB as base RAM, but we’ve been speculating about that for too long at this point, just like the implementation of LPDDR6, so I’m not sure waiting any longer is worth it.

leman · May 18, 2024

Populus said:
I’ll just say that this M4 seems like quite a jump. I’m not sure what to expect from M5 but it will probably be an incremental upgrade, maybe using N3P to have some performance or efficiency gains.

My personal expectation is 2x higher GPU performance.

thenewperson · May 18, 2024

leman said:
My personal expectation is 2x higher GPU performance.

Care to go through your reasoning?

Populus · May 18, 2024

leman said:
My personal expectation is 2x higher GPU performance.

That would be awesome! And makes sense. I get that, with M3, Apple showed us their interest in gaming by adding Ray Tracing to their new GPU architecture, and the smart caching thing.

With M4 the GPU uses the same architecture, and the 12-15% performance gains could mostly be due to the new process node and clocks.

So yeah, now that Apple has brought the gaming attention to their platforms, it seems reasonable that the M5 would just finish the job with another performance jump.

That would make the M5 quite compelling as well… hmm…

Dulcimer · May 18, 2024

leman said:
My personal expectation is 2x higher GPU performance.

I would love to see that personally. But I don’t see that happening within a generation unless you expect the GPU core count increase to account for most of that 2x.

E.g., M4+/5 Max goes from 40 to say 60 GPU cores + another M3-style uarch bump + further mem bandwidth increase (LPDDR5X) + faster clocks with N3E or N3P.

Seems unlikely unless they’re willing to make Max die area significantly larger.

leman · May 19, 2024

thenewperson said:
Care to go through your reasoning?

Dulcimer said:
I would love to see that personally. But I don’t see that happening within a generation unless you expect the GPU core count increase to account for most of that 2x.

Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.

For M3 GPU, they have redesigned the instruction scheduler so that it can select two instructions from two threads per clock and execute them simultaneously. This means that M3 can execute a FP32+FP16, FP32+INT, or FP16+INT instruction pair at the same time. This is already quite useful for some code, and I think the next obvious step is to redesign the execution units themselves to enable simultaneous FP32+FP32. This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well. The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.

Populus · May 19, 2024

leman said:
Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.

For M3 GPU, they have redesigned the instruction scheduler so that it can select two instructions from two threads per clock and execute them simultaneously. This means that M3 can execute a FP32+FP16, FP32+INT, or FP16+INT instruction pair at the same time. This is already quite useful for some code, and I think the next obvious step is to redesign the execution units themselves to enable simultaneous FP32+FP32. This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well. The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.

No idea of what FP or INT (integers?) are, but if this is possible, doubling the performance would be awesome and worth waiting for the next gen GPU, wether in M5 or M6.

Like you said, Apple looks like is quite interested in improving Apple Silicon GPU performance, and that would be a great way to achieve it. This is already having its fruits with more and more game developers releasing their games on Mac.

By the way, Gurman is saying that until mid 2025 we won’t be seeing new Mac Studios. I wonder if this machines could lead the M5 generation or will stay within the M4 family.

leman · May 19, 2024

Populus said:
No idea of what FP or INT (integers?) are, but if this is possible, that would be really great and worth waiting for the next gen GPU, wether in M5 or M6.

FP is floating point. INT is integer.

The current architecture of M3 GPUs is this (taken from an Apple developer video):

They describe are four distinct ALU pipelines (FP32, FP16, Integer, and Complex — although I suspect that Integer and Complex might be the same pipeline). The scheduler can select two instructions for the different ALUs and execute them simultaneously. Right now these instructions must be of different types, e.g. one integer type and one full-precision floating point (FP32) type. My idea is that they are likely to tweak the ALU pipelines so that they can execute more types of instructions simultaneously.

name99 · May 19, 2024

leman said:
Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.

For M3 GPU, they have redesigned the instruction scheduler so that it can select two instructions from two threads per clock and execute them simultaneously. This means that M3 can execute a FP32+FP16, FP32+INT, or FP16+INT instruction pair at the same time. This is already quite useful for some code, and I think the next obvious step is to redesign the execution units themselves to enable simultaneous FP32+FP32. This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well. The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.

The biggest bottleneck to this right now is the register cache. Already it can be oversubscribed by microbenchmark code (though usually not by real code). And widening that could be expensive.

To my eyes, the most obvious quick win is to add a scalar lane to execute scalar operations (and not waste full SIMD on them). They have uniform registers but for whatever reason they never provided a uniform execution path.

Next up in terms of quick wins is probably allowing FP16[2] operations to execute on pairs of FP16s. Either splitting these into one done in FP16 HW, one done on FP32 HW or (slightly more work) rewiring FP32 HW to also support FP16[2].
Both of these match the route AMD and nV has already gone done, and both don't put that much additional pressure on the register cache.

The next stage, of allowing more general pairs of instructions to execute per cycle on whatever hardware is available, does put pressure on the register cache :-(
Maybe they can work around this to a good enough solution via heuristics that split the cache into three caches, eg an int register cache, an FP16 register cache, and an FP32 register cache?

BTW don't confuse what I am calling the register cache (the "REAL" register cache, holding values before they go into execution units) with the "register file cache" which is what is drawn in the diagram above, and which is an SRAM cache of register values which can spill to L2 if required, and is part of the M3 grand rearrangement of how SRAM is used by the GPU.

name99 · May 19, 2024

BTW, on another M4 topic, in fact SSVE works every bit as fast as expected on the M4, and can (when coded correctly) run as fast as a matrix outer product, so ~2000 FP32 GFLOPS.
Problem is the initial Jena results were written up by people who read the ARM manual but never bother to read my manual (or the similar work by Corsix or Dougall).

Details here: https://www.realworldtech.com/forum/?threadid=217524&curpostid=217736

I expect over the next year or so a whole bunch of code (BLIS, EIGEN, Mathematica, MAPLE, etc) to start coding to SME and SSVE, and some pretty spectacular results.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 603

macrumors 68000

macrumors Core

macrumors 68040

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors 68030

macrumors Core

Contributor

macrumors 68030

macrumors 68030

macrumors Core

Suspended

Suspended

macrumors 604

macrumors Core

macrumors 65816

macrumors 604

macrumors 65816

macrumors Core

macrumors 604

macrumors Core

macrumors 68030

macrumors 68030

Our Staff