Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Pressure

macrumors 603
May 30, 2006
5,179
1,544
Denmark
So the M4 supports both SVE and SME as tested here.

We see that a single core reaches 111 FP32 GFLOPS when running Neon fmla instructions, 31 FP32 GLFLOPS when running streaming SVE fmla vector instructions, and over 2000 FP32 GFLOPS for outer-product fmopa instructions.
 
Last edited:

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
[I assume you are talking about N3P (not N4P) here. ]

I can't see Apple pushing that into any iPhone this year, they have the brand new N3E for that which would allow them to save the bump to N3P for next year's iPhone. The iPhone has a target date each year, it has to ship in that window due to the huge logistical/commercial weight it carries. I can't see them holding it up even two months past its traditional release.

What I do think is that we get desktops sometime this summer (come on WWDC!) with M4s on N3E (same family as the iPad Pro) then the MacBooks updated with "M5" whenever N3P is ready to ship with. If that is Oct/Nov it's right on time, about a year since the last updates. If its Dec/Jan, it's still fine, only a couple months delay.

Since N3P uses the same layout rules as N3E Apple could just use the M4 layouts and claim the incremental (+5% performance or power improvement) or it could roll in some improved cores to the package (more likely). As others have suggested the GPU cores are in line for updates, maybe the next round of low hanging fruit for 3D rendering, perhaps tweaks to improve their ability to contribute to computation. Add something to the media cores (AV1 encode?), maybe bump NE cores count up, and you have something worthy of a release.
More likely, I'd expect, is M4 stays as is, and M4 Pro and Max (and Ultra, has to be an Ultra, surely?!?) built on N3P.

Which, if correct, MIGHT mean
- an M4 iMac soon (WWDC?)
- MBPs (some of which will want an M4 Pro/Max) and the rest still a few months
- mini with M4? Maybe also a delay, given that Apple probably wants to release it at the same time as the version with M4 Pro.

As you point out, the 6 month delay may be worth it if you get a boosted GPU (eg the A18 Pro GPU) as part of the M4 P/M/U package...

But it's also true that we tend to project a consistency onto these products that may not be real.
We've had surprises before (no-one expected the 3-core A8X, no-one expected the M4, no-one expected M3 Ultra to be skipped).
So maybe the path will actually be JUST an M4 (no Pro/Max), no M4 upgrades, and a full M5 suite, more or less in synch with A18 Pro, on N3P, in say early December (or at least M5 early December, to catch gift season, and M5 Pro/Max a month or two later)?
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
I always thought certain tests were done in a fridge. But this is quite extreme.
Honestly I'd be curious to see them do a fridge and freezer run, since those are so much less extreme.
My guess is probably fridge, definitely freezer, can sustain 4.41GHz, without the liquid N2 craziness.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
It does not support SVE.

Also, their testing methodology is likely flawed. I’m sure they can get higher vector processing throughput with multiple accumulators.
Yeah, the SME number looks correct, the SSVE number looks wrong. My comments here:

Once they have the basic SSVE flaw sorted out, they need to try the SSVE 2x and 4x instructions, which should allow doubling and quadrupling their vector throughput, to max out at something like the SME number.
(Of course, realistically for many cases, at that point you are limited by load/store bandwidth, not FMAC bandwidth, unless you are doing something like generating random numbers.)
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Yeah, the SME number looks correct, the SSVE number looks wrong. My comments here:

Once they have the basic SSVE flaw sorted out, they need to try the SSVE 2x and 4x instructions, which should allow doubling and quadrupling their vector throughput, to max out at something like the SME number.
(Of course, realistically for many cases, at that point you are limited by load/store bandwidth, not FMAC bandwidth, unless you are doing something like generating random numbers.)

In their test code they accumulate to Z registers but I think they should be accumulating to ZA tile slices. That would mirror the AMX instruction semantics.
 

tenthousandthings

Contributor
May 14, 2012
276
323
New Haven, CT
More likely, I'd expect, is M4 stays as is, and M4 Pro and Max (and Ultra, has to be an Ultra, surely?!?) built on N3P.

Which, if correct, MIGHT mean
- an M4 iMac soon (WWDC?)
- MBPs (some of which will want an M4 Pro/Max) and the rest still a few months
- mini with M4? Maybe also a delay, given that Apple probably wants to release it at the same time as the version with M4 Pro.

As you point out, the 6 month delay may be worth it if you get a boosted GPU (eg the A18 Pro GPU) as part of the M4 P/M/U package...

But it's also true that we tend to project a consistency onto these products that may not be real.
We've had surprises before (no-one expected the 3-core A8X, no-one expected the M4, no-one expected M3 Ultra to be skipped).
So maybe the path will actually be JUST an M4 (no Pro/Max), no M4 upgrades, and a full M5 suite, more or less in synch with A18 Pro, on N3P, in say early December (or at least M5 early December, to catch gift season, and M5 Pro/Max a month or two later)?
I was leaning toward M4 Pro/Max being on N3P around October/November. There is a precedent for changing the process node midstream, the A10 (iPhone 7) on TSMC 16nm in September 2016 and the A10X (iPad Pro 2) on TSMC 10nm in June 2017.

But I like your M5 idea better. It would mean the Studio and Pro get M3 Ultra at WWDC 2024, then M5 Ultra at WWDC 2025.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
In their test code they accumulate to Z registers but I think they should be accumulating to ZA tile slices. That would mirror the AMX instruction semantics.
Problem with that theory is that the ARM spec does not allow for accumulating to Z registers, only to ZA (which makes sense in terms of the HW).

But which then makes you wonder...
- Is the code they are providing NOT the actual code they're executing? So the code they're executing is in fact accumulating to only one register. Which would match Corsix' M1 numbers.
OR
- Does the assembler they are using allow accumulation to a Z register by "helpfully" behind the scenes adding extra data movement instructions (ie the instructions are more like ACLE, less like pure assembly)?
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
Looking up MacroScalar I came across this written about 13 years ago. It was a good read for multiple reasons, not just the technical description of how MacroScalar might work but also from a historical perspective on Apple's entrance into microarchitecture design:


The nutshell version of something like MacroScalar is that rather than work with fixed sized registers and SIMD operations (even SVE-style) you write your code as standard scalar loops, but some of those loops are captured by the compiler and annotated as being amenable to SIMD'ization.
Everything after that happens in the CPU; at some point (possibly when the instructions are loaded to L2, possibly when they're loaded to L1, possibly even at Decode time) certain patterns of instructions are on-the-fly converted to a SIMD version appropriate for this hardware.

The primary win of this is that you get everything SVE gives you, without the various pain points; AND you don't burn up your 32b ISA space the way SVE has done.
The primary loss is that you give up perhaps even more control than SVE (eg for things like data rearrangement).

The idea admittedly sounds some combination of cool and insane. But others have suggested the same sort of idea, for example here's a version for POWER: https://libre-soc.org/openpower/sv/
Another way to look at this is that it's an implementation of "real" vector computing Cray-style ["hardware for loop"], rather than SIMD, and the implementation as SIMD or otherwise is a low-level detail unimportant to the developer.

Yet another way to look at this is to assume (who knows?) that Apple considered THE important part of MacroScalar to be not the "hardware for loop" part but the offloading to separate hardware part, and that's already present with AMX/SME so Mission Achieved, move on...

As I keep saying, the primary win in all these schemes is you get a better compiler target, and that's worth a lot. Hell, if Apple just adopts SVE128, and says they will NEVER change the width, so we NEON plus predication, no need to test for load/store boundary conditions, and no loop tails, that's worth a few percent on "most" code, including non-FP code, and getting a few percent on current CPUs any other way is not easy...
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Problem with that theory is that the ARM spec does not allow for accumulating to Z registers, only to ZA (which makes sense in terms of the HW).

But which then makes you wonder...
- Is the code they are providing NOT the actual code they're executing? So the code they're executing is in fact accumulating to only one register. Which would match Corsix' M1 numbers.
OR
- Does the assembler they are using allow accumulation to a Z register by "helpfully" behind the scenes adding extra data movement instructions (ie the instructions are more like ACLE, less like pure assembly)?

I was under impression that accumulating to Z registers is supported under streaming SVE. The ARM manual is really difficult to read. But usually it says if the instruction is unsupported in streaming mode. No such warning is given for FMLA
 
  • Like
Reactions: name99

Populus

macrumors 603
Aug 24, 2012
5,941
8,411
Spain, Europe
I’ll just say that this M4 seems like quite a jump. I’m not sure what to expect from M5 but it will probably be an incremental upgrade, maybe using N3P to have some performance or efficiency gains.

If what’s been said here is true and N3P is heavily oriented towards efficiency, we might see the M5 aimed at the next Mac laptops like they said… but this is pure speculation on my part.

For me, the path should be quite clear: if next WWDC really entices me to get the new iPad Pro (with a better desktop UI experience, and/or allowing Mac apps to be installed), then I will go with it, hopefully as my main computer device. If not, if iPadOS is more of the same software wise, I’ll finally get an M4 or M4 Pro Mac mini when it’s released.

The only advantage I could see to waiting for the M5 generation would be the adoption of 12GB as base RAM, but we’ve been speculating about that for too long at this point, just like the implementation of LPDDR6, so I’m not sure waiting any longer is worth it.
 
  • Like
Reactions: tenthousandthings

Populus

macrumors 603
Aug 24, 2012
5,941
8,411
Spain, Europe
My personal expectation is 2x higher GPU performance.
That would be awesome! And makes sense. I get that, with M3, Apple showed us their interest in gaming by adding Ray Tracing to their new GPU architecture, and the smart caching thing.

With M4 the GPU uses the same architecture, and the 12-15% performance gains could mostly be due to the new process node and clocks.

So yeah, now that Apple has brought the gaming attention to their platforms, it seems reasonable that the M5 would just finish the job with another performance jump.

That would make the M5 quite compelling as well… hmm…
 

Dulcimer

macrumors 6502a
Nov 20, 2012
967
1,148
My personal expectation is 2x higher GPU performance.
I would love to see that personally. But I don’t see that happening within a generation unless you expect the GPU core count increase to account for most of that 2x.

E.g., M4+/5 Max goes from 40 to say 60 GPU cores + another M3-style uarch bump + further mem bandwidth increase (LPDDR5X) + faster clocks with N3E or N3P.

Seems unlikely unless they’re willing to make Max die area significantly larger.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Care to go through your reasoning?
I would love to see that personally. But I don’t see that happening within a generation unless you expect the GPU core count increase to account for most of that 2x.

Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.

For M3 GPU, they have redesigned the instruction scheduler so that it can select two instructions from two threads per clock and execute them simultaneously. This means that M3 can execute a FP32+FP16, FP32+INT, or FP16+INT instruction pair at the same time. This is already quite useful for some code, and I think the next obvious step is to redesign the execution units themselves to enable simultaneous FP32+FP32. This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well. The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.
 

Populus

macrumors 603
Aug 24, 2012
5,941
8,411
Spain, Europe
Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.

For M3 GPU, they have redesigned the instruction scheduler so that it can select two instructions from two threads per clock and execute them simultaneously. This means that M3 can execute a FP32+FP16, FP32+INT, or FP16+INT instruction pair at the same time. This is already quite useful for some code, and I think the next obvious step is to redesign the execution units themselves to enable simultaneous FP32+FP32. This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well. The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.
No idea of what FP or INT (integers?) are, but if this is possible, doubling the performance would be awesome and worth waiting for the next gen GPU, wether in M5 or M6.

Like you said, Apple looks like is quite interested in improving Apple Silicon GPU performance, and that would be a great way to achieve it. This is already having its fruits with more and more game developers releasing their games on Mac.

By the way, Gurman is saying that until mid 2025 we won’t be seeing new Mac Studios. I wonder if this machines could lead the M5 generation or will stay within the M4 family.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
No idea of what FP or INT (integers?) are, but if this is possible, that would be really great and worth waiting for the next gen GPU, wether in M5 or M6.

FP is floating point. INT is integer.

The current architecture of M3 GPUs is this (taken from an Apple developer video):

57221-116366-IMG_4067-xl.jpg


They describe are four distinct ALU pipelines (FP32, FP16, Integer, and Complex — although I suspect that Integer and Complex might be the same pipeline). The scheduler can select two instructions for the different ALUs and execute them simultaneously. Right now these instructions must be of different types, e.g. one integer type and one full-precision floating point (FP32) type. My idea is that they are likely to tweak the ALU pipelines so that they can execute more types of instructions simultaneously.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.

For M3 GPU, they have redesigned the instruction scheduler so that it can select two instructions from two threads per clock and execute them simultaneously. This means that M3 can execute a FP32+FP16, FP32+INT, or FP16+INT instruction pair at the same time. This is already quite useful for some code, and I think the next obvious step is to redesign the execution units themselves to enable simultaneous FP32+FP32. This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well. The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.
The biggest bottleneck to this right now is the register cache. Already it can be oversubscribed by microbenchmark code (though usually not by real code). And widening that could be expensive.

To my eyes, the most obvious quick win is to add a scalar lane to execute scalar operations (and not waste full SIMD on them). They have uniform registers but for whatever reason they never provided a uniform execution path.

Next up in terms of quick wins is probably allowing FP16[2] operations to execute on pairs of FP16s. Either splitting these into one done in FP16 HW, one done on FP32 HW or (slightly more work) rewiring FP32 HW to also support FP16[2].
Both of these match the route AMD and nV has already gone done, and both don't put that much additional pressure on the register cache.

The next stage, of allowing more general pairs of instructions to execute per cycle on whatever hardware is available, does put pressure on the register cache :-(
Maybe they can work around this to a good enough solution via heuristics that split the cache into three caches, eg an int register cache, an FP16 register cache, and an FP32 register cache?

BTW don't confuse what I am calling the register cache (the "REAL" register cache, holding values before they go into execution units) with the "register file cache" which is what is drawn in the diagram above, and which is an SRAM cache of register values which can spill to L2 if required, and is part of the M3 grand rearrangement of how SRAM is used by the GPU.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
BTW, on another M4 topic, in fact SSVE works every bit as fast as expected on the M4, and can (when coded correctly) run as fast as a matrix outer product, so ~2000 FP32 GFLOPS.
Problem is the initial Jena results were written up by people who read the ARM manual but never bother to read my manual (or the similar work by Corsix or Dougall).

Details here: https://www.realworldtech.com/forum/?threadid=217524&curpostid=217736

I expect over the next year or so a whole bunch of code (BLIS, EIGEN, Mathematica, MAPLE, etc) to start coding to SME and SSVE, and some pretty spectacular results.
 
  • Like
Reactions: Xiao_Xi
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.