M4+ Chip Generation - Speculation Megathread [MERGED]

HoxtonBridge · May 19, 2024

NikkoTuason said:
Base stats:

40 high-performance cores

20 high-efficiency cores

100-core GPU

80-core Neural Engine

8GB Unified memory

The base memory is laughable. 16gb ought to be the bare minimum for anything Mac....

leman · May 19, 2024

name99 said:
BTW, on another M4 topic, in fact SSVE works every bit as fast as expected on the M4, and can (when coded correctly) run as fast as a matrix outer product, so ~2000 FP32 GFLOPS.

I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.

name99 said:
Problem is the initial Jena results were written up by people who read the ARM manual but never bother to read my manual (or the similar work by Corsix or Dougall).

They did a quick and easy thing, which makes sense for an initial investigation. It did take me a while to understand how this stuff works. I agree that the low result should have made them suspicious.

There are two point that I find particularly interesting. First is that using these instructions in a truly scalable way is much tricker than I thought. The obvious thing - using SVE instructions - gives you bad performance. You have to use the ZA storage, and that’s much more involved. The way SME is designed is that results of operations are scattered across the tile storage and you have to collect them back for write out. Adds another layer of complexity if you primarily care about vector HPC. Not to mention that you need to work with pairs or even fours of SVE registers to get the best performance (I’ll have to experiment with different patterns). Not worth it unless you have many thousands of elements.

leman · May 19, 2024

name99 said:
The biggest bottleneck to this right now is the register cache. Already it can be oversubscribed by microbenchmark code (though usually not by real code). And widening that could be expensive.

That is true. I still think it might work well enough even with the current register cache. Apple can do 32-bit FP+Int most of the time without problems after all. And they are the only vendor who offers four-operand instructions. I mean, Nvidia does it and it works for them, even though they do get stalls here and there.

name99 said:
To my eyes, the most obvious quick win is to add a scalar lane to execute scalar operations (and not waste full SIMD on them). They have uniform registers but for whatever reason they never provided a uniform execution path.

Do you think it is worth the effort? I doubt it will improve performance. It’s still a second instruction. Maybe it could save some power, dies sound like extra complexity.

name99 said:
Next up in terms of quick wins is probably allowing FP16[2] operations to execute on pairs of FP16s. Either splitting these into one done in FP16 HW, one done on FP32 HW or (slightly more work) rewiring FP32 HW to also support FP16[2].
Both of these match the route AMD and nV has already gone done, and both don't put that much additional pressure on the register cache.

Doesn’t seem to work well for AMD. And Nvidia abandoned this path. Too situational, IMO. Would help with matmul though.

Xiao_Xi · May 19, 2024

leman said:
I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.

Jena's benchmark result.

Core Type	#Tiles	FP32 GFLOPS
P	1	502.2
P	2	1004.7
P	4	2009.6
E	1	116.3
E	2	116.0
E	4	116.1

Microbenchmarks | Hello SME documentation

By the way, they have moved their code to Github.

GitHub - scalable-analyses/sme

Contribute to scalable-analyses/sme development by creating an account on GitHub.

github.com

leman · May 19, 2024

Xiao_Xi said:
Jena's benchmark result.

Core Type #Tiles FP32 GFLOPS
P 1 502.2
P 2 1004.7
P 4 2009.6
E 1 116.3
E 2 116.0
E 4 116.1

Microbenchmarks | Hello SME documentation

By the way, they have moved their code to Github.

GitHub - scalable-analyses/sme

Contribute to scalable-analyses/sme development by creating an account on GitHub.

github.com

That is still matrix, not vector.

deconstruct60 · May 19, 2024

leman said:
Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.

leman said:
This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well.

Volta isn't really primarily graphics oriented. It is a datacenter compute focused chip. Most folks are likely going to thinkg "2x GPU" primarily has impact on their screen, not embarrassingly parallel data into a file (or ephemeral memory buffer).

More FP can be helpful, but it isn't the whole general flow.

leman said:
The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.

Peak GPU compute isn't usually the norm. Apple has corner cases where they beat a 4090 now; but they are corner cases.

name99 · May 19, 2024

leman said:
That is true. I still think it might work well enough even with the current register cache. Apple can do 32-bit FP+Int most of the time without problems after all. And they are the only vendor who offers four-operand instructions. I mean, Nvidia does it and it works for them, even though they do get stalls here and there.

Do you think it is worth the effort? I doubt it will improve performance. It’s still a second instruction. Maybe it could save some power, dies sound like extra complexity.

Doesn’t seem to work well for AMD. And Nvidia abandoned this path. Too situational, IMO. Would help with matmul though.

It means you are not wasting a SIMD of execution just to calculate a single value.
If you calculate scalar values more than 3% of the time, it's a net win!

What do you mean by "Apple can do 32-bit FP+Int most of the time without problems after all."
You mean that, eg, Apple can sustain execution of 128 FMA32 + INT16 128 integer ops per cycle? I've never seen any claim of this.
As far as I can tell, Apple has four schedulers each feeding a 32 element SIMD, and each scheduler can feed a maximum of 1 instruction per cycle. Phillip Turner says something different on his benchmarking page, but I cannot make head or tail of what he is saying. He does not describe his mental model of the hardware, and I can't reverse engineer it from his description.

Are you referring to post-M1 improvements? I would be interested in seeing such numbers.
Something I COULD believe occurring in post-M1 designs is that shuffle becomes essentially free. Shuffle takes place between the register cache and execution.
For the M1 matrix instructions it happens automatically.
For OTHER instructions, an obvious improvement would be to provide fused instructions that pair a shuffle with the subsequent instruction so that the shuffle just happens as the data moves to the subsequent add or whatever. But that's specifically shuffle, it's not generic integer instructions. I don't know if they do that yet, but it's such an obvious win I'd be disappointed if it hasn't been taken up yet.

name99 · May 19, 2024

deconstruct60 said:
Volta isn't really primarily graphics oriented. It is a datacenter compute focused chip. Most folks are likely going to thinkg "2x GPU" primarily has impact on their screen, not embarrassingly parallel data into a file (or ephemeral memory buffer).

What Leman is referring to is that pre-Volta had separate INT and FP execution hardware, but in a single cycle only one of an FP or an INT instruction could be dispatched. With Volta this was tweaked so that both an INT and FP instruction could be dispatched. nV said at the time that about 30% of instructions were INT, so this basically improved performance by 1/.7=1.4x

BUT for Apple it's a slightly different tradeoff. nVidia has more time (their double-pumped lane stuff) to do easier (track dependencies in instruction bits) scheduling. AND their addressing is more lame than Apple's, so more cycles are required to calculate addresses that Apple can calculate in the load/store instruction. So the payoff for Apple is not as great.

But now that Apple are no longer enslaved to the exact size of the register file and the threadblock scratchpad, there's the opportunity to rebalance the design in all sorts of ways. I could see this resulting in ideas they have not used before (though PowerVR did) like running the 32-wide SIMD as a 16-wide SIMD that executes each instruction twice, which in turn makes it a lot easier to start doing things like effectively executing two instructions per "cycle", eg one to FP32 and one to FP16, or one to FP32 and one to INT.

name99 · May 19, 2024

leman said:
I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.

Did Corsix give performance numbers for the two and four vector variants? Presumably you have to use those to get max throughput. It wasn't clear to me from the link I gave if Eric had *measured* that he got 2000 GFLOPs or not.

It's certainly possible that, even though the double and quad vector variants allow for more compute with fewer instructions, the current implementation limits how many registers can be moved into the execution units per cycle, which in turn throttles how much you can do per cycle as non-outer-product.

OTOH, if you are doing eg long dots you're also throttled by the load/store bandwidth, so if you can offload some of the computes into the 2 or 4x variants, leaving more instruction bandwidth for loads, that's a net win!

leman · May 19, 2024

name99 said:
Did Corsix give performance numbers for the two and four vector variants? Presumably you have to use those to get max throughput.

I am not aware of any two/four vector variants on AMX.

name99 said:
It wasn't clear to me from the link I gave if Eric had *measured* that he got 2000 GFLOPs or not.

Sorry for confusion. Eric Fink on RRW is my alias. I got 250 GFLOPS using the four vector FMLA variant. I think that's a data movement limitation. The ALUs themselves are fully capable of 2000 GFLOPs when doing outer products on two 16-wide FP32 vectors. But they can only sustain 1/8 of that when doing regular vector ops.

name99 said:
If you calculate scalar values more than 3% of the time, it's a net win!

I can see this is a win in energy consumption, not in anything else though. You are still limited by how many instructions can be scheduled simultaneously. It would be a performance win if the scalar instructions were scheduled independently, and that requires a very different design. Anyway, don't seem like that's the direction Apple is moving towards.

name99 said:
What do you mean by "Apple can do 32-bit FP+Int most of the time without problems after all."
You mean that, eg, Apple can sustain execution of 128 FMA32 + INT16 128 integer ops per cycle? I've never seen any claim of this.
As far as I can tell, Apple has four schedulers each feeding a 32 element SIMD, and each scheduler can feed a maximum of 1 instruction per cycle.

I am referring to M3. The new GPU scheduler can issue two instructions per clock to different functional units within the same partition. So they can do FP+INT or FP32+FP16 simultaneously.

The caveat is that when doing dual-issue a full INT32 instructions can only be issued every other cycle. It is possible that the current register cache can only sustain 32bit + 16bit reads for each operand and lane. They would indeed need to widen that if they wanted to go double FP32+FP32.

name99 said:
Phillip Turner says something different on his benchmarking page, but I cannot make head or tail of what he is saying. He does not describe his mental model of the hardware, and I can't reverse engineer it from his description.

I have the same feeling about Philipp's results. The way he describes the hardware does not fit with my understanding.

name99 said:
Are you referring to post-M1 improvements? I would be interested in seeing such numbers.

Here you go, this is the test I did a while ago evaluating the peak compute of various instruction mixes. That's on the 30-core M3 Max

name99 said:
What Leman is referring to is that pre-Volta had separate INT and FP execution hardware, but in a single cycle only one of an FP or an INT instruction could be dispatched. With Volta this was tweaked so that both an INT and FP instruction could be dispatched. nV said at the time that about 30% of instructions were INT, so this basically improved performance by 1/.7=1.4x

BUT for Apple it's a slightly different tradeoff. nVidia has more time (their double-pumped lane stuff) to do easier (track dependencies in instruction bits) scheduling. AND their addressing is more lame than Apple's, so more cycles are required to calculate addresses that Apple can calculate in the load/store instruction. So the payoff for Apple is not as great.

But now that Apple are no longer enslaved to the exact size of the register file and the threadblock scratchpad, there's the opportunity to rebalance the design in all sorts of ways. I could see this resulting in ideas they have not used before (though PowerVR did) like running the 32-wide SIMD as a 16-wide SIMD that executes each instruction twice, which in turn makes it a lot easier to start doing things like effectively executing two instructions per "cycle", eg one to FP32 and one to FP16, or one to FP32 and one to INT.

As I said, they already implemented dual-issue in M3. And it seems to be a real dual-issue, not Nvidia's smart "even-odd issue with two clocks per instruction". Apple also has a nice patent describing this scheme, it's a fairly sophisticated mechanism. Much of my hypothesis is based on the assumption that they didn't go through all that trouble just to have dual FP32+FP16. Real performance wins are with FP32+FP32 or FP16+FP16 on modern shader code.

name99 · May 20, 2024

leman said:
I am not aware of any two/four vector variants on AMX.

Two vector variant was addded w/ M2, 4 vector w/ M3.

leman said:
Sorry for confusion. Eric Fink on RRW is my alias. I got 250 GFLOPS using the four vector FMLA variant. I think that's a data movement limitation. The ALUs themselves are fully capable of 2000 GFLOPs when doing outer products on two 16-wide FP32 vectors. But they can only sustain 1/8 of that when doing regular vector ops.

👍

leman said:
I can see this is a win in energy consumption, not in anything else though. You are still limited by how many instructions can be scheduled simultaneously. It would be a performance win if the scalar instructions were scheduled independently, and that requires a very different design. Anyway, don't seem like that's the direction Apple is moving towards.

leman said:
I am referring to M3. The new GPU scheduler can issue two instructions per clock to different functional units within the same partition. So they can do FP+INT or FP32+FP16 simultaneously.

The caveat is that when doing dual-issue a full INT32 instructions can only be issued every other cycle. It is possible that the current register cache can only sustain 32bit + 16bit reads for each operand and lane. They would indeed need to widen that if they wanted to go double FP32+FP32.

I guess this is courtesy of the new "mini-front-ends" they have added to the GPU...
https://patents.google.com/patent/US11422822B2
Nice to confirm it works as I interpreted it to.
This is one reason I think a scalar lane is feasible. They have the ability to schedule to it now, and it won't hurt register cache bandwidth (uses a different, uniforms, cache).

leman said:
Here you go, this is the test I did a while ago evaluating the peak compute of various instruction mixes. That's on the 30-core M3 Max

As I said, they already implemented dual-issue in M3. And it seems to be a real dual-issue, not Nvidia's smart "even-odd issue with two clocks per instruction". Apple also has a nice patent describing this scheme, it's a fairly sophisticated mechanism. Much of my hypothesis is based on the assumption that they didn't go through all that trouble just to have dual FP32+FP16. Real performance wins are with FP32+FP32 or FP16+FP16 on modern shader code.

Ha ha! I guess we both read that same patent

ader42 · May 20, 2024

HoxtonBridge said:
The base memory is laughable. 16gb ought to be the bare minimum for anything Mac....

Nah. You got choices, lots of people only need 8gb. Not me of course, but the people I know that are not creatives.

HoxtonBridge · May 22, 2024

ader42 said:
Nah. You got choices, lots of people only need 8gb. Not me of course, but the people I know that are not creatives.

Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.

MayaUser · May 22, 2024

is not about the base specs...but the base price...i bet Apple would increase the price if they go 16gb+512ssd and they will lose some customers
Remember for a lot its just a symbol to own a mac...and those dont need anything but base model
If apple will keep the 999 price for 16gb ram and 512ssd...then yes, of course it is a must and it will be the starting base spec for the next decade..or maybe decades (joking)

Torty · May 22, 2024

HoxtonBridge said:
Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.

Something wrong with your system if your 8GB machine can't handle that.

ader42 · May 22, 2024

HoxtonBridge said:
Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.

There is no reason to take away peoples choices.

HoxtonBridge · May 23, 2024

Torty said:
Something wrong with your system if your 8GB machine can't handle that.

It would handle it, just not very well!

HoxtonBridge · May 23, 2024

ader42 said:
There is no reason to take away peoples choices.

Why not offer 2 or 4 GB then?

eldho · May 23, 2024

HoxtonBridge said:
Why not offer 2 or 4 GB then?

I assume that you are well aware of the logical answer to this frivolous question.

falainber · May 23, 2024

eldho said:
I assume that you are well aware of the logical answer to this frivolous question.

So 4GB is frivolous but 8GB is a "choice". Many people are arguing that nowadays 8GB is frivolous.

EugW · May 23, 2024

falainber said:
So 4GB is frivolous but 8GB is a "choice". Many people are arguing that nowadays 8GB is frivolous.

8 GB works OK, but 4 GB (on Intel machines) does not, with recent macOSes.

You can actually run Monterey on 4 GB, but you don't want to. In contrast, Monterey and runs well on 8 GB RAM for light business usage. And lots of people use 8 GB just fine on Sonoma.

However, my prediction/hope is that the next MacBook Air will ship with 12 GB base. 12 GB at the entry level would be great, and more than sufficient for the majority of the population. Jumping all the way up to 16 GB is not necessary for the MB Air. I agree the higher priced MacBook Pro should start at 16 GB or 18 GB though.

SoldOnApple · May 23, 2024

Apple must be holding back so they can push ahead once competing ARM systems make it to market. Like with iPhone, early on Samsung and others had faster phones sometimes, but then Apple pushed far ahead and no one else even bothers competing on real world benchmarks.

deconstruct60 · May 24, 2024

SoldOnApple said:
Apple must be holding back so they can push ahead once competing ARM systems make it to market. Like with iPhone, early on Samsung and others had faster phones sometimes, but then Apple pushed far ahead and no one else even bothers competing on real world benchmarks.

Not much sign of that. Qualcomm's earlier schedule had them releasing "X Elite" in Fall 2023 timeframe. Apple rushed the MBP 14/16" on a short cycle from to M3 in Fall 2023. Turns out Qualcomm had more fine tuning to do and Windows 11 (or pre-12 or 12) wasn't ready so it happen to slide to Q2 2024 (June). The iPad Pro moves to M4 about a month before the Surface Pro.

Apple likes to position iPad Pro as not a 'tablet' ( iPad being something inherently better than a tablet because generally 'tablets' are 'bad' ) , but iPad Pro and Surface Pro have competitive overlap.

Indications are that Apple is trying to arrive 'before' not 'after' the Windows on Arm moves. ( The Windows on Arm moves are not only issue. AMD and Intel have cranked up their focus on more efficient SoCs for Windows also. They aren't gong to remove the efficiency gap in one move , but the gap is going to shrink a bit. )

The MBP line up probably will get another bump to M4 in the Fall, but in some sense Apple has already revealed that elements of that move already with they iPad Pro move. ( cores counts are not grounded , but probably have decent idea of single thread uplift going to see; because it is basically the same across dies sizes. )

The Windows on Arm is mainly limited to laptops though. ( There is a Mini like "Windows developer device" from Qualcomm coming. So the Mini moving to M4/M4 Pro in 2024 is likely going to appear. If there is enough X Elite/Plus supply chain to go around there probably will be some smaller Mini-PC players that drop a 'headless' laptop. )

Where not going to be much Arm movement is in larger desktop space. Linux Arm workstation perhaps, but I doubt that is a creditable threat to Apple ( any more than hackintosh was. Somewhat similar in folks taking Ampere Computing server processors and making workstation with that to run deskside Linux. Linus is doing it. ). Higher end desktop is likely where Apple will be 'holding back". Not so much on competitive pressures though ( lots of movement coming in 2H 2024 in GPU space, AI/ML accelerator , and also in general Workstation space. ). Pushed to 'churn' the Pro and Max dies quicker in the laptops could/should possibly have side effects on the rest of the product line.

Apple is not so far ahead that they can 'hold back'. Arm is ramping up on making the Phone SoCs more competitive ( Arm is shifting from trying to hold onto the 'lowest price SoC" market. RISC-V is coming up from 'below' ) . Nvidia and AMD have synergy deals to move their GPUs into SoCs. Both are rumored to be working on Arm SoCs to move Qualcomm out of the PC laptop placements.

Vision Pro is going to need substantively better Mn and R2 .

ader42 · May 25, 2024

HoxtonBridge said:
Why not offer 2 or 4 GB then?

You likely know full well that 4GB isn’t enough to adequately (for a premium device) run MacOS and a single regular business app like Word whereas 8GB is enough for a great deal of users. Apple certainly does.

I’m pretty sure Apple know what they are doing and that if 8GB was too little people would either not buy them or they would return them - after all Apple have the most generous returns policy of any company I’ve known.

So to answer your question I’ll ask one back, why not offer only 8GB or 64GB?
The answer of course is that Apple likes to give people choices to buy what they themselves want/need.

The fact is, anyone moaning about the minimum being 8GB is in reality moaning about the price.

Let’s try to keep on topic though, M4… if the M4 had 16GB base, the Pro would have 36GB (minimum) and the Max 72GB (minimum). These jumps are a little large from a marketing perspective as it would increase the price and performance differentials between levels of chip, this would actually reduce consumer choice and increase pricing overall.

Apple knows what it is doing when it designs the chips not only from a tech point of view but from a marketing and Mac sales point of view. One has to look at the big picture and not focus on the base entry level chip.

HoxtonBridge · May 25, 2024

ader42 said:
You likely know full well that 4GB isn’t enough to adequately (for a premium device) run MacOS and a single regular business app like Word whereas 8GB is enough for a great deal of users. Apple certainly does.

How many people do you kn ow who run just one application like Word without at least Email and a browser open?

ader42 said:
I’m pretty sure Apple know what they are doing and that if 8GB was too little people would either not buy them or they would return them - after all Apple have the most generous returns policy of any company I’ve known.

So to answer your question I’ll ask one back, why not offer only 8GB or 64GB?
The answer of course is that Apple likes to give people choices to buy what they themselves want/need.

You assume my comment is embedded in the logic of choice; I was showing the limitations of that position. You have just posited an additional illustration of its silliness.

ader42 said:
The fact is, anyone moaning about the minimum being 8GB is in reality moaning about the price.

No! I haven't mentioned the price, although the price of Apple memory is universally regarded as absurdly high.

ader42 said:
Let’s try to keep on topic though, M4… if the M4 had 16GB base, the Pro would have 36GB (minimum) and the Max 72GB (minimum).

More illogical propositions. How did you get to that?

ader42 said:
These jumps are a little large from a marketing perspective

Well yes, but again I'm not especially interested in marketing.

ader42 said:
as it would increase the price and performance differentials between levels of chip, this would actually reduce consumer choice and increase pricing overall.

Apple knows what it is doing when it designs the chips not only from a tech point of view but from a marketing and Mac sales point of view. One has to look at the big picture and not focus on the base entry level chip.

I'm not looking at entry-level chips, I was querying the efficacy of 8gb or memory even for a basic user. You disagree; that's fine, but don't insult my intelligence with a bunch of straw man calculations you have pulled out of nowhere, and fan-boy nonsense about Apple's omniscience.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors newbie

macrumors Core

macrumors Core

macrumors 68000

macrumors Core

macrumors G5

macrumors 68030

macrumors 68030

macrumors 68030

macrumors Core

macrumors 68030

macrumors 6502

macrumors newbie

macrumors 68040

macrumors 65816

macrumors 6502

macrumors newbie

macrumors newbie

macrumors regular

macrumors 68040

macrumors P6

macrumors 65816

macrumors G5

macrumors 6502

macrumors newbie

Our Staff