Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,521
19,675
BTW, on another M4 topic, in fact SSVE works every bit as fast as expected on the M4, and can (when coded correctly) run as fast as a matrix outer product, so ~2000 FP32 GFLOPS.

I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.

Problem is the initial Jena results were written up by people who read the ARM manual but never bother to read my manual (or the similar work by Corsix or Dougall).

They did a quick and easy thing, which makes sense for an initial investigation. It did take me a while to understand how this stuff works. I agree that the low result should have made them suspicious.

There are two point that I find particularly interesting. First is that using these instructions in a truly scalable way is much tricker than I thought. The obvious thing - using SVE instructions - gives you bad performance. You have to use the ZA storage, and that’s much more involved. The way SME is designed is that results of operations are scattered across the tile storage and you have to collect them back for write out. Adds another layer of complexity if you primarily care about vector HPC. Not to mention that you need to work with pairs or even fours of SVE registers to get the best performance (I’ll have to experiment with different patterns). Not worth it unless you have many thousands of elements.
 
  • Like
Reactions: Dulcimer and name99

leman

macrumors Core
Oct 14, 2008
19,521
19,675
The biggest bottleneck to this right now is the register cache. Already it can be oversubscribed by microbenchmark code (though usually not by real code). And widening that could be expensive.

That is true. I still think it might work well enough even with the current register cache. Apple can do 32-bit FP+Int most of the time without problems after all. And they are the only vendor who offers four-operand instructions. I mean, Nvidia does it and it works for them, even though they do get stalls here and there.

To my eyes, the most obvious quick win is to add a scalar lane to execute scalar operations (and not waste full SIMD on them). They have uniform registers but for whatever reason they never provided a uniform execution path.

Do you think it is worth the effort? I doubt it will improve performance. It’s still a second instruction. Maybe it could save some power, dies sound like extra complexity.

Next up in terms of quick wins is probably allowing FP16[2] operations to execute on pairs of FP16s. Either splitting these into one done in FP16 HW, one done on FP32 HW or (slightly more work) rewiring FP32 HW to also support FP16[2].
Both of these match the route AMD and nV has already gone done, and both don't put that much additional pressure on the register cache.

Doesn’t seem to work well for AMD. And Nvidia abandoned this path. Too situational, IMO. Would help with matmul though.
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.
Jena's benchmark result.

Core Type#TilesFP32 GFLOPS
P1502.2
P21004.7
P42009.6
E1116.3
E2116.0
E4116.1

By the way, they have moved their code to Github.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.


This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well.

Volta isn't really primarily graphics oriented. It is a datacenter compute focused chip. Most folks are likely going to thinkg "2x GPU" primarily has impact on their screen, not embarrassingly parallel data into a file (or ephemeral memory buffer).

More FP can be helpful, but it isn't the whole general flow.

The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.

Peak GPU compute isn't usually the norm. Apple has corner cases where they beat a 4090 now; but they are corner cases.
 
  • Like
Reactions: Dulcimer

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
That is true. I still think it might work well enough even with the current register cache. Apple can do 32-bit FP+Int most of the time without problems after all. And they are the only vendor who offers four-operand instructions. I mean, Nvidia does it and it works for them, even though they do get stalls here and there.



Do you think it is worth the effort? I doubt it will improve performance. It’s still a second instruction. Maybe it could save some power, dies sound like extra complexity.



Doesn’t seem to work well for AMD. And Nvidia abandoned this path. Too situational, IMO. Would help with matmul though.
It means you are not wasting a SIMD of execution just to calculate a single value.
If you calculate scalar values more than 3% of the time, it's a net win!

What do you mean by "Apple can do 32-bit FP+Int most of the time without problems after all."
You mean that, eg, Apple can sustain execution of 128 FMA32 + INT16 128 integer ops per cycle? I've never seen any claim of this.
As far as I can tell, Apple has four schedulers each feeding a 32 element SIMD, and each scheduler can feed a maximum of 1 instruction per cycle. Phillip Turner says something different on his benchmarking page, but I cannot make head or tail of what he is saying. He does not describe his mental model of the hardware, and I can't reverse engineer it from his description.

Are you referring to post-M1 improvements? I would be interested in seeing such numbers.
Something I COULD believe occurring in post-M1 designs is that shuffle becomes essentially free. Shuffle takes place between the register cache and execution.
For the M1 matrix instructions it happens automatically.
For OTHER instructions, an obvious improvement would be to provide fused instructions that pair a shuffle with the subsequent instruction so that the shuffle just happens as the data moves to the subsequent add or whatever. But that's specifically shuffle, it's not generic integer instructions. I don't know if they do that yet, but it's such an obvious win I'd be disappointed if it hasn't been taken up yet.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
Volta isn't really primarily graphics oriented. It is a datacenter compute focused chip. Most folks are likely going to thinkg "2x GPU" primarily has impact on their screen, not embarrassingly parallel data into a file (or ephemeral memory buffer).
What Leman is referring to is that pre-Volta had separate INT and FP execution hardware, but in a single cycle only one of an FP or an INT instruction could be dispatched. With Volta this was tweaked so that both an INT and FP instruction could be dispatched. nV said at the time that about 30% of instructions were INT, so this basically improved performance by 1/.7=1.4x

BUT for Apple it's a slightly different tradeoff. nVidia has more time (their double-pumped lane stuff) to do easier (track dependencies in instruction bits) scheduling. AND their addressing is more lame than Apple's, so more cycles are required to calculate addresses that Apple can calculate in the load/store instruction. So the payoff for Apple is not as great.

But now that Apple are no longer enslaved to the exact size of the register file and the threadblock scratchpad, there's the opportunity to rebalance the design in all sorts of ways. I could see this resulting in ideas they have not used before (though PowerVR did) like running the 32-wide SIMD as a 16-wide SIMD that executes each instruction twice, which in turn makes it a lot easier to start doing things like effectively executing two instructions per "cycle", eg one to FP32 and one to FP16, or one to FP32 and one to INT.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.
Did Corsix give performance numbers for the two and four vector variants? Presumably you have to use those to get max throughput. It wasn't clear to me from the link I gave if Eric had *measured* that he got 2000 GFLOPs or not.

It's certainly possible that, even though the double and quad vector variants allow for more compute with fewer instructions, the current implementation limits how many registers can be moved into the execution units per cycle, which in turn throttles how much you can do per cycle as non-outer-product.

OTOH, if you are doing eg long dots you're also throttled by the load/store bandwidth, so if you can offload some of the computes into the 2 or 4x variants, leaving more instruction bandwidth for loads, that's a net win!
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Did Corsix give performance numbers for the two and four vector variants? Presumably you have to use those to get max throughput.

I am not aware of any two/four vector variants on AMX.

It wasn't clear to me from the link I gave if Eric had *measured* that he got 2000 GFLOPs or not.

Sorry for confusion. Eric Fink on RRW is my alias. I got 250 GFLOPS using the four vector FMLA variant. I think that's a data movement limitation. The ALUs themselves are fully capable of 2000 GFLOPs when doing outer products on two 16-wide FP32 vectors. But they can only sustain 1/8 of that when doing regular vector ops.

If you calculate scalar values more than 3% of the time, it's a net win!

I can see this is a win in energy consumption, not in anything else though. You are still limited by how many instructions can be scheduled simultaneously. It would be a performance win if the scalar instructions were scheduled independently, and that requires a very different design. Anyway, don't seem like that's the direction Apple is moving towards.

What do you mean by "Apple can do 32-bit FP+Int most of the time without problems after all."
You mean that, eg, Apple can sustain execution of 128 FMA32 + INT16 128 integer ops per cycle? I've never seen any claim of this.
As far as I can tell, Apple has four schedulers each feeding a 32 element SIMD, and each scheduler can feed a maximum of 1 instruction per cycle.

I am referring to M3. The new GPU scheduler can issue two instructions per clock to different functional units within the same partition. So they can do FP+INT or FP32+FP16 simultaneously.

The caveat is that when doing dual-issue a full INT32 instructions can only be issued every other cycle. It is possible that the current register cache can only sustain 32bit + 16bit reads for each operand and lane. They would indeed need to widen that if they wanted to go double FP32+FP32.

Phillip Turner says something different on his benchmarking page, but I cannot make head or tail of what he is saying. He does not describe his mental model of the hardware, and I can't reverse engineer it from his description.

I have the same feeling about Philipp's results. The way he describes the hardware does not fit with my understanding.

Are you referring to post-M1 improvements? I would be interested in seeing such numbers.

Here you go, this is the test I did a while ago evaluating the peak compute of various instruction mixes. That's on the 30-core M3 Max

1700765326050-png.27416


What Leman is referring to is that pre-Volta had separate INT and FP execution hardware, but in a single cycle only one of an FP or an INT instruction could be dispatched. With Volta this was tweaked so that both an INT and FP instruction could be dispatched. nV said at the time that about 30% of instructions were INT, so this basically improved performance by 1/.7=1.4x

BUT for Apple it's a slightly different tradeoff. nVidia has more time (their double-pumped lane stuff) to do easier (track dependencies in instruction bits) scheduling. AND their addressing is more lame than Apple's, so more cycles are required to calculate addresses that Apple can calculate in the load/store instruction. So the payoff for Apple is not as great.

But now that Apple are no longer enslaved to the exact size of the register file and the threadblock scratchpad, there's the opportunity to rebalance the design in all sorts of ways. I could see this resulting in ideas they have not used before (though PowerVR did) like running the 32-wide SIMD as a 16-wide SIMD that executes each instruction twice, which in turn makes it a lot easier to start doing things like effectively executing two instructions per "cycle", eg one to FP32 and one to FP16, or one to FP32 and one to INT.

As I said, they already implemented dual-issue in M3. And it seems to be a real dual-issue, not Nvidia's smart "even-odd issue with two clocks per instruction". Apple also has a nice patent describing this scheme, it's a fairly sophisticated mechanism. Much of my hypothesis is based on the assumption that they didn't go through all that trouble just to have dual FP32+FP16. Real performance wins are with FP32+FP32 or FP16+FP16 on modern shader code.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
I am not aware of any two/four vector variants on AMX.
Two vector variant was addded w/ M2, 4 vector w/ M3.

Sorry for confusion. Eric Fink on RRW is my alias. I got 250 GFLOPS using the four vector FMLA variant. I think that's a data movement limitation. The ALUs themselves are fully capable of 2000 GFLOPs when doing outer products on two 16-wide FP32 vectors. But they can only sustain 1/8 of that when doing regular vector ops.
👍
I can see this is a win in energy consumption, not in anything else though. You are still limited by how many instructions can be scheduled simultaneously. It would be a performance win if the scalar instructions were scheduled independently, and that requires a very different design. Anyway, don't seem like that's the direction Apple is moving towards.

I am referring to M3. The new GPU scheduler can issue two instructions per clock to different functional units within the same partition. So they can do FP+INT or FP32+FP16 simultaneously.

The caveat is that when doing dual-issue a full INT32 instructions can only be issued every other cycle. It is possible that the current register cache can only sustain 32bit + 16bit reads for each operand and lane. They would indeed need to widen that if they wanted to go double FP32+FP32.
I guess this is courtesy of the new "mini-front-ends" they have added to the GPU...
https://patents.google.com/patent/US11422822B2
Nice to confirm it works as I interpreted it to.
This is one reason I think a scalar lane is feasible. They have the ability to schedule to it now, and it won't hurt register cache bandwidth (uses a different, uniforms, cache).

Here you go, this is the test I did a while ago evaluating the peak compute of various instruction mixes. That's on the 30-core M3 Max

1700765326050-png.27416




As I said, they already implemented dual-issue in M3. And it seems to be a real dual-issue, not Nvidia's smart "even-odd issue with two clocks per instruction". Apple also has a nice patent describing this scheme, it's a fairly sophisticated mechanism. Much of my hypothesis is based on the assumption that they didn't go through all that trouble just to have dual FP32+FP16. Real performance wins are with FP32+FP32 or FP16+FP16 on modern shader code.
Ha ha! I guess we both read that same patent :)
 

HoxtonBridge

macrumors newbie
May 19, 2024
15
22
Nah. You got choices, lots of people only need 8gb. Not me of course, but the people I know that are not creatives.
Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.
 

MayaUser

macrumors 68040
Nov 22, 2021
3,177
7,196
is not about the base specs...but the base price...i bet Apple would increase the price if they go 16gb+512ssd and they will lose some customers
Remember for a lot its just a symbol to own a mac...and those dont need anything but base model
If apple will keep the 999 price for 16gb ram and 512ssd...then yes, of course it is a must and it will be the starting base spec for the next decade..or maybe decades (joking)
 

Torty

macrumors 65816
Oct 16, 2013
1,239
944
Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.
Something wrong with your system if your 8GB machine can't handle that.
 

ader42

macrumors 6502
Jun 30, 2012
436
390
Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.

There is no reason to take away peoples choices.
 

EugW

macrumors G5
Jun 18, 2017
14,897
12,866
So 4GB is frivolous but 8GB is a "choice". Many people are arguing that nowadays 8GB is frivolous.
8 GB works OK, but 4 GB (on Intel machines) does not, with recent macOSes.

You can actually run Monterey on 4 GB, but you don't want to. In contrast, Monterey and runs well on 8 GB RAM for light business usage. And lots of people use 8 GB just fine on Sonoma.

However, my prediction/hope is that the next MacBook Air will ship with 12 GB base. 12 GB at the entry level would be great, and more than sufficient for the majority of the population. Jumping all the way up to 16 GB is not necessary for the MB Air. I agree the higher priced MacBook Pro should start at 16 GB or 18 GB though.
 

SoldOnApple

macrumors 65816
Jul 20, 2011
1,281
2,186
Apple must be holding back so they can push ahead once competing ARM systems make it to market. Like with iPhone, early on Samsung and others had faster phones sometimes, but then Apple pushed far ahead and no one else even bothers competing on real world benchmarks.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Apple must be holding back so they can push ahead once competing ARM systems make it to market. Like with iPhone, early on Samsung and others had faster phones sometimes, but then Apple pushed far ahead and no one else even bothers competing on real world benchmarks.

Not much sign of that. Qualcomm's earlier schedule had them releasing "X Elite" in Fall 2023 timeframe. Apple rushed the MBP 14/16" on a short cycle from to M3 in Fall 2023. Turns out Qualcomm had more fine tuning to do and Windows 11 (or pre-12 or 12) wasn't ready so it happen to slide to Q2 2024 (June). The iPad Pro moves to M4 about a month before the Surface Pro.

Apple likes to position iPad Pro as not a 'tablet' ( iPad being something inherently better than a tablet because generally 'tablets' are 'bad' ) , but iPad Pro and Surface Pro have competitive overlap.

Indications are that Apple is trying to arrive 'before' not 'after' the Windows on Arm moves. ( The Windows on Arm moves are not only issue. AMD and Intel have cranked up their focus on more efficient SoCs for Windows also. They aren't gong to remove the efficiency gap in one move , but the gap is going to shrink a bit. )

The MBP line up probably will get another bump to M4 in the Fall, but in some sense Apple has already revealed that elements of that move already with they iPad Pro move. ( cores counts are not grounded , but probably have decent idea of single thread uplift going to see; because it is basically the same across dies sizes. )

The Windows on Arm is mainly limited to laptops though. ( There is a Mini like "Windows developer device" from Qualcomm coming. So the Mini moving to M4/M4 Pro in 2024 is likely going to appear. If there is enough X Elite/Plus supply chain to go around there probably will be some smaller Mini-PC players that drop a 'headless' laptop. )

Where not going to be much Arm movement is in larger desktop space. Linux Arm workstation perhaps, but I doubt that is a creditable threat to Apple ( any more than hackintosh was. Somewhat similar in folks taking Ampere Computing server processors and making workstation with that to run deskside Linux. Linus is doing it. ). Higher end desktop is likely where Apple will be 'holding back". Not so much on competitive pressures though ( lots of movement coming in 2H 2024 in GPU space, AI/ML accelerator , and also in general Workstation space. ). Pushed to 'churn' the Pro and Max dies quicker in the laptops could/should possibly have side effects on the rest of the product line.

Apple is not so far ahead that they can 'hold back'. Arm is ramping up on making the Phone SoCs more competitive ( Arm is shifting from trying to hold onto the 'lowest price SoC" market. RISC-V is coming up from 'below' ) . Nvidia and AMD have synergy deals to move their GPUs into SoCs. Both are rumored to be working on Arm SoCs to move Qualcomm out of the PC laptop placements.

Vision Pro is going to need substantively better Mn and R2 .
 

ader42

macrumors 6502
Jun 30, 2012
436
390
Why not offer 2 or 4 GB then?
You likely know full well that 4GB isn’t enough to adequately (for a premium device) run MacOS and a single regular business app like Word whereas 8GB is enough for a great deal of users. Apple certainly does.

I’m pretty sure Apple know what they are doing and that if 8GB was too little people would either not buy them or they would return them - after all Apple have the most generous returns policy of any company I’ve known.

So to answer your question I’ll ask one back, why not offer only 8GB or 64GB?
The answer of course is that Apple likes to give people choices to buy what they themselves want/need.

The fact is, anyone moaning about the minimum being 8GB is in reality moaning about the price.

Let’s try to keep on topic though, M4… if the M4 had 16GB base, the Pro would have 36GB (minimum) and the Max 72GB (minimum). These jumps are a little large from a marketing perspective as it would increase the price and performance differentials between levels of chip, this would actually reduce consumer choice and increase pricing overall.

Apple knows what it is doing when it designs the chips not only from a tech point of view but from a marketing and Mac sales point of view. One has to look at the big picture and not focus on the base entry level chip.
 

HoxtonBridge

macrumors newbie
May 19, 2024
15
22
You likely know full well that 4GB isn’t enough to adequately (for a premium device) run MacOS and a single regular business app like Word whereas 8GB is enough for a great deal of users. Apple certainly does.

How many people do you kn ow who run just one application like Word without at least Email and a browser open?
I’m pretty sure Apple know what they are doing and that if 8GB was too little people would either not buy them or they would return them - after all Apple have the most generous returns policy of any company I’ve known.

So to answer your question I’ll ask one back, why not offer only 8GB or 64GB?
The answer of course is that Apple likes to give people choices to buy what they themselves want/need.
You assume my comment is embedded in the logic of choice; I was showing the limitations of that position. You have just posited an additional illustration of its silliness.
The fact is, anyone moaning about the minimum being 8GB is in reality moaning about the price.
No! I haven't mentioned the price, although the price of Apple memory is universally regarded as absurdly high.
Let’s try to keep on topic though, M4… if the M4 had 16GB base, the Pro would have 36GB (minimum) and the Max 72GB (minimum).
More illogical propositions. How did you get to that?
These jumps are a little large from a marketing perspective
Well yes, but again I'm not especially interested in marketing.
as it would increase the price and performance differentials between levels of chip, this would actually reduce consumer choice and increase pricing overall.

Apple knows what it is doing when it designs the chips not only from a tech point of view but from a marketing and Mac sales point of view. One has to look at the big picture and not focus on the base entry level chip.
I'm not looking at entry-level chips, I was querying the efficacy of 8gb or memory even for a basic user. You disagree; that's fine, but don't insult my intelligence with a bunch of straw man calculations you have pulled out of nowhere, and fan-boy nonsense about Apple's omniscience.
 
  • Haha
Reactions: ader42
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.