The base memory is laughable. 16gb ought to be the bare minimum for anything Mac....Base stats:
- 40 high-performance cores
- 20 high-efficiency cores
- 100-core GPU
- 80-core Neural Engine
- 8GB Unified memory
The base memory is laughable. 16gb ought to be the bare minimum for anything Mac....Base stats:
- 40 high-performance cores
- 20 high-efficiency cores
- 100-core GPU
- 80-core Neural Engine
- 8GB Unified memory
BTW, on another M4 topic, in fact SSVE works every bit as fast as expected on the M4, and can (when coded correctly) run as fast as a matrix outer product, so ~2000 FP32 GFLOPS.
Problem is the initial Jena results were written up by people who read the ARM manual but never bother to read my manual (or the similar work by Corsix or Dougall).
The biggest bottleneck to this right now is the register cache. Already it can be oversubscribed by microbenchmark code (though usually not by real code). And widening that could be expensive.
To my eyes, the most obvious quick win is to add a scalar lane to execute scalar operations (and not waste full SIMD on them). They have uniform registers but for whatever reason they never provided a uniform execution path.
Next up in terms of quick wins is probably allowing FP16[2] operations to execute on pairs of FP16s. Either splitting these into one done in FP16 HW, one done on FP32 HW or (slightly more work) rewiring FP32 HW to also support FP16[2].
Both of these match the route AMD and nV has already gone done, and both don't put that much additional pressure on the register cache.
Jena's benchmark result.I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.
Core Type | #Tiles | FP32 GFLOPS |
---|---|---|
P | 1 | 502.2 |
P | 2 | 1004.7 |
P | 4 | 2009.6 |
E | 1 | 116.3 |
E | 2 | 116.0 |
E | 4 | 116.1 |
Jena's benchmark result.
Core Type #Tiles FP32 GFLOPS P 1 502.2 P 2 1004.7 P 4 2009.6 E 1 116.3 E 2 116.0 E 4 116.1
By the way, they have moved their code to Github.
GitHub - scalable-analyses/sme
Contribute to scalable-analyses/sme development by creating an account on GitHub.github.com
Disclaimer: all I say here is obviously just speculation and to large degree wishful thinking. Call it a hunch if you want. It is not entirely baseless though. I think if one puts together the timeline of Apple GPU changes, there appears to be a clear trajectory. Here is the basic idea.
This could be done by making the FP units larger (e.g. shipping two FP32 units instead of FP32 and FP16), or by making the INT32 unit capable of executing FP32 instructions. That is essentially what Nvidia did with Volta and I think this is where Apple is heading as well.
The basic machinery is already in place, and this effective doubling of execution units can be achieved with only minimal increase in GPU die size. And it would instantly double the peak GPU compute.
It means you are not wasting a SIMD of execution just to calculate a single value.That is true. I still think it might work well enough even with the current register cache. Apple can do 32-bit FP+Int most of the time without problems after all. And they are the only vendor who offers four-operand instructions. I mean, Nvidia does it and it works for them, even though they do get stalls here and there.
Do you think it is worth the effort? I doubt it will improve performance. It’s still a second instruction. Maybe it could save some power, dies sound like extra complexity.
Doesn’t seem to work well for AMD. And Nvidia abandoned this path. Too situational, IMO. Would help with matmul though.
What Leman is referring to is that pre-Volta had separate INT and FP execution hardware, but in a single cycle only one of an FP or an INT instruction could be dispatched. With Volta this was tweaked so that both an INT and FP instruction could be dispatched. nV said at the time that about 30% of instructions were INT, so this basically improved performance by 1/.7=1.4xVolta isn't really primarily graphics oriented. It is a datacenter compute focused chip. Most folks are likely going to thinkg "2x GPU" primarily has impact on their screen, not embarrassingly parallel data into a file (or ephemeral memory buffer).
Did Corsix give performance numbers for the two and four vector variants? Presumably you have to use those to get max throughput. It wasn't clear to me from the link I gave if Eric had *measured* that he got 2000 GFLOPs or not.I only got 250 GFLOPs, not 2000. I’ll play around with the code, maybe one can extract a bit more. At any rate this is very similar to what corsix reports for AMX, so I think the results are close to optimal.
Did Corsix give performance numbers for the two and four vector variants? Presumably you have to use those to get max throughput.
It wasn't clear to me from the link I gave if Eric had *measured* that he got 2000 GFLOPs or not.
If you calculate scalar values more than 3% of the time, it's a net win!
What do you mean by "Apple can do 32-bit FP+Int most of the time without problems after all."
You mean that, eg, Apple can sustain execution of 128 FMA32 + INT16 128 integer ops per cycle? I've never seen any claim of this.
As far as I can tell, Apple has four schedulers each feeding a 32 element SIMD, and each scheduler can feed a maximum of 1 instruction per cycle.
Phillip Turner says something different on his benchmarking page, but I cannot make head or tail of what he is saying. He does not describe his mental model of the hardware, and I can't reverse engineer it from his description.
Are you referring to post-M1 improvements? I would be interested in seeing such numbers.
What Leman is referring to is that pre-Volta had separate INT and FP execution hardware, but in a single cycle only one of an FP or an INT instruction could be dispatched. With Volta this was tweaked so that both an INT and FP instruction could be dispatched. nV said at the time that about 30% of instructions were INT, so this basically improved performance by 1/.7=1.4x
BUT for Apple it's a slightly different tradeoff. nVidia has more time (their double-pumped lane stuff) to do easier (track dependencies in instruction bits) scheduling. AND their addressing is more lame than Apple's, so more cycles are required to calculate addresses that Apple can calculate in the load/store instruction. So the payoff for Apple is not as great.
But now that Apple are no longer enslaved to the exact size of the register file and the threadblock scratchpad, there's the opportunity to rebalance the design in all sorts of ways. I could see this resulting in ideas they have not used before (though PowerVR did) like running the 32-wide SIMD as a 16-wide SIMD that executes each instruction twice, which in turn makes it a lot easier to start doing things like effectively executing two instructions per "cycle", eg one to FP32 and one to FP16, or one to FP32 and one to INT.
Two vector variant was addded w/ M2, 4 vector w/ M3.I am not aware of any two/four vector variants on AMX.
👍Sorry for confusion. Eric Fink on RRW is my alias. I got 250 GFLOPS using the four vector FMLA variant. I think that's a data movement limitation. The ALUs themselves are fully capable of 2000 GFLOPs when doing outer products on two 16-wide FP32 vectors. But they can only sustain 1/8 of that when doing regular vector ops.
I can see this is a win in energy consumption, not in anything else though. You are still limited by how many instructions can be scheduled simultaneously. It would be a performance win if the scalar instructions were scheduled independently, and that requires a very different design. Anyway, don't seem like that's the direction Apple is moving towards.
I guess this is courtesy of the new "mini-front-ends" they have added to the GPU...I am referring to M3. The new GPU scheduler can issue two instructions per clock to different functional units within the same partition. So they can do FP+INT or FP32+FP16 simultaneously.
The caveat is that when doing dual-issue a full INT32 instructions can only be issued every other cycle. It is possible that the current register cache can only sustain 32bit + 16bit reads for each operand and lane. They would indeed need to widen that if they wanted to go double FP32+FP32.
Ha ha! I guess we both read that same patentHere you go, this is the test I did a while ago evaluating the peak compute of various instruction mixes. That's on the 30-core M3 Max
As I said, they already implemented dual-issue in M3. And it seems to be a real dual-issue, not Nvidia's smart "even-odd issue with two clocks per instruction". Apple also has a nice patent describing this scheme, it's a fairly sophisticated mechanism. Much of my hypothesis is based on the assumption that they didn't go through all that trouble just to have dual FP32+FP16. Real performance wins are with FP32+FP32 or FP16+FP16 on modern shader code.
Nah. You got choices, lots of people only need 8gb. Not me of course, but the people I know that are not creatives.The base memory is laughable. 16gb ought to be the bare minimum for anything Mac....
Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.Nah. You got choices, lots of people only need 8gb. Not me of course, but the people I know that are not creatives.
Something wrong with your system if your 8GB machine can't handle that.Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.
Very few people work with only one application running. Run two Office apps, safari, and have Onedrive or similar running in the background, things will start to get very slow. There is no reason the base RAM of any Mac needs to be below 16 GB... even with the unnecessarily high price of Apple memory.
It would handle it, just not very well!Something wrong with your system if your 8GB machine can't handle that.
Why not offer 2 or 4 GB then?There is no reason to take away peoples choices.
I assume that you are well aware of the logical answer to this frivolous question.Why not offer 2 or 4 GB then?
So 4GB is frivolous but 8GB is a "choice". Many people are arguing that nowadays 8GB is frivolous.I assume that you are well aware of the logical answer to this frivolous question.
8 GB works OK, but 4 GB (on Intel machines) does not, with recent macOSes.So 4GB is frivolous but 8GB is a "choice". Many people are arguing that nowadays 8GB is frivolous.
Apple must be holding back so they can push ahead once competing ARM systems make it to market. Like with iPhone, early on Samsung and others had faster phones sometimes, but then Apple pushed far ahead and no one else even bothers competing on real world benchmarks.
You likely know full well that 4GB isn’t enough to adequately (for a premium device) run MacOS and a single regular business app like Word whereas 8GB is enough for a great deal of users. Apple certainly does.Why not offer 2 or 4 GB then?
You likely know full well that 4GB isn’t enough to adequately (for a premium device) run MacOS and a single regular business app like Word whereas 8GB is enough for a great deal of users. Apple certainly does.
You assume my comment is embedded in the logic of choice; I was showing the limitations of that position. You have just posited an additional illustration of its silliness.I’m pretty sure Apple know what they are doing and that if 8GB was too little people would either not buy them or they would return them - after all Apple have the most generous returns policy of any company I’ve known.
So to answer your question I’ll ask one back, why not offer only 8GB or 64GB?
The answer of course is that Apple likes to give people choices to buy what they themselves want/need.
No! I haven't mentioned the price, although the price of Apple memory is universally regarded as absurdly high.The fact is, anyone moaning about the minimum being 8GB is in reality moaning about the price.
More illogical propositions. How did you get to that?Let’s try to keep on topic though, M4… if the M4 had 16GB base, the Pro would have 36GB (minimum) and the Max 72GB (minimum).
Well yes, but again I'm not especially interested in marketing.These jumps are a little large from a marketing perspective
I'm not looking at entry-level chips, I was querying the efficacy of 8gb or memory even for a basic user. You disagree; that's fine, but don't insult my intelligence with a bunch of straw man calculations you have pulled out of nowhere, and fan-boy nonsense about Apple's omniscience.as it would increase the price and performance differentials between levels of chip, this would actually reduce consumer choice and increase pricing overall.
Apple knows what it is doing when it designs the chips not only from a tech point of view but from a marketing and Mac sales point of view. One has to look at the big picture and not focus on the base entry level chip.