Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
If you want to apply statistical testing to it the trick would be finding a suitable model. I suppose one would look at Cauchy distribution since it describes the ratio of normally distributed variables, but I am not knowledgeable in that particular domain and don't have an intuition.
Resampling with replacement creates non-independent observations within a sample (since they are picked several times), which violates assumptions of most statistical tests.

I would not compare on the performance ratios to 1, I would just compare the M3 and M4 data for each test. If normalized GB results are normally distributed (which I don't know), you can make linear model with the CPU_type × type_of_test as explanatory variables. Then an ANOVA can be applied on this model to estimate the part of variance explained by the CPU model, the type of test, and their interaction.
 
Resampling with replacement creates non-independent observations within a sample (since they are picked several times), which violates assumptions of most statistical tests.

Im not applying a classical statistical test, so that’s not a problem. The plots I’ve produced are plots of variance of course, and they are also easily interpretable.

I would not compare on the performance ratios to 1, I would just compare the M3 and M4 data for each test. If normalized GB results are normally distributed (which I don't know), you can make linear model with the CPU_type × type_of_test as explanatory variables. Then an ANOVA can be applied on this model to estimate the part of variance explained by the CPU model, the type of test, and their interaction.

The scores should be normally distributed, but as you can see there are severe artifacts, multiple peaks, and long tails in the data. The primary reason why I use ratios is because I wanted to have all results on one plot using the same scale. One could use some other normalization technique and plot the runtimes instead, I personally think the ratios are easier to interpret. And yes, absolutely, one could run a regression on the data, maybe even try to reformulate the model in terms of IPC (since that’s the variable we are interested in). At any rate, if one goes that route one would need to think some more about modeling, maybe even set up a hierarchical model to account for tests coming from the same benchmark run.
 
I found a way to wait for the M4 MacBook Pro: I traded in my baseline M2 Air 15 for a refurbished M1 MacBook Pro at Best Buy.

I doubled my ram, storage, and got a way nicer screen for it. With my trade-in, the Pro was about $600. I'd figured I'd get equal or more to it when trading it in for the M4 MacBook Pro.

Almost like a "down payment" to wait out the new M4 releases.

However, I agree with the viewpoint that the M3 did not fully tap into the 3nm build, and that the M4, hopefully with deeper machine-learning integration, would be a significant upgrade from the M1/2.

Already being on M3 makes that jump to M4 TBD. Then, it would have to be really significantly better because you have two price entries at near-total retail price (one for the M3 Mac and one for the M4 Mac). That's a lot of $ for what could be a minor jump.

I'm not touching M3 cause I'm happy with the M1/2 series.
 
  • Like
Reactions: Kpjoslee
I found a way to wait for the M4 MacBook Pro: I traded in my baseline M2 Air 15 for a refurbished M1 MacBook Pro at Best Buy.

I doubled my ram, storage, and got a way nicer screen for it. With my trade-in, the Pro was about $600. I'd figured I'd get equal or more to it when trading it in for the M4 MacBook Pro.

Almost like a "down payment" to wait out the new M4 releases.

However, I agree with the viewpoint that the M3 did not fully tap into the 3nm build, and that the M4, hopefully with deeper machine-learning integration, would be a significant upgrade from the M1/2.

Already being on M3 makes that jump to M4 TBD. Then, it would have to be really significantly better because you have two price entries at near-total retail price (one for the M3 Mac and one for the M4 Mac). That's a lot of $ for what could be a minor jump.

I'm not touching M3 cause I'm happy with the M1/2 series.
Can't follow your decision :) So you lost your M2 Air 15 + gave additional $600 for an M1 machine only to wait for M4?
 
Can't follow your decision :) So you lost your M2 Air 15 + gave additional $600 for an M1 machine only to wait for M4?
I bought the Air for $900 open-box excellent, traded in for $800 (after six months of usage) to bring down cost of M1 Pro to $600. So, paid total $1,500 between two machines to end up with M1 Pro 16 inch, in addition to also getting 16GB Ram and 512 GB storage (double specs) plus the nicer screen. No brainer.

Looking to get top trade-in value for my M1 Pro when the M4 Pro launches (cost TBD).
 
I bought the Air for $900 open-box excellent, traded in for $800 (after six months of usage) to bring down cost of M1 Pro to $600. So, paid total $1,500 between two machines to end up with M1 Pro 16 inch, in addition to also getting 16GB Ram and 512 GB storage (double specs) plus the nicer screen. No brainer.

Looking to get top trade-in value for my M1 Pro when the M4 Pro launches (cost TBD).
TBH I would have kept the M2 machine, waited for M4 and yes I would have skipped it and see what M5 brings.
 
Im not applying a classical statistical test, so that’s not a problem. The plots I’ve produced are plots of variance of course, and they are also easily interpretable.
You were mentioning a model, I just wanted to point out that the ratio M4/M3 was not the best choice for a response variable.

maybe even set up a hierarchical model to account for tests coming from the same benchmark run.
Since there is a single observation per task per run, I don't think the "run" should be a (random) factor of the model. It's not grouping anything. However, the device used could be a random factor, but I don't think it's identified in the results.
That is, if we had time to spare. Your plot is excellent and a model wouldn't add much to the discussion.
 
TBH I would have kept the M2 machine, waited for M4 and yes I would have skipped it and see what M5 brings.
I was pushing the M2 too much. The 8GB + 256GB SSD were real bottle necks (taking into account how the M2 Air base SSD is single drive instead of two 128 GB, too).

Once I had all my apps open, and Chrome came close to 10 tabs - it was game over. I needed the extra ram - and the double storage plus nicer screen were great side benefits.

All in addition that I've been trying to break into the M-Pro series ownership since launch several years ago. Only way I could afford it was piecemeal with the M2 air as an in-between.
 
You were mentioning a model, I just wanted to point out that the ratio M4/M3 was not the best choice for a response variable.

Ah, yes, of course, that's very fair. Sorry, I was thinking in more general terms.

Since there is a single observation per task per run, I don't think the "run" should be a (random) factor of the model. It's not grouping anything. However, the device used could be a random factor, but I don't think it's identified in the results.

GB6 runs each test several times — I am using the JSON report summary which lists five run times for each test (first one is the warmup and is discarded). In addition, tests run as part of the same benchmark instance are likely to be correlated (environment, other open apps etc.).


That is, if we had time to spare. Your plot is excellent and a model wouldn't add much to the discussion.

Thank you! Would be a nice exercise though :)
 
  • Like
Reactions: crazy dave
Resampling with replacement creates non-independent observations within a sample (since they are picked several times), which violates assumptions of most statistical tests.


It's okay as long as you treat it like bootstrapping: https://en.m.wikipedia.org/wiki/Bootstrapping_(statistics)

Using the central limit theorem you can prove that you can reliably estimate things like variance and confidence intervals and do hypothesis testing. Though as I think we all agree any of that is more than I think we really need to do for this discussion. :)
 
It's okay as long as you treat it like bootstrapping: https://en.m.wikipedia.org/wiki/Bootstrapping_(statistics)
In that case we're not doing a "classical" test like a t-test of F test.
What Leman did indeed amounts to bootstrapping. Computing the percentage of bootstrap replicates for which the performance ratio is < 1 may be a good estimate of the p-value of a one sided-test (=probability of observing such a large, or larger, difference assuming the performance ratio is 1).

Data sampled with replacement would not be independent, and it's unclear how this violation affects the applicability of the central limit theorem in a particular case.
 
Cant have Air with better SoC than Pro.
Can. M2 Air was introduced 6/6/22, M2 Pro/Max was introduced 1/17/23.

That said, this is very suboptimal from a marketing perspective, so it's clearly desirable to have Pro/Max out before Air.

Also use Air to get rid of M3 stock. M3 probably too hot to fit in iPad plus need new AI marketed chip. Plus wouldnt want to keep M3 in production for so long. M4 iPad Pro might have longer cycle than Air.
M3 is definitely not too hot for the iPad. It is strictly better in performance/power than the M2. The lack of doubled INT8 performance in the NPU might be a factor. It's true that at this point they clearly benefit from shifting to N3E ASAP.

I think they originally had planned to release M3 for everything. Half way in they realize either 3nm B is worse than anticipated or 3nm E is better than anticipated. They decide to cut B short and only for highest moving products (MBPs) since they were constrained with contracts to TSMC.
N3B hit its targets well, despite persistent early rumors to the contrary. However N3E is clearly outperforming - it was expected to ship in volume in H2, and yet here we are already.

It is extremely unlikely that Apple is constrained by contracts to stick with N3B over N3E. There's likely very little advantage to TSMC in such an arrangement, and Apple definitely had the whip hand in negotiations over N3B as nobody else wanted any part of it.

Then it turns out things are moving slower than they had hoped for. They release it on more models (iMac and Air) to get rid of more M3 stock.

Combine this with the need to up the game on AI marketing and that they have to update the desktops… and we’ll get new desktops next month and new MBPs either at the same time or shortly after.

Air and iMac will not get M4.
I doubt the sales are all that surprising to them, or having much of an impact on their strategy, which plays out over a somewhat longer period of time typically.

Your conclusions about the Air and iMac are plausible, but I would not bet on them. Really, I wouldn't bet in either direction.
 
Why are people believing the marketing that the M4 is magically a whole new powerhouse chip for AI tasks? A17 Pro NPU can do already do 35 TOPS (vs 38 TOPS in M4). What new ground is Apple breaking here that only the M4 is capable of?
It's a big step up from the M3, which apprently does not have the doubled INT8 performance in the A17's NPU. That alone is significant, though it's not new if you're considering iPhones and not just Macs.

However, there are other new things in the M4, some of which we just don't know about yet. The whole thing about SME, which took up many comments in another discussion this week here, is significant - not all AI processing happens in the NPU, after all. We don't have a good idea what performance wins will be had there, but from the one test in GB6 that used it, it's clearly quite significant.

The scores should be normally distributed, but as you can see there are severe artifacts, multiple peaks, and long tails in the data. The primary reason why I use ratios is because I wanted to have all results on one plot using the same scale. One could use some other normalization technique and plot the runtimes instead, I personally think the ratios are easier to interpret. And yes, absolutely, one could run a regression on the data, maybe even try to reformulate the model in terms of IPC (since that’s the variable we are interested in). At any rate, if one goes that route one would need to think some more about modeling, maybe even set up a hierarchical model to account for tests coming from the same benchmark run.
Trying to build a serious model based on IPC opens up a can of (nearly) infinite worms. Unless you're playing with toy archs where every I executes in the same fixed number of cycles, you then have to tackle the unsolvable (in the general case) problem: what mix are you using? This is essentially the heart of the benchmark problem. You punted on that whole issue by just using GB6, and that was the smart thing to do.
 
  • Like
Reactions: Retskrad
Can. M2 Air was introduced 6/6/22, M2 Pro/Max was introduced 1/17/23.

That said, this is very suboptimal from a marketing perspective, so it's clearly desirable to have Pro/Max out before Air.


M3 is definitely not too hot for the iPad. It is strictly better in performance/power than the M2. The lack of doubled INT8 performance in the NPU might be a factor. It's true that at this point they clearly benefit from shifting to N3E ASAP.


N3B hit its targets well, despite persistent early rumors to the contrary. However N3E is clearly outperforming - it was expected to ship in volume in H2, and yet here we are already.

It is extremely unlikely that Apple is constrained by contracts to stick with N3B over N3E. There's likely very little advantage to TSMC in such an arrangement, and Apple definitely had the whip hand in negotiations over N3B as nobody else wanted any part of it.


I doubt the sales are all that surprising to them, or having much of an impact on their strategy, which plays out over a somewhat longer period of time typically.

Your conclusions about the Air and iMac are plausible, but I would not bet on them. Really, I wouldn't bet in either direction.
I just find it strange that they from the start would decide ”This generation (M3) we will only design with laptops in mind”.

I mean why? Doing that suggests that they knew from scratch something about N3B was inferior to N3E, which sound inplausible. Supposedly they paid a lot of cash for N3B too. To get 6 months ahead of everyone else? That’s not much at all.

It may well be that I’m wrong about all my assumptions but I would love to learn the real reasons for their strategy the last 2 years.
 
I just find it strange that they from the start would decide ”This generation (M3) we will only design with laptops in mind”.

I mean why? Doing that suggests that they knew from scratch something about N3B was inferior to N3E, which sound inplausible. Supposedly they paid a lot of cash for N3B too. To get 6 months ahead of everyone else? That’s not much at all.

It may well be that I’m wrong about all my assumptions but I would love to learn the real reasons for their strategy the last 2 years.
They needed N3B for the iPhone, so they were going to use it anyway, and I suspect they wanted to release new generation Macs in 2023 instead of waiting for 2024.
 
  • Like
Reactions: SBeardsl
I just find it strange that they from the start would decide ”This generation (M3) we will only design with laptops in mind”.

Doing that suggests that they knew from scratch something about N3B was inferior to N3E, which sound inplausible. Supposedly they paid a lot of cash for N3B too. To get 6 months ahead of everyone else? That’s not much at all.

N3B is inferior to N3E in terms of using it for mass production of SoCs. Apple wanted to get to 3nm on iPhone ASAP and N3B allowed them to do so a year sooner - as well as over six months before any other smartphone.

As for being for "laptops only", laptops make up over 80% of Apple's annual Mac sales. People replace laptops more often than they do desktops and far more new customers will choose a laptop over a desktop. So it is important for the MacBook Air (especially) and MacBook Pro to be updated annually whereas the iMac, Mac mini, Mac Studio and Mac Pro (especially) do not. So knowing M3/N3B is a "short term" option, it gets shunted to the laptops and the most-popular desktop (iMac - and even there, iMac skipped M2).
 
  • Like
Reactions: EugW
N3B is inferior to N3E in terms of using it for mass production of SoCs. Apple wanted to get to 3nm on iPhone ASAP and N3B allowed them to do so a year sooner - as well as over six months before any other smartphone.

As for being for "laptops only", laptops make up over 80% of Apple's annual Mac sales. People replace laptops more often than they do desktops and far more new customers will choose a laptop over a desktop. So it is important for the MacBook Air (especially) and MacBook Pro to be updated annually whereas the iMac, Mac mini, Mac Studio and Mac Pro (especially) do not. So knowing M3/N3B is a "short term" option, it gets shunted to the laptops and the most-popular desktop (iMac - and even there, iMac skipped M2).
I might be biased from my own perspective - I strongly dislike laptops.

Just take my 2018 MBP 15”. The chassi is still flawless. Screen perfectly fine. It’s a complete waste to have to trash that, just because I need to upgrade the performance. Having a battery is a bad thing too (unless you’re dependent on the mobility, of course). They age and become unreliable, needing replacement.

Wheres a desktop is great. Superior cooling, longevity, separation of concerns. And almost half the price.
 
Trying to build a serious model based on IPC opens up a can of (nearly) infinite worms. Unless you're playing with toy archs where every I executes in the same fixed number of cycles, you then have to tackle the unsolvable (in the general case) problem: what mix are you using? This is essentially the heart of the benchmark problem. You punted on that whole issue by just using GB6, and that was the smart thing to do.

What do you think about treating IPC as a test-specific variable and evaluating each subtests independently from each other?

I think my overall sentiment aligns with what @jeanlain said: there is probably not much to gain by trying to model these things in more detail.
 
I might be biased from my own perspective - I strongly dislike laptops.
[...]
Wheres a desktop is great. Superior cooling, longevity, separation of concerns. And almost half the price.
That's an entirely legitimate position to take for your use case. But you're letting that bias your perception of another rational actor's positions based on their quite different perspective. Once you recognize that, it all makes perfect sense.
 
What do you think about treating IPC as a test-specific variable and evaluating each subtests independently from each other?

I think my overall sentiment aligns with what @jeanlain said: there is probably not much to gain by trying to model these things in more detail.
I think it's a reasonable thing to do from an analysis perspective, but I also agree with both of you that it doesn't buy you much. If you were planning on a big investment for a specific use case, then the story might be different.
 
In that case we're not doing a "classical" test like a t-test of F test.
What Leman did indeed amounts to bootstrapping. Computing the percentage of bootstrap replicates for which the performance ratio is < 1 may be a good estimate of the p-value of a one sided-test (=probability of observing such a large, or larger, difference assuming the performance ratio is 1).
Absolutely I just wanted to clarify for anyone reading that what @leman did was still statistically valid for getting variance (the violin plots) and even for doing statistical tests if he were so inclined, heck you could even replace the classic t-test or f-test with bootstrapping versions of those test. But you are right that the assumptions of classic test statistics would be violated, I didn't mean to seem like I was contradicting you on that.

Data sampled with replacement would not be independent, and it's unclear how this violation affects the applicability of the central limit theorem in a particular case.
I'm not quite sure I understand. The central limit theorem should still apply, that's at the heart of the bootstrap method. Through the fog, I'm struggling to remember my training here but I believe the sampling with replacement does slow down the convergence but it does still work. They have a proof at the very end of the wikipedia article. Granted that it isn't the proof I remember, I remember a much longer one, but then that was many, many years ago now and my memory is not that great. And it's also possible it's just a different proof. Lots of that going around.

What do you think about treating IPC as a test-specific variable and evaluating each subtests independently from each other?

I think my overall sentiment aligns with what @jeanlain said: there is probably not much to gain by trying to model these things in more detail.

Were I inclined to do so, I would do that. But I agree that I don't think that's worth it though as the visual representation here is really what's important.
 
Last edited:
Hopefully, they have something more intense in the works for the Mac Pro with better integration with external cards and storage. They seem to be working into some changes, but until the Mac Pro starts to look as capable as it used to be, they'll sell more of the Mac Studio.
I think they will eventually drop the Mac Pro and then for those people needing very high performance they will allow "clusters" of Mac Studios where you connect maybe a dozen of them tied together with a very high-speed network. The fast network is easy. Macs already allow networking over Thunderbolt.

I've been working with Apple's new open-source "openELM". It's an LLM that can run on-device. A comment in there by an Apple engineer says (paraphrase) "using Linux for training because Slurm does not work yet on MacOS." So Apple is working on cluster managment and would be able to manage hundreds or even thousands of Macs the way they do now with thousands of Linux servers in their data centers.

So Apple might say to a customer who wants to color grade every frame in a full length feature film in 10 seconds "Go buy a dozen Mac Studios, and if that is not fast enough, buy two dozen."
 
  • Like
Reactions: bousozoku
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.