M3 vs M4

bousozoku · May 12, 2024

ChrisA said:
I think they will eventually drop the Mac Pro and then for those people needing very high performance they will allow "clusters" of Mac Studios where you connect maybe a dozen of them tied together with a very high-speed network. The fast network is easy. Macs already allow networking over Thunderbolt.

I've been working with Apple's new open-source "openELM". It's an LLM that can run on-device. A comment in there by an Apple engineer says (paraphrase) "using Linux for training because Slurm does not work yet on MacOS." So Apple is working on cluster managment and would be able to manage hundreds or even thousands of Macs the way they do now with thousands of Linux servers in their data centers.

So Apple might say to a customer who wants to color grade every frame in a full length feature film in 10 seconds "Go buy a dozen Mac Studios, and if that is not fast enough, buy two dozen."

That reminds me of the PowerPC days when IBM modified the PowerPC 620 to become the PowerPC AS for the AS/400. Our small (midrange) machine ran most of our hospital management with around 800 people logged into the system at any time. IBM was experimenting with clusters over optical cable--this was around 1995--and made it reality a few years later for mainframes, UNIX, and AS/400. To think that our network was running 16 Mbps, optical was so much further ahead.

Now, we have desktop machines that surpass that power and rack-mountable machines to do the work of the AS/400.

Apple just needs to decide what to do about the Mac Pro. They could also do something about gaming while they're letting people guess.

jeanlain · May 12, 2024

crazy dave said:
I'm not quite sure I understand. The central limit theorem should still apply, that's at the heart of the bootstrap method.

You're right, I got confused.

It is sampling without replacement that makes observations non-independent, as you know that an observation won't be sampled again.
When sampling with replacement, a randomly selected observation tells you nothing about what you may pick next. No matter if you pick the same value again, sampling events are still independent. Which allows applying the central limit theorem.

And it's not as if I taught exactly this to students. 😶
So yeah, @leman's method is a quick and efficient way of representing the performance ratio and the uncertainty that come with the tests.
Only for a modeling analysis should we prefer using the raw results as the response variable.

Andropov · May 13, 2024

jeanlain said:
You're right, I got confused.

It is sampling without replacement that makes observations non-independent, as you know that an observation won't be sampled again.
When sampling with replacement, a randomly selected observation tells you nothing about what you may pick next. No matter if you pick the same value again, sampling events are still independent. Which allows applying the central limit theorem.

I don’t think that replacement of individual samples matters here. Two reasons: one, if the pool of samples from which we’re picking samples is sufficiently large, the distributions with and without replacement converge to the same number as the probability of picking a sample twice approaches zero.

Second, what would make the samples non-independent is if the *scores* could only be picked once, not the measurements themselves. Imagine I take a Cs-137 sample and measure the number of photons with an energy between 600 and 700 keV that hit a detector in a given amount of time. If I repeat that measurement 1000 times and compute the mean, every single measurement is only included *once*, and yet all measurements are obviously independent. What would skew the results (and affect the applicability of the Central Limit Theorem) is if, for example, I only allowed the *value* “12 photons” to appear once in the results.

If the data was too large to process, I could take the first 100 results and use that to compute a mean, and it would be valid. Or randomly pick 100 results from the original 1000 without replacement. Still fine.

jeanlain · May 13, 2024

Andropov said:
I don’t think that replacement of individual samples matters here. Two reasons: one, if the pool of samples from which we’re picking samples is sufficiently large, the distributions with and without replacement converge to the same number as the probability of picking a sample twice approaches zero.

Which implies that the sample size is small compared to the population.

Andropov said:
Second, what would make the samples non-independent is if the *scores* could only be picked once, not the measurements themselves.

Sampling with replacement still makes sampling events non-independent, even if you allow sampling the same value (score) several times.
The individuals you pick next depend on what you've already picked. The degree of this dependence depends on sample size vs. population size. But usually it's tolerable. In biology, we almost always sample without replacement.

crazy dave · May 13, 2024

Andropov said:
I don’t think that replacement of individual samples matters here. Two reasons: one, if the pool of samples from which we’re picking samples is sufficiently large, the distributions with and without replacement converge to the same number as the probability of picking a sample twice approaches zero.

Second, what would make the samples non-independent is if the *scores* could only be picked once, not the measurements themselves. Imagine I take a Cs-137 sample and measure the number of photons with an energy between 600 and 700 keV that hit a detector in a given amount of time. If I repeat that measurement 1000 times and compute the mean, every single measurement is only included *once*, and yet all measurements are obviously independent. What would skew the results (and affect the applicability of the Central Limit Theorem) is if, for example, I only allowed the *value* “12 photons” to appear once in the results.

If the data was too large to process, I could take the first 100 results and use that to compute a mean, and it would be valid. Or randomly pick 100 results from the original 1000 without replacement. Still fine.

jeanlain said:
Which implies that the sample size is small compared to the population.

Sampling with replacement still makes sampling events non-independent, even if you allow sampling the same value (score) several times.
The individuals you pick next depend on what you've already picked. The degree of this dependence depends on sample size vs. population size. But usually it's tolerable. In biology, we almost always sample without replacement.

I gotta admit I'm getting too tired to follow and my formal stats education was too long ago, I'm going to tap out here.

Andropov · May 13, 2024

jeanlain said:
Sampling with replacement still makes sampling events non-independent, even if you allow sampling the same value (score) several times.
The individuals you pick next depend on what you've already picked. The degree of this dependence depends on sample size vs. population size. But usually it's tolerable. In biology, we almost always sample without replacement.

No. The events are still independent. It's not only that the effect is tolerable, but rather that sampling without replacement is, in this case, better. Think of it this way, you have the following distributions:

A. The "true", ideal distribution of Geekbench scores across all M4 devices, that you would obtain by running Geekbench on an infinite number of devices.
- You sample this distribution by running Geekbench on a device. All Geekbench runs are independent of each other.
B. The Geekbench results stored in Geekbench Browser (this would approximate the distribution A as the number of results approaches infinity, but it's not identical to A for a finite number of samples, which is the case).
- You sample this distribution by picking a result from the pool of results stored in Geekbench Browser.

You're interested in A, not B. It's true that, if you want to compute the statistical properties of B, you'd need to pick results randomly *with replacement* to get to the true distribution of B. But as it so happens, you're interested in A, not B. In this case, B is simply an unordered collection of independent samples from A. You can pick them at random, or pick the first N results, or basically whatever you want so long as you don't repeat your results and don't pick in a biased manner.

If you prefer another example, imagine you wanted to know the mean result of throwing a dice. The "true" distribution is a uniform distribution with the value 1/6 for all numbers in [1, 6] (throwing the dice infinite times):

Or in fancy math notation, p(x) = 1/6 ∀ x ∈ [1,6]. If you want to approximate that distribution, you need to sample *with replacement*. That means that x (in this case, the result of the dice, 1 to 6) needs to be able to appear more than once. Otherwise, you're skewing the distribution (and you'd only be able to take 6 samples lol).

Now, if you take 1000 samples from the distribution above, you may end up with a dataset that, if plotted, looks like this, as there's sampling noise involved:

You got the above by throwing a dice 1000 times. Each throw of the dice is independent (the samples from A are independent from each other). Now there are two things you can do:
- Option 1: Take the 1000 results, sum them, and divide by 1000. Note that this process is mathematically identical to sampling 1000 times from B without replacement! The result will be the sample mean of A. This is not the mean of A, but with enough samples the sample mean converges to the true mean for well behaved distributions (by the law of large numbers).
- Option 2: Pick a random row each time (with replacement, allowing to pick the same row more than once), 1000 times, sum them, and divide by 1000. By doing this, you're sampling B, creating a third distribution (C) composed of all the samples from B. This distribution would look even more distorted from the distribution we're interested in (A), than B because you've introduced sampling noise a second time:

The result you obtain from Option 2 is not the sample mean of A, but the sample mean of B, or the sample mean of the sample mean of A. In this case, we're not interested in this. If you have both enough samples of A and enough samples of B, the sample mean of the sample mean of A does converge to the true mean of A, but you're adding extra noise with this step.

You could argue that picking 100 results from our 1000 samples of A is different than using all 1000, but it isn't:
- Option 1: You can pick whatever 100 results from the 1000 samples of however you want (as long as you don't do it based on the value), because all samples are independent from each other. For example, the first 100. Or randomly pick 100 results (without replacement) from the list. You'd still be computing the sample mean of A.
- Option 2: You can randomly pick 100 results from the 1000 samples of A with replacement, but you'd be computing the sample mean of the sample mean of A. This is further from the true mean of A than the sample mean of A.

jeanlain · May 13, 2024

Andropov said:
No. The events are still independent.

They're not. If you sample without replacement a population of finite size, the probability of sampling an individual depends on whether you sampled it previously. It's zero if you did.

Andropov said:
A. The "true", ideal distribution of Geekbench scores across all M4 devices, that you would obtain by running Geekbench on an infinite number of devices.
- You sample this distribution by running Geekbench on a device. All Geekbench runs are independent of each other.
B. The Geekbench results stored in Geekbench Browser (this would approximate the distribution A as the number of results approaches infinity, but it's not identical to A for a finite number of samples, which is the case).
- You sample this distribution by picking a result from the pool of results stored in Geekbench Browser.

B is a sample of population A (sample obtained by geekbench users).
If you sample from B, you're doing a resampling. But I'm not arguing for that. Of course resampling a sample is neither needed nor appropriate to estimate parameters of a population.

But Leman resampled with replacement to generate many M4-M3 pairs, which he couldn't have done without replacement. This is the bootstrap approach.

Andropov said:
View attachment 2377383

Or in fancy math notation, p(x) = 1/6 ∀ x ∈ [1,6]. If you want to approximate that distribution, you need to sample *with replacement*.

Or more precisely p(x) = 1/6 ∀ x ∈ {1...6}.
Sampling from a population (which is what I'm discussing) doesn't mean sampling the values {1...6}, it means sampling the observations (here, results you get by rolling a die). A probability distribution, like this uniform discrete distribution, assumes an infinity of observations (dices rolled).
It doesn't matter if you sample with or without replacement in this case. The population is infinite.

DougTheImpaler · May 13, 2024

The 16GB of my M1 Pro 14" is starting to get awfully cramped so I was just about to drop $3500 on an M3 Pro 14" with 36GB RAM and 2TB of storage. That last-minute "iPad might have M4" rumor gave me pause and now I'm kinda glad I waited. I can maybe hold on for another 5-6 months to see what happens this fall.

Andropov · May 13, 2024

jeanlain said:
They're not. If you sample without replacement a population of finite size, the probability of sampling an individual depends on whether you sampled it previously. It's zero if you did.

Again, that's irrelevant to determine the characteristics of the underlying probability distribution. B (the population comprised by all the Geekbench results in Geekbench browser) is, by definition, a collection of independent measurements from A (the true probability distribution of the results, that we can only sample).

By picking without replacement, each new element is always, necessarily, by definition, an independent sample from A. In this case, a new, unique, previously unseen benchmark score. As you keep on picking elements you’d obviously be increasing the probability of picking the remaining elements but that's completely irrelevant: the remaining elements are still a collection of independent samples from A, with the same characteristics as any other collection of random samples of A would have.

Granted, you’d run out of samples pretty quick if you don’t replace them, possibly preventing more useful data analysis, but you’re wrong in thinking that it would somehow make the samples of the probability distribution dependent on each other. They aren't.

jeanlain said:
B is a sample of population A (sample obtained by geekbench users).

A is not a population. A is a probability distribution. B (all the Geekbench results in Geekbench browser) is a population obtained by sampling A (the true probability distribution of the results).

jeanlain said:
If you sample from B, you're doing a resampling. But I'm not arguing for that. Of course resampling a sample is neither needed nor appropriate to estimate parameters of a population.

You say that, but then immediately after:

jeanlain said:
But Leman resampled with replacement to generate many M4-M3 pairs, which he couldn't have done without replacement. This is the bootstrap approach.

The bootstrap approach is literally based on resampling. It’s obvious that picking numbers from the database of Geekbench results is resampling. So not sure what your point here is.

It's true however that he couldn't have gotten as many M4-M3 pairs without reusing the samples, but that does not imply that, had he not used any sample in more than one pair, that this would have misrepresented the original probability distribution.

jeanlain said:
Sampling from a population (which is what I'm discussing) doesn't mean sampling the values {1...6}, it means sampling the observations (here, results you get by rolling a die). A probability distribution, like this uniform discrete distribution, assumes an infinity of observations (dices rolled).
It doesn't matter if you sample with or without replacement in this case. The population is infinite.

But, again, you're not interested in the properties of the population here (how are our specific 1000 rolls of the dice distributed). You're interested in the properties of the underlying probability distribution (what's the probability of a given result if I were to keep rolling the dice?). Those are not the same, and whether or not you need to sample with replacement when resampling the experimental results is only important if you're interested in the former (which we aren't).

jeanlain · May 13, 2024

Andropov said:
Again, that's irrelevant to determine the characteristics of the underlying probability distribution

I'm not arguing about that. I'm just saying that sampling a population of finite size without replacement creates dependence between sampled individuals. That's all.
If you sample a probability distribution, there is no population size (or it could be considered infinite) and there is no concept of replacement.
In R for instance, you sample a normal distribution with the rnorm() function. There is no argument to specify replacement. But if you sample a population with the sample() function, then you specify whether you want replacement.

Andropov said:
A is not a population. A is a probability distribution.

There's a population of iPads, whose geekbench scores follow a probability distribution.
My point is, the published geekbench results compose a (large) sample, there is no reason to resample these results to estimate population/distribution parameters. Just use the whole sample.
Resampling was needed to compute M4/M3 ratios as this requires generating pairs.

Andropov · May 14, 2024

jeanlain said:
I'm not arguing about that. I'm just saying that sampling a population of finite size without replacement creates dependence between sampled individuals. That's all.

No, no, I insist, that statement is not broadly true. I think you're getting confused with the general warning given in probability problems where the act of sampling modifies the probability of the further outcomes (ie the probability of drawing an ace from a deck twice in a row). I'm going to attempt to explain one last time: this does not apply here, after removing (picking without replacement) an event from the population of samples, the remaining events are still an unbiased sample of the distribution you're interested in.

This is fundamentally different from things like drawing cards from a deck. Let me give you one last example: if you draw a card from a deck and it happens to be an ace, the probability of the remaining outcomes in the sample space has been modified. Drawing another ace is now less likely, while the probability of picking any of the remaining cards increases. I understand that this is what you're trying to say. The probabilities for each possible outcome have been changed in an important way: you're left with a biased sample of the original population. You know, for a fact, that your deck with 52 cards is now biased after removing an ace. The probability of drawing an ace is now 0.059 (3/51), instead of 0.077 (4/52) at the beginning of the experiment.

But in the case of resampling from a population, as you are (usually, and in this case) interested in the original distribution this does not matter. After you remove an item (or N items) you've altered the probability of picking the remaining ones, but not in a way that matters: the remaining items still form an unbiased sample of the original distribution. To draw an analogous to the previous example: instead of starting with a deck, you'd start with a number of cards (let's say 52, as in a regular deck), but unlike a regular deck, in this case you've built it by randomly picking each card from a different (new, unbiased) deck. This is, essentially, sampling. So you don't know how many aces are there in your collection, just that each card in it has 4 chances in 52 of being an ace. Let's say that you then draw the first card, and it happens to be an ace: the probability of picking another ace remains unchanged, your remaining cards are still an unbiased sample of the original distribution. The probability of drawing an ace is still 0.077 (as each card is an independent sample!).

Of course, if you inspect your sampled card collection before starting the experiments, you may learn that it contains —for example— 6 aces. And then, you'd know that, if the first card you draw is an ace, the probability of the second card being an ace has been modified, from 0.12 (6/52) to 0.10 (5/51). But those probabilities are properties of your *sample* not the *distribution* and you're only interested in the *distribution*.

iPadified · May 14, 2024

Statistics? Slow day in the rumor mill?

thenewperson · May 14, 2024

iPadified said:
Statistics? Slow day in the rumor mill?

We’re starving, truly

Dulcimer · May 21, 2024

Twitter thread on some M4 deets:

https://twitter.com/x/status/1793115561905574331

Don’t know the source but apparently M4 has 10-wide decide stage vs 9-wide in M3.

The power consumption of M4 is really surprising due to much faster clocks: 50%+ higher under load!

If the thermals of the iPad improved that much, I wonder if other devices will use the new copper + graphite sheet thermal solution.

altaic · May 22, 2024

Dulcimer said:
Twitter thread on some M4 deets:

https://twitter.com/x/status/1793115561905574331

Don’t know the source but apparently M4 has 10-wide decide stage vs 9-wide in M3.

The power consumption of M4 is really surprising due to much faster clocks: 50%+ higher under load!

View attachment 2381156

If the thermals of the iPad improved that much, I wonder if other devices will use the new copper + graphite sheet thermal solution.

That screen cap doesn’t show +50% clock 🙄

Dulcimer · May 22, 2024

altaic said:
That screen cap doesn’t show +50% clock 🙄

50%+ power consumption, not clocks…

altaic · May 22, 2024

Dulcimer said:
50%+ power consumption, not clocks…

Why would you assume the power draw delta would be due to clocks rather than much more capable cores? Your twitter reference shows a vastly different microarch, far beyond decode. Have another look.

leman · May 22, 2024

Dulcimer said:
Twitter thread on some M4 deets:

https://twitter.com/x/status/1793115561905574331

Don’t know the source but apparently M4 has 10-wide decide stage vs 9-wide in M3.

The power consumption of M4 is really surprising due to much faster clocks: 50%+ higher under load!

View attachment 2381156

If the thermals of the iPad improved that much, I wonder if other devices will use the new copper + graphite sheet thermal solution.

I am puzzled about these results. First of all, their M3 power consumption seems very low to me. I am getting ~6 watts at 4Ghz and 5.25 watts at 3.75 GHz. Also, they report an abnormally high frequency for M4; the most I saw is 4.35 - 4.4 GHz. Also, their M2 and M1 results are way off — they both hit 5 watts at their nominal operating frequency (and M2 max hits 6.5 watts). I observed 7 watts for M4 at 4.3 Ghz, so that looks accurate to me.

This is Geekerwan's channel, right? Their testing methodology is often unclear, and the reported results are all over the place. They sometimes use aggressive cooling techniques to extract maximal performance from iPhones and iPads, which overestimates the device's frequency and power in normal operation.

Confused-User · May 22, 2024

leman said:
I am puzzled about these results. First of all, their M3 power consumption seems very low to me. I am getting ~6 watts at 4Ghz and 5.25 watts at 3.75 GHz. Also, they report an abnormally high frequency for M4; the most I saw is 4.35 - 4.4 GHz. Also, their M2 and M1 results are way off — they both hit 5 watts at their nominal operating frequency (and M2 max hits 6.5 watts). I observed 7 watts for M4 at 4.3 Ghz, so that looks accurate to me.

I too am puzzled, but not surprised - I think we're going to need a bunch more smart people poking at the M4 in a Mac to get a really decent understanding of what Apple's giving us. Though there's still plenty to be learned with iPads.

This is Geekerwan's channel, right? Their testing methodology is often unclear, and the reported results are all over the place. They sometimes use aggressive cooling techniques to extract maximal performance from iPhones and iPads, which overestimates the device's frequency and power in normal operation.

Yes. There's a decent summary here.

There are a LOT of surprises!!! The M4 really is a major redesign. Wider than the M3! Deeper ROB, wider dispatch (10 wide)! Lower cache latency (though I'm not clear that this is anything but clocks). Lots more, but I'm not going to summarize the summary.

Hopefully now everyone whining about Apple coasting, and the M4 just being an upclocked M3, will finally shut up.

EugW · May 22, 2024

Geekerwan’s review is now available with subtitles on YouTube.

Xiao_Xi · May 22, 2024

It's amazing that Apple still has tricks to increase IPC.

Xiao_Xi · May 22, 2024

Confused-User said:
There are a LOT of surprises!!! The M4 really is a major redesign. Wider than the M3! Deeper ROB, wider dispatch (10 wide)! Lower cache latency (though I'm not clear that this is anything but clocks). Lots more, but I'm not going to summarize the summary.

Hopefully now everyone whining about Apple coasting, and the M4 just being an upclocked M3, will finally shut up.

It's amazing how quickly Apple can improve its SoCs.

jdb8167 · May 22, 2024

Confused-User said:
There are a LOT of surprises!!! The M4 really is a major redesign. Wider than the M3! Deeper ROB, wider dispatch (10 wide)! Lower cache latency (though I'm not clear that this is anything but clocks). Lots more, but I'm not going to summarize the summary.

We need to find out the origin of the microarchitecture diagrams. It is awfully detailed for something that had to come from Apple. I'm not sure you could get all those details from software tests alone.

Hopefully now everyone whining about Apple coasting, and the M4 just being an upclocked M3, will finally shut up.

Good luck with that.

leman · May 22, 2024

jdb8167 said:
We need to find out the origin of the microarchitecture diagrams. It is awfully detailed for something that had to come from Apple. I'm not sure you could get all those details from software tests alone.

It’s certainly not coming from Apple. One of the earlier such diagrams was done by Dougall Johnson who wrote a series of tests to measure these details. These newer diagrams are likely done by Geekerwan team using tools written by Johnson.

And one can get quite a lot of info by manipulating the data dependencies and measuring how long it takes an instruction chain to execute. Apple also provides a bunch of performance counters that lets you measure all kinds of code statistics.

Xiao_Xi · May 22, 2024

jdb8167 said:
We need to find out the origin of the microarchitecture diagrams.

Would it be possible to get that information from Apple's LLVM fork?

M3 vs M4

Moderator emeritus

macrumors 68020

macrumors 6502a

macrumors 68020

macrumors 68000

macrumors 6502a

macrumors 68020

macrumors 6502a

macrumors 6502a

macrumors 68020

macrumors 6502a

macrumors 68020

macrumors 65816

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors P6

macrumors 68000

macrumors 68000

macrumors 601

macrumors Core

macrumors 68000

Our Staff