M3 vs M4

name99 · May 22, 2024

leman said:
It’s certainly not coming from Apple. One of the earlier such diagrams was done by Dougall Johnson who wrote a series of tests to measure these details. These newer diagrams are likely done by Geekerwan team using tools written by Johnson.

And one can get quite a lot of info by manipulating the data dependencies and measuring how long it takes an instruction chain to execute. Apple also provides a bunch of performance counters that lets you measure all kinds of code statistics.

Here's thing.
It took Dougall and I a fairly long time to put together this information, and most of it only makes sense IF you are willing to understand, really understand, the underlying micro-architecture.
What I am seeing since then is a bunch of people who are capable of compiling and executing microbenchmark code, but without the skill and/or the interest in UNDERSTANDING it. So we get numbers that are, in some technical sense, correct, but not especially important or interesting.

For example look at dougall's discussion of measuring the M1 load/store queues:

Apple M1: Load and Store Queue Measurements

Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and meas…

dougallj.wordpress.com

Even after correcting his various errors and misunderstandings, dougall and I still disagree on a bunch of things. Relevant to the LSQ measurements, I think Apple is using, even with M1, virtual LSQ slots, which is a more important detail than the exact size of the LSQ (and they may well have graduated to virtual register allocation for the M4, again more important than any other detail).

[An OoO CPU allocates resources early in the pipeline at the stage usually called Rename, but which I call Allocate because that's more descriptive.
BUT "allocating" a resource often performs multiple roles...
One role is to ensure that a physical piece of hardware (a slot in a queue, a physical register, whatever) will be available when needed. A DIFFERENT role is tracking dependencies (consider the job of register rename and how the rename table works).
Traditionally a single allocation has been performed that executes both roles, but this is inefficient. dependency tracking generally needs to persist from Allocation until at least Execution, and often past Execution if you want to engage in some sort of Replay or Flush. But physical hardware allocation only needs to persist from Issue to Execution.
Part of what Apple has been doing, since the A7, is one by one splitting tasks and hardware that have always been bundled together into separately optimized pieces. Separating register dependency tracking ("virtual registers") from physical register allocation would be just one more step in this process.

It is details like this that really matter in a new design. But you won't capture this by mindlessly running microbenchmarks, especially if you don't ever tweak the benchmarks to see what behaves unexpectedly, what doesn't match your mental model.

Another issue (super-important if true) is whether 10-wide Decode (if real...) goes along with 10-wide Allocate or not. 10-wide Decode is easy. 10-wide Allocate is hard (much harder than 9-wide, which was much harder than 8-wide). But you often do NOT NEED 10-wide Allocate – usually at least one of the instructions of the 10 decoded is fused with a subsequent instruction...
What you do need, instead, is a decoupling queue between Decode and Allocate to absorb bursts in Decode activity smoothed down to continuous Allocate activity. But again, to discover such a thing you need some careful new microbenchmarks, not just mindless execution of a pre-existing suite.

The basic problem (and I see this even with well-meaning sites like ChipsAndCheese) is that CPUs are not all the same!!! There was a time when perhaps they were all similar, but now there is so much specialized detail that you can't usefully transfer your in-depth knowledge of Intel to just assume you are doing something useful when you microbenchmark Apple (and probably soon Qualcomm/Nuvia). You have to actually understand how Apple is different to know what to even test for. And if there is one thing people hate even more than admitting that they were ever wrong, it's having to learn something new.

So expect five years or so of basically garbage microbenchmarks for both Apple and Qualcomm, until a new generation of teens gets into the game, one that hasn't grown up thinking the world is an x86, :-(

name99 · May 22, 2024

Xiao_Xi said:
Would it be possible to get that information from Apple's LLVM fork?

Not even close

Xiao_Xi · May 22, 2024

name99 said:
Not even close

Doesn't Apple need to write a TableGen file to describe each of its SoCs?

name99 · May 22, 2024

Xiao_Xi said:
Doesn't Apple need to write a TableGen file to describe each of its SoCs?

First of all the public LLVM is only updated with any sort of Apple SoC info months after the SoC ships. Apple may update its internal LLVM, but we don't have access to that.

Secondly the values provided to LLVM are generally not especially interesting. The main useful things you can learn from LLVM are whatever new features have been provided (ie whether the chip supports ARM v 8.7A or whatever) and whether any new fusions have been added. Beyond that, there's generally not anything useful. I've just told you why microbenchmark numbers are mostly not useful; numbers to steer LLVM in a particular direction are even less useful.

leman · May 23, 2024

name99 said:
Here's thing.
It took Dougall and I a fairly long time to put together this information, and most of it only makes sense IF you are willing to understand, really understand, the underlying micro-architecture.
What I am seeing since then is a bunch of people who are capable of compiling and executing microbenchmark code, but without the skill and/or the interest in UNDERSTANDING it. So we get numbers that are, in some technical sense, correct, but not especially important or interesting.

For example look at dougall's discussion of measuring the M1 load/store queues:

Apple M1: Load and Store Queue Measurements

Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and meas…

dougallj.wordpress.com

Even after correcting his various errors and misunderstandings, dougall and I still disagree on a bunch of things. Relevant to the LSQ measurements, I think Apple is using, even with M1, virtual LSQ slots, which is a more important detail than the exact size of the LSQ (and they may well have graduated to virtual register allocation for the M4, again more important than any other detail).

[An OoO CPU allocates resources early in the pipeline at the stage usually called Rename, but which I call Allocate because that's more descriptive.
BUT "allocating" a resource often performs multiple roles...
One role is to ensure that a physical piece of hardware (a slot in a queue, a physical register, whatever) will be available when needed. A DIFFERENT role is tracking dependencies (consider the job of register rename and how the rename table works).
Traditionally a single allocation has been performed that executes both roles, but this is inefficient. dependency tracking generally needs to persist from Allocation until at least Execution, and often past Execution if you want to engage in some sort of Replay or Flush. But physical hardware allocation only needs to persist from Issue to Execution.
Part of what Apple has been doing, since the A7, is one by one splitting tasks and hardware that have always been bundled together into separately optimized pieces. Separating register dependency tracking ("virtual registers") from physical register allocation would be just one more step in this process.

It is details like this that really matter in a new design. But you won't capture this by mindlessly running microbenchmarks, especially if you don't ever tweak the benchmarks to see what behaves unexpectedly, what doesn't match your mental model.

Another issue (super-important if true) is whether 10-wide Decode (if real...) goes along with 10-wide Allocate or not. 10-wide Decode is easy. 10-wide Allocate is hard (much harder than 9-wide, which was much harder than 8-wide). But you often do NOT NEED 10-wide Allocate – usually at least one of the instructions of the 10 decoded is fused with a subsequent instruction...
What you do need, instead, is a decoupling queue between Decode and Allocate to absorb bursts in Decode activity smoothed down to continuous Allocate activity. But again, to discover such a thing you need some careful new microbenchmarks, not just mindless execution of a pre-existing suite.

The basic problem (and I see this even with well-meaning sites like ChipsAndCheese) is that CPUs are not all the same!!! There was a time when perhaps they were all similar, but now there is so much specialized detail that you can't usefully transfer your in-depth knowledge of Intel to just assume you are doing something useful when you microbenchmark Apple (and probably soon Qualcomm/Nuvia). You have to actually understand how Apple is different to know what to even test for. And if there is one thing people hate even more than admitting that they were ever wrong, it's having to learn something new.

So expect five years or so of basically garbage microbenchmarks for both Apple and Qualcomm, until a new generation of teens gets into the game, one that hasn't grown up thinking the world is an x86, :-(

Thank you, Maynard! All of this makes a lot of sense to me. It is very similar to what we do in sciences - anybody can run some stats, it’s interpreting the results and putting together a coherent model of the world where the challenge lies.

This is also I am not fully satisfied with reviewers like Geekerwan. Their effort is much appreciated, I would still like to see a bit more methodological rigor and discussion instead of presenting numbers as facts.

MayaUser · May 23, 2024

So it is true that M4 has 12gb ram but 4 are disabled in the ipad, so on macs we can see the full 12gb ram M4?

killhippie · May 23, 2024

Does the M4 still suffer GoFetch? THey have hardly had time to change the architecture from the M3 and redesigning a CPU and fabricating it takes a long time, so I wonder if this chip will be vulnerable too.

thenewperson · May 23, 2024

MayaUser said:
So it is true that M4 has 12gb ram but 4 are disabled in the ipad, so on macs we can see the full 12gb ram M4?

I doubt it. It’s not like iPad’s won’t benefit from more RAM yet they decided to do that. So I wouldn’t be surprised if they simply continued on in Macs.

uller6 · May 23, 2024

MayaUser said:
So it is true that M4 has 12gb ram but 4 are disabled in the ipad, so on macs we can see the full 12gb ram M4?

Definitely not. They’re only putting 8Gb packages in iPads.

MayaUser · May 23, 2024

uller6 said:
Definitely not. They’re only putting 8Gb packages in iPads.

dont understand. only putting 8gb on ipads so that means on the mac 12?

MayaUser · May 23, 2024

is the m4 Apple's First Armv9.4-A?

MayaUser · May 23, 2024

impressive move from M3 to M4

Xiao_Xi · May 23, 2024

MayaUser said:
impressive move from M3 to M4

How much has M4 improved over M3 in GB 6.2?

MayaUser · May 23, 2024

It seems all OEM that presented arm laptops, all have fans...even for the x plus models not just for Elite X...so are they competing in the macbook pro territory even if they dont have M3 Max power or they play safe for sustain load ?!
But it is very strange not to offer a fanless Macbook Air competitor
Both surface laptops with fan

leman · May 23, 2024

MayaUser said:
is the m4 Apple's First Armv9.4-A?

Doesn't seem like it. It's not like Apple cares much about ARM versions.

MayaUser · May 23, 2024

Surface laptop under load editing in a quiet 20dba room , with the fan running was around 36dba

name99 · May 23, 2024

leman said:
Thank you, Maynard! All of this makes a lot of sense to me. It is very similar to what we do in sciences - anybody can run some stats, it’s interpreting the results and putting together a coherent model of the world where the challenge lies.

This is also I am not fully satisfied with reviewers like Geekerwan. Their effort is much appreciated, I would still like to see a bit more methodological rigor and discussion instead of presenting numbers as facts.

Agreed. We have to live with the data sources we have. While recognizing that they are highly imperfect...

As I keep saying to all the teens or students reading these fora – you want a job at a big tech company? It's not hard, just DO SOMETHING that exhibits some sort of deep technical knowledge and innovation.
Do what Geekerwan or ChipsAndCheese are doing, but in much more depth. Do what Dougall and I did for M1, but applied to M4.
There are like five people in the world doing this stuff!

name99 · May 23, 2024

MayaUser said:
is the m4 Apple's First Armv9.4-A?

Certainly not 9.4/8.9 since it doesn't have the memcpy instructions (which are 9.3/8.8).

A15, A16, M2, M3 are definitely considered 8.6/9.1 (although they're missing one crypto instruction, SM4, which appears to be a China-specific thing that Apple doesn't care about)

8.7/9.2 provide for SME (presumably optional, like SVE now is), so check. Also some 64B atomics which look useful and I wouldn't be surprised if Apple has them.

8.9/9.4 bring SME2

We don't even have an A17 entry in LLVM (at least not last time I checked) let alone an M4 entry. But my *guess* is that they will be marked as 8.7 with an additional FEAT_SME2 flag.

But a lot of the new stuff in 8.8, 8.9 looks like performance and security/OS stuff Apple would like. So I'm guessing they'll move to full 8.9 ASAP.
Maybe even something like 8.9 for the A18, then 9.4 (ie 8.9 plus SVE2) for an M5 this time next year? Well, a man can dream!

Xiao_Xi · May 23, 2024

name99 said:
A17 entry in LLVM

Code:

ARMCPUTestParams<AArch64::ExtensionBitset>(
            "apple-a17", "armv8.6-a", "crypto-neon-fp-armv8",
            AArch64::ExtensionBitset(
                {AArch64::AEK_CRC, AArch64::AEK_AES, AArch64::AEK_SHA2,
                 AArch64::AEK_SHA3, AArch64::AEK_FP, AArch64::AEK_SIMD,
                 AArch64::AEK_LSE, AArch64::AEK_RAS, AArch64::AEK_RDM,
                 AArch64::AEK_RCPC, AArch64::AEK_DOTPROD, AArch64::AEK_FP16,
                 AArch64::AEK_FP16FML, AArch64::AEK_SHA3, AArch64::AEK_BF16,
                 AArch64::AEK_I8MM, AArch64::AEK_JSCVT, AArch64::AEK_FCMA,
                 AArch64::AEK_PAUTH}),
            "8.6-A"),

llvm-project/llvm/unittests/TargetParser/TargetParserTest.cpp at main · llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - llvm/llvm-project

github.com

Confused-User · May 23, 2024

killhippie said:
Does the M4 still suffer GoFetch? THey have hardly had time to change the architecture from the M3 and redesigning a CPU and fabricating it takes a long time, so I wonder if this chip will be vulnerable too.

GoFetch is a very interesting vulnerability, but it's not a practical one. They should fix it, but in the real world it's not meaningful.

The M3 has a bit that can be set in order to disable DMP, and the M4 surely has that as well. (It turns out that, despite press to the contrary, the M1 and M2 also have an equivalent feature.) Between that and the necessary preconditions for an attack, it's not of significant concern.

For a reasonable overview of this see, for example, this.

uller6 said:
Definitely not. They’re only putting 8Gb packages in iPads.

You're not keeping up to date.

The 1-2TB iPads have two 8GB chips. But apparently the 256GB and 512GB iPads have 2x 6GB chips, not 2x 4GB, despite being advertised as the latter. The iPads only use 8GB of that though.

MayaUser said:
It seems all OEM that presented arm laptops, all have fans...even for the x plus models not just for Elite X...so are they competing in the macbook pro territory even if they dont have M3 Max power or they play safe for sustain load ?!
But it is very strange not to offer a fanless Macbook Air competitor
Both surface laptops with fan

It will be interesting to see if they try to build one without a fan, and if so what speed the chip will run at. At the lower power and performance tier QC was talking about when they introduced the SXE, the chip was already slower than the M3 even in multicore, and it gets utterly clobbered in SC. They may think it's a bad look to run at such low performance. But on the other hand, they don't want to be turning away sales...

mr_roboto · May 23, 2024

killhippie said:
Does the M4 still suffer GoFetch? THey have hardly had time to change the architecture from the M3 and redesigning a CPU and fabricating it takes a long time, so I wonder if this chip will be vulnerable too.

Please, I urge everyone, stop worrying about GoFetch.

The academic researchers behind GoFetch took a look at a common performance feature in recent CPU cores, one known to have potential side channel attacks against cryptography. They found that yes, it could indeed be exploited. But did they demonstrate a practical exploit you should worry about? Absolutely not. Their proof-of-concept implementation consists of a victim server process which will happily handle failed decryption requests at any arbitrary rate, forever, and an attacker client process. A real exploit would involve demonstrating that they could extract private keys from Apple's system APIs, such as Keychain. They have not done this.

And they probably can't. One trivial protection against this style of exploit is for your "server" code to simply monitor requests and stop serving any process making tons of failed requests, because that's not a pattern which legitimate users of cryptography APIs ever do.

Another protection mechanism is that Arm has an optional CPU feature, FEAT_DIT, which allows programs to request "Data Independent Timing" execution. DIT mode is intended to briefly disable vulnerable performance features, such as the one attacked by GoFetch, during execution of sensitive code. Apple implements DIT in their CPUs, and it's a good bet that they use it to protect all their system provided cryptography code.

Another issue with practical exploitation is that Apple does a lot of cryptography inside the Secure Enclave instead of relying on software running on the main application processors. GoFetch cannot extract anything from the Secure Enclave, it's strictly a demonstration that if a victim process and attacker process are running on the same M1 CPU cluster (not even different clusters!), the attacker can slowly extract key bits from the victim.

Moving on to fixes... Apple did make one mistake in M1 and M2. In these CPUs, DIT mode does not turn off the data-dependent prefetcher which GoFetch attacks. This was already fixed as of M3, and should remain fixed in M4.

Finally, even though M1 and M2 are vulnerable in a way which probably doesn't matter, that doesn't mean there's nothing Apple can do about it, despite all the UNPATCHABLE OMG SKY IS FALLING coverage. Hector Martin has discovered that Apple has a mode control bit (separate from the FEAT_DIT bit) to disable this special prefetcher. Presumably it's an oversight on their part that they didn't make DIT control this too, but the practical implication is that if Apple has any potential GoFetch vulns in system libraries, they can fix that even on M1 and M2.

EugW · May 23, 2024

uller6 said:
Definitely not. They’re only putting 8Gb packages in iPads.

Maybe I'm misunderstanding, but in the 8 GB iPad Pro model, the packaged chips are 48 Gigabit + 48 Gigabit (6 GB + 6 GB). Or at least some of them are.

BTW, Geekerwan claims they're custom LPDDR5, not LPDDR5X.

thenewperson said:
I doubt it. It’s not like iPad’s won’t benefit from more RAM yet they decided to do that. So I wouldn’t be surprised if they simply continued on in Macs.

I held off buying a used M1 MacBook Air for our kitchen. My thinking is that the base M4 MacBook Air might just get 12 GB base. Maybe.

Torty · May 23, 2024

EugW said:
Maybe I'm misunderstanding, but in the 8 GB iPad Pro model, the packaged chips are 48 Gigabit + 48 Gigabit (6 GB + 6 GB). Or at least some of them are.

BTW, Geekerwan claims they're custom LPDDR5, not LPDDR5X.

I held off buying a used M1 MacBook Air for our kitchen. My thinking is that the base M4 MacBook Air might just get 12 GB base. Maybe.

Maybe the MBP while the MBA gets 12GB too but only 8GB usable.

leman · May 23, 2024

EugW said:
Maybe I'm misunderstanding, but in the 8 GB iPad Pro model, the packaged chips are 48 Gigabit + 48 Gigabit (6 GB + 6 GB). Or at least some of them are.

We should keep in mind that the only evidence for this comes from a third-party chip model number decoder. The chips themselves are custom, made to order and there is no official spec sheet. I am not saying that the decoder is wrong, just that one should carefully weight the evidence.

altaic · May 24, 2024

leman said:
We should keep in mind that the only evidence for this comes from a third-party chip model number decoder. The chips themselves are custom, made to order and there is no official spec sheet. I am not saying that the decoder is wrong, just that one should carefully weight the evidence.

They could also be binned parts, such that areas w/defects are disabled. Either way, it doesn’t really matter unless you care about clicks, in which case: HALLELUJAH!

M3 vs M4

macrumors 68030

macrumors 68030

macrumors 68000

macrumors 68030

macrumors Core

macrumors 68040

macrumors 6502a

macrumors 65816

macrumors 65816

macrumors 68040

macrumors 68040

macrumors 68040

Attachments

macrumors 68000

macrumors 68040

macrumors Core

macrumors 68040

macrumors 68030

macrumors 68030

macrumors 68000

macrumors 6502a

macrumors 6502a

macrumors P6

macrumors 65816

macrumors Core

macrumors 6502a

Our Staff