M3 Chip Generation - Discussion Megathread

ArkSingularity · Aug 23, 2023

leman said:
Another way to look at it is that SMT is a way to reduce the resource waste your architecture is already suffering from. If SMT gives you 40% higher performance in a multithreaded scenario, that kind of means that 30% of your execution resources that are essentially wasted in a single-threaded scenario.

SMT is a cheap way to get some of it back, but now when the race for highest IPC is back on, designers will try to improve the resource efficiency at the core itself (pun). It's not easy, but it can be done, as Apple very clearly illustrates with their 3.6Ghz CPUs reaching the same performance as Intel at 5Ghz. And if your core is already very good at utilising the resources, SMT becomes redundant.

I've thought about this also. One thing to consider as they widen these cores further is that the achievable parallelism on any given stream of instructions might not necessarily always be able to fully utilize the architecture 100% of the time. If the code is structured in such a way that it cannot keep an 8-wide pipeline full, SMT can utilize those resources whereas one thread might not.

Newer processors seem to be going crazy on execution resources in the hopes of being able to utilize them. The goal of course is to find any achievable parallelism and to exploit it, and SMT can help to fill the cracks where this isn't achievable.

All of that being said, I don't really miss SMT much. I wouldn't complain too much if Apple added it back in, but Apple's strategy has worked very well without it. Keeps things much simpler (I don't have to worry whether audio applications are going to complain if their threads aren't assigned ideally by the scheduler).

name99 · Aug 23, 2023

leman said:
It's a very different market segment, so your usual logic might not apply. And I also remember reading that some software licenses in that market are per CPU core, and with their SMT model IBM is somehow exploiting this. Don't ask me for details though, I don't understand how it works.

SMT is about optimizing your core for THROUGHPUT instead of latency.
The sensible way to optimize for throughput is via multiple cores. But the markets IBM sells in charge for SW by the core; so SMT in IBM's case is a legal workaround.

Needless to say NOTHING about this has any relevance to Apple...

(It also has zero real relevance to x86. So why do Intel continue to do it? Well, go ask them.
But the quality of their fans, what they demand, and the "technical" arguments they offer up in support of SMT, give you all the answer you really need.
SMT has become a tribal marker, and much of team x86 these days buys based on tribal affiliation.)

name99 · Aug 23, 2023

ArkSingularity said:
I've thought about this also. One thing to consider as they widen these cores further is that the achievable parallelism on any given stream of instructions might not necessarily always be able to fully utilize the architecture 100% of the time. If the code is structured in such a way that it cannot keep an 8-wide pipeline full, SMT can utilize those resources whereas one thread might not.

Newer processors seem to be going crazy on execution resources in the hopes of being able to utilize them. The goal of course is to find any achievable parallelism and to exploit it, and SMT can help to fill the cracks where this isn't achievable.

All of that being said, I don't really miss SMT much. I wouldn't complain too much if Apple added it back in, but Apple's strategy has worked very well without it. Keeps things much simpler (I don't have to worry whether audio applications are going to complain if their threads aren't assigned ideally by the scheduler).

I'm not going to go down this rabbit hole again because if there's one thing SMT advocates have shown repeatedly it's that THEY CAN NOT LEARN. You can explain technical details a thousand times and they will take nothing from the explanation.

The simple answer to your question is that MOST of what makes modern OoO speculative computers work well is various forms of locality. Caches, branch predictors, even register allocation and power generated by bit flips, all work well/at lower energy because of strong correlation between successive instructions and data values. All that goes out the window when you introduce SMT and start polluting your various core data structures with unrelated material.

What you are arguing for is optimizing a tiny part of the CPU (the actual execution units) at the expense of the vast bulk of the CPU (all the various storage that's making those caches and predictors work so well).
Even the so-called "inefficiency" of the CPU in that frequently it can't run all those units is not wasted; Apple is constantly twiddling how fast the CPU runs at A CYCLE-BY-CYCLE granularity, so that less work done for a few cycles means the next few cycles don't have to run slower. (This is not DVFS; DVFS is way too slow for cycle-by-cycle manipulation. This is something called clock dithering, where the average energy/cycle of parts of each core are tracked, and if that average energy goes too high, the clock to that part of the core is frozen for a cycle or too, to briefly reduce the current draw.)

Basically if you think the constraint on performance in a modern CPU is insufficiently aggressive use of execution units, rather than energy and EVERYTHING energy related (like the cost of moving data rather vs modifying it) you are living in the 1990s.

ArkSingularity · Aug 23, 2023

name99 said:
I'm not going to go down this rabbit hole again because if there's one thing SMT advocates have shown repeatedly it's that THEY CAN NOT LEARN. You can explain technical details a thousand times and they will take nothing from the explanation.

The simple answer to your question is that MOST of what makes modern OoO speculative computers work well is various forms of locality. Caches, branch predictors, even register allocation and power generated by bit flips, all work well/at lower energy because of strong correlation between successive instructions and data values. All that goes out the window when you introduce SMT and start polluting your various core data structures with unrelated material.

What you are arguing for is optimizing a tiny part of the CPU (the actual execution units) at the expense of the vast bulk of the CPU (all the various storage that's making those caches and predictors work so well).
Even the so-called "inefficiency" of the CPU in that frequently it can't run all those units is not wasted; Apple is constantly twiddling how fast the CPU runs at A CYCLE-BY-CYCLE granularity, so that less work done for a few cycles means the next few cycles don't have to run slower. (This is not DVFS; DVFS is way too slow for cycle-by-cycle manipulation. This is something called clock dithering, where the average energy/cycle of parts of each core are tracked, and if that average energy goes too high, the clock to that part of the core is frozen for a cycle or too, to briefly reduce the current draw.)

Basically if you think the constraint on performance in a modern CPU is insufficiently aggressive use of execution units, rather than energy and EVERYTHING energy related (like the cost of moving data rather vs modifying it) you are living in the 1990s.

I wasn't trying to attack anyone. Your expertise on this is obvious and I hope I did not come across otherwise.

name99 · Aug 23, 2023

ArkSingularity said:
I wasn't trying to attack anyone. Your expertise on this is obvious and I hope I did not come across otherwise.

Fair enough. Sorry, I didn't mean to attack you; it's just SMT is one of these zombie ideas that will not die, but lives on not for technical reasons but (as far as I can tell) for affiliational reasons.

It's just maddening to have to explain every month why it is a bad idea, and yet never seem to have any effect because the people who demand it demand it for non-technical reasons, and so are impervious to technical arguments.

ArkSingularity · Aug 23, 2023

name99 said:
Fair enough. Sorry, I didn't mean to attack you; it's just SMT is one of these zombie ideas that will not die, but lives on not for technical reasons but (as far as I can tell) for affiliational reasons.

It's just maddening to have to explain every month why it is a bad idea, and yet never seem to have any effect because the people who demand it demand it for non-technical reasons, and so are impervious to technical arguments.

For what it's worth, I have read a lot of your technical deep dives in things around here, and I have appreciated and learned a lot from them.

leman · Aug 24, 2023

ArkSingularity said:
I've thought about this also. One thing to consider as they widen these cores further is that the achievable parallelism on any given stream of instructions might not necessarily always be able to fully utilize the architecture 100% of the time. If the code is structured in such a way that it cannot keep an 8-wide pipeline full, SMT can utilize those resources whereas one thread might not.

Newer processors seem to be going crazy on execution resources in the hopes of being able to utilize them. The goal of course is to find any achievable parallelism and to exploit it, and SMT can help to fill the cracks where this isn't achievable.

I think this was pretty much the background behind SMT on Intel CPUs. This attitude of "real-world ILP is limited, so going wider is not worth it" became somewhat of a dogma in the x86 world, so nobody was even trying to see whether it was actually true. Then Apple designs came along that suddenly could extract 30-50% more ILP from the same code and now everybody is scrambling to go wider. I'm sure we are still far from practical limits.

Basic75 · Aug 24, 2023

leman said:
Then Apple designs came along that suddenly could extract 30-50% more ILP from the same code and now everybody is scrambling to go wider. I'm sure we are still far from practical limits.

Lower clock frequencies are one important reason for Apple's designs being able to achieve higher IPC. The lower the frequency the easier it is to increase IPC -- the same absolute cache and memory latencies translate to fewer clock cycles. As for practical limits ... I'm not sure. AFAIK an average IPC of 2 on most real-world loads is still considered good, even on a modern 8-wide core.

leman · Aug 24, 2023

Basic75 said:
Lower clock frequencies are one important reason for Apple's designs being able to achieve higher IPC. The lower the frequency the easier it is to increase IPC -- the same absolute cache and memory latencies translate to fewer clock cycles.

I'm not sure I follow? The x86 CPUs have similar or lower cache access latency (in cycles) as Apple designs from what I understand?

Basic75 said:
As for practical limits ... I'm not sure. AFAIK an average IPC of 2 on most real-world loads is still considered good, even on a modern 8-wide core.

If that would be the case, we wouldn't see Apple matching the performance of much higher clocked x86 CPUs.

Basic75 · Aug 24, 2023

leman said:
I'm not sure I follow? The x86 CPUs have similar or lower cache access latency (in cycles) as Apple designs from what I understand?

That was a general statement. Intel has great caches, and their best CPUs have higher frequencies and higher performance than Apple. Perhaps Apple gets away with worse caches because they target lower clocks.

leman said:
If that would be the case, we wouldn't see Apple matching the performance of much higher clocked x86 CPUs.

Well, what's the current score? A 3.5 GHz Apple Avalanche matches a 4.5 GHz Intel Raptor Cove? But the Intel can clock all the way up to 6 GHz. Sure, caches aren't everything and Apple can exploit the lower clocks in other ways, too.

leman · Aug 24, 2023

Basic75 said:
That was a general statement. Intel has great caches, and their best CPUs have higher frequencies and higher performance than Apple. Perhaps Apple gets away with worse caches because they target lower clocks.

More likely because Apple has much more cache. What I mean is that I don't understand why higher clock cache needs to have higher latency. I thought latency was connected to cache size and associativity first and foremost? Not that I know anything about the topic.

Basic75 said:
Well, what's the current score? A 3.5 GHz Apple Avalanche matches a 4.5 GHz Intel Raptor Cove? But the Intel can clock all the way up to 6 GHz. Sure, caches aren't everything and Apple can exploit the lower clocks in other ways, too.

More like Apples 3.5Ghz ~ Intels 5Ghz, but we are talking about IPC here. Apple's cores somehow are able to execute 30-40% more instructions per cycle per thread. This is only possible if the intrinsic IPC of the code is high enough. Can't extract something that's not there.

Basic75 · Aug 24, 2023

leman said:
More likely because Apple has much more cache. What I mean is that I don't understand why higher clock cache needs to have higher latency. I thought latency was connected to cache size and associativity first and foremost? Not that I know anything about the topic.

Sure, I didn't take into account Apple's larger L1 cache.

leman said:
More like Apples 3.5Ghz ~ Intels 5Ghz, but we are talking about IPC here. Apple's cores somehow are able to execute 30-40% more instructions per cycle per thread. This is only possible if the intrinsic IPC of the code is high enough. Can't extract something that's not there.

True. And of course there are many factors to make that happen. AArch64 has more registers than x86-64, that'll help. And I believe that Apple has larger ROBs than Intel and AMD. Like with caches the lower clock frequency makes it easier to have a larger data structure with the same access latency measured in cycles. It's the old brainiac vs. speed demon development for the nth time. Though each time it happens energy efficiency becomes more important.

quarkysg · Aug 24, 2023

Basic75 said:
Lower clock frequencies are one important reason for Apple's designs being able to achieve higher IPC.

I would think power draw is the main reason for Apple lower clock frequencies instead of IPC.

It would probably be easier for aarch64 to go wider for the decoder compared to x86-64 due to variable length instructions.

Basic75 · Aug 24, 2023

leman said:
If that would be the case, we wouldn't see Apple matching the performance of much higher clocked x86 CPUs.

That was you responding to my "2 is a good average IPC". It's been a while that I checked, but even a few years ago I was quite appaled by many real-world tasks barely gettin an IPC of 1 on the best-at-the-time Intel cores! The big picture is that CPUs are much wider than what they can get out of most applications they are running. I hope you're right about there being possibilities for further improvement, but it ain't easy and I won't hold my breath.

quarkysg · Aug 24, 2023

leman said:
What I mean is that I don't understand why higher clock cache needs to have higher latency.

Higher frequencies means shorter time for signal to stabilize before reading it off the signal line, therefore may need additional clocks delays before reading or writing. But that is probably offset by the higher clock.

leman · Aug 24, 2023

quarkysg said:
I would think power draw is the main reason for Apple lower clock frequencies instead of IPC.

That's also what I think. They can afford a lower clock because they are wider, not the other way around.

quarkysg said:
It would probably be easier for aarch64 to go wider for the decoder compared to x86-64 due to variable length instructions.

Definitely cheaper in terms of die area and power consumption.

Basic75 said:
That was you responding to my "2 is a good average IPC". It's been a while that I checked, but even a few years ago I was quite appaled by many real-world tasks barely gettin an IPC of 1 on the best-at-the-time Intel cores! The big picture is that CPUs are much wider than what they can get out of most applications they are running. I hope you're right about there being possibilities for further improvement, but it ain't easy and I won't hold my breath.

I suppose we will know soon enough. If Apple 3nm chips are faster normalising for clock, then there were still some IPC reserves to harness. If instead the performance comes mostly from clock increases, then we know that Apple has hit a wall.

As there is some evidence that Apple's upcoming cores are going to be wider, I remain optimistic.

Boil · Aug 24, 2023

Apple needs 6.9GHz CPUs & 4.20GHz GPUs...! ;^p

quarkysg · Aug 24, 2023

Boil said:
Apple needs 6.9GHz CPUs & 4.20GHz GPUs...! ;^p

Nah. GHz is so last century. THz man.

Romain_H · Aug 24, 2023

quarkysg said:
THz man.

I'd take it 😀

Xiao_Xi · Aug 24, 2023

leman said:
there is some evidence that Apple's upcoming cores are going to be wider

What evidence?

leman · Aug 24, 2023

Xiao_Xi said:
What evidence?

Recent Apple patents that describe branch prediction. It was discussed a couple of pages ago in this thread if I remember correctly.

deconstruct60 · Aug 24, 2023

leman said:
Recent Apple patents that describe branch prediction. It was discussed a couple of pages ago in this thread if I remember correctly.

That isn't necessarily 'wider'. When the branch prediction is wrong there is pretty high 'back up and start over' penalty overhead. Better prediction pragmatically lowers the overall aggregate overhead because fewer times have to 'pay the cost'. That doesn't necessarily mean can speculate even FARTHER ahead. If move father ahead and cross even more conditional boundaries then just have even bigger 'mess' to clean up.

The big 'win' here is long loops over static data. (e.g, micro benchmarks on fixed in stone data). Testing for data errors that don't pramgatically don't exist (in the static data) basically give a 'free lunch' to speculate to the extreme through ( pragmatically the error handling code is 'dead'. So dynamically just extracting it from the program. )

ILP in basic blocks is fixed. That really isn't a 'dogma' . The 'in real life' is more conditional. Loops where the data is completely error free and/or non random the trick of throwing away the conditional checks and dynamically aggregating a bigger basic block than physically present is 'trick' that has legs. If need to conditionally test all the data looping threw for a dynamically given target ( select c1 c2 from table1 where c3 = user_input_1 ). then that trick doesn't have nearly so much 'real life' added value .

IPC is an calculated average number for some workload aggregate thrown at it. It isn't static across all workloads.

leman · Aug 24, 2023

deconstruct60 said:
That isn't necessarily 'wider'.

The patents specifically describes a type of execution unit optimised for executing speculative branches where the prediction confidence is high. The cost of branch misprediction when invoking this unit is higher, but one can speculate that the implementation cost of such a unit itself in terms of transistor budget and energy expenditure will be smaller. I don't think there is much point in discussing this type of unit specialisation unless you plan to have more of them.

deconstruct60 said:
ILP in basic blocks is fixed. That really isn't a 'dogma' . The 'in real life' is more conditional. Loops where the data is completely error free and/or non random the trick of throwing away the conditional checks and dynamically aggregating a bigger basic block than physically present is 'trick' that has legs. If need to conditionally test all the data looping threw for a dynamically given target ( select c1 c2 from table1 where c3 = user_input_1 ). then that trick doesn't have nearly so much 'real life' added value .

Sure, but if your OOE window is large enough you could perform the condition check many steps ahead, effectively unrolling the loop at runtime. The ILP of the basic block might be fixed but you can still execute multiple such blocks concurrently.

name99 · Aug 24, 2023

Basic75 said:
Lower clock frequencies are one important reason for Apple's designs being able to achieve higher IPC. The lower the frequency the easier it is to increase IPC -- the same absolute cache and memory latencies translate to fewer clock cycles. As for practical limits ... I'm not sure. AFAIK an average IPC of 2 on most real-world loads is still considered good, even on a modern 8-wide core.

Clock frequencies is a choice.
L2 size is a choice.
Presence of SLC is a choice.
Algorithms to increase cache EFFECTIVE size (placement, replacement algorithms, dead block prediction, etc) are a choice.
High quality prefetchers are a choice.

In every one of these cases, Intel made its choices. Apple made better choices. It's that simple.

As far as how much ILP is available, this is not some single number known for all time. It depends on the details of the machine.
If the machines predicts branches more accurately, ILP goes up.
If it hits in cache more often because the prefetcher works better, ILP goes up.
If load/store aliasing is easily recovered from (replay rather than flush) ILP goes up.
Across a wide range of code (and a statement like this is inevitably handwaving) x86 gets about 2 ILP, Apple gets about 4.
If (again handwaving) you assume that Apple are spending half their time doing nothing (waiting for RAM or recovering from branch misprediction) then the other half of the time they are running 8-wide...
You can boost that 4ILP either by going to say 10 wide, or by reducing the fraction of time spent waiting for RAM, and recovering from mispredicts. Both are good and sensible strategies.

During the 90s a few studies were done of the form called Limit Studies where you assume various perfect aspects of your CPU (perfect cache, or perfect branch prediction, or ROB of unlimited size or whatever). The conclusion of these studies was that there's substantially (like ~50) ILP available in this limit across a wide range of code (and of course that number ranges widely).
Unfortunately I haven't seen any recent Limit Studies that target anything like a modern design, and that do a good job of varying the parameters to give a feel for the gap between how well a perfect design performs vs how well a stretch (but plausible in say 5 years) design might behave.

Point is, the number is (ASSUMING perfect everything...) large enough that wider keeps getting better for our current design points. BUT (and I keep stressing this) this all only holds true if, as you grow wider, you also approach closer and closer to perfect with your branch prediction, cache performance, etc.
There was a time when Intel put massive effort into these, including sponsoring vast amounts of very good academic work. But somewhere around Sandy Bridge they seemed to lose all interest in this and since then it's just pathetic minor tweaks. Meanwhile Apple looked at all that academic research and said "damn! the sky is the limit for the next thirty years"...

name99 · Aug 24, 2023

Basic75 said:
That was a general statement. Intel has great caches, and their best CPUs have higher frequencies and higher performance than Apple. Perhaps Apple gets away with worse caches because they target lower clocks.

Worse caches means what?
Your thinking is EXACTLY what I claim has destroyed Intel – the assumption that the sole important metric of a design (including a cache) is frequency.

Apple caches are better by every metric (size, hit rate, energy, parallelism etc) than Intel except two
- timing (ie frequency)
- width (ie you can load wide AVX512 vectors from Intel L1D)
In both cases I'd argue that Intel has massively optimized for the wrong thing, and it shows.

And of course as you go to the further caches (L2 then SLC) there's no comparison. Apple is doing things with those caches in terms of energy saving, IO communication, managing GPU streams, etc that Intel hasn't even dreamed of.

M3 Chip Generation - Discussion Megathread

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors 6502a

macrumors 68030

macrumors 6502a

macrumors Core

macrumors 68020

macrumors Core

macrumors 68020

macrumors Core

macrumors 68020

macrumors 65816

macrumors 68020

macrumors 65816

macrumors Core

macrumors 68040

macrumors 65816

macrumors 6502a

macrumors 68000

macrumors Core

macrumors G5

macrumors Core

macrumors 68030

macrumors 68030

Our Staff