M3 Chip Generation - Discussion Megathread

AgentMcGeek · Sep 15, 2023

MRMSFC said:
Well, we don’t really know if the clocks that the A-series runs at are the limit of stability or whether they’re conservative.

If the former is true then, yeah there’s difficulty. If the latter, then it’s basically just an overclock, something that enthusiasts do on PCs.

I don’t think it’s fair to just compare this complex engineering to enthusiast OC.

You have to balance power consumption, heat constraints, transistor allocation, cache (not) scaling etc.

I agree with the general sentiment that they seem to have favoured NPU and GPU this year. The fact that they’ve been able to add an additional GPU core in the same thermal and physical envelop is great in itself. Looking forward to the M3 lineup.

uller6 · Sep 15, 2023

scottrichardson said:
I believe we are seeing the fundamentally new way of doing computing - and that is with dedicated ASICs/silicon optimised for unique, tightly grouped tasks, and move away from a generalised monolithic CPU core/cores.

We already have encode/decode for ProRes, AV1, and silicon for encrypt/decrypt for file fault.

And now ray-tracing.

Why not other silicon that's optimal for other tasks we do in computing? Is there something better than a monolithic generalised CPU at say... managing photoshop layers and masks? Or handling virtual instruments and plugins in audio applications? What about just simple stuff like typing and rendering text on-screen? Could that be handled by dedicated circuitry? The list goes on.

Scott

Totally dude! It’s called heterogeneous computing! We’ve been going that direction for 30 years already adding more and more functionality to SOCs. Remember when the intel floating point unit was a separate piece of silicon?

quarkysg · Sep 15, 2023

uller6 said:
Totally dude! It’s called heterogeneous computing! We’ve been going that direction for 30 years already adding more and more functionality to SOCs. Remember when the intel floating point unit was a separate piece of silicon?

The north and south bridges used to be separate ICs connected to the CPU. Now the north bridge is inside the main CPU. So yeah, you need things to sit closer if you want performance.

leman · Sep 15, 2023

Yeah, it's interesting how computing is emphasising both specialisation/heterogenous processing (specialised processors for different class of tasks) AND tight integration (for fast communication between the processors). One might think that these are opposites, but they go hand in hand.

Mac_fan75 · Sep 15, 2023

sack_peak said:
I am disappointed that Apple did not take the opportunity to increase battery mAh and instead offered a lighter weight iPhone.

Would be nice if 221g 15 Pro Max had the same weight as the 240g 14 Pro Max but with more battery like say 5,000 mAh instead of just 4,422 mAh.

Longer battery life = Lesser recharge cycles = Longer battery health before replacement

iPhone Pro Max battery mAh

- 15 Pro Max: 4,422mAh
- 14 Pro Max: 4,323 mAh
- 13 Pro Max: 4,352 mAh
- 12 Pro Max: 3,687 mAh
- 11 Pro Max: 3,500 mAh
- Xs Max: 3,174 mAh

Like if say Steve Jobs was not that stubborn about phabtlets back in 2010 and knew how unpopular the iPhone mini was we'd have 2015 iPhones 6 4.7" by 2010 iPhone 4 3.5".

Yeah but the phone is already a heavy brick, I really welcome those changes to the weight.
The unpopularity of the Mini is purely because of the size, only small portion perhaps the battery but that wasn't so bad anyways. I would say people who buy a mini are probably also not the people who stare at their phone the whole day.

sack_peak · Sep 15, 2023

Mac_fan75 said:
Yeah but the phone is already a heavy brick, I really welcome those changes to the weight.
The unpopularity of the Mini is purely because of the size, only small portion perhaps the battery but that wasn't so bad anyways. I would say people who buy a mini are probably also not the people who stare at their phone the whole day.

If I get 2000 Nokia 3310 battery life then I do not mind the weight all too much.

Xiao_Xi · Sep 15, 2023

AgentMcGeek said:
I would question the assumption that clock=easy vs IPC=hard.

You can increase the clock speed by designing a longer pipeline. Intel has used this strategy extensively.

name99 · Sep 15, 2023

leman said:
I would argue that specialisation is as old as computing itself, and that it goes both ways.

Take something as basic as floating point numbers, which we usually take for granted. You can perfectly well do them using the general-purpose integer execution units. And that's what we were doing not so long ago — but it's slow and a waste of energy. So as soon as transistor budgets permitted for some opulence, hardware designers moved these computations to specialised hardware. And there are many examples like these, from specialised cryptography instructions to trigonometrical functions in hardware (something that would be unthinkable luxury in early 80-ties for example).

But we also have enough examples of the opposite developments too. Remember gaming sound cards? They have quickly vanished when CPUs became fast enough to do audio effect processing in real time (and now hardware raytracing allows to do really complex audio stuff on the GPU as well). Similarly, early GPUs contained configurable fixed-function hardware for computing visual effects (e.g. Nvidia's register combiners), this stuff has completely disappeared now, replaced by the programmable pipeline, even though the fixed-function hardware is more efficient.

At the end of the day, it's all about solving practical problems. If we had unlimited computing power, nobody would bother with specialised hardware. But at the same time one doesn't want to build a custom accelerator for every little thing. Tasks like raytracing, compression (and I consider encode/decode a subset of compression), deep learning inference are unique and important enough to justify dedicated hardware. Managing photoshop layers or rendering text is probably not, especially since it can be expressed as a subclass of parallel programs for which accelerators — GPUs — already exist.

To sum it up, no I don't believe that there will be a proliferation of task-specific accelerators. We will certainly have plenty of modules to do specific tasks, but these are all tasks that do not require programmability or flexible logic. But developers should definitely learn about different classes of problems and how to distribute work across available hardware.

What can be done, however, is to reduce the overhead of using accelerators.
AMX is a nice example of this, giving an accelerator that's fairly light weight in terms of latency (tens of cycles) while also fairly powerful.

I think there's scope for adding blit acceleration (copy lots of data) or basic pattern search in this same way; basically think of all the types of things people want to do under the term PiM (Processing in Memory) and provide such an accelerator, using the same machinery as AMX to couple it to the CPU. Such a PiM accelerator could be attached to either the L2 or the SLC (there are reasonable arguments for either option, and perhaps it's small enough that both should be provided).

On a very different track, while the GPU and ANE are powerful, getting to them right now is very high overhead; basically not worth doing if you have less than a million items to process. Can we reduce this overhead substantially? Consider specifically compute-only paths (so no worrying about how this has to interact with the rest of the graphics machinery) - could we put together a package of updates for a super streamlined path in
- the language (ie MSL) and APIs
- the OS call
- the driver
- the dispatch of the GPU call by firmware
that reduces this by a factor of 10? A factor of 100?

I don't see any INTRINSIC reason why not. For example you could start by providing something like the CUDA Graph setup where the app, at initialization time, provides a list of graphs (ie sets of functions and their mutual dependencies) that it will wish to execute; and these are compiled, validated, whatever else the machinery wants to do, so that they are sitting in RAM just locked and ready to go. Then each time the app wishes to make a call, that's transported by IPI to the GPU firmware, which fires off the appropriate code. None of this business of dynamically created queues of instructions that are re-submitted over and over – that probably makes sense for graphics use cases but seems like pointless (and extremely expensive) overhead for simple compute co-processor cases.

name99 · Sep 15, 2023

koyoot said:
13 Pro Max lasts longer on battery, albeit slightly, than 14 Pro Max, while doing the exact same, every day tasks.

2% is going to be lost in the noise of whether the cell phone/wifi signal is weaker or stronger where you happen to be doing those "every day tasks", or whether it's brighter or darker (meaning screen ramps up to slightly brighter or darker)!

name99 · Sep 15, 2023

MRMSFC said:
This is correct,

Apple moved away from Intel because they were stagnating, and the A-series was on track to outperform them. The M1 made a huge splash because it offered great performance at crazy low power draw, and pretty much everyone extrapolated performance gains from the previous ten years of A-series processors.

When it seems like those goals aren’t being realized, it makes people anxious whether Apple can continue to deliver competitive speed.

As for their chip design team, the people here knowledgeable seem to think Apple’s doing good. If they say so, I have no reason to doubt them.

As for Intel and AMD, they seem like they’re just going all out in speed, wattage be damned. If that’s what it takes to compete with the M-series cpu cores, then I don’t think it’s Apples chip designers we need to worry about.

I think it’s more prudent to focus on GPU. It’s what most people seem to tear into the M-series for anyway. And to their credit it is, I believe, Apple stands to gain the most.

Apple talked up the A16 purely in terms of energy, said nothing about performance (you can go back and look at the announcement) and yet we got about a 10% performance boost.

If (as seems the way to bet) the M3 is based on the A17 and ALL we get is the same sort of scaling as the A16 and A17, that's a 20% speed boost; more so than the (not unwelcome) 10% boost going from M1 to M2. This is even apart from whatever additional functionality Apple may add to the M1 (or Pro/Max) as appropriate (eg support for large pages or extra AMX functionality).

Hell it's even possible that there's 1% or 2% additional performance to be coaxed out of the A17 just by using an optimized compiler (eg one that is aware of additional fusion patterns used by the A17). Remember that GB6 clearly cannot be compiled to optimize for A17 as a target...

Point is, all in all it seems kinda premature to weep and wail about the failure of the grand Apple Silicon experiment just as it is about to deliver us an M3 that's, who knows, 20..25% faster than M2 on P cores; and, who knows what, faster on GPU, NPU, and media!

leman · Sep 15, 2023

name99 said:
What can be done, however, is to reduce the overhead of using accelerators.
AMX is a nice example of this, giving an accelerator that's fairly light weight in terms of latency (tens of cycles) while also fairly powerful.

I think there's scope for adding blit acceleration (copy lots of data) or basic pattern search in this same way; basically think of all the types of things people want to do under the term PiM (Processing in Memory) and provide such an accelerator, using the same machinery as AMX to couple it to the CPU. Such a PiM accelerator could be attached to either the L2 or the SLC (there are reasonable arguments for either option, and perhaps it's small enough that both should be provided).

On a very different track, while the GPU and ANE are powerful, getting to them right now is very high overhead; basically not worth doing if you have less than a million items to process. Can we reduce this overhead substantially? Consider specifically compute-only paths (so no worrying about how this has to interact with the rest of the graphics machinery) - could we put together a package of updates for a super streamlined path in
- the language (ie MSL) and APIs
- the OS call
- the driver
- the dispatch of the GPU call by firmware
that reduces this by a factor of 10? A factor of 100?

I don't see any INTRINSIC reason why not. For example you could start by providing something like the CUDA Graph setup where the app, at initialization time, provides a list of graphs (ie sets of functions and their mutual dependencies) that it will wish to execute; and these are compiled, validated, whatever else the machinery wants to do, so that they are sitting in RAM just locked and ready to go. Then each time the app wishes to make a call, that's transported by IPI to the GPU firmware, which fires off the appropriate code. None of this business of dynamically created queues of instructions that are re-submitted over and over – that probably makes sense for graphics use cases but seems like pointless (and extremely expensive) overhead for simple compute co-processor cases.

I was thinking about the same thing. What I’d like to see is heterogeneous code as first class citizen, where you can (asynchronously) call a GPU function from CPU code etc, with similar overhead as launching a thread. Synchronization/scheduling overhead is one problem, but there is also ease of use. Right now invoking the GPU requires a lot of interfacing code, would be great if most of it could be removed. That’s a big advantage of unified source models like CUDA where the runtime/compiler takes care of interfacing for you. But why not have it done at the OS level, maybe even with low level hardware support?

Persnally, I’m very exited by the idea of heterogenous computing and am looking forward to its evolution!

name99 · Sep 15, 2023

Xiao_Xi said:
You can increase the clock speed by designing a longer pipeline. Intel has used this strategy extensively.

That's a losing game in terms of energy and performance.
Firstly you lose time in the (fixed) settling time between stages.

Secondly there are stages (most obviously scheduling) that really HAVE to complete in one cycle. If I can't issue a dependent operation the next cycle after its predecessor then I might as well run the CPU at half the frequency, because I've lost all the latency benefit of being able to issue back to back.

name99 · Sep 15, 2023

leman said:
I was thinking about the same thing. What I’d like to see is heterogeneous code as first class citizen, where you can (asynchronously) call a GPU function from CPU code etc, with similar overhead as launching a thread. Synchronization/scheduling overhead is one problem, but there is also ease of use. Right now invoking the GPU requires a lot of interfacing code, would be great if most of it could be removed. That’s a big advantage of unified source models like CUDA where the runtime/compiler takes care of interfacing for you. But why not have it done at the OS level, maybe even with low level hardware support?

Persnally, I’m very exited by the idea of heterogenous computing and am looking forward to its evolution!

My guess is that a clean, easy GPU/ANE compute mechanism is going to be added to Swift that gives us what we want. Leave MSL complexities for people who care about graphics, but provide a parallel compute-only Swift API.

But Apple being Apple, they want to get the API just right so that they don't have to deprecate it in a year (I think they burned a lot of goodwill in the first two or three years of Swift, and don't want to repeat that experience!)

Gudi · Sep 15, 2023

koyoot said:
13 Pro Max lasts longer on battery, albeit slightly, than 14 Pro Max, while doing the exact same, every day tasks.

Not true. The iPhones do more and more sophisticated tasks: (AV1, USB 3, RTX). Saving battery life is not the only goal to optimize for. Otherwise we would all be using a Nokia 1110.

Xiao_Xi · Sep 15, 2023

name99 said:
Secondly there are stages (most obviously scheduling) that really HAVE to complete in one cycle. If I can't issue a dependent operation the next cycle after its predecessor then I might as well run the CPU at half the frequency, because I've lost all the latency benefit of being able to issue back to back.

Could you explain it a bit more? Intel and AMD design their CPUs to have a very high clock speed, so it's not a problem.

leman said:
I was thinking about the same thing. What I’d like to see is heterogeneous code as first class citizen, where you can (asynchronously) call a GPU function from CPU code etc, with similar overhead as launching a thread. Synchronization/scheduling overhead is one problem, but there is also ease of use. Right now invoking the GPU requires a lot of interfacing code, would be great if most of it could be removed. That’s a big advantage of unified source models like CUDA where the runtime/compiler takes care of interfacing for you.

How would this differ from the sender/receiver model proposed in C++?

New C++ Sender Library Enables Portable Asynchrony

NVIDIA is excited to announce the release of the stdexec library on GitHub and in the 22.11 release of the NVIDIA HPC Software Development Kit. The stdexec library is a […]

www.hpcwire.com

Gudi said:
The iPhones do more and more sophisticated tasks: (AV1, USB 3, RTX). Saving battery life is not the only goal to optimize for.

I think Apple didn't communicate well what they wanted to achieve with the A17 Pro. Apple has innovated in marketing almost as much as they have in their products, and they tend to explain their CPU improvements better.

name99 · Sep 15, 2023

Xiao_Xi said:
Could you explain it a bit more? Intel and AMD design their CPUs to have a very high clock speed, so it's not a problem.

Intel TRIED the "crank the frequency to the limit" strategy with P4. It was a failure.
IBM tried it with POWER6. It was a failure.
Just think what back-to-back issue of dependent instructions MEANS.

I'm not saying Apple couldn't go to a higher frequency if they wanted.
I am saying that
(a) you can't keep increasing frequency indefinitely by adding more pipeline stages
(b) the maximum frequency you can go is not the frequency at which performance will be maximized (because each stage no longer has time to engage in OoO smarts)
(c) there are certain operations in the pipeline which, if you split them over more than one pipeline stage, you might as well run at half the frequency because you have destroyed the performance of back-to-back instruction execution. The most important of these is scheduling, but another example is what value would it be if if took two cycles to execute the simplest single-cycle instructions (eg ADD)? You can boast that your CPU runs at 10GHz, but for practically every purpose, it might as well run at 5GHz and take one cycle, not two, for ADD, AND, NOT, etc.

(d) so even if you want to maximize performance you don't get there by raw maximizing of GHz.
(e) and you probably don't want to maximize performance, you want to maximize some combination of performance and energy...

Right now the highest single core GB6 results (that aren't obvious nonsense) are around 3300 for Intel 19-13900KS. M2 Pro/Max are at around 2800. If we assume basic 20% improvement (10% from A16, 10% from A17) that gives us M3 at 3360. So it's hardly clear that Apple are following the wrong path...

Gudi · Sep 15, 2023

Xiao_Xi said:
I think Apple didn't communicate well what they wanted to achieve with the A17 Pro. Apple has innovated in marketing almost as much as they have in their products, and they tend to explain their CPU improvements better.

It's a trinity: technology, design and marketing. You wouldn't pay premium for a BMW, if it wasn't well designed and didn't promise to be fun to drive. When you look at the Titanium iPhone, you'll see a lot of design refinements and a presentation geared towards professional film making à la Hollywood.

As for the chip, we knew the GPU needed the most attention and RTX was rumoured to be planned for last years chip. People are only disappointed, because they expect the M1 revolution to happen every year. M2 was a stopgap, M3 is a stopgap, M4 is going to be a stopgap year too! 😆

mr_roboto · Sep 15, 2023

AgentMcGeek said:
I would question the assumption that clock=easy vs IPC=hard. Redesigning an architecture to allow for higher clock speeds at equal or lower power is hard. It’s not always done just through die shrinks. Silicon experts feel free to agree/disagree.

Yes. Sometimes going faster means redesigning a lot of stuff.

MRMSFC said:
Well, we don’t really know if the clocks that the A-series runs at are the limit of stability or whether they’re conservative.

If the former is true then, yeah there’s difficulty. If the latter, then it’s basically just an overclock, something that enthusiasts do on PCs.

No. Enthusiasts never do what silicon device manufacturers do. They simply cannot - they don't have the expertise, and more importantly they don't have all the information they'd need even if they did.

But more importantly, you're missing the entire point. We're talking about a new generation of the microarchitecture. You're assuming that a clock speed increase must ipso facto be evidence that Apple just turned the clock speed to 11 like any overclocker would do. Instead, we might be seeing real world evidence that in engineering, sometimes if you want to go faster (in clock frequency), there are side effects and you lose some of the gains.

Apple probably isn't too interested in just boosting voltage to increase frequency. In a MacBook they might get away with the reduced efficiency, in a phone not so much. So maybe they decided to increase the number of pipeline stages in this generation's P core. That has a bunch of side effects! For example, it makes the penalty for restarting the pipeline after a flush (due to branch mispredict, etc) worse. These problems might, in turn, drive other design improvements to re-gain that lost IPC back.

When you interpret every clock frequency increase in a new generation design as evidence of "overclocking" the previous generation design, you're making some very bad assumptions. Especially when you know (because they told us) that there's microarchitectural changes, and furthermore it's on a new process node.

Xiao_Xi · Sep 15, 2023

mr_roboto said:
So maybe they decided to increase the number of pipeline stages in this generation's P core. That has a bunch of side effects! For example, it makes the penalty for restarting the pipeline after a flush (due to branch mispredict, etc) worse. These problems might, in turn, drive other design improvements to re-gain that lost IPC back.

Could an improved branch predictor allow Apple to mitigate the cons of a longer pipeline?

mr_roboto · Sep 15, 2023

Xiao_Xi said:
Could an improved branch predictor allow Apple to mitigate the cons of a longer pipeline?

It's one of many possibilities, yes.

Gudi · Sep 15, 2023

mr_roboto said:
You're assuming that a clock speed increase must ipso facto be evidence that Apple just turned the clock speed to 11 like any overclocker would do. Instead, we might be seeing real world evidence that in engineering, sometimes if you want to go faster (in clock frequency), there are side effects and you lose some of the gains.

There's a much more obvious explanation. A17 Pro is the world's first 3 nm chip, which means fewer electrons travel to represent one bit. Lower energy consumption and therefore lower heat production per cycle mean, you can safely increase clock speed and run more cycles per second within the same thermal envelope. This happens with every DIE shrink. The only time to worry is when they increase clock speed without switching to a smaller chip manufacturing process.

AgentMcGeek · Sep 15, 2023

Another thing we need to keep in mind is that the 3nm process node that Apple is (exclusively) using is not good. TSMC N3 has had lots of issues and was delayed by 6+ months. It brings limited improvements.

N3B and N3X are the real deal and coming next year. Actually it might be that the M3 series will use one of these. TSMC wants to move on from vanilla N3 ASAP because the yields also aren't as good as the newer nodes.

T'hain Esh Kelch · Sep 15, 2023

AgentMcGeek said:
N3B and N3X are the real deal and coming next year. Actually it might be that the M3 series will use one of these. TSMC wants to move on from vanilla N3 ASAP because the yields also aren't as good as the newer nodes.

N3B *is* N3.

Boil · Sep 15, 2023

AgentMcGeek said:
Another thing we need to keep in mind is that the 3nm process node that Apple is (exclusively) using is not good. TSMC N3 has had lots of issues and was delayed by 6+ months. It brings limited improvements.

N3B is the first 3nm process; Intel backed out of their deal with TSMC and Apple took up the slack...

I wouldn't say it was "not good", more like "less than optimal"...

AgentMcGeek said:
N3B and N3X are the real deal and coming next year. Actually it might be that the M3 series will use one of these. TSMC wants to move on from vanilla N3 ASAP because the yields also aren't as good as the newer nodes.

N3E is the follow-up to N3B; N3S & N3P after that; and N3X is the final 3nm variant, designed for higher power draw & clock speeds...

leman · Sep 15, 2023

mr_roboto said:
Yes. Sometimes going faster means redesigning a lot of stuff.

No. Enthusiasts never do what silicon device manufacturers do. They simply cannot - they don't have the expertise, and more importantly they don't have all the information they'd need even if they did.

But more importantly, you're missing the entire point. We're talking about a new generation of the microarchitecture. You're assuming that a clock speed increase must ipso facto be evidence that Apple just turned the clock speed to 11 like any overclocker would do. Instead, we might be seeing real world evidence that in engineering, sometimes if you want to go faster (in clock frequency), there are side effects and you lose some of the gains.

Apple probably isn't too interested in just boosting voltage to increase frequency. In a MacBook they might get away with the reduced efficiency, in a phone not so much. So maybe they decided to increase the number of pipeline stages in this generation's P core. That has a bunch of side effects! For example, it makes the penalty for restarting the pipeline after a flush (due to branch mispredict, etc) worse. These problems might, in turn, drive other design improvements to re-gain that lost IPC back.

When you interpret every clock frequency increase in a new generation design as evidence of "overclocking" the previous generation design, you're making some very bad assumptions. Especially when you know (because they told us) that there's microarchitectural changes, and furthermore it's on a new process node.

Thank you, this is most instructive! I’m always learning something from your posts.

M3 Chip Generation - Discussion Megathread

macrumors 6502

macrumors 65816

macrumors 65816

macrumors Core

macrumors regular

Suspended

macrumors 68000

macrumors 68030

macrumors 68030

macrumors 68030

macrumors Core

macrumors 68030

macrumors 68030

Suspended

macrumors 68000

macrumors 68030

Suspended

macrumors 6502a

macrumors 68000

macrumors 6502a

Suspended

macrumors 6502

macrumors 604

macrumors 68040

macrumors Core

Our Staff