Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

playtech1

macrumors 6502a
Oct 10, 2014
695
889
I am very curious about where Apple have decided to land on the voltage/frequency curve. If Apple is having trouble getting extra performance out of its latest chips there must be a temptation to trade some battery life and heat for more performance so they can add a nice graph to the slide showing it beating its competitors.

Personally for the new MacBook Pros I would take a modest overall performance gain if it came with a material battery life improvement and kept the cool and quiet approach. More scope on the desktop to go wild, but I think Apple nailed the laptop requirements for a general user with M1 Pro.

My bet for tonight is that Apple focuses on new features rather than overall speed improvement. If the new chip's GPU has ray-tracing support then I think we are in for a graph showing how pitiful unaccelerated ray-trading performance is compared to the latest and greatest.
 
  • Like
Reactions: Populus

MayaUser

macrumors 68040
Nov 22, 2021
3,177
7,196
My bet for tonight is that Apple focuses on new features rather than overall speed improvement. If the new chip's GPU has ray-tracing support then I think we are in for a graph showing how pitiful unaccelerated ray-trading performance is compared to the latest and greatest.
100% we will have ray tracing support...and im 50% sure that Apple will brag about it in 3d modelling/motion or in gaming comparing the M3 with M2 because there we can have over twice as improvement
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
You know, everyone says that, but based on what data? My impression is that TSMC has an oversupply of wafers because everyone else except a couple of crypto-chip makers (very low wafer volume; the dice are tiny) passed on N3B. But like everyone else I'm just trying to read between lines that are exceptionally blurry.

There is a substantial material difference between 'idle wafer capacity' and 'total wafer capacity'. If the number of orders is just 5K wafers per month that is a different indicator if the total wafer capacity is 25K verus 50K (operating at 20% versus operating at 10%) . It boils down to how much 'runway' that 'oversupply' of wafer start slots is.

The A17 Pro is a big demand bubble. That idle capacity 'runway' could shrink really fast have a huge spike in demand.

It more so looks like Apple might has been playing a game of 'chicken' with TSMC. They will hold orders until later trying to either skip paying for defects as much as possible ( e.g., rumors of not paying for wafers and only good dies) or just playing 'hardball' on price . Unless demand for the Apple products dramatically falls lower over time ... it is just a ballon squeeze as to where the demand bubble shows up.

Recent rumblings from " Moore's Law is dead " is that Intel is sticking with N3B but slide into 2024. [ Re-spin to N3E wasn't worth the time/effort/cost given other issues they are juggling. Waiting even longer to release wasn't going to help them. If Zen 5 arrives before 'Arrow Lake' they are already in deep doo-doo. Pushing it out much longer just pushes them into an even deeper pit. And even on N3B they still needed to 'fix' stuff. ]

Suggestive that there was a very substantive wafer 'hole' that Apple didn't 'have to' fill. The lead time on making these N3 generation chips is several months long. If the rest of the parts for a Mac/iPhone were scheduled to show up on Apple's long term plan 3-5 months later , then buying relatively expensive N3B sllicon and sitting on large quantities of it for over a Qtr just hurts Apple's number's. And the planned buying numbers Apple would have N3B would have been context of another large consumer for the wafers. ( so probably spread out demand bubbles of new products. ).


In recent earnings call TSMC said that "N3" had only just now in Q3 recognized substantive earnings to report. So either they were doing a far amount of stuff for 'almost free' in Q1-Q2 or there just were not any orders. If either way if the 'workload' isn't being spread out over the year , then just building a bigger demand bubble in the last part of the year.

My instinct is that the additional backlog in iPhone 15 Pro shipping time would not be meaningful, but I can't (yet) support that with evidence. And again, this all assumes that chip supply is even an issue, and I seriously doubt that.

It doesn't have to be N3B wafers. It can be for other components which were planned for a different timeframe in an earlier longer term plan ( e.g. supply constrained 2020-2022 era. )
 
  • Like
Reactions: Chuckeee

SBeardsl

macrumors member
Aug 9, 2007
56
14
100% we will have ray tracing support...and im 50% sure that Apple will brag about it in 3d modelling/motion or in gaming comparing the M3 with M2 because there we can have over twice as improvement
Minor quibble but I don't see it repeated often enough in this thread. M3 GPUs will be Apple's first pass at hardware ray tracing support not its final act. Apple will have selected a collection of general operations and specific functions to speed up 3D rendering but there will still be plenty of opportunity left to improve performance in future versions of Apple Silicon.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
They weren't conservative with new added features but conservative with raw CPU and GPU speed improvements.

33% wider int subsystem and 300Mhz higher clock is hardly conservative

It could be a Zen2 to Zen3 moment where Zen2 added new features but Zen3 was the chip that allowed AMD to actually lead in performance.

There is arguably a bigger change in the execution width between the A16 and A17 than there was between Zen2 or Zen3.

One needs to keep in mind that already A14 is much wider and deeper than anything that Intel or AMD have built. It's arguably easier to achieve meaningful increases in IPC if your IPC is low to begin with. I am not al all surprised by AMD's claims that Zen5 will increase the IPC by 15% — that's where Apple was five years ago.

My bet for tonight is that Apple focuses on new features rather than overall speed improvement.

Well, I'm betting agains your bet :)
 
  • Haha
Reactions: playtech1

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
I won't say I'm certain they're wrong (nobody outside AMD could do that) but it seems fairly improbable.

1) If they do a core design for 3nm, why would they want to redo the whole thing for 4nm? It's not a die shrink, it's a bunch of work. Not to mention the rest of the CCX.
2) What does this buy them? Larger and hotter dies?

AMD reportedly did the Zen5 design for both N4 and N3 all along. The primary motivator for the N3 branch is the Zen5c. The Zen4c cores are denser than the plain Zen4. If 'denser' is a primary objective why would you skip denser N3? No good reason to. Will there be a Zen5c denser on N4 also? Yes. But the consumer 5c/N4 can pay for itself if used widely.

But N3 isn't changing the microarchtecture of Zen5. It is just making the chlplet smaller so they can pack more cores into a package.

Ampere One is already at 192 cores on TSMC N5 ( versus AMD Bergamo's 128 cores )


Should AMD wait around until Ampere 'Two' shows up on N3 ? Some folks are pointing to Intel delivering their 'client' solution slower as to why AMD can slow roll their Zen5c into 2025. The primary cloud competition isn't Intel. Amazon is aggressively dumping x86 for their own Arm cores in AWS. Most major cloud vendors are doing the same with AmpereComputing offerings. AMD/Intel efforts are both 'backstops' trying to stop the slide.

In the consumer space N4 buys AMD cheaper packages. They are still going to try to beat Intel on price point in some areas. Not by a huge amount, but enough to toss in retail discount programs if they have to just to move more volume and show market share growth. Pricing flexibility. ( works even better against Apple who is normally pricing rigid. )

But there are counterarguments, maybe.
1) Doing cache in N3E instead of N4 gets no area advantage at all and it costs more. I don't know if you can reasonably stack an N4 cache chiplet on top of an N3E CCX - I don't see why not, but I don't know enough about the tech and could be missing something blindingly obvious. If you can't, that could possibly be an argument in favor of N4.

3D cache packaging is more expensive than just plain design at N4. For the lower-midrange mainstream options not going to have 3D cache put on them and hit the current ( or lower) price points.

The vast bulk of what AMD is going to sell in the mainstream category is not the bleeding edge , high end stuff that there will be a flurry of tech spec porn hype about.


For mainstream AMD could very easily go N4 --> N3P. (even more so if the Zen5c is on N3E which is 'same family' general design rule compatible with N3P). The notion of skipping every possible iteration of N3 is very dubious for processor packages that are cost constrained. N2 is pretty unlikely to get any cheaper. The machines ot make it are just that much more expensive. The added tooling and process complexity is higher . etc etc. etc. The cost/wafer is likely going up. Whereas 3rd generation N3 has a pretty good chance of beating N2 on costs.

Similarly N3 was 'six months late'. N2 is even more complex and novel 'new tech' process. Pretty good chance TSMC hits the farther out part of any range they submit. ( e.g. 2H 2025 more so means December than July-August. ). If trying to carefully control costs jumping to N2 as fast as possible doesn't make much sense.
 

Populus

macrumors 603
Aug 24, 2012
5,941
8,411
Spain, Europe
Just 9 hours to know what’s all about the M3!!!

Will Apple try to impress to counter the lower sales, or will they take a more conservative approach? We’re about to know!!

I will share my not-too-technical first impressions over here, regardless of the people disagreeing with them. And I’m looking forward to reading yours!
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
But underwhelming speed improvements for a brand-new node.

We've been over this multiple times. TSMC promised 10-15% higher speed at some power and 25-30% lower power at the same speed compared to N5.

With A17 Pro we got 12% higher clock at 5 watts and 30-40% lower power draw at 3.4Ghz compared to M2. It's exactly what TSMC promised. So if you consider this underwhelming, well, that's just how the technology develops.
 

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
We've been over this multiple times. TSMC promised 10-15% higher speed at some power and 25-30% lower power at the same speed compared to N5.

With A17 Pro we got 12% higher clock at 5 watts and 30-40% lower power draw at 3.4Ghz compared to M2. It's exactly what TSMC promised. So if you consider this underwhelming, well, that's just how the technology develops.
That's without any actual IPC improvement. IE. Apple routinely improved ST by 10 - 20% on the same family node in the past. Node improvement isn't the only way to improve speed. That's why I said I might expect M4 to have more significant real-world speed improvement than M3.

It's also the reason I gave you the Zen2 to Zen3 example. I believe Zen3 had as much as 19% improvement in IPC despite using the same 7nm node. Maybe a slightly enhanced 7nm node if I recall.

Zen3_arch_3.jpg
 
Last edited:
  • Like
Reactions: nquinn and Populus

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
If we have indeed reached the parallelisation limit on most codes, I am very sceptical about future architectures delivering major IPC improvements. This leaves increasing frequency as the primary way to improve performance, or maybe a radical paradigm change and rethinking how we write code in general.

There is a presumption there that trying to force all the cost into a single type of core. IPC is relative to to a core.
On generic 'dick , jane , spot' code there are caps to how much parallelism can squeeze out of what is constructed as a serial set of events. ( speculatively chasing deeper and deeper past error/limit bound checking tends to open up holes as pulling/mutating data really shouldn't be pulling/mutating. ) The issue more so is why code for just one hyper-general processor. It is like trying to do floating point with no hardware support. You can flog the integer system to run the float emulation faster or ... just put in float. Another example is encryption. Flog the generic opcodes to go faster or add an AES instruction?

IPC has a premise built into it that more instructions mean more work done by 'use to do everything' instructions.
 
  • Like
Reactions: Confused-User

leman

macrumors Core
Oct 14, 2008
19,521
19,675
That's without any actual IPC improvement. IE. Apple routinely improved ST by 10 - 20% on the same family node in the past. Node improvement isn't the only way to improve speed. That's why I said I might expect M4 to have more significant real-world speed improvement than M3.

And we've been over that multiple times. There is limit to how much IPC you can achieve (Amdahl's law). Apple's M1 was already extremely wide. I think it's only natural that they can't continue with the same rate of IPC improvements and will need to look elsewhere for gains.

Look at it like this. Marathon world record in the 1950-es was at around 2:20. World record today is at 2 hours. That's 20 minute improvement in 70 years. Do you think it's reasonable to expect that a non-assisted, non-modified human can reach 1:40 mark by 2100? I kind of doubt it. There are limits. Improvement rate slows down as you approach the asymptote.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
There is a presumption there that trying to force all the cost into a single type of core. IPC is relative to to a core.
On generic 'dick , jane , spot' code there are caps to how much parallelism can squeeze out of what is constructed as a serial set of events. ( speculatively chasing deeper and deeper past error/limit bound checking tends to open up holes as pulling/mutating data really shouldn't be pulling/mutating. ) The issue more so is why code for just one hyper-general processor. It is like trying to do floating point with no hardware support. You can flog the integer system to run the float emulation faster or ... just put in float. Another example is encryption. Flog the generic opcodes to go faster or add an AES instruction?

IPC has a premise built into it that more instructions mean more work done by 'use to do everything' instructions.

Yep, and why you are saying goes into that "radically rethink how we write code" I mentioned. The constant factor for many narrowly defined domains can be substantially improved with the help of specialised hardware. But this doesn't work if your domain is general enough.

For example, it seems to me like Apple is going after branchy code with A17. They have substantially increased the core capability to handle flags and conditional branches (by up to 2x in floating point code as it seems!). Will a lot of code benefit from this? Probably not, but it is possible that we will see more noticeable increases on very branchy code or less optimised code.
 

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
And we've been over that multiple times. There is limit to how much IPC you can achieve (Amdahl's law). Apple's M1 was already extremely wide. I think it's only natural that they can't continue with the same rate of IPC improvements and will need to look elsewhere for gains.

Look at it like this. Marathon world record in the 1950-es was at around 2:20. World record today is at 2 hours. That's 20 minute improvement in 70 years. Do you think it's reasonable to expect that a non-assisted, non-modified human can reach 1:40 mark by 2100? I kind of doubt it. There are limits. Improvement rate slows down as you approach the asymptote.
Sorry, I'm not going to buy that. Do I believe that IPC is harder and harder to gain? Absolutely. Do I believe that you're also making some excuses for Apple for little to no IPC lift? Yes. Do I expect A18 Pro to have more IPC gains than A17 Pro? Yes, I think A17 Pro is more about new features. It's the lob before the dunk. Hence, I said earlier that I might wait until M4 before I make an upgrade. It just depends on how good M3 looks.

PS. In computer science, we mainly cite Amdahl's law for MT performance - not IPC improvements in chips.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Also, there’s the “problem” that architectures made for the N3B node don’t scale well (?) or are not very compatible (?) with the upcoming N3E process. Yeah, I’m not sure what that means, but I think it points towards a new, much better architecture for the silicon coming with the N3E process, such as the A18 and the A18 Pro chip. And I do expect a big improvement with that generation of silicon.

N3E is not going to give an inherently better architecture to anyone's design arising from the process itself.

It could be cheaper to make ( not necessarily lower end-user prices). But that is a completely different dimension than performance ( which is where most of this thread it harping on 'better' ). N3E probably results in incrementally bigger dies, but that is offset by the wafer cost being incrementally lower also. Neither of those creates an inherently better implementation architecture.

N3E can push the clocks a bit more because it is not as dense. That too isn't really an arch improvement in and of itself. Pretty likely going to get an architecture of "less stuff' rather than "better stuff" because ran out of transistor budget (and/or die area ) quicker. ( If Apple ports their N3B implementation to N3E likely it is mainly same arch on a slightly bigger die and just some uplift on max clock. Not a much better arch. The single thread drag race folks will cheer , but it is just 'hot rod' of the same stuff. ) .

N3E 'compatible design rule' benefit is that it will be easier to port some architecture implementation to either N3P, N3S , or (extra extremely unlikely in Apple's case) N3X. N3X throw peformance per Watt out the window so

There is hype that lots of folks are going to skip off to TSMC N2 as fast as possible and follow ons to N3E are 'dead'. That is extremely likely not true where end-user package cost is an issue. However, if only going to do 'one' stop on anything N3 variant then N3E doesn't have much over N3B if don't have extremely tight wafer cost constraints.

Wether the M3 chips are based on the A17 Pro (and made using the N3E node) or are they based on the upcoming A18 (and made using the N3E process), is the key, at least for me, to know if this M3 gen is going to be 1) as continuist and iterative as the A17 Pro, just higher clock speeds, more RAM and a more powerful GPU architecture with RayTracing, or 2) they are going to introduce a new e-core, p-core and n-core architecture as well, with higher core counts, which would be a considerable improvement on the Apple Silicon horizon.

A quick shift to N3E for A18 likely just means clock boost of mostly the same stuff on CPU/GPU core front. It would quite easy for Apple just to port the A17 Pro over to N3E and just slap a A18 label on it. And for A18 Pro to just 'bin down' some cores for the A18.

N3P and N3S are going to bring some density improvements that N3E backslided on. But that really is just getting back to N3B like densities. I would be skeptical of major core count increases coming. Maybe more optimized cores for Perf/Watt than the first two, but there is not a ton of extra transistor budget coming at all and the A17 Pro really didn't shrink radically on size even with N3B. ( i.e., Apple has already thrown a lot more 'stuff' onto the die. ) . So pretty likely going to get more specialized cores than a generic "just throw more general purpose cores" approach.

So M5-M6 for general compute core count increases , if any. Core counts is missing forest for the tree.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Sorry, I'm not going to buy that. Do I believe that IPC is harder and harder to gain? Absolutely. Do I believe that you're also making some excuses for Apple for little to no IPC lift?

I am not making any excuses. I am trying to understand why a chip with 33% wider execution backend and 50% more branch units only achieves 3-5% higher IPC. Either Apple has royally botched up something inside that CPU, or their IPC is already so high that expecting more substantial gains is unreasonable.

Also, AMD is hardly a good example. Zen4 IPC is on par with Apple A12. Of course it's easier for AMD to get notable IPC improvements. Hell, they got a decent IPC improvement simply from increasing the cache size. They are basically repeating steps Apple did years ago. At some point the bag of tricks is empty.

Yes. Do I expect A18 Pro to have more IPC gains than A17 Pro? Yes, I think A17 Pro is more about new features. It's the lob before the dunk. Hence, I said earlier that I might wait until M4 before I make an upgrade. It just depends on how good M3 looks.

I am not sure what you are basing all these speculations on. Analysis (however limited) on A17 u-arch is available. We know that this is Apple's first really new microarchitecture in many years, and we know that it's substantially wider than Firestorm and its iterations. It is extremely unlikely that Apple will have another massive micro-architecture update within a year. Maybe some minor tweaks that help extract more IPC from that wide core, sure, that cannot be discounted. But A17 is the basis for the next few years at least.

PS. In computer science, we mainly cite Amdahl's law for MT performance - not IPC improvements in chips.

What's the difference? It's ultimately the same thing. On the fundamental level, a modern super-scalar CPU core is multi-device machine trying to concurrently chop away on a serial program in the most efficient way.
 
  • Like
Reactions: Confused-User

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Yep, and why you are saying goes into that "radically rethink how we write code" I mentioned. The constant factor for many narrowly defined domains can be substantially improved with the help of specialised hardware. But this doesn't work if your domain is general enough.

Apple tends to hide their accelerators behind libraries. Calling AppleMatMulitpy ( m1 , m2 , m3 ) isn't much of a rewrite from MyMatMultiple ( m1 , m2 , m3 ) . Even less if were calling AppleMatMulitply all along.

Similar standard library BLASxwy where Apple does the back end. Also where the compiler analyzes a large basic block cluster and recomposes it the way it wants to (loop unrolling, etc) .
 
  • Like
Reactions: killawat

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Apple tends to hide their accelerators behind libraries. Calling AppleMatMulitpy ( m1 , m2 , m3 ) isn't much of a rewrite from MyMatMultiple ( m1 , m2 , m3 ) . Even less if were calling AppleMatMulitply all along.

This helps with future-proofing the hardware, but at the same time it also makes the accelerators much more limited. I'd love to use the AMX with a high-performance immutable data structure for example, but we are locked into the legacy API patterns (most of which are from the early nineties...)
 
  • Like
Reactions: killawat

Confused-User

macrumors 6502a
Oct 14, 2014
852
987
Yep, and why you are saying goes into that "radically rethink how we write code" I mentioned. The constant factor for many narrowly defined domains can be substantially improved with the help of specialised hardware. But this doesn't work if your domain is general enough.

For example, it seems to me like Apple is going after branchy code with A17. [...]
I think we can think bigger than that!

One thing that we haven't really talked about in earlier discussions about Apple's annoyingly limiting soldered-on memory is that if they're soldering it on, they can use any damn memory they please! It doesn't have to be bog-standard DIMMs (or CAMMs or whatever). And in the volume Apple does for iPhones, I think it's quite plausible that Apple could start down the road to PIM. You want a real speed revolution? Many problems can be solved 2-20x faster with PIM. You're not going to get that speedup with IPC or clock boosts anytime in the next ten years. And yes, of course, it's not a general solution to all problems, it won't boost all code, and it requires rewrites. So what? The same is true of NPUs, RT support in the GPU, etc. All that matters is that it helps enough, for enough problems.
 
  • Like
Reactions: Chuckeee

senttoschool

macrumors 68030
Nov 2, 2017
2,626
5,482
I am not making any excuses. I am trying to understand why a chip with 33% wider execution backend and 50% more branch units only achieves 3-5% higher IPC. Either Apple has royally botched up something inside that CPU, or their IPC is already so high that expecting more substantial gains is unreasonable.
So you know as much as we do. Nothing much.

My theory is that Apple focused on just getting the new architecture out (A17 Pro, Zen2) before the optimizations kick in (A18 Pro, Zen3). That's the best I got.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
One thing that we haven't really talked about in earlier discussions about Apple's annoyingly limiting soldered-on memory is that if they're soldering it on, they can use any damn memory they please! It doesn't have to be bog-standard DIMMs (or CAMMs or whatever). And in the volume Apple does for iPhones, I think it's quite plausible that Apple could start down the road to PIM. You want a real speed revolution? Many problems can be solved 2-20x faster with PIM. You're not going to get that speedup with IPC or clock boosts anytime in the next ten years. And yes, of course, it's not a general solution to all problems, it won't boost all code, and it requires rewrites. So what? The same is true of NPUs, RT support in the GPU, etc. All that matters is that it helps enough, for enough problems.

Absolutely. I think it's only a matter of time until we see an NPU-in-RAM device for example, and Apple is positioned well to win that particular race.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
This helps with future-proofing the hardware, but at the same time it also makes the accelerators much more limited. I'd love to use the AMX with a high-performance immutable data structure for example, but we are locked into the legacy API patterns (most of which are from the early nineties...)

That really is a different issue than trying to flog generic instructions harder at lower Pref/Watt. That is more of a 'my corner case code' is more important then 'their larger corner case code' issue. The accelerators are going to get allocated by how broad the usage patterns on. If future AI inference workload is projected to go way up then NPUs will get a bigger transistor budget because it impacts more users.

As the cores tend to get smaller that pragmatically opens up die area for other/bigger accelerators , but it won't open it up to everything. How wide spread the use case is will be an issue.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
That really is a different issue than trying to flog generic instructions harder at lower Pref/Watt. That is more of a 'my corner case code' is more important then 'their larger corner case code' issue. The accelerators are going to get allocated by how broad the usage patterns on. If future AI inference workload is projected to go way up then NPUs will get a bigger transistor budget because it impacts more users.

As the cores tend to get smaller that pragmatically opens up die area for other/bigger accelerators , but it won't open it up to everything. How wide spread the use case is will be an issue.

The accelerators would have no problems supporting my use case. I also doubt that this will change in the future. It's the legacy API that is the problem. If we had SME-like interface to AMX, it would be much more useful in many other domains.

The NPU is more complicated since it can benefit from more specialisation. I am ok with locking it behind APIs.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.