Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

leman

macrumors Core
Oct 14, 2008
19,521
19,675
I'm a little tired so this is probably obvious, but how does this show it? Is it that the Intel Core i7-12700T has the AMX but the Intel Xeon w5-3435X doesn't?

The Xeon has matrix accelerators, and it also has AVX-512. One would expect the Xeon to run circles around M4 in math-heavy tasks because of AVX-512 alone, especially since single-core boost for the w5-3435X is 4.7 Ghz — 5% higher than M4 (let's ignore the tests that use the matrix coprocessor for now). But the passively cooled iPad is faster in every single single-core subtest. That is absolutely incredible.
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
The Xeon has matrix accelerators, and it also has AVX-512. One would expect the Xeon to run circles around M4 in math-heavy tasks because of AVX-512 alone, especially since single-core boost for the w5-3435X is 4.7 Ghz — 5% higher than M4 (let's ignore the tests that use the matrix coprocessor for now). But the passively cooled iPad is faster in every single single-core subtest. That is absolutely incredible.
As an aside, does Xeon actually hit boost clocks in single core when using AVX512? I could have sworn AVX512 caused Intel CPUs to throttle. My searching is giving mixed results on this.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
As an aside, does Xeon actually hit boost clocks in single core when using AVX512? I could have sworn AVX512 caused Intel CPUs to throttle. My searching is giving mixed results on this.

Not hitting peak clocks when running AVX-512 is pretty much the only explanation I can muster for it doing so poorly on these tests. Apple only has four 128-bit SIMD units, so the most M4 can retire per clock are 512-bit worth of ALU operations. But the Xeon should be able to do twice or even three times more.

I think what this demonstrates is how wide vector instructions are a dead end. What is even a point of having 512-bit processing paths if you have to slow your core down to a crawl to use them? That is just dumb design.
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
Not hitting peak clocks when running AVX-512 is pretty much the only explanation I can muster for it doing so poorly on these tests. Apple only has four 128-bit SIMD units, so the most M4 can retire per clock are 512-bit worth of ALU operations. But the Xeon should be able to do twice or even three times more.

I think what this demonstrates is how wide vector instructions are a dead end. What is even a point of having 512-bit processing paths if you have to slow your core down to a crawl to use them? That is just dumb design.
Yeah it is very dumb.
Thanks!
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,226
As has been pointed out, AMD also had huge benefits to Object Detection in Zen 3 to Zen 4 (presumably AVX-512 VNNI) which no one complained about then. So I recapitulated my earlier chart with Zen 3 to Zen 4. I haven't had time to do what @leman did (haven't collected the data) so imagine error bars or fancy box plots instead of a chart generated by Numbers but uhhh ... suddenly Zen 3 to Zen 4 doesn't look that much better than M-series progression - a bit better perhaps, but not in every subtest. Maybe Zen5 will indeed be better, but AMD has a lot further to go ...

1715604762935.png
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
These are still optional extensions. Apple can implement v8 with SME subset on top. For instance, I don’t see them adopting SVE in the near future.
As far as I can tell from various reading, the *current* "legal" situation (everything is of course negotiable, and this has changed relative to a few years ago, hence on-going confusion...) is

- SME does NOT require SVE. And M4 falls into this case, providing SME and SSVE, but not SVE.
This is significant because the initial SME announcement strongly hints that SME required SVE.

- v9 NO LONGER requires SVE
- so in principle (and perhaps in reality) M4 could be some flavor of v9, and Apple could be back to tracking the annual v9 updates

It still seems to me that TECHNICALLY there are more advantages than disadvantages to Apple adopting SVE (unless they're going to go far far outside the box and adopt MacroScalar...)
So maybe, like the AMX/SME case, Apple are still worried about some edge cases in SVE and want ARM to make changes/clarify issues/provide an extension before they're willing to commit?
 

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
As far as I can tell from various reading, the *current* "legal" situation (everything is of course negotiable, and this has changed relative to a few years ago, hence on-going confusion...) is

- SME does NOT require SVE

It seems so, but I would like Arm's documentation to be clearer. The Arm Architecture Reference Manual for A-profile architecture (page 165) says:
If FEAT_SME is implemented, this does not imply that FEAT_SVE and FEAT_SVE2 are implemented when the PE is not in Streaming SVE mode.
What happens when the PE is in Streaming SVE mode?
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
It still seems to me that TECHNICALLY there are more advantages than disadvantages to Apple adopting SVE (unless they're going to go far far outside the box and adopt MacroScalar...)

Assuming that Apple wants to keep their SIMD width at 128-bit, how much advantage is there really to adopting SVE? Masks are a great thing to have of course, especially for auto-vectorization, I'm just not sure how much of a priority target SVE is. I mean, if your array is so large that you benefit from vectorization, there is a good chance that you really want to use SSVE with its higher throughput. And for small data structures SVE doesn't have that much advantage over NEON.
 
  • Like
Reactions: crazy dave

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
Assuming that Apple wants to keep their SIMD width at 128-bit, how much advantage is there really to adopting SVE? Masks are a great thing to have of course, especially for auto-vectorization, I'm just not sure how much of a priority target SVE is. I mean, if your array is so large that you benefit from vectorization, there is a good chance that you really want to use SSVE with its higher throughput. And for small data structures SVE doesn't have that much advantage over NEON.
Why would they want to keep their SIMD width at 128b? 256b seems a better match to transistor densities over the next few years. They can have both, NEON 128b and SVE 256b.

But even SVE at 128b is better than NEON.
What SVE gives you is a rather nicer compiler target than NEON, a way to map generic loops onto SIMD without having to worry about edge conditions like reading past the end of an array, or having to write a second loop to handle any possible elements left after the primary 128b-wide loop.

And NEON/SVE give you LATENCY win. SVE costs you ~15 cycles to get from core to AMX unit. GPU is far far worse, you only break even at around 1,000,000 elements. NEON/SVE are nice for small loops, 100 to 1000 elements.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
Why would they want to keep their SIMD width at 128b? 256b seems a better match to transistor densities over the next few years. They can have both, NEON 128b and SVE 256b.

Do you think that density improvements will be large enough to offset the increased power consumption from wider data buses and caches? I’m wondering whether Apple would prefer to invest in other areas instead. After all, their 4x128 bit solution is still very formidable. M4 even outperforms Xeons with AVX512 on FP workloads.

And even if they decide to widen the FP backend, something like 5-6x128 might be a more flexible solution.

And NEON/SVE give you LATENCY win. SVE costs you ~15 cycles to get from core to AMX unit. GPU is far far worse, you only break even at around 1,000,000 elements. NEON/SVE are nice for small loops, 100 to 1000 elements.

Oh, absolutely. I was always a big fan of the latency/throughput vector processor split.
 
  • Like
Reactions: crazy dave

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,226
Why would they want to keep their SIMD width at 128b? 256b seems a better match to transistor densities over the next few years. They can have both, NEON 128b and SVE 256b.

But even SVE at 128b is better than NEON.
What SVE gives you is a rather nicer compiler target than NEON, a way to map generic loops onto SIMD without having to worry about edge conditions like reading past the end of an array, or having to write a second loop to handle any possible elements left after the primary 128b-wide loop.

And NEON/SVE give you LATENCY win. SVE costs you ~15 cycles to get from core to AMX unit. GPU is far far worse, you only break even at around 1,000,000 elements. NEON/SVE are nice for small loops, 100 to 1000 elements.

Do you think that density improvements will be large enough to offset the increased power consumption from wider data buses and caches? I’m wondering whether Apple would prefer to invest in other areas instead. After all, their 4x128 bit solution is still very formidable. M4 even outperforms Xeons with AVX512 on FP workloads.

And even if they decide to widen the FP backend, something like 5-6x128 might be a more flexible solution.



Oh, absolutely. I was always a big fan of the latency/throughput vector processor split.
This is not something I know much about but I do know that one advantage of SVE is that a single code path can target multiple vector processor lengths - would there be any advantage to having both 128b and 256b SVE units? Obviously both the P-cores and E-cores would have to have both or we run into the Alder-Lake scheduling problem even with SVE, but would having a mix be useful in any way? Like smaller problems going to the 128b vector units and using less power but getting more performance out of the 256b units when the problem size demands it? If so, what would be examples of that kind of problem size split - like I can get my head around 512b units or AMX accelerators really needing a beefy enough problem to be worth turning on, but I'm not sure about the optimal problem size for 128 vs 256 and why Apple opts for multiple 128 vs fewer 256.

Right now I think it's 4 128b units on M3 P-core and I believe 3 128b units on the E-core, would, in this scenario were they can add more, a mix of 1 256b unit and 4 128b units on P and 1 256b unit and 2 128b unit on E-core make sense or just be silly? Would that be better in any way over 6 128b units on P and 4 128b units on E?
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,675
This is not something I know much about but I do know that one advantage of SVE is that a single code path can target multiple vector processor lengths - would there be any advantage to having both 128b and 256b SVE units?

In one design? That sounds like a logistics nightmare to me. Length-agnostic vector instructions are nice if you want to write the algorithm once and ship everywhere. Still, I wouldn't mix vector widths in one design. That's exactly what the streaming mode does well — it allows the implementation to have two separate vector states and capabilities and does not mix those.

To be honest, it seems to me that length-agnostic vector instructions are not the panacea one hoped them to be. They make a lot of sense for HPC applications, where you want to apply repetitive operations to long arrays. But a lot of utility of SIMD comes from flexible swizzling and shuffling, and those are very difficult to do in a length-agnostic way. Newest SVE revisions have introduced dedicated 128-bit and 256-bit operations to support these things. And pretty much all SVE implementations that matter are currently 128-bit, because 256-bit is expensive. Just compare Apple's 4x128-bit and x86 2-3x256bit SIMD implementations — Apple ends up faster pretty much everywhere, because having more smaller units is simply more flexible. Going wider is good if you know that you will have wide data to process, as it saves you state and transistors. Real-word applications are more complex.

My argument is that the "wide"/HPC need is already well covered by SME/SSVE and for regular CPU SIMD flexibility is more valuable than theoretical throughput. And being length-agnostic is probably important for SME/SSVE, since it allows for scaling opportunities in the future. Now, it would be nice to have SVE, simply for the added flexibility of the masked load/stores. But I'd be surprised if Apple cared much about the length-agnostic aspect it offers. Frankly, I'd take 5x128bit units on my P core instead of 3x256 bit, because smaller independent units are more useful on scientific code.

Right now I think it's 4 128b units on M3 P-core and I believe 3 128b units on the E-core, would, in this scenario were they can add more, a mix of 1 256b unit and 4 128b units on P and 1 256b unit and 2 128b unit on E-core make sense or just be silly? Would that be better in any way over 6 128b units on P and 4 128b units on E?

Just think of the scheduling problems such design would bring. Your algorithm has to query the vector size somehow, since it needs to decide how many loops it has to do. But on your hypothetical CPU the vector size can change from instruction to instruction! So you either have to pin some SIMD units to threads, or figure out some other complex way to deal with this. I just don't see how that would work and still be sane.
 

crazy dave

macrumors 65816
Sep 9, 2010
1,453
1,226
(unless they're going to go far far outside the box and adopt MacroScalar...)
Looking up MacroScalar I came across this written about 13 years ago. It was a good read for multiple reasons, not just the technical description of how MacroScalar might work but also from a historical perspective on Apple's entrance into microarchitecture design:


The company’s 2008 and 2010 acquisitions of small chip design firms P. A. Semi and Intrinsity suggests, as unlikely as it may seem, that Apple might believe they can achieve significant power, performance, or functionality gains by going toe-to-toe with the microarchitectural establishment.

If Apple were to get into the CPU design business, this move would represent a major shift in consumer electronics: as far as I know, Apple would be the only company to sell “whole widgets” to consumers featuring their own in-house microarchitecture. And, if this move proves not to be mere NIHism and the iPads of the future are way more awesome because of their new Apple-designed processor cores, it could lead to a sea change in the computer architecture landscape and the way that consumer electronics companies compete.

The quote above is actually quite perspicacious.

In one design? That sounds like a logistics nightmare to me. Length-agnostic vector instructions are nice if you want to write the algorithm once and ship everywhere. Still, I wouldn't mix vector widths in one design. That's exactly what the streaming mode does well — it allows the implementation to have two separate vector states and capabilities and does not mix those.

To be honest, it seems to me that length-agnostic vector instructions are not the panacea one hoped them to be. They make a lot of sense for HPC applications, where you want to apply repetitive operations to long arrays. But a lot of utility of SIMD comes from flexible swizzling and shuffling, and those are very difficult to do in a length-agnostic way. Newest SVE revisions have introduced dedicated 128-bit and 256-bit operations to support these things. And pretty much all SVE implementations that matter are currently 128-bit, because 256-bit is expensive. Just compare Apple's 4x128-bit and x86 2-3x256bit SIMD implementations — Apple ends up faster pretty much everywhere, because having more smaller units is simply more flexible. Going wider is good if you know that you will have wide data to process, as it saves you state and transistors. Real-word applications are more complex.

My argument is that the "wide"/HPC need is already well covered by SME/SSVE and for regular CPU SIMD flexibility is more valuable than theoretical throughput. And being length-agnostic is probably important for SME/SSVE, since it allows for scaling opportunities in the future. Now, it would be nice to have SVE, simply for the added flexibility of the masked load/stores. But I'd be surprised if Apple cared much about the length-agnostic aspect it offers. Frankly, I'd take 5x128bit units on my P core instead of 3x256 bit, because smaller independent units are more useful on scientific code.



Just think of the scheduling problems such design would bring. Your algorithm has to query the vector size somehow, since it needs to decide how many loops it has to do. But on your hypothetical CPU the vector size can change from instruction to instruction! So you either have to pin some SIMD units to threads, or figure out some other complex way to deal with this. I just don't see how that would work and still be sane.

Yeah I guess I was thinking about the scheduling P/E-core problems but not thinking about scheduling even within a core being a problem believing the core itself would know best how to do that, but that might've been naive.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
Looking up MacroScalar I came across this written about 13 years ago. It was a good read for multiple reasons, not just the technical description of how MacroScalar might work but also from a historical perspective on Apple's entrance into microarchitecture design:


The nutshell version of something like MacroScalar is that rather than work with fixed sized registers and SIMD operations (even SVE-style) you write your code as standard scalar loops, but some of those loops are captured by the compiler and annotated as being amenable to SIMD'ization.
Everything after that happens in the CPU; at some point (possibly when the instructions are loaded to L2, possibly when they're loaded to L1, possibly even at Decode time) certain patterns of instructions are on-the-fly converted to a SIMD version appropriate for this hardware.

The primary win of this is that you get everything SVE gives you, without the various pain points; AND you don't burn up your 32b ISA space the way SVE has done.
The primary loss is that you give up perhaps even more control than SVE (eg for things like data rearrangement).

The idea admittedly sounds some combination of cool and insane. But others have suggested the same sort of idea, for example here's a version for POWER: https://libre-soc.org/openpower/sv/
Another way to look at this is that it's an implementation of "real" vector computing Cray-style ["hardware for loop"], rather than SIMD, and the implementation as SIMD or otherwise is a low-level detail unimportant to the developer.

Yet another way to look at this is to assume (who knows?) that Apple considered THE important part of MacroScalar to be not the "hardware for loop" part but the offloading to separate hardware part, and that's already present with AMX/SME so Mission Achieved, move on...

As I keep saying, the primary win in all these schemes is you get a better compiler target, and that's worth a lot. Hell, if Apple just adopts SVE128, and says they will NEVER change the width, so we NEON plus predication, no need to test for load/store boundary conditions, and no loop tails, that's worth a few percent on "most" code, including non-FP code, and getting a few percent on current CPUs any other way is not easy...

BTW I think the question of increasing transistor densities is interesting. Here's what I wrote recently:

You can read this and view it as meaning that RAPID transistor density increases are over, so SVE256b is a luxury that will not make sense for most users, better to use that area in other ways.
Or you can view it as meaning that all the machinery we hope might make SVE256b less necessary (like eg 5 or 6 NEON units) will not come as easily as it did in the past, and widening vector length [with the obvious concomitant features, like not activating unused lanes to save power] is in fact easier than those alternatives.
I honestly don't know.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
You can read this and view it as meaning that RAPID transistor density increases are over, so SVE256b is a luxury that will not make sense for most users, better to use that area in other ways.
Or you can view it as meaning that all the machinery we hope might make SVE256b less necessary (like eg 5 or 6 NEON units) will not come as easily as it did in the past, and widening vector length [with the obvious concomitant features, like not activating unused lanes to save power] is in fact easier than those alternatives.
I honestly don't know.

One thing is clear — high-performance designs are likely to become more costly than ever. This is also where new packaging technologies will become indispensable. In the long run it might be more economically viable to combine multiple dies made on different nodes into a single semi-stacked package rather than trying to build a monolithic one.

Paradoxically, this puts Apple in an advantageous position, since they don't have to sell their chips on the free market
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
987
TSMC just announced mass production of N3P 2H this year. That's impressively fast. I don't think Apple can get the iPhone 16 on it because of the ridiculous volume they need. But I dunno... Maybe the A18 goes on N3E and the A18Pro goes on N3P?

If I'm right that M4 desktops are coming at WWDC, it's not clear when Apple starts using N3P. If I'm wrong, they could possibly start using it for their desktops starting sometime this summer.

The advantages of N3P are not major, but they're meaningful. More for power than performance, but either can benefit.

EDIT: Crap, wrote "N4P" 3x when I meant to write "N3P". Fixed. :-(
 
Last edited:

SBeardsl

macrumors member
Aug 9, 2007
56
14
If I'm right that M4 desktops are coming at WWDC, it's not clear when Apple starts using N4P. If I'm wrong, they could possibly start using it for their desktops starting sometime this summer.
[I assume you are talking about N3P (not N4P) here. ]

I can't see Apple pushing that into any iPhone this year, they have the brand new N3E for that which would allow them to save the bump to N3P for next year's iPhone. The iPhone has a target date each year, it has to ship in that window due to the huge logistical/commercial weight it carries. I can't see them holding it up even two months past its traditional release.

What I do think is that we get desktops sometime this summer (come on WWDC!) with M4s on N3E (same family as the iPad Pro) then the MacBooks updated with "M5" whenever N3P is ready to ship with. If that is Oct/Nov it's right on time, about a year since the last updates. If its Dec/Jan, it's still fine, only a couple months delay.

Since N3P uses the same layout rules as N3E Apple could just use the M4 layouts and claim the incremental (+5% performance or power improvement) or it could roll in some improved cores to the package (more likely). As others have suggested the GPU cores are in line for updates, maybe the next round of low hanging fruit for 3D rendering, perhaps tweaks to improve their ability to contribute to computation. Add something to the media cores (AV1 encode?), maybe bump NE cores count up, and you have something worthy of a release.
 
  • Like
Reactions: Confused-User

SBeardsl

macrumors member
Aug 9, 2007
56
14
If N3P for shipping this Winter it helps Apple avoid what seem to me an interesting dilemma.

Consider Apple's double interest in
(1) releasing Mac updates on a regular cycle, say around every 12 months and
(2) always being first out of the gate with the latest TSMC process

What would Apple do it they knew that N3P would not be ready to ship product until early Spring 2025? Would Apple continue to hold the MacBooks that long to keep them available to put out with N3P as soon as possible even if it meant a time between of releases of more like 15-18 months or would Apple just rev them in Oct/Nov with M4s and update the desktops with N3P/M5 when ready after less than a year?
 

Confused-User

macrumors 6502a
Oct 14, 2014
852
987
[I assume you are talking about N3P (not N4P) here. ]
Yes, damnit. :-(
I can't see Apple pushing that into any iPhone this year, they have the brand new N3E for that which would allow them to save the bump to N3P for next year's iPhone. The iPhone has a target date each year, it has to ship in that window due to the huge logistical/commercial weight it carries. I can't see them holding it up even two months past its traditional release.

What I do think is that we get desktops sometime this summer (come on WWDC!) with M4s on N3E (same family as the iPad Pro) then the MacBooks updated with "M5" whenever N3P is ready to ship with. If that is Oct/Nov it's right on time, about a year since the last updates. If its Dec/Jan, it's still fine, only a couple months delay.

Since N3P uses the same layout rules as N3E Apple could just use the M4 layouts and claim the incremental (+5% performance or power improvement) or it could roll in some improved cores to the package (more likely). As others have suggested the GPU cores are in line for updates, maybe the next round of low hanging fruit for 3D rendering, perhaps tweaks to improve their ability to contribute to computation. Add something to the media cores (AV1 encode?), maybe bump NE cores count up, and you have something worthy of a release.
As I said, I agree about the iPhones - the Pro going on N3P seems barely on the edge of plausibility to me.

If they don't ship MBPs at the same time as desktops, then perhaps you're right, and they'll use N3P for them. I still hope we'll get an avalanche next month, but I don't have any particular expectation that we will.

I suspect they'll take the power improvements for the MBPs if they do put them on N3P. Perf bump is only ~5%, while power iso-clock is -~10%. If not they'll take the clocks. I don't see them taping out a new M4 or Max with more cores.

You know, there are other weird options. *If* the M4 Ultra is still dual-chip, as opposed to the imagined monolithic chip we've discussed, then they'd be introducing both Max and Ultra M4s for the desktops. They wouldn't have to introduce Pros though. So they could always bring out MBP M4 Maxes, and defer the M4 Pro models until later in the year.

If N3P for shipping this Winter it helps Apple avoid what seem to me an interesting dilemma.

Consider Apple's double interest in
(1) releasing Mac updates on a regular cycle, say around every 12 months and
(2) always being first out of the gate with the latest TSMC process

What would Apple do it they knew that N3P would not be ready to ship product until early Spring 2025? Would Apple continue to hold the MacBooks that long to keep them available to put out with N3P as soon as possible even if it meant a time between of releases of more like 15-18 months or would Apple just rev them in Oct/Nov with M4s and update the desktops with N3P/M5 when ready after less than a year?
The iPhones *must* ship yearly. Macs... as we've seen, Apple feels they have more flexibility, presumably because competition from Intel/AMD is poor (and not felt as directly). I think it's way too situational to imagine a rule for predicting what they'll do.

M5 Extreme on N3X at WWDC 2025...?
That would be extremely surprising. N3X goes entirely against Apple's entire design philosophy.

And honestly, it's pointless in most applications. I don't think it's useful for anything other than top single-threaded performance. And perhaps, if you're building a water-cooled supercomputer for millions of $$ you might find a use for it. But normal large-core-count chips are power-constrained for multicore perf. Making such a chip on N3X just makes the power problems worse. It also costs more in die area. So it's just a giant all-around fail.

After N3P I think they go directly to N2.

Edit: Fixed typo N4P -> N3P again, damnit. :-(
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.