Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
Hi @theorist9, It means (speculates) the future Max will be made using x2 Pro chips (Brava) given what Gurman’s code names imply. I’ll just point you back to my original posts for more and say thanks to everyone for this conversation.

Post #585
Post #598
But I thought you were saying more than that...not only that you thought that M4 Max = 2 x M4 Pro, but that this will be a continuation ("maintenance") of what they are doing now ("maintain the economics of their existing three chip strategy...via the following format"). [See quote at bottom.] And my point is that this would not be a continuation of what they are doing now, since M3 Max != 2 x M3 Pro.

I have no opinion on what they will be doing. I'm only saying that IF they do M4 Max = 2 x M4 Pro, that would be a divergence from their current segmentation strategy.

Yes, I know this is a minor point. But I didn't want folks to think that Max = 2 X Pro is what they are doing now. They are currently doing Ultra = 2 x Max, but they are not currently doing Max = 2 x Pro.
They'll likely maintain the economics of their existing three chip strategy (as Gurman’s code names suggests) via the following format:

M4 (Donan)
M4 Pro (Brava)
M4 Max (2x Brava)
M4 Ultra (2x Hidra)
M4 Extreme (4x Hidra)

And the A/S transition will finally be over, Amen! (or not?)
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,675
According to Apple, M4 is ARMv9.2. I previously claimed that M4 is not ARMv9 because I erroneously assumed that SVE and some other features are mandatory in v9. It does seem like LLVM engineers made the same mistake, so I don’t feel too bad about myself :)

 
Last edited:

Confused-User

macrumors 6502a
Oct 14, 2014
852
987
Not going to be much of a 'quirk' if Arm's v9 and almost everyone else's v9 implementation has SVE implemented.
if Arm is openly, freely contributing most of the Arm v9 specification optimizations, it probably isn't about lack of clarity. It is having one less ifdef to chasing for a single vendor going for the 'participation trophy' subset version of v9. Implementation of a 'subset' of SVE before doing SVE is more so the quirky path.

For a compiler if every single feature of v9 has to be a seperate command line flag what you end up with is a large bucketload of flag bloat (and the code that goes along with flag bloat. And the combination/permutations of testing of the flag bloat for interactions. )

Having a 'standard' were pramgatically every feature is 'optional' really isn't much of a standard. It absolutely doesn't lead to easier to maintain and test source code. It is more so a political convenience to make 'participation trophies' easier to hand out.
Yes, but.

It is actually better than having no standard at all, by a lot. Imagine what things would look like if ARM didn't define anything at all. Then you'd have a number of completely incompatible implementations solving the same issue in very different ways, which would be even worse for compilers.

As Maynard is implying, ARM is not the omnipotent authority any more - they have to negotiate with Apple, and probably realistically QC as well. Maybe even others (Fujitsu?) in limited areas. And every month that R-V gains share, they have to tread more and more carefully. This may be the best they can do, given certain market realities and some unfortunate decisions from earlier days.

Hopefully, as time passes, they'll have better clarity and be able to reduce unnecessary choices.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,675
More interesting, in terms of the very murky roadmap going forward, the "official" features list includes all the SME variants but NOT SSVE... Make of that what you will.

Is SSVE even a separate feature? I was under the impression it was simply part of SME. This is why I don't get all this SSVE vs. SME discussion. For all intents and purposes, the two are inseparable, and one is not possible without the other. The particular implementation details, especially leaky abstractions like performance when using different destinations, are a different matter, IMO.

As to the rest, I fully agree with you and the others. ARM spec is incredibly hard to follow. Their blog posts detail all these features, but when you ctually look in the manual almost everything is optional. What is even the difference between v8.7 and v9.0? Or between v8.6 and v8.7 for what matters?
 
  • Like
Reactions: altaic

leman

macrumors Core
Oct 14, 2008
19,521
19,675
It is actually better than having no standard at all, by a lot. Imagine what things would look like if ARM didn't define anything at all. Then you'd have a number of completely incompatible implementations solving the same issue in very different ways, which would be even worse for compilers.

I don't think there is any disagreement on this. What I don't get personally is why define profiles if the contents of profiles are optional. Wouldn't it be better to focus on optional features and feature bundles instead? Because right now, the statement "this CPU supports ARMv9" tells us very little about the actual feature set.
 
  • Like
Reactions: name99 and altaic

Xiao_Xi

macrumors 68000
Oct 27, 2021
1,627
1,101
Is SSVE even a separate feature?
Maybe not. There are FEAT_SSVE_FP8DOT4, FEAT_SSVE_FP8DOT2 and FEAT_SSVE_FP8FMA (all from 2023), but not FEAT_SSVE.

By the way, I find it impressive that Apple has managed to include SME2 (from 2022). It shows how quickly Apple can add important features to its SoCs.

What I don't get personally is why define profiles if the contents of profiles are optional. Wouldn't it be better to focus on optional features and feature bundles instead? Because right now, the statement "this CPU supports ARMv9" tells us very little about the actual feature set.
I find it amusing that RISC-V wants to adopt profiles to become more like Arm, while Arm is becoming more and more like RISC-V with so many optional features.
 
  • Like
Reactions: Chuckeee

leman

macrumors Core
Oct 14, 2008
19,521
19,675
By the way, I find it impressive that Apple has managed to include SME2 (from 2022). It shows how quickly Apple can add important features to its SoCs.

The truth is that Apple has been working on this since at least 2017, maybe earlier. If I remember correctly, the first Apple core to feature the matrix coprocessor was A13, and Apple has been evolving this technology ever since. SME is simply the “legalization” of what Apple already had, and I am certain they had a big say on its design. SME2 reflects what Apple has shipped for years, not the other way around. It’s much easier to attach a new interface in from of technology you already have rather than develop the technology from scratch.

This is also why I don’t expect others to ship SME any time soon. Getting the details down can be very tricky. ARM might ship SME with on-core implementation though, skipping the co-processor architecture.

P.S. By the way, this is also evident from Apples low performance when using certain SVE instructions. SSVE prescribes that Z registers can be used as accumulators. AMX hardware however is tightly integrated with the tile storage. So these instructions are likely microcoded, performing the accumulation on the tile slice and moving the data back and forth. The discrepancy in throughput fits this hypothesis.
 

altaic

macrumors 6502a
Jan 26, 2004
711
484
You’d think LLVM would be as close to ground truth as you get. Not that it really matters, but I wonder what the ratio of code bases that have it wrong vs right is.
 
Last edited:

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
Is SSVE even a separate feature? I was under the impression it was simply part of SME. This is why I don't get all this SSVE vs. SME discussion. For all intents and purposes, the two are inseparable, and one is not possible without the other. The particular implementation details, especially leaky abstractions like performance when using different destinations, are a different matter, IMO.

As to the rest, I fully agree with you and the others. ARM spec is incredibly hard to follow. Their blog posts detail all these features, but when you ctually look in the manual almost everything is optional. What is even the difference between v8.7 and v9.0? Or between v8.6 and v8.7 for what matters?
The legalistic answer is what you say, that SSVE comes along with SME. To me that's a crazy answer because it does not answer the question "who ASKED for SSVE? what problem is it solving?"
SME includes the relevant functionality from SSVE but writes the results in a ZA register.

- Apple doesn't want the SSVE functionality. They've been happy with their scheme that writes vector results to ZA registers.
- QC seems unimpressed by/uninterested in SVE (and presumably SSVE)
- Fujitsu doesn't care about SME

So who looked at what Apple had done and said, OK, sure, we can make this a great extension called SME, including a whole bunch of vector instructions, but then, just for fun, let's fsck it up by adding a whole PARALLEL set of instructions called SSVE that either replicate what the ISA already has available (in SVE) or what's already available via the vector instructions in SME. No-one's asking for it, but let's just do it anyway!?!

This is the part I keep trying to get at, which legalistic replies don't answer -- WHY make such a patently stupid addition?

I can't help but think that the end result is basically what I keep suggesting:
Apple (probably with QC agreeing, maybe also Fujitsu) will school them that the very ideas of SSVE is DUMB DUMB DUMB.
Keep SVE as an extension of NEON, use the vector instructions in SME if you want "high throughput but high latency vector instructions" and go burn SSVE in a fire somewhere.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
By the way, I find it impressive that Apple has managed to include SME2 (from 2022). It shows how quickly Apple can add important features to its SoCs.
Ha ha. You have the causality UTTERLY backward.
No-one will admit it, but it's so obvious that Apple added the functionality (AMX), kept improving it, and ARM basically said "can we copy that; here's our suggested instruction encoding" which Apple then just adopted instead of the previous encoding.

All this makes sense and is par for the course. Same thing happened with Pointer Authentication or with Page Table indirection.
The weirdness, the part that makes ZERO sense to anyone who has thought about the issue for more than a minute, is: why add SSVE...
 
  • Like
Reactions: Xiao_Xi

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
Is SSVE even a separate feature? I was under the impression it was simply part of SME.
According to this helpful Venn diagram from ARM (created by Martin Weidmann), you are correct. Streaming mode SVE, aka SSVE, is indeed part of SME:


I also found this summary page helpful:

1718587074540.png
 
  • Like
Reactions: SBeardsl

tenthousandthings

Contributor
May 14, 2012
276
323
New Haven, CT
But I thought you were saying more than that...not only that you thought that M4 Max = 2 x M4 Pro, but that this will be a continuation ("maintenance") of what they are doing now ("maintain the economics of their existing three chip strategy...via the following format"). [See quote at bottom.] And my point is that this would not be a continuation of what they are doing now, since M3 Max != 2 x M3 Pro.

I have no opinion on what they will be doing. I'm only saying that IF they do M4 Max = 2 x M4 Pro, that would be a divergence from their current segmentation strategy.

Yes, I know this is a minor point. But I didn't want folks to think that Max = 2 X Pro is what they are doing now. They are currently doing Ultra = 2 x Max, but they are not currently doing Max = 2 x Pro.
Also a minor point, but it's worth noting that M1-M2 Max != 2 x Pro. M1-M2 Max GPU = 2 x Pro GPU. The doubling was only for the GPU cores.

I guess the idea is that we will see monolithic generations like M3, where there is no need to design for advanced packaging (like UltraFusion), and heterogeneous-integration generations like M1 and M2, where there is a need to design for advanced packaging.

M4 will be heterogeneous. I think both A18/M4 and A18 Pro/M4 Pro will be monolithic, and M4 Max will continue in its hybrid role as both [1] the peak of Apple's line of monolithic SoCs and [2] the foundation for heterogeneous integration in the Ultra/Extreme. I think we will still see a base M4 Ultra that is simply 2x Max, but with next-generation advanced packaging/integration that allows for further configurations. Not 4x Max, but a more asymmetrical, workload-driven approach. I used to think we wouldn't see this kind of thing until 2030 or so, but I think now that it may be here.

So I don't think the hybrid role of the Max is going to change. I think there's a reason for it, and I don't think that reason has changed, even though I can't explain it.

There is a hint of this in the original M1-M2 approach, in the Pro/Max relationship (double the GPU). So the M4 Ultra/Extreme would double the GPU. Of course, advanced packaging is not limited to GPUs, or memory. But it will start there.
 
Last edited:
  • Like
Reactions: name99

treehuggerpro

macrumors regular
Oct 21, 2021
111
124
M4 Family.jpg


This illustrates the format of the initial speculation. Marketing vs (Code Names - as per Gurman)

If Gurman’s leak is correct: Brava = Mobile / Hidra = Desktop

@tenthousandthings, Apple’s patent already suggests this much:

[0023]
In some embodiments, I/O dies can be partitioned from the logic dies and packaged within the MCM routing substrate prior to mounting of the logic dies and memory dies or packages. I/O die partitioning can have the effect of reducing logic die area by off-loading I/O regions as well as electrostatic discharge (ESD) circuits from the logic dies. This individual logic die area reduction may furthermore reduce overall MCM area, offsetting the area increase due to expanding chip-to-chip placement.

The other thing I'd be interested to hear thoughts on, is whether it's likely Apple could/would remove the current On-Die Memory? A primary claim for the Eliyan interconnects, is that they break down the “Memory Wall” by providing high bandwidth access to large banks of memory at good (thermally isolated) distances from the main Die(s).

The combined options Apple may have to reclaim die area on the Hidra SOC, by off-loading I/O, Memory(?) and salvaging unnecessary duplications from an x2 / x4 only design, may make space for the additional GPU cores, or it could just mean they'll stick to 40 and reduce the Hydra's die size in accordance with the last sentence of [0023].

M3 Max ~Die Area breakdown:

M3 Die by Area.jpg
 
Last edited:

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
The other thing I'd be interested to hear thoughts on, is whether it's likely Apple could/would remove the current On-Die Memory? A primary claim for the Eliyan interconnects, is that they break down the “Memory Wall” by providing high bandwidth access to large banks of memory at good (thermally isolated) distances from the main Die(s).

The combined options Apple may have to reclaim die area on the Hidra SOC, by off-loading I/O, Memory(?) ...
The memory in AS isn't on-die. It's only the memory controllers that are on-die. The RAM chips themselves are off-die (but on the same SoC).

This is illustrated by the following graphic from https://www.apple.com/newsroom/2023...xt-generation-chips-for-next-level-workflows/ The RAM modules are the four black blocks located on either side of the die.

1718674556247.png


Those yellow blocks labelled "LPDDR5" in your post are the controllers, not the RAM itself. The cache is on-die, but the RAM is not.
 
Last edited:
  • Like
Reactions: Chuckeee

treehuggerpro

macrumors regular
Oct 21, 2021
111
124
The memory in AS isn't on-die. It's only the memory controllers that are on-die. The RAM chips themselves are off-die (but on the same SoC).
Yes. I'm referring to the LPDDR5 portion of the M3 Max layout above, occupying 10.7% of the Die.

[Ah, Ok - as per your edit. I assumed they needed some storage local to the GPU/CPU]
 
Last edited:

Chuckeee

macrumors 68040
Aug 18, 2023
3,065
8,727
Southern California
View attachment 2389806

This illustrates the format of the initial speculation. Marketing vs (Code Names - as per Gurman)

If Gurman’s leak is correct: Brava = Mobile / Hidra = Desktop

@tenthousandthings, Apple’s patent already suggests this much:



The other thing I'd be interested to hear thoughts on, is whether it's likely Apple could/would remove the current On-Die Memory? A primary claim for the Eliyan interconnects, is that they break down the “Memory Wall” by providing high bandwidth access to large banks of memory at good (thermally isolated) distances from the main Die(s).

The combined options Apple may have to reclaim die area on the Hidra SOC, by off-loading I/O, Memory(?) and salvaging unnecessary duplications from an x2 / x4 only design, may make space for the additional GPU cores, or it could just mean they'll stick to 40 and reduce the Hydra's die size in accordance with the last sentence of [0023].

M3 Max ~Die Area breakdown:

View attachment 2389808
Just being a killjoy. I thought Apple examined a 4 chip M1 Hydra and determined it was too expensive to produce and the market at the necessary price point would be too small to be equitable. Yes there are relevant Apple patents but Apple (like most high technology companies) have loads of patents that have never led to an actual product.

So what has so drastically chaged that now make an 4 [large] die M4 Hydra any closer to now being a real product?

I thought the whole point of chiplets was to use many multiple medium sized dies to get away from the dependence of large dies for high performance chips

Edited: spelling
 
Last edited:
  • Like
Reactions: tenthousandthings

Confused-User

macrumors 6502a
Oct 14, 2014
852
987
Just being a killjoy. I thought Apple examined a 4 chip M1 Hydra and determined it was too expensive to produce and the market at the necessary price point would be too small to be equitable. Yes there are relevant Apple patents but Apple (like most high technology companies) have loads of patents that have never led to an actual product.

So what has so drastically chaged that now make an 4 [large] dye M4 Hydra any closer to now being a real product?
Quite a lot has changed! Apple is clearly willing to invest more in their chip design, now that the M series is a proven hit and the design team has proven they can iterate, and they appear to have more design bandwidth. Thus the M3 generation has a Pro that's not simply a chopped-down Max, and we get an M4 half a year after the M3.

We also have them investing enormously in using their own silicon for their AI datacenter. This means that there is potentially a very large internal market for high-end chips.

I thought the whole point of chiplets was to use many multiple medium sized dyes to get away from the dependence of large dyes for high performance chips
"Dice" or sometimes "dies". Sort of. It's about using smaller chiplets. Not necessarily small.
 
  • Like
Reactions: wojtek.traczyk

tenthousandthings

Contributor
May 14, 2012
276
323
New Haven, CT
Just being a killjoy. I thought Apple examined a 4 chip M1 Hydra and determined it was too expensive to produce and the market at the necessary price point would be too small to be equitable. Yes there are relevant Apple patents but Apple (like most high technology companies) have loads of patents that have never led to an actual product.

So what has so drastically chaged that now make an 4 [large] die M4 Hydra any closer to now being a real product?

I thought the whole point of chiplets was to use many multiple medium sized dies to get away from the dependence of large dies for high performance chips

Edited: spelling
No doubt this was discussed ad infinitum back in the day. My impression, though, was it had little to do with additional costs and/or diminishing markets (I think if either were the case then "Jade 4C-Die" would never have made it onto the leaked roadmap in the first place), but rather it was engineering roadblocks, possibly some make-or-break feature they couldn't make work, but more likely inherent performance and/or efficiency limitations that were compounded as they multiplied the dies. Chip-first bridge packaging worked with simple, matching two-way pairs of chips, but beyond that it became problematic. We know now that Apple was the first to bring InFO-LSI to market, so it was probably a bit of an engineering feat just to make that happen. The four-way bridge design (however that was constructed) didn't work as well, so it wasn't built.

Here's what Anand Shimpi said in September 2023, possibly referring to this experience: "At the end of the day, we’re a product company. So we want to deliver, whether it’s features, performance, efficiency. If we’re not able to deliver something compelling, we won’t engage, right? ... We won’t build the chip."

Edit to add that the removal of chip-first InFO-LSI packaging from TSMC's current public relations/press release site might be a consequence of this apparent failure beyond a simple two-way bridge. Note that "InFO-LSI" still appears as part of TSMC's chip-last CoWoS-L packaging. It's not inconceivable that M4's advanced packaging will switch to that approach.
 
Last edited:

MrGunny94

macrumors 65816
Dec 3, 2016
1,148
675
Malaga, Spain
I’m curious to see if Apple will answer with the OLED panels as the windows competition is definitely rocking them hard with the X Elite launch.

At this point they have to first make them available on the Pro lineup and that looks like to be 2026.
 
  • Like
Reactions: Zorori

Pressure

macrumors 603
May 30, 2006
5,180
1,544
Denmark
I’m curious to see if Apple will answer with the OLED panels as the windows competition is definitely rocking them hard with the X Elite launch.

At this point they have to first make them available on the Pro lineup and that looks like to be 2026.
How so? Define "rocking hard".

Besides those OLED panels have a peak brightness of around 400 nits as far as I have seen.
 

Antony Newman

macrumors member
May 26, 2014
55
46
UK
Quite a lot has changed! Apple is clearly willing to invest more in their chip design, now that the M series is a proven hit and the design team has proven they can iterate, and they appear to have more design bandwidth. Thus the M3 generation has a Pro that's not simply a chopped-down Max, and we get an M4 half a year after the M3.

We also have them investing enormously in using their own silicon for their AI datacenter. This means that there is potentially a very large internal market for high-end chips.
<snip>

The jump to M4 included:
- architectural designs that can scale to the datacentre
- AMX ISA rolled into the ARM ISA for LLVM to more readily utilise (without depending on calling Apple libs)
- a minimum level of on device TOPs for SLM (small LMs) for Apple‘s chosen point for on edge compute

The jump to GAA transistor in TSMC 2nm onwards:
- will have automated inter-tile signalling
- will enable a monumental level of architectural optimisation (for which 3D routing optimisation tools are in their infancy / likely not quite ready for 100B transistor designs.)
- will allow Apple to use tiles on different lithographies (TSMC NanoFlex)
- will allow P cores forces to jump up by at least 15% in performance + E cores use 30% less power
- will (if the reason for the ‘secret’ Apple + TSMC meeting pans out) mean that Apple have 100% of all of the first years 2nm wafers.
- Was planed for a 2025 H1 HVM (high vol manufacturing) until TSMC hit issues (which resulted in Back Side Power Delivery being removed from this release & a 2 x Quarter delay to TSMC 2nm)

If Apple‘s previous hit a Compute Scaling limit where a 4 x SoC ’Extreme’ did not provide additional HPC performance for their architecture design choices - it seems likely that x2 & x4 is what they are aiming for; and to have a platform that can compete effectively with a high end Nvidia backed Windows offerings.

If Apple announce the Hidra at the end of 2025_Q2 - perhaps they will actually release is a 2nm design?
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,318
Just being a killjoy. I thought Apple examined a 4 chip M1 Hydra and determined it was too expensive to produce and the market at the necessary price point would be too small to be equitable. Yes there are relevant Apple patents but Apple (like most high technology companies) have loads of patents that have never led to an actual product.

So what has so drastically chaged that now make an 4 [large] die M4 Hydra any closer to now being a real product?

I thought the whole point of chiplets was to use many multiple medium sized dies to get away from the dependence of large dies for high performance chips

Edited: spelling
I consider it unlikely that Apple ever entertained business thoughts of an M1 Extreme, for technical reasons.
The M1 Ultra was impressive in many ways, but also disappointing in its scaling. This was surely expected *within* Apple; it was a learning step.
One aspect of scaling that had to be fixed immediately (and probably was fixed to a substantial extent with M2) was GPU scheduling - you want to schedule kernels that will use common data on a common GPU so that (as far as possible) data is not sloshing back and forth between the two L2 SRAM blocks of the two Max's.
There's a similar sort of concern for ANE scheduling. Some of this scheduling is done by the OS (at a very high level) or by the GPU or ANE ARM companion core, but the companion core needs access to an on-going stream of telemetry to make optimal decisions (along with, perhaps, augmented data structures in the GPU or ANE to hold tokens representing those decisions).
So point is, GPU and ANE needed better scheduling to really work well in an Ultra-style design, and that scheduling required hardware assistance.

On the CPU (and entire SoC) side, the cache coherence protocol also needed to be made more powerful, so that less overhead is spent simply keeping various caches informed about what other caches are doing. This protocol was designed and patented a few years ago but may not be implemented yet. (Cache protocols are HARD. I could believe something like an initial version was put on the M2 to test in the M2 Ultra, various edge cases were found, and elements of the design had to be refined).

So point is, technically I don't think anyone in Apple imagined that scaling M1 Max 4 ways was worth doing. More likely the expectation was "We learn what we can from the patches applied to M2 Ultra, and as soon as we have those elements working, we level up to an Extreme and see what the issues in that design are". So, again technically, nothing about business decisions, I could see it as plausible that an M4 is ready for an extreme design; maybe with internal testing to see if an octa-SoC design actually works.
The other thing that starts to kick in is that the obvious layout of an Extreme is a pretty hefty block of silicon, I think I calculated about 3x3 inches (so larger than two hands side by side). You start running into questions of how the geometry is best laid out, where the connections, do you use double-side DRAM (eg ranks, as we discussed a few weeks ago). Again the time seems probably right to deal with these questions in the M4 generation?
But for an Octa maybe other ideas spring into play. Could you stack two of these things with aggressive cooling?

On the business side, I don't think we can guess.
nVidia sell DGX's, in various sizes. If LLM upscaling doesn't all end in tears soon, there may be a market for similar Apple products (at similar prices). I've seen some ML researchers praising the Studio Ultra as a surprisingly nice training machine for a budget, and if word of that spreads, and people want to bump up their existing Metal/CoreML code to an Extreme with say a $15K budget, or even an Octa at say a $40K budget???
And of course how many of these things could Apple use internally if they want to build OpenAI-sized training centers and Meta-sized data warehouses?
Apple PCC (Private Compute Cloud) doesn't have the words AI or ML in it...
There's no obvious reason Apple doesn't expand this functionality, as I've said before, to any sort of use cases (by developer or user) where it makes sense to shunt some large computation into the cloud for a few seconds or a few minutes.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.