M4+ Chip Generation - Speculation Megathread [MERGED]

Xiao_Xi · Jun 4, 2024

thenewperson said:
…they used MaxTech as a source?? That’s low.

Which source would you choose to demonstrate that M4 supports the SME/SME2 instructions?

The main problem is not to use Vadim as a source, but to imply that M4 is compatible with Armv9.4 because it supports SME/SME2 instructions.

What features is the M4 missing to be considered ARMv9 compatible?

thenewperson · Jun 4, 2024

Xiao_Xi said:
Which source would you choose to demonstrate that M4 supports the SME/SME2 instructions?

@never_released or @felixclc_ would be much better sources.

Xiao_Xi said:
The main problem is not to use Vadim as a source, but to imply that M4 is compatible with Armv9.4 because it supports SME/SME2 instructions.

And they seem to have sourced Vadim for that, so it very much is a problem to use him as a source.

Xiao_Xi said:
What features is the M4 missing to be considered ARMv9 compatible?

According to never_released FEAT_SME isn't featured. Although they do note further down the line that it's optional for v9 anyway? I guess someone would have to read the Arm terms or something. Anyway here's where I'm getting all this from:

https://twitter.com/x/status/1790580044550504468

Xiao_Xi · Jun 4, 2024

thenewperson said:
And they seem to have sourced Vadim for that, so it very much is a problem to use him as a source.

Vadim said that M4 uses Arm v9 instructions, but did not say which version.

https://twitter.com/x/status/1788607053361189307

After reading some Arm documentation, I am under the impression that all Arm v9 extensions are optional. If that is true, is it far-fetched to say that M4 is an Arm v9 core if it is at least Armv8.5 compatible and uses a couple of optional Arm v9 extensions?

deconstruct60 · Jun 5, 2024

Xiao_Xi said:
Vadim said that M4 uses Arm v9 instructions, but did not say which version.

https://twitter.com/x/status/1788607053361189307

After reading some Arm documentation, I am under the impression that all Arm v9 extensions are optional. If that is true, is it far-fetched to say that M4 is an Arm v9 core if it is at least Armv8.5 compatible and uses a couple of optional Arm v9 extensions?

"new" features in Arm version generation tend to start off optional in the early releases. There is going to be a v9.1 , 9.2, 9.4 , ... 9.6, 9.7 , 9.8 , etc. There usually are a hard subset of those 'optional' that tend to go 'required' by x.6 , .x.7 , .x.8 . Arm lets implementors ease they way "into the pool" instead of given folks lots of work (complexity) upfront that have to deal with all at once. It gives folks room to order which features they work on first and to incrementally grow a more complex implementation.

The problem with hyping up Apple "v9" is that they are covering SME but not SVE2. There is a more than pretty good chance that SVE2 goes 'required' toward the end of the v9 "go required" sequence. If Apple is "v9" in version9.4 and then not 'v9' in version 9.8 , then the big hypetrain about them being 'v9' is a bit overblown. If Apple skips SVE2 that would pull them off pragmatically being "v9" in the future.

The part that should give folks pause is that Arm describes SME as being a 'superset' of SVE2. Most of Arm's presentations about SME2 is that it is build on "top of" SVE2. However, SME has some loopholes built in.

" ... SME – possible implementations

The spec leaves some key details for implementations to define
...

... SVE vs SSVE

By default, some SVE 2 instructions are not supported in the streaming mode ...
..."

https://mlir.llvm.org/OpenMeetings/2023-06-22-Targeting-SME.pdf

[ Also the MLIR AMX in the slide deck is Intel's AMX. Apple's AMX cover same general issue (matrix math). ]

The Arm picture for possible implementations has SVE and SME implemented inside the Arm core and alternative SVE inside the Arm core and SME as a co-prosessor. The latter looks like Apple's co-processor AMX (only no SVE in the Arm core. )

If what Apple has done is something akin to putting both SSVE and SME in the coprocessor ( e.g, layered on top of their AMX implementation. ) , then long term it is questionable they are going in the same direction as Arm.
Apple already had Apple AMX. Is putting a compatibility interface on top of it really moving 'toward' v9 and or just getting a inexpensive win in the intermediate term?

Arm v9.1-9.4 introduces lots more than just SME2.

AArch64 - Wikipedia

en.wikipedia.org

For example if there are 10 new v9 features. If one implementor does 4 of 10 and another does 1 of 10 ... which implementor is more so on the "v9" track?

Geekbench having some 'hook' for SME and better scores following out shouldn't be the primary completion indicator of implementing Arm v9.

Xiao_Xi · Jun 5, 2024

It seems that the feature that differentiates an Armv9 core from an Armv8 core is whether it implements ETE and TRBE or ETM. According to the Arm Architecture Reference Manual Supplement Armv9, "an implementation of the Armv9-A architecture cannot include an Embedded Trace Macrocell (ETM)."

Confused-User · Jun 5, 2024

deconstruct60 said:
"new" features in Arm version generation tend to start off optional in the early releases. There is going to be a v9.1 , 9.2, 9.4 , ... 9.6, 9.7 , 9.8 , etc. There usually are a hard subset of those 'optional' that tend to go 'required' by x.6 , .x.7 , .x.8 . Arm lets implementors ease they way "into the pool" instead of given folks lots of work (complexity) upfront that have to deal with all at once. It gives folks room to order which features they work on first and to incrementally grow a more complex implementation.

The problem with hyping up Apple "v9" is that they are covering SME but not SVE2. There is a more than pretty good chance that SVE2 goes 'required' toward the end of the v9 "go required" sequence. If Apple is "v9" in version9.4 and then not 'v9' in version 9.8 , then the big hypetrain about them being 'v9' is a bit overblown. If Apple skips SVE2 that would pull them off pragmatically being "v9" in the future.

Your phrasing implies arguments that aren't correct, I think. Making a 9.4-compliant core and not updating it to support 9.8 doesn't retroactively make that core not v9.

Really, this whole v8/v9 discussion is pretty silly. They are what they are, they have the features they have (though they sometimes don't make it easy to discern some of those features). I doubt anyone out there is going to make a big purchase decision based on v8 vs. v9.

l0stl0rd · Jun 6, 2024

Confused-User said:
Your phrasing implies arguments that aren't correct, I think. Making a 9.4-compliant core and not updating it to support 9.8 doesn't retroactively make that core not v9.

Really, this whole v8/v9 discussion is pretty silly. They are what they are, they have the features they have (though they sometimes don't make it easy to discern some of those features). I doubt anyone out there is going to make a big purchase decision based on v8 vs. v9.

Well said also who cares if they are v8 or v9 apple implements what they want as they control the complete software and hardware stack.
Apple chips are basically like SDXL they licensed the base from ARM after that they tune to their desires.

Confused-User · Jun 6, 2024

l0stl0rd said:
Apple chips are basically like SDXL they licensed the base from ARM after that they tune to their desires.

Say what? By SDXL are you referring to Stable Diffusion XL?

If so, you're wrong. Apple gets nothing from ARM except the ISA. Not one transistor. They don't tune, they create from scratch. That's why they have such a crushing performance lead.

name99 · Jun 6, 2024

What's new?

vol 6 - GPU

vol 7 - ANE

vol 1 - 100 pages added giving differences from M1 to M4

misc minor changes in the other volumes

GitHub - name99-org/AArch64-Explore

Contribute to name99-org/AArch64-Explore development by creating an account on GitHub.

github.com

treehuggerpro · Jun 6, 2024

If there are no new M4s at WWDC, could it be Apple is waiting on new tech to make use of the standard packaging approaches outlined in their 2022 patent for all the chip tiers above the base M4?

Assuming the rumoured three chips for M4 (Donan, Brava and Hidra) are correct . . .

Mobile chips:
Donan = M4
Brava = M4 Pro (~6E+6P / 20GPU / NPU)
x2 Brava = M4 Max (~12E+12P / 40GPU / 2NPU)

Workstation + Apple Server chips:
x2 Hidra = M4 Ultra (~32P / 80-96GPU / 2NPU)
x4 Hidra = M4 Extreme (~64P / 160-192GPU / 4NPU)

. . . both the Brava and Hidra chips (plus their memory) would require Eliyan’s new interconnect (due in the third quarter on N3) and explain Gurman’s no further M4 tiers until late 2024 into 2025 for the complete lineup . . ?

Boil · Jun 6, 2024

treehuggerpro said:
Workstation + Apple Server chips:
x2 Hidra = M4 Ultra (~32P / 80-96GPU / 2NPU)
x4 Hidra = M4 Extreme (~64P / 160-192GPU / 4NPU)

. . . both the Brava and Hidra chips (plus their memory) would require Eliyan’s new interconnect (due in the third quarter on N3) and explain Gurman’s no further M4 tiers until late 2024 into 2025 for the complete lineup . . ?

So, WWDC 2025 for the M4 Extreme Mac Pro Cube...?

treehuggerpro · Jun 6, 2024

Boil said:
So, WWDC 2025 for the M4 Extreme Mac Pro Cube...?

Hope so, I’m already with you on the Cube concept. There’s definitely a performance per dollar market in rendering applications for an x4 Mac without the full overhead of the Mac Pro’s case (given discrete GPUs have gone the way of the Dodo).

If the other M4s are going to be based on Eliyan's tech, it appears Apple will have what it needs to create the x4 configuration of their patent in the not too distant future. I’ve been watching Eliyan’s site for news / updates, but Gurman’s timeline on the rest of the M4 lineup does seem to make sense, if the M4 Max tier is also an x2 chip configuration.

Boil · Jun 8, 2024

treehuggerpro said:
If there are no new M4s at WWDC, could it be Apple is waiting on new tech to make use of the standard packaging approaches outlined in their 2022 patent for all the chip tiers above the base M4?

Assuming the rumoured three chips for M4 (Donan, Brava and Hidra) are correct . . .

Mobile chips:
Donan = M4
Brava = M4 Pro (~6E+6P / 20GPU / NPU)
x2 Brava = M4 Max (~12E+12P / 40GPU / 2NPU)

Workstation + Apple Server chips:
x2 Hidra = M4 Ultra (~32P / 80-96GPU / 2NPU)
x4 Hidra = M4 Extreme (~64P / 160-192GPU / 4NPU)

. . . both the Brava and Hidra chips (plus their memory) would require Eliyan’s new interconnect (due in the third quarter on N3) and explain Gurman’s no further M4 tiers until late 2024 into 2025 for the complete lineup . . ?

So a single Hidra SoC would be (16P / 40-48GPU / 1NPU)...?

treehuggerpro said:
Hope so, I’m already with you on the Cube concept. There’s definitely a performance per dollar market in rendering applications for an x4 Mac without the full overhead of the Mac Pro’s case (given discrete GPUs have gone the way of the Dodo).

Niche market AF, but Apple ASi GPU/NPU PCIe cards could be a thing, but they are background compute/render engines, no display output...?

treehuggerpro said:
If the other M4s are going to be based on Eliyan's tech, it appears Apple will have what it needs to create the x4 configuration of their patent in the not too distant future. I’ve been watching Eliyan’s site for news / updates, but Gurman’s timeline on the rest of the M4 lineup does seem to make sense, if the M4 Max tier is also an x2 chip configuration.

Be interesting to see Apple go to SoC chiplets on a package, with a mobile-class assembly & two desktop-class assemblies; utilizing a mobile-class SoC & a desktop-class SoC as the building blocks...

Boil said:
So, WWDC 2025 for the M4 Extreme Mac Pro Cube...?

Maybe the Mc Pro Cube has two daughter cards, each with a M4 Extreme SoC package (four Hidra desktop-class SoCs); maybe some new interconnect that allows each M4 Extreme to talk with the other M4 Extreme in a, well, extremely efficient & ultra-low latency way...?

Of course then macOS would need to be modified to address 128 CPU cores...

But the performance would be better than a top-end next-gen Threadripper & dual RTX5090Ti GPUs...!

128-core CPU
384-core GPU
128-core NPU
1.92TB LPDDR5X RAM (w/inline ECC)
4.32TB/s UMA bandwidth
64TB NVMe SSD (8@8TB NAND blades in RAID)

Priced at a low, low $29,999; we know you're going to love it...! ;^p

Antony Newman · Jun 8, 2024

treehuggerpro said:
<snip>
Assuming the rumoured three chips for M4 (Donan, Brava and Hidra) are correct . . .

Mobile chips:
Donan = M4
Brava = M4 Pro (~6E+6P / 20GPU / NPU)
x2 Brava = M4 Max (~12E+12P / 40GPU / 2NPU)

Workstation + Apple Server chips:
x2 Hidra = M4 Ultra (~32P / 80-96GPU / 2NPU)
x4 Hidra = M4 Extreme (~64P / 160-192GPU / 4NPU)
<snip>

<Edit : Caveat the following is speculation on potential capabilities of the M4 lineup>

[Donan] +) A basic M4 (4P.6E) + 38 TOPs NPU + GPU has enough performance for most peoples needs:
iPad Pro, Macbook Air, Macbook Pro, Mac Mini, Apple TV

If the 4P version is only 12% faster than the 3P version in real world iPad tests - it seems reasonable to estimate that at least 8P additional cores would be required to get a machine that is 2 x times a fast on the compute side.

[Brava] +) A 12P.6E M4 Max + 75 TOPs NPU + larger GPU would meet the needs of:
Macbook Pro (Top spec) : Mac Mini (Top spec) : Mac Studio (entry spec)

[Hidra] +) A 32P M4 Ultra + 120 TOPs NPU + huge GPU would meet the needs of:
Mac Studio (Top Spec) : Mac Pro (Entry spec)

Release Schedule

[Donan & Brava] I don't believe the Compute, NPU or GPU part of the M4 architecture is what is holding Apple back from release new hardware - Apple will need to contrast AI performance of real world tasks as part of their sales pitch - Which would means new h/w after MacOs 15 comes out of Beta in Oct 2024 in 4 months.

[Hidra] If Apple needs (speculation) a 6 months supply of Hidra to build their data centre - I speculate this would be critical to their long term AI success, with each month delay visible in the industry as 'leadership failure'. If Apple commit ALL their Hidra silicon (and Crazy amounts of 'in short supply' MEMORY) to their data centre ahead of the consumer market - Would consumers be 'happy' with Apples Brava offerings until mid 2025?

If a 48GB M4 Mac mini comes out in Q4 that is comparable in performance to an entry level M2 MacStudio - would people be willing to wait another 8 months for the M4 Mac Studio to come out? I think not.
Would Apple want existing Studio owners to Stay on an a Mac Studio? I think so.
I think it reasonably therefore that Apple will differentiate the Mini from the Studio through Brava GPU + Memory size - and over a version of both at the end of Q4.

DrWojtek · Jun 8, 2024

What makes you say this:
Donan = M4
Brava = M4 Pro (~6E+6P / 20GPU / NPU)
x2 Brava = M4 Max (~12E+12P / 40GPU / 2NPU)

Workstation + Apple Server chips:
x2 Hidra = M4 Ultra (~32P / 80-96GPU / 2NPU)
x4 Hidra = M4 Extreme (~64P / 160-192GPU / 4NPU)

And not this:

Donan = M4 (6E+4P / 10GPU / NPU)
Brava = M4 Pro (6E+6P / 20GPU / NPU)
Hidra = M4 Max (4E+12P / 40GPU / NPU)

Workstation + Apple Server chips:
x2 Hidra = M4 Ultra (8E+24P / 80-96GPU / 2NPU)
x4 Hidra = M4 Extreme (16E+48P / 160-192GPU / 4NPU)
?

It makes sense that the Max is a Hidra. Hidra = capable of having many heads, and just one head. Of course there’ll be a version with just 1 chip. Just like the M1/M2 gen.

Antony Newman · Jun 8, 2024

DrWojtek said:
What makes you say this: <snip>And not this:

Donan = M4 (6E+4P / 10GPU / NPU)
Brava = M4 Pro (6E+6P / 20GPU / NPU)
Hidra = M4 Max (4E+12P / 40GPU / NPU)

Workstation + Apple Server chips:
x2 Hidra = M4 Ultra (8E+24P / 80-96GPU / 2NPU)
x4 Hidra = M4 Extreme (16E+48P / 160-192GPU / 4NPU)

It makes sense that the Max is a Hidra. Hidra = capable of having many heads, and just one head. Of course there’ll be a version with just 1 chip. Just like the M1/M2 gen.

If Apple are running a data centre - the running cost is minimised when performance per watt is maximised for very large language models. This requires a large amount of coherent memory, and hardware acceleration for LLM.

<Edit1 : Both of the following points have been contested later in this thread.>
<Edit2 : LLVM comment from Apple confirms the M4 is ArmV9>

1) Apple M4 has the LLM specific matrix calculations in the CPU (part of the ARMv9.2a spec)
2) At higher performance levels the Performance cores are more efficient than the Efficiency cores

It follows that the (data centre focused?) Hidra will have as many performance cores as is necessary to saturate the SME (LLM matrix operations) part of each P-Core in the SoC.

It remains to be seen if Apple have early enough access to latest TSMCs tiling & packaging to create reticle busting designs for their Data Centre - but it could mean Apple end up with some rather large tiled solutions that Apple could use for more novel physical solutions (eg to maximise HBM that is within 30mm of the Hidra SoC).

If Apple reimagine their internal server designs for density of memory first - I wonder if we will see a return of tubular modules (trashcan style) with radial (or orthogonal) memory boards and liquid cooling; A system put together by a newer version of TrashCan assembly robot.

Nvidia have shied away from using Silicon Photonics to connect their supercomputers together - citing alarmingly high failure rates. If Apple end up with 1 x TB a node coherent memory in this iteration of their data centre - that (external to SoC) bandwidth need could also be greatly reduced (with an associated reduction in data centre heat generated)

leman · Jun 8, 2024

Antony Newman said:
If Apple are running a data centre - the running cost is minimised when performance per watt is maximised for very large language models. This requires a large amount of coherent memory, and hardware acceleration for LLM.

1) Apple M4 has the LLM specific matrix calculations in the CPU (part of the v9.4 spec)
2) At higher performance levels the Performance cores are more efficient than the Efficiency cores

It follows that the (data centre focused?) Hidra will have as many performance cores as is necessary to saturate the SME (LLM matrix operations) part of each P-Core in the SoC.

It remains to be been if Apple have early enough access to latest TSMCs tiling & packaging to create reticle busting designs for their Data Centre - but it could mean Apple end up with some rather large tiled solutions that Apple could use for more novel physical solutions (eg to maximise HBM that is within 30mm of the Hidra SoC).

If Apple reimagine their internal server designs for density of memory first - I wonder if we will see a return of tubular modules (trashcan style) with radial (or orthogonal) memory boards and liquid cooling; A system put together by a newer version of TrashCan assembly robot.

Nvidia have shied away from using Silicon Photonics to connect their supercomputers together - citing alarmingly high failure rates. If Apple end up with 1 x TB a node coherent memory in this iteration of their data centre - that (external to SoC) bandwidth need could also be greatly reduced (with an associated reduction in data centre heat generated)

It's important to clarify some misconceptions.

First, it is incorrect that P-cores are more efficient than E-cores under load. Apple's P-cores are 3-4 times faster than the E-cores but consume 6-10x more power. If your goal is to lower the power consumption on a massively parallel workload, a large amount of E-cores is strictly better.

Second, the matrix coprocessor is a separate hardware unit not part of a P-core or an E-core (Apple also does not implement ARMv9, but that is another matter). Access to the matrix unit is shared for a group of CPU cores, called the cluster (in M4, there is one large matrix unit shared by the four P-cores and one smaller matrix unit shared by the four E-cores). If you are primarily interested in using the matrix unit for LLM acceleration, a design that uses many P-cores would be very wasteful. A custom device that consists mostly of matrix accelerators and a few slower CPU cores to feed them would be preferable. Even then, the matrix unit is not as performant as the GPU (simply because the GPU is so big), nor as energy efficient as the NPU. In other words, relying on the matrix accelerator is probably not the best way for Apple to build a cost-efficient LLM acceleration at scale.

Antony Newman · Jun 8, 2024

leman said:
It's important to clarify some misconceptions.
<snip>

Thankyou for chiming in - this is the first time I’ve heard the M4 is not really ARMv9.

<Edit June 15th : Apple regard the M4 as ARMv9.2a and SVE being an optional ISA component>

Do you think Apples server processors will end up having 33% more E-cores than P-Cores in each cluster?

What do you think the bottle neck that Apple will be addressing in Hidra that would necessitate further design work for a (rumoured) mid 2025 release?

Xiao_Xi · Jun 8, 2024

Antony Newman said:
this is the first time I’ve heard the M4 is not really ARMv9

Not everyone on the forum shares the same definition for a core to be Armv9.

Confused-User · Jun 8, 2024

leman said:
It's important to clarify some misconceptions.

Yes, and we can start by noting that most of today's posts have been pure speculation, not based on any actual facts. (I'm not including your post in that.)

First, it is incorrect that P-cores are more efficient than E-cores under load. Apple's P-cores are 3-4 times faster than the E-cores but consume 6-10x more power. If your goal is to lower the power consumption on a massively parallel workload, a large amount of E-cores is strictly better.

Your conclusion is wrong, despite your premises being right. You can't know whether many E cores is better than that same area/transistors being used for P cores, without specifying the workload. Certainly, for traditional "embarrassingly parallel" jobs, the E-cores will do better. But for other things, you don't know until you actually benchmark (or simulate). This basically recapitulates the difference in philosophies between GB 5 and 6... not that they represent the only two points on that spectrum. (And not, in fact, that it's a spectrum at, since a spectrum sort of implies a single dimension, whereas for the P/E question you've got many variables.)

(Edit: You did say "massively parallel", but I take that to mean something a little different. If I interpreted that wrong, apologies. But in any case, the tasks to which an Apple AI datacenter might be put are of uncertain levels of parallelism, at the level of single jobs. Further, minimizing latency is going to be a high (top?) priority for client-facing functions, which is widely expected to be what any such datacenter would be for. So using faster but less efficient hardware might make business sense anyway.)

Second, the matrix coprocessor is a separate hardware unit not part of a P-core or an E-core (Apple also does not implement ARMv9, but that is another matter). Access to the matrix unit is shared for a group of CPU cores, called the cluster (in M4, there is one large matrix unit shared by the four P-cores and one smaller matrix unit shared by the four E-cores). If you are primarily interested in using the matrix unit for LLM acceleration, a design that uses many P-cores would be very wasteful. A custom device that consists mostly of matrix accelerators and a few slower CPU cores to feed them would be preferable. Even then, the matrix unit is not as performant as the GPU (simply because the GPU is so big), nor as energy efficient as the NPU. In other words, relying on the matrix accelerator is probably not the best way for Apple to build a cost-efficient LLM acceleration at scale.

Right. (Except it's 6 E-cores, not 4, but I know you know that already.) However if Apple is truly designing silicon for its own use, there is no reason to assume they'll use the same NPU as is seen on the M4. It should be relatively easy for them to be MUCH larger NPUs, if that's what they actually want. It's not at all clear that it is though; the NPU may not suit all the client-facing AI tasks, and it certainly isn't what they want for training- though they may choose not to do training in-house.

MrGunny94 · Jun 8, 2024

Confused-User said:
Yes, and we can start by noting that most of today's posts have been pure speculation, not based on any actual facts. (I'm not including your post in that.)

Your conclusion is wrong, despite your premises being right. You can't know whether many E cores is better than that same area/transistors being used for P cores, without specifying the workload. Certainly, for traditional "embarrassingly parallel" jobs, the E-cores will do better. But for other things, you don't know until you actually benchmark (or simulate). This basically recapitulates the difference in philosophies between GB 5 and 6... not that they represent the only two points on that spectrum. (And not, in fact, that it's a spectrum at, since a spectrum sort of implies a single dimension, whereas for the P/E question you've got many variables.)

(Edit: You did say "massively parallel", but I take that to mean something a little different. If I interpreted that wrong, apologies. But in any case, the tasks to which an Apple AI datacenter might be put are of uncertain levels of parallelism, at the level of single jobs. Further, minimizing latency is going to be a high (top?) priority for client-facing functions, which is widely expected to be what any such datacenter would be for. So using faster but less efficient hardware might make business sense anyway.)

Right. (Except it's 6 E-cores, not 4, but I know you know that already.) However if Apple is truly designing silicon for its own use, there is no reason to assume they'll use the same NPU as is seen on the M4. It should be relatively easy for them to be MUCH larger NPUs, if that's what they actually want. It's not at all clear that it is though; the NPU may not suit all the client-facing AI tasks, and it certainly isn't what they want for training- though they may choose not to do training in-house.

That’s pretty true especially if they wanna take some good use of the NPUs on larger dies.

Let’s not forget the discussion we had with the Ultra chips being used by them as servers

Boil · Jun 8, 2024

Antony Newman said:
What do you think the bottle neck that Apple will be addressing in Hidra that would necessitate further design work for a (rumoured) mid 2025 release?

It would seem a possible aspect to the delayed timeline for the M4-series of SoCs above the base M4 would be the Eliyan packaging tech...?

treehuggerpro said:
. . . both the Brava and Hidra chips (plus their memory) would require Eliyan’s new interconnect (due in the third quarter on N3) and explain Gurman’s no further M4 tiers until late 2024 into 2025 for the complete lineup . . ?

treehuggerpro · Jun 8, 2024

My speculation was based on the (seemingly) more reliable rumours we’ve had so far. Any, or all, of which could still be off or wrong.

What the rumours have speculated for the M4s is:

The previous three chip building block approach for Apple’s product stack will remain, but the focus / use of each chip tier may be changing.
The mid-tier (Pro or Brava) chip, is supposed to meet the requirements of the MacBook Pros, i.e. a mobile focused chip that can fill both the current Pro and Max roles, presumably x1 Brava for a Pro and x2 Brava for the Max.
The top-tier (Max or Hidra) chip, is supposed to be switching away from mobile to become an x2 / x4 (Hydra) only chip. The concept of the Hidra chip being for x2 (Ultra) and x4 (Extreme) configurations only, would suggest Apple can salvage some die space from any of the unnecessary duplications that might occur using a chip that also has to work in a standalone application.

The core counts per chip tier in my post above were just intended to fit approximately, and comfortably, within Apple’s current M-chip die sizes, while not disrupting the steps in their current product stack. The additional bit of speculation (using Gurman’s M4 timeline) is an assumption based on the availability of Eliyan’s new interconnect, due in the third quarter.

Along with low power consumption and design flexibility, Eliyan’s interconnect provides the packaging approach Apple’s patent outlines as a lower cost, higher yielding, faster way to build multi-chip-modules. Good-all-round it seems.

@tenthousandthings follows this stuff pretty closely, he might know some, or ten thousand, things that could counter my assumptions on Eliyan’s interconnect as Apple’s next step for the M series of chips?

Boil said:
So a single Hidra SoC would be (16P / 40-48GPU / 1NPU)...?

Yes. I’m hoping some of the potential die space saved by the Hidra format could go to a few more GPU cores (48).

Boil said:
Niche market AF, but Apple ASi GPU/NPU PCIe cards could be a thing, but they are background compute/render engines, no display output...?

Yes again on the Niche market point, but that's largely Apple's fault. There hasn't been a pull factor on the Mac for Architecture (or 3D in general) for some time. Developers have taken a more positive outlook since the arrival of Apple Silicon though. And an x4 Cube or a Multi slot Mac Pro would / should compare very favourably against traditional PC workstations in performance per dollar terms. Probably not a big enough pull factor to get Autodesk to bring Revit across to the Mac, but the nagging in Autodesk's forums would certainly go up.

Xiao_Xi · Jun 9, 2024

leman said:
Apple also does not implement ARMv9

You haven't explained why you think it's an Armv9 core. Can a core implement some Armv9 instructions without being an Armv9 core?

Could it be possible that Apple didn't have access to Armv9 until last year's deal?

Apple Signs New Deal With Arm to License Chip Designs Beyond 2040

Apple has signed a new deal with British chip design company Arm to license its chip technology that extends beyond 2040, reports Reuters. News of...

www.macrumors.com

leman · Jun 9, 2024

Antony Newman said:
Thankyou for chiming in - this is the first time I’ve heard the M4 is not really ARMv9.

Xiao_Xi said:
Not everyone on the forum shares the same definition for a core to be Armv9.

Xiao_Xi said:
You haven't explained why you think it's an Armv9 core. Can a core implement some Armv9 instructions without being an Armv9 core?

I had another look at ARM's Architecture Manual, and the matter is indeed extremely confusing. They define a series of features as part of ARMv9, but pretty much all of them are marked as optional. In fact, ARMv9 is pretty much defined as equivalent to ARMv8.5. So it is indeed possible to claim that M4 is ARMv9-compatible (even though it does not support most of the ARMv9 optional features) — but by that logic, M1 is also ARMv9-compatible.

The reason why I say M4 is not ARMv9 is because it lacks the core features associated with other ARMv9 implementations, such as SVE.

Antony Newman said:
Do you think Apples server processors will end up having 33% more E-cores than P-Cores in each cluster?

I don't know what an Apple server would accomplish, so I cannot speculate on a hypothetical product that I don't understand. The rumors of Appel using M2 Ultra chips for LLM inference make no sense to me. I can imagine them building a data center for Xcode Cloud like that, not for machine learning.

Antony Newman said:
What do you think the bottle neck that Apple will be addressing in Hidra that would necessitate further design work for a (rumoured) mid 2025 release?

It is really difficult for me to speculate on these things. If they are serious about mainstream ML, memory bandwidth and GPU-specific ML features will be important. SME and the NPU are great, but the GPU has tremendous performance potential for ML applications. On the GPU side, I have speculated that Apple might be aiming for 2x FP32 pipes per scheduler in the near future. If they indeed go that route, they could implement GEMM features for various data formats, resulting in 512 BF16 and 512 or 1024 INT8 GEMM OPS per GPU core.

Confused-User said:
Your conclusion is wrong, despite your premises being right. You can't know whether many E cores is better than that same area/transistors being used for P cores, without specifying the workload. Certainly, for traditional "embarrassingly parallel" jobs, the E-cores will do better. But for other things, you don't know until you actually benchmark (or simulate). This basically recapitulates the difference in philosophies between GB 5 and 6... not that they represent the only two points on that spectrum. (And not, in fact, that it's a spectrum at, since a spectrum sort of implies a single dimension, whereas for the P/E question you've got many variables.)

Quite a lot of testing has been done on P- and E-cores. So far, the conclusion is that E-cores use considerably less power for every tested workload, so I expect this to generalize. The difference in power consumption is just too large. Maybe it is possible to craft an artificial workload that exploits architectural differences between P- and E-cores to make the P-core more efficient, I doubt that such workload will be useful.

Confused-User said:
(Edit: You did say "massively parallel", but I take that to mean something a little different. If I interpreted that wrong, apologies. But in any case, the tasks to which an Apple AI datacenter might be put are of uncertain levels of parallelism, at the level of single jobs. Further, minimizing latency is going to be a high (top?) priority for client-facing functions, which is widely expected to be what any such datacenter would be for. So using faster but less efficient hardware might make business sense anyway.)

If we are talking about mainstream ML using CPU cores, that is always a massively parallel task (GEMM is trivially parallelizable, which is why we use GPUs to accelerate ML). But I think that in this day and age, building CPU clusters to accelerate ML is a nonsensical enterprise. Other hardware solutions (NPUs, GPUs) are simply better suited for the task.

Confused-User said:
Right. (Except it's 6 E-cores, not 4, but I know you know that already.)

Ah yes, thanks for pointing it out. I find it difficult to keep the cluster sizes right across models.

Confused-User said:
However if Apple is truly designing silicon for its own use, there is no reason to assume they'll use the same NPU as is seen on the M4. It should be relatively easy for them to be MUCH larger NPUs, if that's what they actually want. It's not at all clear that it is though; the NPU may not suit all the client-facing AI tasks, and it certainly isn't what they want for training- though they may choose not to do training in-house.

Using custom SoC with large NPUs would indeed be the most reasonable way to build an Apple LLM server. Of course, it would also be a huge investment and it is not clear whether the current-gen NPU will work well with future LLMs. This is why I find all these rumors rather dubious.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 68000

macrumors 65816

macrumors 68000

macrumors G5

macrumors 68000

macrumors 6502a

macrumors 6502

macrumors 6502a

macrumors 68030

macrumors regular

macrumors 68040

macrumors regular

macrumors 68040

macrumors member

macrumors 6502

macrumors member

macrumors Core

macrumors member

macrumors 68000

macrumors 6502a

macrumors 65816

macrumors 68040

macrumors regular

macrumors 68000

macrumors Core

Our Staff