I’d tend to agree with you on this.Using custom SoC with large NPUs would indeed be the most reasonable way to build an Apple LLM server. Of course, it would also be a huge investment and it is not clear whether the current-gen NPU will work well with future LLMs. This is why I find all these rumors rather dubious.
It appears that the difference between an Armv8 core and an Armv9 core is what trace features implement. According to the Arm Architecture Reference Manual Supplement Armv9 (pag 25):I had another look at ARM's Architecture Manual, and the matter is indeed extremely confusing. They define a series of features as part of ARMv9, but pretty much all of them are marked as optional. In fact, ARMv9 is pretty much defined as equivalent to ARMv8.5.
An implementation of the Armv9-A architecture cannot include an Embedded Trace Macrocell (ETM)
Doesnt servers use ecc ram? As far as I know this is not supported by current m prozessors.
What seems more likely, is that Apple can have multiple M4 Ultras in a server, liquid cooled. Potentially even a higher tier “binned” part that surpasses what Apple puts in Mac Studios.
optimizations around the fact that the GPU isn’t powering a display, increasing the amount of instructions that need to be processed
Could Apple be looking at long term building custom Server SOCs? Sure, but it would need to reconcile priorities for its fabbing budget with TSMC. They can only fabricate so many chips.
How do you know that M-series are not using error-corrected RAM? Apple hasn't given us any information about this. They have a bunch of patents describing error correction with LPDDR5, whether these things are actually being used or not — we simply have no way of knowing.
They don't even need to liquid cool it. It's not like M2 Ultra draws that much power. I still don't really see the purpose of such a server. Certainly not for ML workloads. It seems to me that Nvidia systems would be better and more cost-effective for both training and inference.
The GPU doesn't really have anything to do with a display. It's a parallel processor with some graphics-related functionality. It takes data from memory, does some processing, and writes the results back to memory. What exactly happens to the result is not the GPU's business. Will GPU have more time for compute if there is no display connected? Certainly. Will it meaningfully change the performance? Absolutely not
If they use the older 5nm process, manufacturing won't be a problem. The bigger constraint is RAM.
Take a look at Amazon Gavitron and Azure ARM chips they are ECC memory built in packageDoesnt servers use ecc ram? As far as I know this is not supported by current m prozessors. Will this change or is the integrated ram less sensitive for errors?
Thanks for the mention! With WWDC coming up, I'm just wondering if Apple will use it to say something more about M4. You never know.My speculation was based on the (seemingly) more reliable rumours we’ve had so far. Any, or all, of which could still be off or wrong.
What the rumours have speculated for the M4s is:
The core counts per chip tier in my post above were just intended to fit approximately, and comfortably, within Apple’s current M-chip die sizes, while not disrupting the steps in their current product stack. The additional bit of speculation (using Gurman’s M4 timeline) is an assumption based on the availability of Eliyan’s new interconnect, due in the third quarter.
- The previous three chip building block approach for Apple’s product stack will remain, but the focus / use of each chip tier may be changing.
- The mid-tier (Pro or Brava) chip, is supposed to meet the requirements of the MacBook Pros, i.e. a mobile focused chip that can fill both the current Pro and Max roles, presumably x1 Brava for a Pro and x2 Brava for the Max.
- The top-tier (Max or Hidra) chip, is supposed to be switching away from mobile to become an x2 / x4 (Hydra) only chip. The concept of the Hidra chip being for x2 (Ultra) and x4 (Extreme) configurations only, would suggest Apple can salvage some die space from any of the unnecessary duplications that might occur using a chip that also has to work in a standalone application.
Along with low power consumption and design flexibility, Eliyan’s interconnect provides the packaging approach Apple’s patent outlines as a lower cost, higher yielding, faster way to build multi-chip-modules. Good-all-round it seems.
@tenthousandthings follows this stuff pretty closely, he might know some, or ten thousand, things that could counter my assumptions on Eliyan’s interconnect as Apple’s next step for the M series of chips?
[...]
The Apple Watch is SiP.Thanks for the mention! With WWDC coming up, I'm just wondering if Apple will use it to say something more about M4. You never know.
In terms of your assumptions, I guess I'd point out that in 2022, TSMC spent (USD) $3.6 billion on advanced packaging capacity, Intel spent $4 billion, Samsung $2 billion. [Anandtech] Not to mention spending by members of TSMC's 3DFabric Alliance and other industry partners.
As I understand it, Eliyan has raised $120 million total in two rounds. One of their investors is Intel (both rounds), so it wouldn't be unreasonable to speculate (this is the speculation thread, after all!) that their activity on N3 at TSMC is related to Intel's upcoming products on N3 (Lunar Lake, etc.) The timing seems about right, but there are a lot of unknowns. It would, however, explain how a startup like Eliyan was able to access TSMC's N3 capacity at all.
I have no comment on the pending patent, but I'd be wary of reading anything into that. Keep in mind the massive scale of the industry-wide effort in this direction, for both standard packaging and advanced packaging. System-in-Package is a thing, but it's a really, really big thing. [!]
That was sort of my point. If you're actually using CPU clusters (as opposed to GPU/NPU), then chances are, you're *not* doing the typical massively-parallel AI tasks. So in that case, it may be that you lose way too much in efficiency spreading the load widely for E cores to be as good as P cores.Quite a lot of testing has been done on P- and E-cores. So far, the conclusion is that E-cores use considerably less power for every tested workload, so I expect this to generalize. The difference in power consumption is just too large. Maybe it is possible to craft an artificial workload that exploits architectural differences between P- and E-cores to make the P-core more efficient, I doubt that such workload will be useful.
[...]
If we are talking about mainstream ML using CPU cores, that is always a massively parallel task (GEMM is trivially parallelizable, which is why we use GPUs to accelerate ML). But I think that in this day and age, building CPU clusters to accelerate ML is a nonsensical enterprise. Other hardware solutions (NPUs, GPUs) are simply better suited for the task.
Timothy Prickett Morgan says:
MARCH 29, 2024 AT 8:37 PM
They have done their PHY in 5 nanometer and 3 nanometer processes as far as I know. And they have one large scale customer, someone not quite a hyperscaler but more than a typical service provider.
That was sort of my point. If you're actually using CPU clusters (as opposed to GPU/NPU), then chances are, you're *not* doing the typical massively-parallel AI tasks. So in that case, it may be that you lose way too much in efficiency spreading the load widely for E cores to be as good as P cores.
My other point (which I think was pretty clear) was that latency may be a significant factor for customer-facing work, in which case P cores may be a better choice even if less efficient for the task.
I think without knowing what they want to do with these servers, we can't really know what's better.
I am skeptical about all this talk of Eliyan. Apple has established pretty clearly that they want to own their stack from the bottom to the top. They're trying to bring everything in-house that isn't already there. The last major piece of the puzzle for them is RF stuff; being able to dump Qualcomm and Broadcom is extremely important for them. (There are other smaller pieces, much of it analog stuff. But they're working on that too, I expect.)
I don't see them taking on a new dependency now. That's exactly the opposite of their general direction as a company. The one way I can imagine this happening is if they have an option to acquire Eliyan completely - which is not impossible. But I do think it's much more likely they will continue to build things on their own. They've certainly had good success with that strategy so far.
Yes, Apple’s vertical stack is very much to their advantage. Unless there is a particular deficiency Apple needs addressed though, I’d venture the type of IP Eliyan has sitting under their PHY, enabling its bandwidth and size, is not the kind of IP Apple would have a particular interest in re-spinning. That's the assumption I'm working with, because it is IP packaged within the existing industry standards, and that can basically slot in everywhere existing interconnects already sit.
@tenthousandthings would have better knowledge on just how specific and/or idiosyncratic proprietary solutions might get in this space, but, for the most part, where standards are present, there’s generally benefits to broad adoption.
I am skeptical about Apple using Eliyan tech for two reasons. First, Apple never cared about UCIe and multi-chip-module standardization because they are not interested in integrating third-party dies into their solutions. Second, Apple already has comparable tech with their UltraFusion interconnect. If I understand it correctly, the main advantage of Eliyan's IP (besides standardization) is that it does not require a bridge and communicates via a standard substrate. This is a great cost saver if you need a few GB/s die-to-die interconnect and don't want to pay for advanced packaging. I am not convinced that a no-bridge solution is viable for Apple with their multi-TB/s requirements. They will likely still need a bridge. And if they need a bridge, why license some third-party tech if you already have a perfectly fine solution?
It's true that Eliyan's "magical" interference cancellation techniques apply to both standard packaging and advanced packaging. [EE Times Europe] But they are hardly alone in making progress, and TSMC and others have also referenced major, orders-of-magnitude, breakthroughs in this area (substrates) in the past few years.Yes, Apple’s vertical stack is very much to their advantage. Unless there is a particular deficiency Apple needs addressed though, I’d venture the type of IP Eliyan has sitting under their PHY, enabling its bandwidth and size, is not the kind of IP Apple would have a particular interest in re-spinning. That's the assumption I'm working with, because it is IP packaged within the existing industry standards, and that can basically slot in everywhere existing interconnects already sit.
@tenthousandthings would have better knowledge on just how specific and/or idiosyncratic proprietary solutions might get in this space, but, for the most part, where standards are present, there’s generally benefits to broad adoption.
It's true that Eliyan's "magical" interference cancellation techniques apply to both standard packaging and advanced packaging. [EE Times Europe] But they are hardly alone in making progress, and TSMC and others have also referenced major, orders-of-magnitude, breakthroughs in this area (substrates) in the past few years.
Thanks for the vote(s) of confidence, but I'm best thought of as a competent historian. That's my field of training. I think the semiconductor industry is high art, in every sense of that word. So I'm interested. I'm okay at not jumping to conclusions and I'm good at evaluating secondary sources, but I make mistakes (one recently with regard to what is a local silicon interconnect), and I'm very much NOT a trained engineer or scientist. I have one small advantage, in that I have Chinese language skills that enable me to be a little more confident in that world, when talking about China and/or Taiwan. I also participated in the AppleSeed program throughout the early years of OS X.
This is me: https://www.earlymacintosh.org/
This is also me (with friends): https://www.chinesemac.org/
Isn't that exactly what I described in patents like https://patents.google.com/patent/US20220013504A1I am skeptical about Apple using Eliyan tech for two reasons. First, Apple never cared about UCIe and multi-chip-module standardization because they are not interested in integrating third-party dies into their solutions. Second, Apple already has comparable tech with their UltraFusion interconnect. If I understand it correctly, the main advantage of Eliyan's IP (besides standardization) is that it does not require a bridge and communicates via a standard substrate. This is a great cost saver if you need a few GB/s die-to-die interconnect and don't want to pay for advanced packaging. I am not convinced that a no-bridge solution is viable for Apple with their multi-TB/s requirements. They will likely still need a bridge. And if they need a bridge, why license some third-party tech if you already have a perfectly fine solution?
Apple do have a bunch of patents on a particular idea. All LPDDR5 has to have internal error correction, but normally they do that internally and that's the end of the story. Apple's patents are about how to propagate the info out of the DRAM (I don't know if the details of how they do this can be done on all DRAM, or they ask for a slight tweak by Micron et al) so that the OS can track the patterns of where error correction is needed and respond appropriately (which may range from giving that page of DRAM more frequent refreshes to masking it out and never using it again).How do you know that M-series are not using error-corrected RAM? Apple hasn't given us any information about this. They have a bunch of patents describing error correction with LPDDR5, whether these things are actually being used or not — we simply have no way of knowing.
Remember that patent I pointed out regarding ranks? Even if we never see ranks in a consumer Mac...If they use the older 5nm process, manufacturing won't be a problem. The bigger constraint is RAM.
I read the article on nextplatform when you first posted it.Horror vacui. When I posted the initial question there was a bunch of links included for anyone curious about the subject. The intention and subsequent invitations, for the interested, was to review the material and pick apart the speculation (objectively).
The forum’s a void for tossing out such queries, right? The compulsion however, I feel, too often leans towards five second assertions without review. It’s the kind of thing, I think, should make us pause on the concerns around AI. Because, while AI is built on the premise of review and assemble, if humans can’t review, before they assemble, we’re in real trouble! We’ll end up a bunch of budgies blathering at ourselves in the mirror because we like the reflection.
Anyway, name99 and tenthousandthings have continued my faith for curiosity’s sake. So here's another stab at filling the current vacuum with a little more (reasoned and simplified) speculation. Probably with many technical considerations overlooked, so the caveat is, this is a long way from my own discipline and I happily invite further objective scrutiny.
UltraFusion provides about 2.5TB/s off of a 20 mm approx. shoreline. Please check my math and assumptions! This rounds to ~1.1 Tbps/mm (22 Terabits / 20 mm). From the linked article, the table below shows Eliyan’s interconnect against some of the other (current) protocols and puts UltraFusion’s capacity in context.
View attachment 2387827
In line with @name99's post, when Apple cooked up UltraFusion they were pushing, and were able to market it as ‘4x the bandwidth of the leading . . . interconnect’ at the time, a statement they subsequently removed with the marketing of the M2 Ultra. By current standards however, UltraFusion sits around Intel’s MDFIO in the table above, and is now a relatively anaemic interconnect on Advanced Packaging. The tech in UltraFusion was conceived on 5nm and quite possibly wasn't going to meet an M3 Ultra’s requirement across a 22 mm shoreline, but I’ll leave that to the boffins.
This may however, put a more rational narrative around the lack of an interconnect on the M3 Max, and make sense of why Apple’s speculated M4 release schedule is more likely what Gurman has suggested?
For the M1 Ultra, Apple (with TSMC) needed to come up with a solution that was not readily available. That has now changed and more significantly perhaps, it has change for Standard Packaging with Eliyan’s interconnect arriving in the third quarter.
As sardonically implied (apologies), Eliyan's IP is a leap in particular for Standard Packaging, all things considered (FoM). Its bandwidth (2.0-4.0 Tbps/mm) will allow Apple to market their next step in this direction with the Hidra x2 and x4 Macs as UltraFusion 2 (at 5TB/s minimum to 10TB/s) off of a 22 mm chip shoreline. And they could also continue to market it as UltraFusion at 2.5TB/s min. (in Standard Packaging) off of an 11 mm shoreline for an x2 Brava configuration as the new Max.
So again, my speculation is that Apple wants to move to Standard Packaging for their MCMs and is waiting (in part at least) for a new interconnect (from Eliyan), because it will allow them to finally execute an x4 Mac. And because Standard Packaging is cheaper, higher yielding and faster to manufacture (according to Apple’s patent and Eliyan’s articles). They'll likely maintain the economics of their existing three chip strategy (as Gurman’s code names suggests) via the following format:
M4 (Donan)
M4 Pro (Brava)
M4 Max (2x Brava)
M4 Ultra (2x Hidra)
M4 Extreme (4x Hidra)
And the A/S transition will finally be over, Amen! (or not?)
I read the article on nextplatform when you first posted it.
Assuming Eliyan can actually deliver, they might be an attractive option for Apple... except, as I said yesterday, Apple has shown an *intense* desire to bring everything in-house. Will they look at this as a physical packaging thing (which they're happy to contract out, at least to TSMC), and so a plausible exception to that strategy? Maybe. I'm not convinced, though I don't think it's impossible.
A minor nitpick: 2.5TBps = 20tbps. I don't know if signaling overhead, FEC, etc. is included in that number, but it's the only number we've got. So you're looking at 1tb/s/mm. I am not sure that there is a definitive answer about whether it's 20tbps in each direction, or summed.
I wonder why you think Apple is stuck with that performance level. They built that more than two years ago. In the meantime, nVidia has recently started shipping an interconnect that at 10TBps is twice as fast as Apple's (or four times, depending on whether or not that 20tbps figure is unidirectional). It's hard to imagine nVidia doing a substantially better job than Apple on this- they're both working with similar building blocks, iso-process, and Apple is in fact on a newer process- though it's not actually clear to me that the process matters much for TSMC's packaging tech.
Given Apple's importance to TSMC, it would be pretty surprising if they *weren't* collaborating on newer versions of UltraFusion. So if I had to bet, I'd still bet that Apple's going to do their own thing, relying on TSMC to some extent.
Of course I've bet wrong a few times recently (I bet on new Studios at WWDC, though I was clear that was a low-confidence bet). Could be my streak will continue. :-/
(a) When I did the math on Blackwell's nvLink, I may have got it incorrect BUT my conclusion is that an individual nvLink is behind UltraFusion. However Blackwell implements a number of such links. That's how they hit a larger aggregate bandwidth.I read the article on nextplatform when you first posted it.
Assuming Eliyan can actually deliver, they might be an attractive option for Apple... except, as I said yesterday, Apple has shown an *intense* desire to bring everything in-house. Will they look at this as a physical packaging thing (which they're happy to contract out, at least to TSMC), and so a plausible exception to that strategy? Maybe. I'm not convinced, though I don't think it's impossible.
A minor nitpick: 2.5TBps = 20tbps. I don't know if signaling overhead, FEC, etc. is included in that number, but it's the only number we've got. So you're looking at 1tb/s/mm. I am not sure that there is a definitive answer about whether it's 20tbps in each direction, or summed.
I wonder why you think Apple is stuck with that performance level. They built that more than two years ago. In the meantime, nVidia has recently started shipping an interconnect that at 10TBps is twice as fast as Apple's (or four times, depending on whether or not that 20tbps figure is unidirectional). It's hard to imagine nVidia doing a substantially better job than Apple on this- they're both working with similar building blocks, iso-process, and Apple is in fact on a newer process- though it's not actually clear to me that the process matters much for TSMC's packaging tech.
Given Apple's importance to TSMC, it would be pretty surprising if they *weren't* collaborating on newer versions of UltraFusion. So if I had to bet, I'd still bet that Apple's going to do their own thing, relying on TSMC to some extent.
Of course I've bet wrong a few times recently (I bet on new Studios at WWDC, though I was clear that was a low-confidence bet). Could be my streak will continue. :-/