M4+ Chip Generation - Speculation Megathread [MERGED]

mr_roboto · Aug 18, 2024

Confused-User said:
Or am I wildly off base and missing something fundamental?

SoCs are usually structured as a bunch of nodes (things like CPU clusters, GPU clusters, Apple Neural Engine, DRAM controller, all the peripherals, etc) which all attach to a common interconnect. The industry likes to call this interconnect a NoC, short for network-on-chip. NoCs do have the somewhat network-like property that any node can reach any other node simply by issuing a request to the NoC. The addressing used in the NoC is physical memory addresses, so the "network switches" do simple address decoding to route incoming requests to the appropriate port.

The NoC in a complex SoC is partitioned to avoid needing a single giant switch that serves all nodes in the system. Instead, there's always several switches, which connect both to each other and to leaf nodes. If there's a subsystem that needs less bandwidth, for example, you might connect its nodes to a secondary switch that connects to the main high bandwidth switch through a narrow, low-bandwidth port.

What travels across UltraFusion is almost certainly one or more switch-to-switch links to bridge one Max chip's NoC to the other. At minimum, I'd expect the biggest, highest bandwidth switch that's connected to the DRAM controllers gets linked to its counterpart in the other die. That might be all they did, in fact, as it's all that's necessary to make the system function.

The link is probably not 10K wide, it's more likely to be structured as a bunch of narrower links, allowing the system to forward many requests and responses simultaneously.

leman · Aug 18, 2024

mr_roboto said:
SoCs are usually structured as a bunch of nodes (things like CPU clusters, GPU clusters, Apple Neural Engine, DRAM controller, all the peripherals, etc) which all attach to a common interconnect. The industry likes to call this interconnect a NoC, short for network-on-chip. NoCs do have the somewhat network-like property that any node can reach any other node simply by issuing a request to the NoC. The addressing used in the NoC is physical memory addresses, so the "network switches" do simple address decoding to route incoming requests to the appropriate port.

The NoC in a complex SoC is partitioned to avoid needing a single giant switch that serves all nodes in the system. Instead, there's always several switches, which connect both to each other and to leaf nodes. If there's a subsystem that needs less bandwidth, for example, you might connect its nodes to a secondary switch that connects to the main high bandwidth switch through a narrow, low-bandwidth port.

What travels across UltraFusion is almost certainly one or more switch-to-switch links to bridge one Max chip's NoC to the other. At minimum, I'd expect the biggest, highest bandwidth switch that's connected to the DRAM controllers gets linked to its counterpart in the other die. That might be all they did, in fact, as it's all that's necessary to make the system function.

The link is probably not 10K wide, it's more likely to be structured as a bunch of narrower links, allowing the system to forward many requests and responses simultaneously.

To add to this, Apple patents describe using different types of networks, such as a network that supports operation ordering for the CPU and a high-bandwidth network without ordering guarantees for the CPU. These networks can feature different topologies (e.g. ring, mesh) to achieve their desirable properties under implementation constraints. Also, Apple Silicon features a distinct memory hierarchy, with each memory controller being associated with a specific SoC cache block (that is, data is never replicated across multiple SoC cache blocks, which massively simplifies synchronization). The patents describe extending these networks beyond the chip boundary, so I think @mr_roboto is spot on with the explanation. Of course, I cannot claim to understand the deep details of all these arrangements as I do not have any training in this area; therefore, treat all this with a grain of salt.

Edit: the patent in question if anyone is interested: https://patentscope.wipo.int/search/en/detail.jsf?docId=EP433580442&_cid=P21-LZZKMX-21966-1

Confused-User · Aug 18, 2024

I appreciate both @leman and @mr_roboto taking the time to write about this. I am up against the limits of my ignorance, but not much in the places you've addressed.

I was already pretty much up to speed on the material you presented (thus my question, what besides the NoC...). I'm also expert on routing and switching, generally speaking, though not specifically in the intra-chip context. I still don't see how you get to 10k signals. I imagined they might have added additional links directly between, say, SLCs, though that seems very messy... and still probably doesn't get you anywhere near 10k signals.

The obvious first followup question is, how wide is the NoC? I vaguely imagine it being the width of one cache line, that being what most traffic to and from memory controllers and caches looks like (at least conceptually- I don't know if actual practice differs). If I'm wildly off on this, then maybe that alone explains the "10k signals".

I wa suggesting that if you assume Apple wanted to keep UF (I'm tired of writing out "UltraFusion") from being a huge bottleneck, how do the solve the problem that its bandwidth (presumably 2gbps/wire, though I suggested one way it could be 4gbps) just can't keep up with the NoC, which must run at a higher speed? You really can't do anything about the extra latency, but you *can* do something about lower bandwidth. I was positing that they replicated enough links across UF that you can funnel as many bits around over UF as you can over the rest of the NoC. Each link is used round-robin, so that with enough of them you can have an available link every cycle. I figure that that could triple or quadruple the number of wires that get used - maybe even more? - and maybe *that* gets you to 10K.

Anyway... I don't know enough to know if that idea is actually sensible or not. That's what I was asking.

leman · Aug 18, 2024

@Confused-User Thanks for clarifying what you have been wondering about. These are indeed good questions, and I'd be very curious to know the answer as well! I hope that someone with industry knowledge (like @mr_roboto) could shed the light on these matters.

mr_roboto · Aug 18, 2024

Confused-User said:
I appreciate both @leman and @mr_roboto taking the time to write about this. I am up against the limits of my ignorance, but not much in the places you've addressed.

I was already pretty much up to speed on the material you presented (thus my question, what besides the NoC...). I'm also expert on routing and switching, generally speaking, though not specifically in the intra-chip context. I still don't see how you get to 10k signals. I imagined they might have added additional links directly between, say, SLCs, though that seems very messy... and still probably doesn't get you anywhere near 10k signals.

The obvious first followup question is, how wide is the NoC? I vaguely imagine it being the width of one cache line, that being what most traffic to and from memory controllers and caches looks like (at least conceptually- I don't know if actual practice differs). If I'm wildly off on this, then maybe that alone explains the "10k signals".

In Apple Silicon one cache line is 128 bytes, or 1024 bits (documented in "Apple Silicon CPU Optimization Guide", which you can download from Apple after signing up for a developer account). Thus, 1K bit wide datapaths would be fully utilized by moving a whole cache line in one clock cycle. Furthermore, all signals are likely unidirectional, in which case you'll end up needing a second 1K bit wide datapath to move cachelines in the opposite direction.

Add some overhead on top of that: some Arm bus interface standards, such as AXI, move requests (address + other stuff) in their own channels. Let's handwave and assume data, address, and miscellaneous overhead amount to 2.5k pins per channel. Now we're getting into the zone of just four channels filling out UF's 10K pin budget.

The natural next question is: why would they want more than one channel? The answer: in these giant SoCs, there's a lot of agents generating a lot of memory requests in parallel, and a lot of memory controllers that can handle requests in parallel (assuming the requests made by different agents are reasonably spread out over the set of memory controllers). You need some parallelism in the NoC to take full advantage of that. There's probably plenty of parallel traffic that would get bottlenecked if there was only a single link to transport it between the two chips.

On that note, I'd actually guess the data links are considerably narrower than 1 cache line. Memory parallelism is very important in such a big system, and the latency penalty of using 4- or 8-cycle long bursts to move a 1024-bit cacheline isn't that bad, so it's likely a better choice to have more channels at the cost of each channel being narrower.

Hope that clarifies things - in the absence of documentation all we can deal in is educated guesses, but I think we can be reasonably sure that 10K signals isn't crazy.

Confused-User · Aug 18, 2024

mr_roboto said:
Hope that clarifies things - in the absence of documentation all we can deal in is educated guesses, but I think we can be reasonably sure that 10K signals isn't crazy.

It does clarify one thing - I somehow missed that a cache line is 128 bytes and not 64!

However, it still doesn't address my main point, which is the notion that UF runs at lower bandwidth per wire than the NoC does within to the chip, and that Apple may simply be replicating links over UF compared to within the chip in order to compensate for that. So however wide the NoC is on-chip, be it 1024 bits or less (and your point about the advantages of a narrower "bus", and multiples, makes sense to me), UF would be some smallish multiple of that (3x? 4x?).

BTW, since Apple's NoC is in fact a network and not (say) a ring bus, there's a good deal of parallelism just from that. SLC1 can xfer to the GPU while SLC2 is xferring to the CPU and SLC3 is xferring from the TB4 controller, say, as long as there's no shared resource that all are touching (like, no big crossbar at the middle).

quarkysg · Aug 18, 2024

Confused-User said:
However, it still doesn't address my main point, which is the notion that UF runs at lower bandwidth per wire than the NoC does within to the chip, and that Apple may simply be replicating links over UF compared to within the chip in order to compensate for that.

I would think it would have to run at the same max clock as the IP cores reading the caches.

altaic · Aug 18, 2024

leman said:
To add to this, Apple patents describe using different types of networks, such as a network that supports operation ordering for the CPU and a high-bandwidth network without ordering guarantees for the CPU. These networks can feature different topologies (e.g. ring, mesh) to achieve their desirable properties under implementation constraints. Also, Apple Silicon features a distinct memory hierarchy, with each memory controller being associated with a specific SoC cache block (that is, data is never replicated across multiple SoC cache blocks, which massively simplifies synchronization). The patents describe extending these networks beyond the chip boundary, so I think @mr_roboto is spot on with the explanation. Of course, I cannot claim to understand the deep details of all these arrangements as I do not have any training in this area; therefore, treat all this with a grain of salt.

Edit: the patent in question if anyone is interested: https://patentscope.wipo.int/search/en/detail.jsf?docId=EP433580442&_cid=P21-LZZKMX-21966-1

Wow, that patent is rather detailed. That’ll take awhile to digest.

mr_roboto · Aug 18, 2024

Confused-User said:
However, it still doesn't address my main point, which is the notion that UF runs at lower bandwidth per wire than the NoC does within to the chip, and that Apple may simply be replicating links over UF compared to within the chip in order to compensate for that.

My guess is actually that UF in M1 and M2 runs at the same clock speed as any other NoC link - about 2GHz is a reasonable-sounding main NoC frequency.

Confused-User said:
BTW, since Apple's NoC is in fact a network and not (say) a ring bus, there's a good deal of parallelism just from that. SLC1 can xfer to the GPU while SLC2 is xferring to the CPU and SLC3 is xferring from the TB4 controller, say, as long as there's no shared resource that all are touching (like, no big crossbar at the middle).

Even better than that: if Alice wants to talk to Bob at the same time as Carol wants to talk to Dave, a true crossbar switch can accommodate both conversations simultaneously with no resource conflicts.

treehuggerpro · Aug 18, 2024

altaic said:
Wow, that patent is rather detailed. That’ll take awhile to digest.

With details of bidirectional Lane bundles sharing a common power signal and a common clock signal . . .

Calculations based on the Tech Insights’ µm pitch and dimension findings . . .

Speculation only: Half Slice –> (3tx+3rx+4aux+2clk) resolves to 10032 Lanes.

Confused-User · Aug 18, 2024

mr_roboto said:
My guess is actually that UF in M1 and M2 runs at the same clock speed as any other NoC link - about 2GHz is a reasonable-sounding main NoC frequency.

Wait, really? Ok, I think we've hit one of the main things I didn't know.

I was under the impression that the NoC would run at higher clocks when under load. If 2GHz really is a likely max speed, then that means my idea is definitely not necessary and does not explain the 10k figure... though your related explanation about using multiple paths to explicitly gain parallel traffic flows presumably does.

Thank you for clearing that up.

Pressure · Aug 19, 2024

treehuggerpro said:
With details of bidirectional Lane bundles sharing a common power signal and a common clock signal . . .

Calculations based on the Tech Insights’ µm pitch and dimension findings . . .
View attachment 2407304

Speculation only: 6-pin bundles/Lane (tx+rx+tv+rv+clk+p) would resolve to 10184 Lanes . . .

Never Mind.

From the marketing material it was already clear that the UltraFusion silicium interposer had over 10,000 signal lanes.

mr_roboto · Aug 19, 2024

Confused-User said:
Wait, really? Ok, I think we've hit one of the main things I didn't know.

I was under the impression that the NoC would run at higher clocks when under load. If 2GHz really is a likely max speed, then that means my idea is definitely not necessary and does not explain the 10k figure... though your related explanation about using multiple paths to explicitly gain parallel traffic flows presumably does.

Thank you for clearing that up.

To me it's more of a reasonable speed than a likely speed, but yeah, it wouldn't be weird if it was that slow. It fits Apple's observable design philosophy: they like to build big/wide designs that run at relatively low clock speeds because that often results greater power efficiency. They are willing to accept the tradeoff that comes with this, reduced areal efficiency.

You can see this most clearly in their GPUs, which have worse perf/mm^2 than NVidia GPUs, because they're clocked far slower, but offer substantially better perf/W. This is also their CPU design philosophy, and it would not be surprising to see it applied to the interconnect fabric too.

treehuggerpro · Aug 19, 2024

Pressure said:
From the marketing material it was already clear that the UltraFusion silicium interposer had over 10,000 signal lanes.

[The patent] allowed us to see the die-to-die format UltraFusion was built / based on. Bidirectional bundles follow along the lines of the BoW Specification. (An earlier version obviously.)

__________

The relevance being UltraFusion already follows the footprint the UMI Standards follow from. The first of the UMI Standards was originally going to be called UMI-BoW2.1.

crazy dave · Aug 19, 2024

mr_roboto said:
To me it's more of a reasonable speed than a likely speed, but yeah, it wouldn't be weird if it was that slow. It fits Apple's observable design philosophy: they like to build big/wide designs that run at relatively low clock speeds because that often results greater power efficiency. They are willing to accept the tradeoff that comes with this, reduced areal efficiency.

You can see this most clearly in their GPUs, which have worse perf/mm^2 than NVidia GPUs, because they're clocked far slower, but offer substantially better perf/W. This is also their CPU design philosophy, and it would not be surprising to see it applied to the interconnect fabric too.

For GPUs, one should add that Nvidia's MaxQ designs have very similar TFLOP/core count/clock/power trade off as Apple's chips (with differences in actual raster and ray tracing performance based on each GPU's strength). The 4070 MaxQ is similar to a cut down M3 Max (even lower clocks!) - although the 4050 MaxQ has a little bigger and more power hungry design than the M3 Pro. It seems 35W is as low as Nvidia wanted to make its dGPUs go.

Of course, the MaxQ designs aren't very popular with many gamers precisely because of that. I mean the 4070 mobile has a decent ~40% better performance than the 4070 MaxQ, but costs more than 3x the power to get it! I guess a lot of PC gamers would rather have harrier jump jets than pay for silicon die area and get cool, quiet operations. Yeah I'm being snarky, but I guess it's all about what you value and I suppose if you're wearing headphones anyway ... But I do wonder about TCO for regular gamers in some areas with particularly high electricity bills though*.

*one of the arguments for wanting 4xxx mobile over 4xxx MaxQ is "if I'm going to pay for 4xxx silicon, I should pay a tiny bit more and get lots more performance", but if you are gaming regularly I bet you pay for some of that difference over time (admittedly maybe not enough to equalize perf/$, but still), never mind intangibles like costs in comfort.

mr_roboto · Aug 19, 2024

treehuggerpro said:
[The patent] allowed us to see the die-to-die format UltraFusion was built / based on. Bidirectional bundles follow along the lines of the BoW Specification. (An earlier version obviously.)

__________

The relevance being UltraFusion already follows the footprint the UMI Standards follow from. The first of the UMI Standards was originally going to be called UMI-BoW2.1.

[citation needed]

It's not really helping anyone that you keep just deciding random chiplet interface standards must be related to what Apple's doing, without any kind of evidence.

Also... earlier you wrote:

treehuggerpro said:
With details of bidirectional Lane bundles sharing a common power signal and a common clock signal . . .

Calculations based on the Tech Insights’ µm pitch and dimension findings . . .

Speculation only: 6-pin bundles/Lane (tx+rx+tv+rv+clk+p) would resolve to 10184 Lanes . . .

Your calculations (assuming that was your text and arrows overlaid on the Tech Insights image) have at least one giant flaw, and another somewhat less serious error.

What on earth are "tv" and rv"? Why do you think the patent describes bundles as containing clock and power signals in addition to data? (*) Did you not notice the language in the patent warning that its diagrams and pin counts are only examples to illustrate the principle, and actual devices covered by the patent may have far larger bundles (hundreds of signals or more)?

* I'm pretty sure I read the sections that led you to believe this, but if so, you're misinterpreting. The patent's text and diagrams cover the fact that using common branches of the clock and power distribution trees to serve an entire transmitter group helps to minimize timing skew across that group.

But what is not implied is that Apple's moving lots of clock and power signals across the bridge. They don't need to do that. The two chips should have a common reference clock, and PLLs and/or delay lines to phase-shift receive flip-flops until they're sampling in the middle of the data eye. As for power, just... no. That's even more wrongheaded. Apple is probably providing lots of ground in the bridge, but only as shielding to reduce signal-to-signal interference. Both dies have their own connections to power from the package and there really isn't any good reason to burn precious bridge interface bumps on moving power from one die to the other.

komuh · Aug 19, 2024

crazy dave said:
For GPUs, one should add that Nvidia's MaxQ designs have very similar TFLOP/core count/clock/power trade off as Apple's chips (with differences in actual raster and ray tracing performance based on each GPU's strength). The 4070 MaxQ is similar to a cut down M3 Max (even lower clocks!) - although the 4050 MaxQ has a little bigger and more power hungry design than the M3 Pro. It seems 35W is as low as Nvidia wanted to make its dGPUs go.

Of course, the MaxQ designs aren't very popular with many gamers precisely because of that. I mean the 4070 mobile has a decent ~40% better performance than the 4070 MaxQ, but costs more than 3x the power to get it! I guess a lot of PC gamers would rather have harrier jump jets than pay for silicon die area and get cool, quiet operations. Yeah I'm being snarky, but I guess it's all about what you value and I suppose if you're wearing headphones anyway ... But I do wonder about TCO for regular gamers in some areas with particularly high electricity bills though*.

*one of the arguments for wanting 4xxx mobile over 4xxx MaxQ is "if I'm going to pay for 4xxx silicon, I should pay a tiny bit more and get lots more performance", but if you are gaming regularly I bet you pay for some of that difference over time (admittedly maybe not enough to equalize perf/$, but still), never mind intangibles like costs in comfort.

PC gamers prefer peak performance because this is only thing that matters for games I don’t want to play my game in 1080p on my 4K tv. Same about professionals prefer to run render/export etc. faster as power is a lot cheaper compared to human cost at least for now.

This as two different use cases one is you want your mobile device to be quiet and have good battery life other is you need power to do your compute and don’t want to wait twice as long because you are power limited. Why F1 drivers don’t use Toyota Corolla they would have quiet and more power efficient car, they would save so much on gas/electricy and rubber prices if they do 😅

crazy dave · Aug 19, 2024

komuh said:
PC gamers prefer peak performance because this is only thing that matters for games I don’t want to play my game in 1080p on my 4K tv. Same about professionals prefer to run render/export etc. faster as power is a lot cheaper compared to human cost at least for now.

This as two different use cases one is you want your mobile device to be quiet and have good battery life other is you need power to do your compute and don’t want to wait twice as long because you are power limited. Why F1 drivers don’t use Toyota Corolla they would have quiet and more power efficient car, they would save so much on gas/electricy and rubber prices if they do 😅

Except that these are laptops and thus you are already in a thermally limited environment. The excuse of favoring raw performance above all else may work on a large desktop with sufficient cooling*. It does not on many if not most laptops. A lot of these designs, especially anything less than 16", can't actually handle the thermals of 150-200W for any length of time, bringing the performance down anyway. Sure you could build a 17" 10lbs monstrosity to run a 4090 mobile, and OEMs have, but gamers aren't exactly buying those in droves either. They want to have the svelte, lightweight laptop (at least compared to what would be required to adequately cool their processors) and have the performance promised by the benchmarks. Which means they are, in effect, kidding themselves. To turn your analogy back on itself, this is the equivalent of putting an F1 engine in a Smart Car and thinking you are still going to compete in the Grand Prix.

Okay I'm exaggerating a little in the above paragraph: of course you're still getting better performance, $ for $, from a laptop with a mobile chip than one with a MaxQ. I don't dispute that. I also understand why gamers prefer it and there's nothing wrong being an enthusiast for gaming over any other hobby, but the performance uplift doesn't turn a 1080p gaming laptop into a 4K one and there are significant trade offs in trying to do so. Truthfully the MaxQ/Apple design is simply more sensible for the form factor, even for gaming (as well as production and compute I should add).

*And even then, AMD, Nvidia, and especially Intel have all shown at various points over the last couple of years that particular design philosophy for desktop chips has negative consequences when taken too far.

komuh · Aug 19, 2024

crazy dave said:
Except that these are laptops and thus you are already in a thermally limited environment. The excuse of favoring raw performance above all else may work on a large desktop with sufficient cooling*. It does not on most laptops. A lot of these designs can't actually handle the thermals of 200W for any length of time, bringing the performance down anyway. Sure you could build a 10lbs monstrosity, and OEMs have, but gamers aren't buying those in droves either. They want to have the svelte, lightweight laptop (at least compared to what would be required to adequately cool their processors) and have the performance promised by the benchmarks. Which means they are, in effect, kidding themselves. To turn your analogy back on itself, this is the equivalent of putting an F1 engine in a Smart Car.

Okay I'm exaggerating a little in the above paragraph: of course you're still getting better performance, $ for $, from a laptop with a mobile chip than one with a MaxQ. I don't dispute that. I also understand why gamers prefer it and there's nothing wrong being an enthusiast for gaming over any other hobby, but the performance uplift doesn't turn a 1080p gaming laptop into a 4K one and has significant trade offs in trying to do so. Truthfully the MaxQ/Apple design is simply more sensible for the form factor, even for gaming (as well as production and compute I should add).

*And even then, AMD, Nvidia, and especially Intel have all shown at various points over the last couple of years that particular design philosophy for desktop chips has negative consequences when taken too far.

Thats why I started my response from saying PC gamers, on mobile devices designing for perf/watt is useful on my M1 Ultra/RTX3090 PC i don't want to be limited by power but by thermal performance of my system. (which sadly is not the case for M1 Ultra GPU which is extremely underpowered compared to possible thermal performance of whole designe) and on my MacBook pro i want to have smooth high quality 120hz display and long battery life i don't care about performance if it can run web and videos good enough. (Also undervolting is always possible i have 12700k on my server undervolted to about 3.5-5W of idle power usage so close to my M1 ultra)

crazy dave · Aug 19, 2024

komuh said:
Thats why I started my response from saying PC gamers, on mobile devices designing for perf/watt is useful on my M1 Ultra/RTX3090 PC i don't want to be limited by power but by thermal performance of my system. (which sadly is not the case for M1 Ultra GPU which is extremely underpowered compared to possible thermal performance of whole designe) and on my MacBook pro i want to have smooth high quality 120hz display and long battery life i don't care about performance if it can run web and videos good enough. (Also undervolting is always possible i have 12700k on my server undervolted to about 3.5-5W of idle power usage so close to my M1 ultra)

Right but my post is referring to specifically gaming laptops. Nvidia Mobile and MaxQ's GPUs are only used in laptops as far as I know and what I was saying is that Nvidia's MaxQ design philosophy is basically the same as Apple's and is more sensible, in my opinion, for the form factor, but is still disliked by the majority of gamers regardless.

That said, while I appreciate Apple wanting to build quiet workstations I don't disagree that they could afford to push their clocks, especially GPU clocks, on desktop more. No question. Even while maintaining their reputation for quiet operation, they have a lot more thermal headroom to play with and should use it. Gurman's rumor is that that the M4/M5 "Hidra" die will be the basis for a true "desktop-first" chip. We'll see of course. But hopefully he's right.

leman · Aug 20, 2024

mr_roboto said:
But what is not implied is that Apple's moving lots of clock and power signals across the bridge. They don't need to do that. The two chips should have a common reference clock, and PLLs and/or delay lines to phase-shift receive flip-flops until they're sampling in the middle of the data eye. As for power, just... no. That's even more wrongheaded. Apple is probably providing lots of ground in the bridge, but only as shielding to reduce signal-to-signal interference. Both dies have their own connections to power from the package and there really isn't any good reason to burn precious bridge interface bumps on moving power from one die to the other.

I don't know how relevant this is to the discussion, but if anyone is curious, Apple also has a lot of patents related to multi-die clock and power state management. There is a lot of information being exchanged between the dies — of course, it is information, not power wires themselves etc.

For those curious, you can go do https://patentscope.wipo.int/search and use this search pattern to find a lot of relevant patents

Code:

ALLNAMES:(apple) fp:(die) fp:(clock)

komuh said:
PC gamers prefer peak performance because this is only thing that matters for games I don’t want to play my game in 1080p on my 4K tv.

Gamers prefer sustained performance, not peak performance. You want stable high framerates in your game, not spikes. The main reason why Nvidia designs the gaming hardware the way they do is because it is cheaper and the customers don't mind the occasional instability or excessive power consumption. That said, instabilities are quire rare nowadays, but you did get them quite a lot on older gaming GPUs — running memory at super high clocks could produce errors, but these are tolerable for games. Workstation GPUs instead were clocked more conservatively to facilitate stability.

komuh · Aug 20, 2024

leman said:
I don't know how relevant this is to the discussion, but if anyone is curious, Apple also has a lot of patents related to multi-die clock and power state management. There is a lot of information being exchanged between the dies — of course, it is information, not power wires themselves etc.

For those curious, you can go do https://patentscope.wipo.int/search and use this search pattern to find a lot of relevant patents

Code:

ALLNAMES:(apple) fp:(die) fp:(clock)

Gamers prefer sustained performance, not peak performance. You want stable high framerates in your game, not spikes. The main reason why Nvidia designs the gaming hardware the way they do is because it is cheaper and the customers don't mind the occasional instability or excessive power consumption. That said, instabilities are quire rare nowadays, but you did get them quite a lot on older gaming GPUs — running memory at super high clocks could produce errors, but these are tolerable for games. Workstation GPUs instead were clocked more conservatively to facilitate stability.

Peak stable performance. I feel like this is basic knowledge peak in sense of silicon and thermals limits not peak like some sort of random boost.

leman · Aug 20, 2024

komuh said:
Peak stable performance. I feel like this is basic knowledge peak in sense of silicon and thermals limits not peak like some sort of random boost.

I just think it is important to clarify, given that the peak performance for CPUs is usually the burst performance and the GPUs too have the history of some shenanigans here (boost clocks on some Nvidia GPUs for example)

treehuggerpro · Aug 20, 2024

Hi @mr_roboto. The main observations from the patent were that the bidirectional and Lane bundle formats provided reasonable leads on Apple’s approach. BoW was the most progressive (Open) option when Apple was developing UltraFusion and as a template it fits. I expect Apple did whatever was necessary in getting it to market though.

Yeah, I see a giant error (+ others), which after a closer look at BoW has the numbers nearer to a realistic fit with its options and the pin count (post adjusted). The noted need for very low latency, also assumes Apple may have used the half slice in favour of lower SerDes latency + additional Aux for yield at the pitch used. Whether that’s right or wrong thinking, I don’t know. The note on SerDes in 2.4. seems to suggest BoW is already somewhat mitigated around the problem. The v (validation) was for FEC/Aux and, obviously, the experiment was intended as an approximate.

The excess of pins and the patent seem to indicate Apple is working with the same analogue bridging requirements as others. I made reasonable attempts at finding anything referencing or demonstrating a PHY-less option, which only led back to further detail on the necessity of die-to-die PHY.

Die-to-die and die-to-memory constraints are considered the primary holdup on the broader adoption of chiplets. Eliyan because they’re first to market with the necessary leaps, also because of their broad backing, in particular from the memory manufacturers as the industry standardises.

The UMI Standards are intended as universal interconnects, die-to-die and die-to-memory. They are however, targeting the memory wall issue in a couple ways, including with their nomenclature.

I’m happy for the analysis etc, but this isn’t a professional forum, narrative, unfortunately, is the dominant mode. If you wish for a particular standard, it is via the subject matter with the subject matter. Equally with counters, there’s no way to address unreferenced or un-researchable concepts.

This subject is pretty much exhausted I expect, until the next WWDC comes around.

Confused-User · Aug 20, 2024

treehuggerpro said:
Hi @mr_roboto. The main observations from the patent were that the bidirectional and Lane bundle formats provided reasonable leads on Apple’s approach. BoW was the most progressive (Open) option when Apple was developing UltraFusion and as a template it fits. I expect Apple did whatever was necessary in getting it to market though.

You keep coming up with PHYs with SerDes and saying that they are a basis for or the next generation of UF.

Why?

The patent that's been discussed here at length is really clear on NOT taking that approach. You wrote "I made reasonable attempts at finding anything referencing or demonstrating a PHY-less option, which only led back to further detail on the necessity of die-to-die PHY." - but that's ignoring that patent.

I can't even parse the second to last paragraph. No idea what you're trying to say there.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 6502a

macrumors regular

macrumors 6502a

macrumors 603

macrumors 6502a

macrumors regular

macrumors 68000

macrumors 6502a

Suspended

macrumors 68000

Suspended

macrumors 68000

macrumors Core

Suspended

macrumors Core

macrumors regular

macrumors 6502a

Our Staff