M4+ Chip Generation - Speculation Megathread [MERGED]

name99 · Jun 18, 2024

tenthousandthings said:
No doubt this was discussed ad infinitum back in the day. My impression, though, was it had little to do with additional costs and/or diminishing markets (I think if either were the case then "Jade 4C-Die" would never have made it onto the leaked roadmap in the first place), but rather it was engineering roadblocks, possibly some make-or-break feature they couldn't make work, but more likely inherent performance and/or efficiency limitations that were compounded as they multiplied the dies. Chip-first bridge packaging worked with simple, matching two-way pairs of chips, but beyond that it became problematic. We know now that Apple was the first to bring InFO-LSI to market, so it was probably a bit of an engineering feat just to make that happen. The four-way bridge design (however that was constructed) didn't work as well, so it wasn't built.

Here's what Anand Shimpi said in September 2023, possibly referring to this experience: "At the end of the day, we’re a product company. So we want to deliver, whether it’s features, performance, efficiency. If we’re not able to deliver something compelling, we won’t engage, right? ... We won’t build the chip."

Edit to add that the removal of chip-first InFO-LSI packaging from TSMC's current public relations/press release site might be a consequence of this apparent failure beyond a simple two-way bridge. Note that "InFO-LSI" still appears as part of TSMC's chip-last CoWoS-L packaging. It's not inconceivable that M4's advanced packaging will switch to that approach.

They COULD probably build an InFO-LSI 4-way design.
This patent shows one way to do it.

US20240039539A1 - Systems and methods for implementing a scalable system - Google Patents

Multi-chip systems and structures for modular scaling are described. In some embodiments an interfacing bar is utilized to couple adjacent chips. For example, a communication bar may utilized to coupled logic chips, and memory bar may be utilized to couple multiple memory chips to a logic chip.

patents.google.com

There are various options given, but Fig 30 is the one I'm thinking of (although that's a crazy scaled up version, that shows you how to go wild if you want!)

Imagine a central square, then put a Max chip on each side of the square. Inside the square put a "hub/router" chip.
Then an EMIB/InFO-LSI bridge connects each Max to the central router chip.
For the central hub there are other interesting additional options you can imagine.
Fab it on N7 for lower price?
Fill it up with an additional 256MB (or 1GB? whatever fits!) of additional SLC cache?

This option allows substantially unmodified Max chips, so that's an advantage. Other designs allow for more compact packing, but then you start to use up more than one edge for Fusion, and it becomes unclear that the single chip is what you want to sell as a Max...

BTW another interesting set of patents are
(2023) https://patents.google.com/patent/US20230214350A1
(2023) https://patents.google.com/patent/US20240085968A1
dealing with things like power and clock sync across multiple SoCs.
We've had stuff like this before for two-way designs (ie Ultra) but these are for four-way designs...

leman · Jun 18, 2024

altaic said:
Yeah, though the SLC wouldn’t go off die; 4 ns of latency is a lot of clock cycles.

They already have off-die SLC, kind of - Mx Ultra has half of the SLC on the other die.

The latency of communicating between Apple Silicon IP blocks is already very high. Going between CPU clusters for example is similar to multi-socket server systems. Doesn’t seem like it has a pronounced negative effect on performance.

altaic · Jun 18, 2024

leman said:
They already have off-die SLC, kind of - Mx Ultra has half of the SLC on the other die.

The latency of communicating between Apple Silicon IP blocks is already very high. Going between CPU clusters for example is similar to multi-socket server systems. Doesn’t seem like it has a pronounced negative effect on performance.

Shared caches (which are each sized for their respective die) across dies is pretty different to having only off-die SLC. I don’t remember the cache latencies off hand, though I do remember being initially surprised. I guess I’d be once again surprised if an extra 8 ns (round trip) would be acceptable.

Confused-User · Jun 18, 2024

leman said:
They already have off-die SLC, kind of - Mx Ultra has half of the SLC on the other die.

The latency of communicating between Apple Silicon IP blocks is already very high. Going between CPU clusters for example is similar to multi-socket server systems. Doesn’t seem like it has a pronounced negative effect on performance.

Is this well established?

You could presumably plot a curve showing scaling for a multicore workload running on 1-16 P cores on an M2 Ultra, and then look for a knee between 8 and 9 cores. Has this been done for enough different workloads that we can feel confident that UltraFusion isn't imposing any significant costs? I haven't seen anything on this but I haven't been paying attention consistently since it came out.

If so, then that suggests that there is some other bottleneck that they're going to want to address. We *should* see a penalty.

leman · Jun 18, 2024

altaic said:
Shared caches (which are each sized for their respective die) across dies is pretty different to having only off-die SLC.

Another thing to keep in mind is that, in Apple designs, SLC cache is tightly integrated with the memory controllers. So if SLC goes off-die, the controllers likely would too. I don’t think this is feasible with usual multi-die technology. Maybe using die stacking with more advanced direct interconnects. I think Apple did published a patent or two describing such arrangements, at this point it’s only academic.

Confused-User said:
Is this well established?

You could presumably plot a curve showing scaling for a multicore workload running on 1-16 P cores on an M2 Ultra, and then look for a knee between 8 and 9 cores. Has this been done for enough different workloads that we can feel confident that UltraFusion isn't imposing any significant costs? I haven't seen anything on this but I haven't been paying attention consistently since it came out.

If so, then that suggests that there is some other bottleneck that they're going to want to address. We *should* see a penalty.

Oh, I am certain that one can craft a synthetic workload that would run very slow on Apple. Anything that requires excessive inter-CPU synchronization will do. Then again, these kind of workloads will be very slow on any type of modern multi-core processor. It seems to me that Apple optimizes for the common case where synchronization is limited to a small number of threads (which can be allocated to the same cluster with fast synchronization and data sharing). They are making clusters larger, so there is some evolution here as well. Common workstation uses (where you actually want multiple cores) tend to be focused on parallel processing, so Apples design works very well there.

Boil · Jun 19, 2024

name99 said:
octa-SoC

64-core CPU (48P/16E)
384-core GPU
128-core Neural Engine
1.92TB LPDDR5X RAM (inline ECC)
4.32TB/s UMA bandwidth

name99 · Jun 19, 2024

Confused-User said:
Is this well established?

You could presumably plot a curve showing scaling for a multicore workload running on 1-16 P cores on an M2 Ultra, and then look for a knee between 8 and 9 cores. Has this been done for enough different workloads that we can feel confident that UltraFusion isn't imposing any significant costs? I haven't seen anything on this but I haven't been paying attention consistently since it came out.

If so, then that suggests that there is some other bottleneck that they're going to want to address. We *should* see a penalty.

Depends how they hash the addresses.
The smart thing would be to interleave the addresses as densely as possible (probably by DRAM page size granularity) across the two SLCs. In which case it would be very very difficult to detect.

name99 · Jun 19, 2024

leman said:
Another thing to keep in mind is that, in Apple designs, SLC cache is tightly integrated with the memory controllers. So if SLC goes off-die, the controllers likely would too. I don’t think this is feasible with usual multi-die technology. Maybe using die stacking with more advanced direct interconnects. I think Apple did published a patent or two describing such arrangements, at this point it’s only academic.

I don't think that's essential. There are many ways you could do it, the most trivial being a two level SLC, with the second level having large "lines" (eg 4kB or even 16kB) and duplicate tags situated in the first level SLC.
That would fit closely with the current design.

BUT the details of how data is routed across the SoC through the SLC to the memory controller have changed at least three times (and substantially redesigns, not minor tweaks). I could easily see another such redesign if it's considered to make sense.

Certainly an alternative design that envisages no standalone "Max-equivalent", only a base element that's always paired up with a "hub" chip could be refactored to place a faster "local" SLC on the "max-equivalent", but no memory controller, and the memory controllers sit with "global" SLC on the hub chip. But the geometry of that gets tricky unless you want to build a two layer SoC - which maybe you do, given what I said about the size of an Extreme built like a duplicated Ultra...

leman said:
Oh, I am certain that one can craft a synthetic workload that would run very slow on Apple. Anything that requires excessive inter-CPU synchronization will do. Then again, these kind of workloads will be very slow on any type of modern multi-core processor. It seems to me that Apple optimizes for the common case where synchronization is limited to a small number of threads (which can be allocated to the same cluster with fast synchronization and data sharing). They are making clusters larger, so there is some evolution here as well. Common workstation uses (where you actually want multiple cores) tend to be focused on parallel processing, so Apples design works very well there.

Also is the synthetic workload actually appropriately written for Apple Silicon (or ARM in general)?

In particular the most usual tests of these things on x86 are kinda dumb on ARM because the right way to solve the most common cross-CPU (and cross-cluster) synchronization would be via remote atomics rather than via locks or the equivalent...

Confused-User · Jun 19, 2024

name99 said:
Depends how they hash the addresses.
The smart thing would be to interleave the addresses as densely as possible (probably by DRAM page size granularity) across the two SLCs. In which case it would be very very difficult to detect.

I may be missing what you're saying here. Possibly because you are too terse, but also possibly because of a gap in my understanding.

Are you saying that you think that memory addresses (presumably hw addresses, but reflecting through into vm addresses) should be distributed page-by-page across both Max dies? This seems fairly fantastic to me - wouldn't that make it impossible to take advantage of memory locality to improve performance?

I am naively assuming that it is optimal for the working set of a process (thread) to be proximate to the core running it. If you spread adjacent pages across both chips, that's impossible (right?).

Thinking about this more, it's clear that if you have a set of threads that are large enough to require simultaneous use of both dies, in which individual threads frequently access the same pages, rather than mostly touching pages exclusive to themselves, this goes out the window. But that's a worst-case situation, and not (I thought, at least) all that common. Or am I wrong about that? And are there other considerations I'm missing?

testato · Jun 20, 2024

the miracle of ios 18 on neural engine ,hoping for the same results on M4 ne with sequoia

iPhone 15 Pro vs iPhone 15 Pro - Geekbench

browser.geekbench.com

Pressure · Jun 20, 2024

testato said:
the miracle of ios 18 on neural engine ,hoping for the same results on M4 ne with sequoia

iPhone 15 Pro vs iPhone 15 Pro - Geekbench

browser.geekbench.com

I'm guessing it will be like the M4 in the iPad Pro 🤷🏼‍♂️

Here is an example.

altaic · Jun 20, 2024

testato said:
the miracle of ios 18 on neural engine ,hoping for the same results on M4 ne with sequoia

iPhone 15 Pro vs iPhone 15 Pro - Geekbench

browser.geekbench.com

Same here, though my French isn’t great 😊

name99 · Jun 20, 2024

Confused-User said:
I may be missing what you're saying here. Possibly because you are too terse, but also possibly because of a gap in my understanding.

Are you saying that you think that memory addresses (presumably hw addresses, but reflecting through into vm addresses) should be distributed page-by-page across both Max dies? This seems fairly fantastic to me - wouldn't that make it impossible to take advantage of memory locality to improve performance?

I am naively assuming that it is optimal for the working set of a process (thread) to be proximate to the core running it. If you spread adjacent pages across both chips, that's impossible (right?).

Thinking about this more, it's clear that if you have a set of threads that are large enough to require simultaneous use of both dies, in which individual threads frequently access the same pages, rather than mostly touching pages exclusive to themselves, this goes out the window. But that's a worst-case situation, and not (I thought, at least) all that common. Or am I wrong about that? And are there other considerations I'm missing?

Suppose you have eight DRAM chips, let's say each 1GB in size. (yes, yes, I'm simplifying tremendously to make a point)

How do you distribute addresses across them?
The naive way is to route all physical addresses from 0GB to 1GB to the first DRAM, then from 1GB to 2GB to the second DRAM, and so on. But is this optimal? It will tend to locate large blocks of contiguous physical memory (as in when an app or the OS asked for a large block of memory) in one DRAM chip, so we can't stream data from multiple DRAM chips at once.

And alternative is to rearrange the address bits ("hash" the address) so that say the first 64kB goes to the first DRAM, the next 64kB to the second DRAM, and so on. This will enhance the likelihood that various common use cases (like sequential streaming) are spread across multiple DRAMs.

Now how small should we make these units? The optimal size is probably the size of a DRAM page (not an OS page). A DRAM page is the smallest unit of DRAM that can be active. For LPDDR5 they're probably 1, 2 or 4kB depending on the exact DRAM details. So that's probably the unit we should interleave.

Note that this is all about how PHYSICAL (not virtual) addresses are routed to DRAM. It's invisible to any hardware above the memory controller.

Now take those same ideas and apply them to an Apple style design with an SLC. Suppose we have, say, two SLC blocks (SRAM plus memory controller) like on say an M1 Pro. [Again this is a simplification, work with me and just assume there are two, one on the left side, one on the right.]

Once again, how do we split data across those SLCs? The dumb alternative is to associate say the upper half of the physical address range with the left SLC, the lower half of the physical address range with the right SLC, for the same sort of reasons - in many cases you'll have most of the activity hitting one SLC but not the other.
Once again a much better choice is to interleave a smaller unit between the SLCs, so that, eg, the first 128B of physical address space routes to one SLC, then the second to the other SLC and back. Once again, now, common patterns like sequential streaming will alternate between SLCs.

Obviously we want to take advantage of both these ideas. This means we probably don't want to alternate SLC at the 128B level because that doesn't work for DRAM; we need to alternate addresses between SLCs at the DRAM page granularity, then the memory controller can further hash addresses to stripe over as many independent DRAM elements as it controls.
The same ideas then scale up to an M1 Max, and then an M1 Ultra.
The exact details will depend on capacities, timing, simulations, etc. The memory controllers and NoCs are programmable, so all this stuff can be varied. The obvious way to vary it is by load - if the load is small it's certainly possible in principle to freeze the system, flush SLCs, move data between DRAMs, reprogram the hash pattern, and restart. This would allow you to, for example, power down half your DRAMs, memory controller, and SLC. Apple have patents for doing this, but it's tricky getting the sequencing correct (and knowing when you should reduce or increase capacity) so I don't know the extent to which they actually do this.

I hope this explains the issues. At the virtual level all addresses are still contiguous. At the physical level, when streaming you'll ping pong between the two Max SLCs and sets of DRAM. Is this slower than using just data on the one Max? Define "slower"...
You will have higher latency to the SLC for half the requests, but also twice the SLC capacity, twice the number of open pages in DRAM, and twice the DRAM bandwidth.
I'm not sure what the optimal details are, but Apple systems have lots of telemetry tracking these things (along with simulations) so I'm sure they do a reasonable job of toggling between use cases.

This is the patent for dynamic address hashing: https://patents.google.com/patent/US20220342806A1
(the actual patent describes a way of saving power by stripping bits off the hashed address as the data packets move through routing, but the background is address hashing)

My gut feeling is that most of the time either
- larger SLC (with half of it slightly higher latency, but still much better than DRAM) wins [small jobs] OR
- larger bandwidth wins [large jobs]
so in most cases it's a win.

BTW this business of address hashing has consequence people don't think of.
M3 Pro has THREE address controllers and SRAM block (as opposed to the four of M1 and M2 Pro).
OK, that's fine, but how do you spread addresses across these three (apart from the naive and dumb method of splitting the physical address range into three)???
You need address hashing circuits that are more sophisticated than just rearranging address lines, as described in httphttps://patents.google.com/patent/US20240104017A1

name99 · Jun 20, 2024

testato said:
the miracle of ios 18 on neural engine ,hoping for the same results on M4 ne with sequoia

iPhone 15 Pro vs iPhone 15 Pro - Geekbench

browser.geekbench.com

Of course those are very nice boosts

,
but do you know for sure that the boosted benchmarks (ie the Transformer text based ones) are running on ANE?
The 32bit GB6 benchmarks always run on GPU and in the past I think the same was true for all the text benchmarks.

If you run restricting GB6 to GPU+CPU (no ANE) do you get different results?

My guess is what we are seeing is combination of the stateful support in CoreML (so more efficient use/update of KV cache) and the fused dot product attention (which is an MPS optimization, not an ANE optimization). Plus of course various other tweaks and improvements.

Confused-User · Jun 20, 2024

name99 said:
Suppose you have eight DRAM chips, let's say each 1GB in size. (yes, yes, I'm simplifying tremendously to make a point)
[...]

Thanks for this fantastic explanation. I already knew most of this, but not all, and this tied it up really nicely, making the tradeoffs a lot clearer. I was looking at this as a straight tradeoff of latency vs bandwidth, but of course, as you say, the SLC factors in too, removing some of the latency penalty. Or, no, not removing it, but providing additional cache, which removes a different and bigger latency (to RAM).

And yet this still seems very nonobvious to me. (That's OK, I guess, some things can't be roughed out with intuition, they require tests or simulation.) For example, if you're doing a big encode job, all cores will be busy, and there I *think* they'd do better with better memory locality.

But nevermind that, I already know my intuition isn't good here. Instead this raises another question.

Why do any hashing on the chip? (ISTM that in any case, you don't need to hash, you could just bitshift the addresses, or at most do a simple permute... until you get to systems with non-PoT # of controllers, like the M3 Pro... oh well.) I mean, why not do all that work in the VM system? Is this a matter of performance wins from doing it in hardware? Or of being able to do the "dynamic" part of it when it's all done in HW? I would have thought the OS is better positioned to make smart decisions about this.

name99 · Jun 21, 2024

Confused-User said:
Thanks for this fantastic explanation. I already knew most of this, but not all, and this tied it up really nicely, making the tradeoffs a lot clearer. I was looking at this as a straight tradeoff of latency vs bandwidth, but of course, as you say, the SLC factors in too, removing some of the latency penalty. Or, no, not removing it, but providing additional cache, which removes a different and bigger latency (to RAM).

And yet this still seems very nonobvious to me. (That's OK, I guess, some things can't be roughed out with intuition, they require tests or simulation.) For example, if you're doing a big encode job, all cores will be busy, and there I *think* they'd do better with better memory locality.

But nevermind that, I already know my intuition isn't good here. Instead this raises another question.

Why do any hashing on the chip? (ISTM that in any case, you don't need to hash, you could just bitshift the addresses, or at most do a simple permute... until you get to systems with non-PoT # of controllers, like the M3 Pro... oh well.) I mean, why not do all that work in the VM system? Is this a matter of performance wins from doing it in hardware? Or of being able to do the "dynamic" part of it when it's all done in HW? I would have thought the OS is better positioned to make smart decisions about this.

Take a look at

https://lca.ece.utexas.edu/pubs/MICRO44_Dimitris_Kaseridis.pdf

(Fig 5 shows what you are interested in, some of the other elements you can ignore).

This shows the sorts of bit rearrangements that are of interest. And why.
These are done at the memory controller level, but you want something similar at the SoC level to decide where to route (ie which SLC) a particular address.

BTW in this context "hashing" refers to ANY rearranging of address bits. Shifts and permutes ARE the common cases.
In this context we're rarely using a "strong" hash, ie the sorts of hashes used by SW hash tables, but a weak hash is till a hash!
The functionality in each case IS the same as a hash table, you just sometimes have to squint a little.
eg for an L1 cache you generate a hash which defines the set, then you look at each of the buckets (called ways) associated with the hash value...

hoodlum90 · Jun 22, 2024

It looks like the M4 has 2 additional TB controllers, allowing for additional ports or more bandwidth.

https://twitter.com/x/status/1803197266963931349

tenthousandthings · Jun 22, 2024

hoodlum90 said:
It looks like the M4 has 2 additional TB controllers, allowing for additional ports or more bandwidth.

What would Thunderbolt 5 support look like?

hoodlum90 · Jun 22, 2024

tenthousandthings said:
What would Thunderbolt 5 support look like?

I still don’t think that is enough bandwidth for TB5. Maybe it could be supported by the M4 Ultra, depending on how that chip is designed

Confused-User · Jun 22, 2024

hoodlum90 said:
I still don’t think that is enough bandwidth for TB5. Maybe it could be supported by the M4 Ultra, depending on how that chip is designed

Say what? This has nothing to do with bandwidth, but, if it has the BW for four TB3/USB4 controllers, then it has the bandwidth for two TB5 controllers. But this is still a silly conversation. If they want more bandwidth, they will... just do that.

The extra controllers are probably there to allow the low-end 14" Pro to offer the same ports as the 14" with a Pro or Max chip. Currently that's not the case. If there really is a third display controller, this offers more flexibility in using it, too.

FWIW, I strongly doubt that the controller is TB5 capable. Seems way too early, and a mismatch for the chip's target audience. I can vaguely imagine them putting TB5 on the M4 Pro and Max but that would be surprising too. I'd expect to see it in the M5... maybe along with an 8K display.

Christopher Pelham · Jun 29, 2024

There is already a Razer laptop available for sale with a TB5 port. IMO it wouldn't be shocking to see Apple include it with its M4 computers and release a 120MHz 5K Apple Studio Display and perhaps a higher end 8K monitor. Remember, these are not supposed to come out until late in the year or early next year.

EugW · Jun 29, 2024

Christopher Pelham said:
There is already a Razer laptop available for sale with a TB5 port. IMO it wouldn't be shocking to see Apple include it with its M4 computers and release a 120MHz 5K Apple Studio Display and perhaps a higher end 8K monitor. Remember, these are not supposed to come out until late in the year or early next year.

Well, it won't be with all M4 computers. I doubt the M4 SoC was designed with TB5 in mind. They doubled the number of Thunderbolt controllers in M4 vs. M1/M2/M3, but I suspect the M4 controllers are still TB4. One step at a time...

If any of the Macs do get TB5 this iteration, it'd likely be one of the higher tier models using M4 Max/M4 Ultra, perhaps using a separate TB5 controller.

Antony Newman · Jun 29, 2024

The Four x TB ports on the M4 die shot look almost identical to the Two x TB4 ports in the M3.

Apple‘s 3m (9.8ft) TB4 cables are already eye wateringly expensive.

Perhaps the M5 chip in (2nd or 4th Q) 2025 will end up with TB5?

Christopher Pelham · Jun 29, 2024

Antony Newman said:
The Four x TB ports on the M4 die shot look almost identical to the Two x TB4 ports in the M3.

Apple‘s 3m (9.8ft) TB4 cables are already eye wateringly expensive.

Perhaps the M5 chip in (2nd or 4th Q) 2025 will end up with TB5?

Ah I see. Is this die shot of the iPad Pro M4? And is it thought that the M4 that goes into the iPad Pro is likely to be the very same one that will go into the rest of the lineup later?

I ask about this in part because I am interested in getting an Apple Studio Display. People always say get what you need when you need it but in this case, it would be an upgrade and is not a must have, and if a better one may come out in a few months I'll wait. But if it likely won't come before the M5 desktops are out, then I might as well grab the current one....

EugW · Jun 29, 2024

Antony Newman said:
The Four x TB ports on the M4 die shot look almost identical to the Two x TB4 ports in the M3.

Can you really tell that from a die shot?

Anyhow, I find it interesting that each Thunderbolt controller is bigger than two efficiency cores. Eyeballing it, it looks like the four M4 Thunderbolt controllers are equivalent in size to about nine M4 efficiency cores. That's a decent amount of real estate so obviously Apple deems them of significant importance even in this entry level chip.

Christopher Pelham said:
Ah I see. Is this die shot of the iPad Pro M4? And is it thought that the M4 that goes into the iPad Pro is likely to be the very same one that will go into the rest of the lineup later?

Yes and yes.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 68030

macrumors Core

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors 68040

macrumors 68030

macrumors 68030

macrumors 6502a

macrumors member

macrumors 603

macrumors 6502a

macrumors 68030

macrumors 68030

macrumors 6502a

macrumors 68030

macrumors regular

Contributor

macrumors regular

macrumors 6502a

macrumors newbie

macrumors P6

macrumors member

Attachments

macrumors newbie

macrumors P6

Our Staff