With all the unenlightening back and forth about whether A17 really is faster, no it isn't, no it is but only at the cost of power, well but that's only (realistically) in short bursts, blah blah; an obvious data source is being omitted!
What do the numbers look like for an A17 running either JetStream 2.1 or Speedometer 2.0?
Browser benchmarks have the great advantages that
(a) they are easy to run, by anyone
(b) they correspond to a workload we KNOW Apple really cares about for the phone (as opposed to both GB and SPEC which each in their own way don't really match how people use phones)
(c) they will stress code that's written and compiled by Apple to be a best fit for the A17. If there is functionality in the A17 that increases IPC, but requires special, or at least different code (either generated by the compiler, or otherwise exploited, eg in the JIT) we will see it there in a way we won't in GB6 (not yet recompiled for A17) or SPEC (who knows how that's been compiled by the sites that claim to be running it on iPhone?)
So how about it, people? If you own an iPhone 15 Pro/Max (or for that matter, something with an A16 or A15 AND you are running iOS latest, so the Safari is up to date!) go to
https://browserbench.org/JetStream/ and/or
and/or
and give us your numbers (trying to be as honest as possible; eg don't do this while you are also doing something else in the background!)
In general patent filings can be a mixed bag. Several companies have used patent filings to limit competitors or to incentivize employees to innovate without specific intentions of developing those ideas into products. Just because there is a patent it does not mean the product is actively being developed.Latest patent findings show a possible four-way SoC package...?
A few months ago, I predicted that by the year 2030 or so, the SoC would just be a giant neural engine with a CPU attached to it.
Big core area reduced by 20%, small core area reduced by 20%. GPU single-core area increased by 5%, and GPU overall area increased by 20%.
I think we're going to start to see a disproportionate number of transistors devoted to the neural engine starting in 2025. That's the year Apple designs will have had time to respond to the LLM/AI boom that started in late 2022.
Big core area reduced by 20%, small core area reduced by 20%. GPU single-core area increased by 5%, and GPU overall area increased by 20%.
View attachment 2281675
Latest patent findings show a possible four-way SoC package...?
Why not make the central interconnect die a central SLC? UltraFusion seems to be all about cache pseudo-coherency. Furthermore, fast DRAM could be attached to the back side of that die for a massive amount of slow cache (but much faster and more efficient than going through a controller and transimpedance amplifiers to go off package to LPDDR).That might make for a good for something that wanted to take on a subset of what MI300 covers. However, for '4090' killer ... it looks to have gobs of NUMA issues that won't provision a unified , single GPU out of that. You are separating the four "not-so-good chiplets" by about as much space as possible and still have them on the same package ( the networking rectangle in the middle is filled with just multiple layers of 'wires' . Basically 10x farther apart than in a dual UltraFusion set up. That is going to surface issues that show up as quirks as the application software level. )
And affordable? ... very probably not.
Get rid of that huge empty void in the middle and run multiple layers so that had "express lanes" from top and bottom similar to Vertical blue lines here:
with multiple layers. ( top bottom express one layer). the bridge between chiplets another would not leave large 'hole' in the middle.
The paten is more about the layers of networking than the 'iron cross' of dies.
Why not make the central interconnect die a central SLC? UltraFusion seems to be all about cache pseudo-coherency. Furthermore, fast DRAM could be attached to the back side of that die for a massive amount of slow cache (but much faster and more efficient than going through a controller and transimpedance amplifiers to go off package to LPDDR).
By pseudo-coherency, I meant that multiple SLC behave as a single cache, which is exactly what UltraFusion was designed for (as you described). I guess I have to mention the caveat that the M1 UltraFusion didn’t meet those design criteria very well, hence NUMA concerns. Perhaps in the incarnation I’m attempting to describe, one should call the per-die SLC a DLC (die level cache), and the central die SLC simply the system level cache, with very fast DRAM backing (and with appropriate memory controllers). I generally don’t want to get caught up in splitting hairs about definitions; just trying to convey an architectural idea.Probably not that simple seeing that Apple's SLC blocks are associated with individual memory controllers. And it's not cache pseudo-coherency either (at least if we go by previous Ultras). The device will request the data from the appropriate memory controller/SLC block even if it means going to the other die. In other words, SLC caches on different dies are not mirrored and the interconnect simply makes one big network out of everything.
I am not sure I share @deconstruct60's pessimism about NUMA issues. Yes, there will be latency costs, but Apple designs already have very high latency. Scheduling work for GPU is another issue of course, but Apple has some methods to approach this.
By pseudo-coherency, I meant that multiple SLC behave as a single cache, which is exactly what UltraFusion was designed for (as you described). I guess I have to mention the caveat that the M1 UltraFusion didn’t meet those design criteria very well, hence NUMA concerns. Perhaps in the incarnation I’m attempting to describe, one should call the per-die SLC a DLC (die level cache), and the central die SLC simply the system level cache, with very fast DRAM backing (and with appropriate memory controllers). I generally don’t want to get caught up in splitting hairs about definitions; just trying to convey an architectural idea.
Why not make the central interconnect die a central SLC? UltraFusion seems to be all about cache pseudo-coherency.
Furthermore, fast DRAM could be attached to the back side of that die for a massive amount of slow cache
(but much faster and more efficient than going through a controller and transimpedance amplifiers to go off package to LPDDR).
I am not sure I share @deconstruct60's pessimism about NUMA issues. Yes, there will be latency costs, but Apple designs already have very high latency. Scheduling work for GPU is another issue of course, but Apple has some methods to approach this.
Maybe OTHER companies do that. I have seen very little evidence of this in Apple's technical patents. Most of the technical patents I've looked atIn general patent filings can be a mixed bag. Several companies have used patent filings to limit competitors or to incentivize employees to innovate without specific intentions of developing those ideas into products. Just because there is a patent it does not mean the product is actively being developed.
So we’re talking 10-15% improvements for -20% real estate CPU wise? That’s very impressive.
What’s a bit of a disappointment is that we don’t seem to get much perf-to-space improvement on the GPU side. I guess the new RT features are taxing.
That might make for a good for something that wanted to take on a subset of what MI300 covers. However, for '4090' killer ... it looks to have gobs of NUMA issues that won't provision a unified , single GPU out of that.
UltraFusion is all about making the internal network between to dies transparent. There is really nothing much other than transmitting the cache coherency data over the network to do with UltraFusion. UltraFusion's job is so that do NOT have to do anything substantially different to handle coherency issues. It is the same shipping data among the various die subsystem units as it is on die.
Instead of it being an approximately 800mm^2 single die it is two approxmiately 400mm^2 dies. That's it. Primarily, It is more cost effective to make.
If the interconnection die is NOT 3D then it is no where near as transparent. It introduces side effects because physics is physics ; longer distances through more material bring issues. Note also that instead of taking one large arear and slicing it half ( with not net increase in 2D footprint) UltraFusion does not introduce substantively more distance between the dies. If you create a 2D central die that is at least as big, if not bigger, than the dies you are connecting then the 2D footprint here is going way up.
Move the 'SLC' farther away from the memory doesn't make much sense. Basically increasing the round trip time there also. Part of the design trade off here is by bringing the memory packages much closer you are trading off how many chiplets you can bring into extremely close proximity to one another of that type.
The AMD desktop/server approach is to make all CPU cores have uniformly slower access. So things stay uniform. ( The memory all hang off the single I/O die and a relatively much more limited set of memory lanes (lower concurrent bandwidth). AMD does a 180 direction turn on MI200/MI300/7900 . the 'core' count is orders of magnitude higher and the channel breadth to handle concurrent requests is much higher.
But it is plainly NOT in the diagrams. The problem with attaching all of the DRAM to a single die is that it is a single die. You are not going to get as many memory channels because the edge space on a single die around the same size as the ones you are connect isn't going to have a bigger circumference than the aggregate of the others. Going off-die to RAM sucks up a disportioncate larger amount of die space. The more you try to do in one place the faster you run out of area.
This is way Apple is trying to spread out the memory channels to more dies. The get more concurrency. That comes in very handing of trying to feed hundreds and hundreds of cores in a timely fashion. This is mainly a GPU with some CPU cores and uncore stuff 'bolted on' ; not primarily a CPU with other stuff 'bolted on'.
If you can't even explain what you tested and what your "efficiency" score refers to, no-one is going to take you seriously...I tested a bit with an Android phone and I get better efficiency scores than this A17 Pro, Apple only go for performance instead of efficiency, is only when you activate the low power mode that it beats my phone, Im thinking iOS 17 is starting to be a heavy OS IMO, thats why I hear that some XR/XS, 11 are getting toasty too with iOS 17
If the design already has have high latency that they had to put in mitigations to get around those ... cranking it even HIGHER probably makes the mitigations less effective. Piling even more Rube Goldberg mitigations on top of those mitigations has a decent chance of going sideways also.
IF we take these numbers at face value (and that's always tricky. The numbers in patents are supposed to be "suggestive", not design specs) this implies the basically plausibleFrom the drawings, more like M3 Extreme...!
- 64-core CPU (48P/16E)
- 160-core GPU (w/hardware ray-tracing)
- 64-core Neural Engine
- 960GB ECC LPDDR5X RAM (Sixteen 64GB chips)
- 2TB/s UMA bandwidth