Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,665
OBX
That would be the obvious conclusion, no?
I guess, they have increased CPU clocks instead of adding (P) cores I just think it is interesting that adding more GPU cores is cheaper than increasing clocks. Does Apple have differing clock domains like AMD does for frontend and shader in their GPU, or is it clocked monolithically like Nvidia (who used to have different frontend/shader clock domains).
 

MRMSFC

macrumors 6502
Jul 6, 2023
371
381
I guess, they have increased CPU clocks instead of adding (P) cores I just think it is interesting that adding more GPU cores is cheaper than increasing clocks. Does Apple have differing clock domains like AMD does for frontend and shader in their GPU, or is it clocked monolithically like Nvidia (who used to have different frontend/shader clock domains).
Perhaps the gpu cores scale better with amount than frequency? They were initially designed for mobile power consumption.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,678
I guess, they have increased CPU clocks instead of adding (P) cores I just think it is interesting that adding more GPU cores is cheaper than increasing clocks.

Graphical work is usually embracingly parallel, so it scales very well with increasing the number of processors. With CPUs, you don't have the same kind of effect — adding more CPU cores will not accelerate basic workloads (you need your programs to be multithreaded for that).

It is entirely possible that 6 lower clocked GPUs use less energy than 5 higher clocked ones and this is Apple's way to manage power consumption.

Does Apple have differing clock domains like AMD does for frontend and shader in their GPU, or is it clocked monolithically like Nvidia (who used to have different frontend/shader clock domains).

We don't know. Personally, I'd be surprised if the rasteriser and the compute complex ran on the same clock, it just doesn't make much sense.
 

caribbeanblue

macrumors regular
May 14, 2020
138
132
- 16 core NPU. It's not widely remarked upon how (very unlike the GPU or CPU) the NPU core count remains the same from A to Max. Kinda strange that, and not sure what it means. Maybe that, for now anyway, Apple see this as mainly about facilitating language UI, and there's only one person talking whether you have an iPhone or a Mac Studio?
The M1 Max seemed to have a 2nd ANE on the bottom left that wasn't utilized and seemingly omitted from the M2 Max, so maybe at first they were planning to scale the ANE by 2x for the Max and Ultra variants, but late into the M1 Max design process decided to utilize the die area for other features for later generations?
41.png
 

Attachments

  • 43.png
    43.png
    2.8 MB · Views: 79

name99

macrumors 68020
Jun 21, 2004
2,410
2,322
The M1 Max seemed to have a 2nd ANE on the bottom left that wasn't utilized and seemingly omitted from the M2 Max, so maybe at first they were planning to scale the ANE by 2x for the Max and Ultra variants, but late into the M1 Max design process decided to utilize the die area for other features for later generations? View attachment 2284194
Following up on all this, it's remarkable how quickly we all settle into an idea that none of us imagined 3 years ago (Pro/Max/Ultra design) and start to assume it will be THE design going forward always...

Consider the following alternative.
- Give the Pro UltraFusion connector(s). For the sake of numbers, let's say this Pro has 12+4 cores, two SLC+memoy controller, and 20 GPU cores (and NPU, media, etc)

- no Max exists. Instead the money that would have been used on Max masks is used to create a "GPU" chip of essentially the same size; so it has let's say two SLC+memory controllers on it, and let's say sixty GPU cores

Now multiple mix and match options exist. At the low end just a standard Pro. In the mid-range, if I mainly want CPU, I can join two Pro's together. If I mainly want GPU, I can join a Pro and a GPU. At the high end, I can get anything from four Pro's to a Pro and three GPU's.

In terms for CONFIGURATION, this gets us to something like the world of a nice CPU+iGPU along with an eGPU, only with the Apple advantages of common memory, memory bandwidth scaling as required, and the GPU hardware all presenting as a single GPU. Or to put it differently, this is presumably where Intel is kinda sorta hoping to be headed with MTL. Hell, I'm not fussy. If GPU's are such that they don't take serious advantage of N3 (don't run at the highest frequencies, will not be in mobile designs anyway) make the Apple GPU chiplet on N4, and be even more Intel/MTL-like.

You can raise the boring obvious complaint about geometry (how do you lay this out? where does the memory go?) but honestly that's just not interesting! There are at least two alternatives. One is on the back of the package (as is done with the A16, and maybe A17 - haven't seen any photos), the other is via the memory-spine design I have described on a few occasions.

My guess is that as of M3 there's still basic work being done on scaling, enough so that it will match what we expect though (probably?) with an Extreme. But for M4, once we have all this plumbing working correctly, I think there are real advantages to the alternative design I am suggesting.
 

Boil

macrumors 68040
Oct 23, 2018
3,478
3,173
Stargate Command
A GPU-specific SoC you say...? Sounds oddly familiar...

(...I know I am not the only one to think along the lines of a GPU-specific SoC, but I have been going on about it for awhile now...)
 

Confused-User

macrumors 6502a
Oct 14, 2014
854
988
Now multiple mix and match options exist. At the low end just a standard Pro. In the mid-range, if I mainly want CPU, I can join two Pro's together. If I mainly want GPU, I can join a Pro and a GPU. At the high end, I can get anything from four Pro's to a Pro and three GPU's.
I think we've talked about this before.

I wonder if this would actually work. You also say:
You can raise the boring obvious complaint about geometry (how do you lay this out? where does the memory go?) but honestly that's just not interesting! There are at least two alternatives. One is on the back of the package (as is done with the A16, and maybe A17 - haven't seen any photos), the other is via the memory-spine design I have described on a few occasions.
Are you saying that instead of shoreline they'd use the back of the package for memory busses? I imagine that would make BSPD harder once they get there in a couple more nodes. But regardless, how do you distribute all those memory busses across the chiplets without spending a ton of energy (and delay) shipping all those bits constantly between the chiplets?

I imagine it would come down to how local memory can be for GPU work patterns, and I don't know enough about GPU code to have an opinion.

The memory-spine thing is fascinating. I don't think it fully eliminates the problems of possible nonlocality though, from a performance/efficiency standpoint at least (it obviates NUMA issues, if I remember right).
 

caribbeanblue

macrumors regular
May 14, 2020
138
132
Maybe you’re right. Notebookcheck mentioned this behavior in an article, and this Reddit post said the same, I should have verified it further. It seemed plausible to me because 4 E cores should have (roughly) a multicore performance score in that ballpark.
No, I have to verify my hypothesis as well, your hypothesis sounds just as if not more plausible than my hypothesis. I'll comment again if I have any more evidence. FWIW, I saw Geekerwan say the A17 changes the strategy of the low power mode while turning off the P-cores, which might explain the a lot lower power/performance point than older chips in their charts for the low power mode, if true.
44.png
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,322
I think we've talked about this before.

I wonder if this would actually work. You also say:

Are you saying that instead of shoreline they'd use the back of the package for memory busses? I imagine that would make BSPD harder once they get there in a couple more nodes. But regardless, how do you distribute all those memory busses across the chiplets without spending a ton of energy (and delay) shipping all those bits constantly between the chiplets?

I imagine it would come down to how local memory can be for GPU work patterns, and I don't know enough about GPU code to have an opinion.

The memory-spine thing is fascinating. I don't think it fully eliminates the problems of possible nonlocality though, from a performance/efficiency standpoint at least (it obviates NUMA issues, if I remember right).
(a) I referred to the A16. Don't you think I made that reference for a REASON?
It's certainly possible that mounting memory this way results in lower energy transfers than mounting it on the side. The details would depend on things like the substrate (unclear whether it is silicon, "real" glass, standard thin ABF, or something else altogether) and the capacitance of the TSV's. But the fact that it was adopted for A16 certainly suggests that it both works (technology is reliable) and that it is low power.

BSPD only becomes an issue if you mount the memory on the BSPD side. There are two sides, so you don't have to do that...

(b) This sort of design is targeted for a Studio and higher. So while efficiency is nice, it is not the ONLY consideration. If you can get nVidia performance at a third of nVidia power, you're already way ahead.
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,322
Now THIS is interesting!
Check out https://patents.google.com/patent/US20230298996A1 and https://patents.google.com/patent/US20230299001A1 and https://patents.google.com/patent/US20230299068A1
All three are patents related to backside metal.

I've mentioned before that the lousy scaling for N3, especially for SRAM, comes as absolutely no surprise because the limiting variable, especially for SRAM, is wiring density not transistor size. And that BSPD is hoped to fix that by moving power delivery (and hence that wiring) to the backside, relieving topside metal congestion.
The next step after BSPD, as suggested by IMEC, is back side clock delivery, moving more wires to the backside.

However these patents are substantially more ambitious.
The middle one simply talks about both top and backside for power delivery. Worthy but not exciting.
However the first one and third ones talks about using the backside for genuine logic/control signals!

It's unclear what the expected timeframe would be. Obviously this means Apple is thinking about how to design in the presence of backside metal layers; but this may also mean that TSMC's internal roadmap is substantially more ambitious than what they've said publicly.
Publicly all they've announced is what looks like a somewhat cautious version of power delivery, and maybe two years after Intel claims (...) that they will deliver BSPD. Maybe the reason for this later delivery is that what they are implementing is substantially more ambitious, not just the basic hacks (as described by IMEC) to get BSPD, but a full process stack allowing for more or less full double-sided metal layers?
 

Confused-User

macrumors 6502a
Oct 14, 2014
854
988
(a) I referred to the A16. Don't you think I made that reference for a REASON?
It's certainly possible that mounting memory this way results in lower energy transfers than mounting it on the side. The details would depend on things like the substrate (unclear whether it is silicon, "real" glass, standard thin ABF, or something else altogether) and the capacitance of the TSV's. But the fact that it was adopted for A16 certainly suggests that it both works (technology is reliable) and that it is low power.

BSPD only becomes an issue if you mount the memory on the BSPD side. There are two sides, so you don't have to do that...
Right, but I was thinking more about nonlocality. Basically the design you're describing (1 Mx + 3 GPU chiplets) is similar to the Zen 1 generation epyc chips - each chiplet has some memory busses, and that gives you NUMA. AMD chose to move memory to the I/O chiplet in succeeding generations to avoid the problems that presented. Sure, Apple may be smarter, but those definitely are issues that would need to be addressed. And what I was asking was, if you had that NUMA architecture with GPU chiplets, would that likely be a bigger or smaller issue than NUMA is with CPU chiplets? That is, is the GPU workload more or less sensitive to NUMA? I have no feel for that at all.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,678
I see, can your program see the amount of ALU and RAM???

Apple APIs tell you the amount of RAM (if you trust them). As to ALUs… the phone/Mac GPU cores have 128 ALUs each, so it’s easy to calculate there, but I don’t have any code to check for it. I’d need to think about that…
 
  • Like
Reactions: TigeRick

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,665
OBX
Right, but I was thinking more about nonlocality. Basically the design you're describing (1 Mx + 3 GPU chiplets) is similar to the Zen 1 generation epyc chips - each chiplet has some memory busses, and that gives you NUMA. AMD chose to move memory to the I/O chiplet in succeeding generations to avoid the problems that presented. Sure, Apple may be smarter, but those definitely are issues that would need to be addressed. And what I was asking was, if you had that NUMA architecture with GPU chiplets, would that likely be a bigger or smaller issue than NUMA is with CPU chiplets? That is, is the GPU workload more or less sensitive to NUMA? I have no feel for that at all.
I feel like I am talking out of turn, for GPU rasterization they tend to care more about bandwidth than latency. For GPGPU I read that it depends on the workload as to if it is more latency sensitive.

Now for Apple with TBDR I think it is less bandwidth sensitive, but that may only apply for rasterization.

I also assume if the local (L1-3) caches are large enough it can blunt some of the latency/bandwidth requirements, again depending on dataset.
 

Gudi

Suspended
May 3, 2013
4,590
3,267
Berlin, Berlin
Apple are willing to play along with this [gamer] fantasy to the extent that it doesn't get in the way of real customers, but no more than that. Which means they care a LOT about making sure the A17 works well when playing Candy Crush. But they are not going to increase weight, cost, or anything else, PURELY to make Genshin Impact behave differently.
They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.
 
  • Like
Reactions: tenthousandthings

leman

macrumors Core
Oct 14, 2008
19,521
19,678
They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.

The fun thing about RT is that it sends a signal that Apple is serious about games while at the same time improving their value proposition in a much more tangible domain — professional rendering. There are a lot of creative users who already gravitate towards a Mac - if Apple can reach at least 60-70% of Nvidia's performance here they could improve their footing in this market.
 

PgR7

Cancelled
Sep 24, 2023
45
13
They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.
You can use that for DLSS, so "free" performance, and if Nvidia is going that route is because is more efficient.
So...
  • M3 Max/Ultra/Extreme
  • High Power Mode
  • CPU @ 4.5GHz
  • GPU @ 2.0GHz
...???
4.5 is too big of a jump, I would say 4.2 max, 4.1 probably, the actual M2 Max/Ultra is 3.7
 

name99

macrumors 68020
Jun 21, 2004
2,410
2,322
They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.
The fact that you think ray tracing exists as "pure unnecessary eye candy for triple-A games" says volumes about how little you know of where Apple is going.

Ray tracing is an essential element of a variety of graphics algorithms that are relevant to AR, for example casting realistic shadows... I suspect that is vastly more important in Apple's calculations than game eye-candy, which is just a minor side benefit.
 

Confused-User

macrumors 6502a
Oct 14, 2014
854
988
Ray tracing is an essential element of a variety of graphics algorithms that are relevant to AR, for example casting realistic shadows... I suspect that is vastly more important in Apple's calculations than game eye-candy, which is just a minor side benefit.
I think you're *probably* right about this. But there are a few counter-indications, that Apple may finally be thinking about someday maybe getting serious about gaming. I wouldn't bet on it, but it no longer seems entirely insane to think so.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.