M3 Chip Generation - Discussion Megathread

leman · Sep 28, 2023

diamond.g said:
You think adding a GPU core was more power efficient than bumping the GPU clock up 10-20%?

That would be the obvious conclusion, no?

scottrichardson · Sep 28, 2023

Odd my MacBook Pro M2 Max gets 441 at best.

diamond.g · Sep 28, 2023

leman said:
That would be the obvious conclusion, no?

I guess, they have increased CPU clocks instead of adding (P) cores I just think it is interesting that adding more GPU cores is cheaper than increasing clocks. Does Apple have differing clock domains like AMD does for frontend and shader in their GPU, or is it clocked monolithically like Nvidia (who used to have different frontend/shader clock domains).

MRMSFC · Sep 28, 2023

diamond.g said:
I guess, they have increased CPU clocks instead of adding (P) cores I just think it is interesting that adding more GPU cores is cheaper than increasing clocks. Does Apple have differing clock domains like AMD does for frontend and shader in their GPU, or is it clocked monolithically like Nvidia (who used to have different frontend/shader clock domains).

Perhaps the gpu cores scale better with amount than frequency? They were initially designed for mobile power consumption.

leman · Sep 28, 2023

diamond.g said:
I guess, they have increased CPU clocks instead of adding (P) cores I just think it is interesting that adding more GPU cores is cheaper than increasing clocks.

Graphical work is usually embracingly parallel, so it scales very well with increasing the number of processors. With CPUs, you don't have the same kind of effect — adding more CPU cores will not accelerate basic workloads (you need your programs to be multithreaded for that).

It is entirely possible that 6 lower clocked GPUs use less energy than 5 higher clocked ones and this is Apple's way to manage power consumption.

diamond.g said:
Does Apple have differing clock domains like AMD does for frontend and shader in their GPU, or is it clocked monolithically like Nvidia (who used to have different frontend/shader clock domains).

We don't know. Personally, I'd be surprised if the rasteriser and the compute complex ran on the same clock, it just doesn't make much sense.

Macintosh IIcx · Sep 28, 2023

leman said:
Regarding the GPU, I did a quick test and A17 Pro delivers 2TFLOPs compute throughput. That gives us the GPU clock of 1.3Ghz, same as A15.

As an aside, I seem to remember that the Mac Pro 2013 had the same 2 teraflops compute with the D300 GPUs.

caribbeanblue · Sep 28, 2023

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

name99 · Sep 29, 2023

caribbeanblue said:
The M1 Max seemed to have a 2nd ANE on the bottom left that wasn't utilized and seemingly omitted from the M2 Max, so maybe at first they were planning to scale the ANE by 2x for the Max and Ultra variants, but late into the M1 Max design process decided to utilize the die area for other features for later generations?View attachment 2284194

Following up on all this, it's remarkable how quickly we all settle into an idea that none of us imagined 3 years ago (Pro/Max/Ultra design) and start to assume it will be THE design going forward always...

Consider the following alternative.
- Give the Pro UltraFusion connector(s). For the sake of numbers, let's say this Pro has 12+4 cores, two SLC+memoy controller, and 20 GPU cores (and NPU, media, etc)

- no Max exists. Instead the money that would have been used on Max masks is used to create a "GPU" chip of essentially the same size; so it has let's say two SLC+memory controllers on it, and let's say sixty GPU cores

Now multiple mix and match options exist. At the low end just a standard Pro. In the mid-range, if I mainly want CPU, I can join two Pro's together. If I mainly want GPU, I can join a Pro and a GPU. At the high end, I can get anything from four Pro's to a Pro and three GPU's.

In terms for CONFIGURATION, this gets us to something like the world of a nice CPU+iGPU along with an eGPU, only with the Apple advantages of common memory, memory bandwidth scaling as required, and the GPU hardware all presenting as a single GPU. Or to put it differently, this is presumably where Intel is kinda sorta hoping to be headed with MTL. Hell, I'm not fussy. If GPU's are such that they don't take serious advantage of N3 (don't run at the highest frequencies, will not be in mobile designs anyway) make the Apple GPU chiplet on N4, and be even more Intel/MTL-like.

You can raise the boring obvious complaint about geometry (how do you lay this out? where does the memory go?) but honestly that's just not interesting! There are at least two alternatives. One is on the back of the package (as is done with the A16, and maybe A17 - haven't seen any photos), the other is via the memory-spine design I have described on a few occasions.

My guess is that as of M3 there's still basic work being done on scaling, enough so that it will match what we expect though (probably?) with an Extreme. But for M4, once we have all this plumbing working correctly, I think there are real advantages to the alternative design I am suggesting.

Boil · Sep 29, 2023

A GPU-specific SoC you say...? Sounds oddly familiar...

(...I know I am not the only one to think along the lines of a GPU-specific SoC, but I have been going on about it for awhile now...)

TigeRick · Sep 29, 2023

leman said:
Regarding the GPU, I did a quick test and A17 Pro delivers 2TFLOPs compute throughput. That gives us the GPU clock of 1.3Ghz, same as A15.

May I know what app you used to test the TFLOPS???

leman · Sep 29, 2023

TigeRick said:
May I know what app you used to test the TFLOPS???

It’s a simple compute shader I wrote.

TigeRick · Sep 29, 2023

leman said:
It’s a simple compute shader I wrote.

I see, can your program see the amount of ALU and RAM???

Cause I just created a thread regarding Apple Watch's benchmarking and specs here...

Confused-User · Sep 29, 2023

name99 said:
Now multiple mix and match options exist. At the low end just a standard Pro. In the mid-range, if I mainly want CPU, I can join two Pro's together. If I mainly want GPU, I can join a Pro and a GPU. At the high end, I can get anything from four Pro's to a Pro and three GPU's.

I think we've talked about this before.

I wonder if this would actually work. You also say:

You can raise the boring obvious complaint about geometry (how do you lay this out? where does the memory go?) but honestly that's just not interesting! There are at least two alternatives. One is on the back of the package (as is done with the A16, and maybe A17 - haven't seen any photos), the other is via the memory-spine design I have described on a few occasions.

Are you saying that instead of shoreline they'd use the back of the package for memory busses? I imagine that would make BSPD harder once they get there in a couple more nodes. But regardless, how do you distribute all those memory busses across the chiplets without spending a ton of energy (and delay) shipping all those bits constantly between the chiplets?

I imagine it would come down to how local memory can be for GPU work patterns, and I don't know enough about GPU code to have an opinion.

The memory-spine thing is fascinating. I don't think it fully eliminates the problems of possible nonlocality though, from a performance/efficiency standpoint at least (it obviates NUMA issues, if I remember right).

caribbeanblue · Sep 29, 2023

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

name99 · Sep 29, 2023

Confused-User said:
I think we've talked about this before.

I wonder if this would actually work. You also say:

Are you saying that instead of shoreline they'd use the back of the package for memory busses? I imagine that would make BSPD harder once they get there in a couple more nodes. But regardless, how do you distribute all those memory busses across the chiplets without spending a ton of energy (and delay) shipping all those bits constantly between the chiplets?

I imagine it would come down to how local memory can be for GPU work patterns, and I don't know enough about GPU code to have an opinion.

The memory-spine thing is fascinating. I don't think it fully eliminates the problems of possible nonlocality though, from a performance/efficiency standpoint at least (it obviates NUMA issues, if I remember right).

(a) I referred to the A16. Don't you think I made that reference for a REASON?

iPhone 14 Pro‚ÌS‘Ÿ•”AuA16 Bionicv‚ð‰ðÍ‚·‚é

¡‰ñ‚Í2022”N9ŒŽ16“ú‚É””„‚³‚ê‚½Apple‚ÌÅVƒXƒ}[ƒgƒtƒHƒ“uiPhone 14 Prov‚ÌƒvƒƒZƒbƒTuA16 Bionicv‚É‚Â‚¢‚Ä•ñ‚·‚éBA16 Bionic‚ÍiPhone 14 Pro‚É‚Ì‚ÝÌ—p‚³‚ê‚Ä‚¢‚éB

eetimes.itmedia.co.jp

It's certainly possible that mounting memory this way results in lower energy transfers than mounting it on the side. The details would depend on things like the substrate (unclear whether it is silicon, "real" glass, standard thin ABF, or something else altogether) and the capacitance of the TSV's. But the fact that it was adopted for A16 certainly suggests that it both works (technology is reliable) and that it is low power.

BSPD only becomes an issue if you mount the memory on the BSPD side. There are two sides, so you don't have to do that...

(b) This sort of design is targeted for a Studio and higher. So while efficiency is nice, it is not the ONLY consideration. If you can get nVidia performance at a third of nVidia power, you're already way ahead.

Boil · Sep 29, 2023

So...

M3 Max/Ultra/Extreme
High Power Mode
CPU @ 4.5GHz
GPU @ 2.0GHz

...???

name99 · Sep 29, 2023

Now THIS is interesting!
Check out https://patents.google.com/patent/US20230298996A1 and https://patents.google.com/patent/US20230299001A1 and https://patents.google.com/patent/US20230299068A1
All three are patents related to backside metal.

I've mentioned before that the lousy scaling for N3, especially for SRAM, comes as absolutely no surprise because the limiting variable, especially for SRAM, is wiring density not transistor size. And that BSPD is hoped to fix that by moving power delivery (and hence that wiring) to the backside, relieving topside metal congestion.
The next step after BSPD, as suggested by IMEC, is back side clock delivery, moving more wires to the backside.

However these patents are substantially more ambitious.
The middle one simply talks about both top and backside for power delivery. Worthy but not exciting.
However the first one and third ones talks about using the backside for genuine logic/control signals!

It's unclear what the expected timeframe would be. Obviously this means Apple is thinking about how to design in the presence of backside metal layers; but this may also mean that TSMC's internal roadmap is substantially more ambitious than what they've said publicly.
Publicly all they've announced is what looks like a somewhat cautious version of power delivery, and maybe two years after Intel claims (...) that they will deliver BSPD. Maybe the reason for this later delivery is that what they are implementing is substantially more ambitious, not just the basic hacks (as described by IMEC) to get BSPD, but a full process stack allowing for more or less full double-sided metal layers?

Confused-User · Sep 29, 2023

name99 said:
(a) I referred to the A16. Don't you think I made that reference for a REASON?

iPhone 14 Pro‚ÌS‘Ÿ•”AuA16 Bionicv‚ð‰ðÍ‚·‚é

¡‰ñ‚Í2022”N9ŒŽ16“ú‚É””„‚³‚ê‚½Apple‚ÌÅVƒXƒ}[ƒgƒtƒHƒ“uiPhone 14 Prov‚ÌƒvƒƒZƒbƒTuA16 Bionicv‚É‚Â‚¢‚Ä•ñ‚·‚éBA16 Bionic‚ÍiPhone 14 Pro‚É‚Ì‚ÝÌ—p‚³‚ê‚Ä‚¢‚éB

eetimes.itmedia.co.jp

It's certainly possible that mounting memory this way results in lower energy transfers than mounting it on the side. The details would depend on things like the substrate (unclear whether it is silicon, "real" glass, standard thin ABF, or something else altogether) and the capacitance of the TSV's. But the fact that it was adopted for A16 certainly suggests that it both works (technology is reliable) and that it is low power.

BSPD only becomes an issue if you mount the memory on the BSPD side. There are two sides, so you don't have to do that...

Right, but I was thinking more about nonlocality. Basically the design you're describing (1 Mx + 3 GPU chiplets) is similar to the Zen 1 generation epyc chips - each chiplet has some memory busses, and that gives you NUMA. AMD chose to move memory to the I/O chiplet in succeeding generations to avoid the problems that presented. Sure, Apple may be smarter, but those definitely are issues that would need to be addressed. And what I was asking was, if you had that NUMA architecture with GPU chiplets, would that likely be a bigger or smaller issue than NUMA is with CPU chiplets? That is, is the GPU workload more or less sensitive to NUMA? I have no feel for that at all.

leman · Sep 29, 2023

TigeRick said:
I see, can your program see the amount of ALU and RAM???

Apple APIs tell you the amount of RAM (if you trust them). As to ALUs… the phone/Mac GPU cores have 128 ALUs each, so it’s easy to calculate there, but I don’t have any code to check for it. I’d need to think about that…

diamond.g · Sep 30, 2023

Confused-User said:
Right, but I was thinking more about nonlocality. Basically the design you're describing (1 Mx + 3 GPU chiplets) is similar to the Zen 1 generation epyc chips - each chiplet has some memory busses, and that gives you NUMA. AMD chose to move memory to the I/O chiplet in succeeding generations to avoid the problems that presented. Sure, Apple may be smarter, but those definitely are issues that would need to be addressed. And what I was asking was, if you had that NUMA architecture with GPU chiplets, would that likely be a bigger or smaller issue than NUMA is with CPU chiplets? That is, is the GPU workload more or less sensitive to NUMA? I have no feel for that at all.

I feel like I am talking out of turn, for GPU rasterization they tend to care more about bandwidth than latency. For GPGPU I read that it depends on the workload as to if it is more latency sensitive.

Now for Apple with TBDR I think it is less bandwidth sensitive, but that may only apply for rasterization.

I also assume if the local (L1-3) caches are large enough it can blunt some of the latency/bandwidth requirements, again depending on dataset.

Gudi · Sep 30, 2023

name99 said:
Apple are willing to play along with this [gamer] fantasy to the extent that it doesn't get in the way of real customers, but no more than that. Which means they care a LOT about making sure the A17 works well when playing Candy Crush. But they are not going to increase weight, cost, or anything else, PURELY to make Genshin Impact behave differently.

They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.

leman · Sep 30, 2023

Gudi said:
They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.

The fun thing about RT is that it sends a signal that Apple is serious about games while at the same time improving their value proposition in a much more tangible domain — professional rendering. There are a lot of creative users who already gravitate towards a Mac - if Apple can reach at least 60-70% of Nvidia's performance here they could improve their footing in this market.

PgR7 · Sep 30, 2023

Gudi said:
They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.

You can use that for DLSS, so "free" performance, and if Nvidia is going that route is because is more efficient.

Boil said:
So...

M3 Max/Ultra/Extreme

High Power Mode

CPU @ 4.5GHz

GPU @ 2.0GHz

...???

4.5 is too big of a jump, I would say 4.2 max, 4.1 probably, the actual M2 Max/Ultra is 3.7

name99 · Sep 30, 2023

Gudi said:
They just did with RTX on A17 Pro. It's a significant multiyear investment and it burns extra battery life compared to simply deactivating a graphics feature, which is just pure unnecessary eye candy for triple-A games. Apple is all about advancing their own platform in ways no one else can. Next to programming their own OS, designing their own chips is a major vector to achieve this uniqueness. And a gaming GPU is just one subplot in this chip design endeavour. It doesn't even need to be that useful on a phone. The same GPU technology will go into iPads, MacBooks, AppleTV and iMacs where eventually it will be properly cooled. It doesn't matter whether console gaming on a phone stays a fantasy for now. Maybe it will help pushing iPad sales, maybe it will take off after the next die shrink. The groundwork has been laid for others to build upon.

The fact that you think ray tracing exists as "pure unnecessary eye candy for triple-A games" says volumes about how little you know of where Apple is going.

Ray tracing is an essential element of a variety of graphics algorithms that are relevant to AR, for example casting realistic shadows... I suspect that is vastly more important in Apple's calculations than game eye-candy, which is just a minor side benefit.

Confused-User · Sep 30, 2023

name99 said:
Ray tracing is an essential element of a variety of graphics algorithms that are relevant to AR, for example casting realistic shadows... I suspect that is vastly more important in Apple's calculations than game eye-candy, which is just a minor side benefit.

I think you're *probably* right about this. But there are a few counter-indications, that Apple may finally be thinking about someday maybe getting serious about gaming. I wouldn't bet on it, but it no longer seems entirely insane to think so.

M3 Chip Generation - Discussion Megathread

macrumors Core

macrumors 6502a

macrumors G5

macrumors 6502

macrumors Core

macrumors 6502a

Cancelled

macrumors 68030

macrumors 68040

macrumors regular

macrumors Core

macrumors regular

macrumors 6502a

Cancelled

macrumors 68030

macrumors 68040

macrumors 68030

macrumors 6502a

macrumors Core

macrumors G5

Suspended

macrumors Core

Cancelled

macrumors 68030

macrumors 6502a

Our Staff