Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Homy

macrumors 68030
Jan 14, 2006
2,525
2,508
Sweden
The major assumption on those scaling M-series GPU core counts is that the memory bandwidth will also scale. They can't compute with data they don't have local on the die. ( keeping all of those cores "fed" with data becomes a larger and larger issue. ).

They have already had to do pretty wide already with just 8 GPUs "cores". ( When get to very high double digit multiples of that just how wide can they go? 2-5x bandwidth increase is one thing. Going up 12-16x can turn into an issue if the starting point is already wider than normal.

128 GPU cores without LPDDR5 is going to be tough to get to with linear performance increase on non ultra embarrassingly parallel , cacheable, data tasks. It probably would be better than the 64 GPU core model but not necessarily double on most real world apps.

Yes, the next gen Apple Silicon needs to be better in different aspects, like memory bandwidth but I wonder if they need to have as high bandwidth as dGPUs. As I understand TBDR is more memory efficient. M1 LPDDR4X peaks at 68.25 GB/s. RX 560X GGDR5 has almost double the bandwidth with 112 GB/s. Yet M1 GPU is much faster than 560X. LPDDR5 is 50% faster than LPDDR4X so M2 would have at least ca 102 GB/s. They can also have 256-bit RAM, larger cache and higher clocked GPUs.
 
Last edited:
  • Like
Reactions: BenRacicot

diamond.g

macrumors G4
Mar 20, 2007
11,458
2,687
OBX
Yes, the next gen Apple Silicon needs to be better in many aspects, like memory bandwidth but I wonder if they need to have as high bandwidth as dGPUs. As I understand TBDR is more memory efficient. M1 LPDDR4X peaks at 68.25 GB/s. RX 560X GGDR5 has almost double the bandwidth with 112 GB/s. Yet M1 GPU is much faster than 560X. LPDDR5 is 50% faster than LPDDR4X so M2 would have at least ca 102 GB/s. They can also have 256-bit RAM, larger cache and higher clocked GPUs.
With a faster clock, smaller process node, and more ROPs I would hope the M1 is faster and use less power than the 560X. 9FBE3C58-FCDE-4C1B-AEE9-D2A9FB9C902C.jpeg
 

MrGunny94

macrumors 65816
Dec 3, 2016
1,149
675
Malaga, Spain
I believe we will see a big jump with LDDR5 with the M2 (this being that Apple won't just launch a new arch as in a P1 chip for the high end models and just bump the current M1 with big cores)

Honestly for me I think the M1 at it's core it's vastly better to anything currently available on the Macbook space, especially when compared to the 5300M on the 16" lol. I think we are getting some very good performance from this very first SOCs for the Mac.
 
  • Like
Reactions: BenRacicot

adcx64

macrumors 65816
Nov 17, 2008
1,270
124
Philadelphia
Then why does Activity Monitor show a hard limit on ram for CPU and GPU? If you have 4GB of RAM then only 2.5GB is allowed for the CPU even if GPU isn’t using close to 1.5GB. It doesn’t seem unified to me.

Was Apple lying when they introduced Apple silicon and said the unified memory was a new thing? Everything I’ve read about this since it was announced says Apple silicon is a fundamental shift in how memory is used. I’ve heard developers talk about unified memory making it faster to do certain things like image editing too.

That is just driver/OS provision thing. M1 machines also reserve some memory for GPU operation as well as system operations, it’s just that these provisions are not directly advertised. But Intel GPU can address any of the system memory. That’s what ultimately matters on the technical level.



They didn’t lie. Neither did they claim that they have invented unified memory. Now, unified memory is usually found in two kinds of systems: low end, where it serves to save money and power, and custom supercomputers, where it enables heterogeneous workflows. Apple Silicon innovation is in bringing unified memory solutions across the entire range of consumer hardware - from entry-level to professional workstation. This simplifies the programming model, enables more efficient processing and improves user experience.


Thank you again @leman. People on MR need to google Grand Central Dispatch. Idk how you have the patience to explain to these people they can't get RT cores from m1 compute cores. Asinine.
 

diamond.g

macrumors G4
Mar 20, 2007
11,458
2,687
OBX
Thank you again @leman. People on MR need to google Grand Central Dispatch. Idk how you have the patience to explain to these people they can't get RT cores from m1 compute cores. Asinine.
To be fair you can use the compute cores to do RT, you just tend to not be able to process other shader instructions at the same time. If you have space and power budget and are advertising it, doing it nvidias way is the "faster" option.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
I am not even sure that this is such a big limitation in practical terms. A single 5K frame is around 60Mb uncompressed. To sustain a 60 FPS stream of such frames you need less than 3.6GB/s. M1 can likely support multiple 5K monitors before the bandwidth impact is felt. Not to mention that display image will be compressed in practice, saving you tons of bandwidth.'

I think you mean 60MB not Mb. ( 5120*2880*30 = 442Mb - 55 MB )
And yet the M1 Macs are capped with support of just one 6K XDR display.

Additionally have the prep the next frame buffers while the display controller is reading the finished ones out ( the display controller is going to "write to " the wires going out of the system; not into memory. And also if the compression scheme isn't exactly the same as DisplayPorts then all of that has to be decompressed and then recompressed into the different encoding scheme.

Reading and writing don't happen at the same time so pragmatically looking at about twice what you highly above. And if the display controller and GPU core writes on the same memory bus you'll have that read/write contention.



That's a great reference and analysis! Just to add to this — M1 uses 128bit memory interface, with 8 independent 16-bit memory channels. Having these many memory channels is one of the many features that allows M1 to have high memory-level parallelism - multiple memory requests can be in flight simultaneously and so memory is used efficiently (having wider bus means that you might fetch more memory than you actually need, wasting the channel).

Depends upon what is going on. On a dense set of vector/matrix ops fetching too much isn't all that likely limited to a 16-bit wide sources. However, yes. keep 8 CPU and 8 GPU cores "happy" with 8-to-16 different memory address sequence request streams will work better on the "wider" set up that Apple has. And if doing vectors on a relaitvely tiny 16-bit wide path it much better if possibly no other requests have to time slice with to get the much longer stream need to get some work done. ( 4 16-bit requests is only going to one 64-bit word... not a decent sized data vector. )


Careful measurements of M1 RAM show that they perform just you'd expect regular regular LPDDR4X-4267 RAM. Peak bandwidth and latency is identical to Intel's Tiger Lake platform for example (as measured by Anandtech).

LPDDR4 for 8 concurrent cores requests

pbw-m1_575px.png


Pragmatically, that is probably lower effective latencies than if had 8+ x86+GPU core reads stacked up on top of each other. Not sure that test is really point at total bisection bandwidth. In fact, probably not since memory copies are actually higher than that rate.

"... Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. ..."

(unless copying a single small fragment out of the L2 cache. )




LPDDR5 would gives them a nice boost in bandwidth, but they would still need to use multi-channel controllers. That's why the conservative prediction for the upcoming prosumer chip is at least 256-bit RAM interface (double that of M1)

LPDDR5 would be a reduction of channels. It just gives them an outlet to grow up closer to 3080-6900 in future iterations. It is a growth path because have used up the "go wider" approach.



True, but let's not forget that Appel has large system level caches — something that GPUs lack. An RTX 3080 has only 5MB of L2 cache, even M1 has 16GB LLC.

Again, probably not the right number. 16MB LLC. Pointing at the system RAM and claiming it is some kind of "cache' is just wrong. A memory cache is some subset of the whole; not the whole thing.

IF go back to the techinsight article the M1 System cache is actually smaller than the A14. It is no where near the single GB capacity ( let alone 16 ).

As for no GPUs have large caches. Not really true.

"..Navi 21 will have a 128MB Infinity Cache. Meanwhile AMD isn’t speaking about other GPUs, but those will presumably include smaller caches as fast caches eat up a lot of die space.
...
On that note, let's talk about the size of the infinity Cache. In short, based on currently available data it would appear that AMD has dedicated a surprisingly large portion of their transistor budget to it. Doing some quick paper napkin math and assuming AMD is using standard 6T SRAM, Navi 21’s Infinity Cache would be at least 6 billion transistors in size, which is a significant number of transistors even on TSMC’s 7nm process (for reference, the entirety of Navi 10 is 10.3B transistors)... "
https://www.anandtech.com/show/1620...-starts-at-the-highend-coming-november-18th/2

Even the 6700 has 96MB of Infinity Cache.



So Apple "way out in front of everyone else" .... not really. It is going to be tough for them to out compete the 6900 which has higher baseline VRAM bandwidth and a larger "system cache" already. let alone when get to 6nm or 5nm iterations.











Large caches help to compensate the limited system RAM bandwidth in both compute and graphical workloads, and on the graphics side Apple also has TBDR. And again, let's not forget compression — A14/M1 has hardware memory compression between the cache and the RAM for GPU compute workloads, which allows it to save memory bandwidth.


TBDR isn't going to help on compute tasks ( e.g., de-bayer RED RAW or something that the M-series fixed fuction media logic doesn't do. ). TBDR is why Apple is trying to herd developers into fully optimizing for their specific GPUs though. ( can't do any other GPU optimization on M-series macs at the moment for any other GPU implementaion. )

AMD and Nvidia have various texture and frame compression features too. Again sitting on a faster baseline bandwidth VRAM tech and adding in the same tricks ... Apple isn't going to pass them up at the top levels. Apple can eat into old "discrete exclusive" territory but blow past the entire spectrum? Probably not.



At the same time, Apple is less reliant on PCI-e lanes. GPU, SSD — all the usual culprits run on some sort of internal chip magic so you are pretty much left with Thunderbolt for your PCIe requirements. Of course, it remains to be seen how (and whether) they will solve modularity issues for a new Mac Pro. But even if it will be modular, I doubt that it will allow much in terms of third-party PCI-e device expansion.

That's the same "paint ourselves into a corner" Apple already said was a problem before. Only one internal storage drive is good enough for all high end users. Nope. Only an embedded proprieitary GPU is good enough. Nope. Thunderbolt as a panacea for high end bandwidth I/O cards . Nope ( a single RX 6900 is whipping Vega II Duos from a couple years ago on intense computational tasks. So two RX 6900's would do what to a single Apple iGPU.... more than extremely likely leave it in the dust. let alone what a 7900 or 8900 would do in a several of years. )

For the laptops? Yes. Apple never used all of the PCI-e lane allocation anyway. They didn't use it all in the iMac Pro. If Apple kneecaps the large screen iMac to fit inside the chin so they can maximize the thin-out of the system. yeah can see hand waving away the lanes. But that is basically the retreat back to desktop using laptop focused CPUs and GPUs.
 

BenRacicot

macrumors member
Original poster
Aug 28, 2010
80
45
Providence, RI
@leman for clarification RDNA2 doesn’t have “dedicated” RT cores either. The ray accelerators are a part of the CU, so they cannot accelerate RT and raster graphics at the same time. It provides better hardware utilization, but suffers for one or the other.

EDIT: I suspect it is a similar route Apple will take if they add hardware acceleration to their GPU’s. Intel is going the nvidia route with dedicated RT cores (which is why none if the Xe iGPUs have it yet).
I had thought any processor can crunch RT if the software-level allows it, is this true? If so, then why is it impossible for the SoC to not offload certain efforts onto neural?

DDR6 is in the research labs. It "exists" as in there are prototypes that people are using to concretely talk to in JEDEC standards discussions.
That's true! And it does look like hundreds of sites have labelled their article LPDDR6 on purpose ugh! That being said it has been at JEDEC under JESD8-30A since at least 2019 that I could find.

@leman I wasnt being sarcastic, I actually would appreciate being educated on this stuff. Your latest answers have been very helpful.
 

diamond.g

macrumors G4
Mar 20, 2007
11,458
2,687
OBX
I had thought any processor can crunch RT if the software-level allows it, is this true? If so, then why is it impossible for the SoC to not offload certain efforts onto neural?


That's true! And it does look like hundreds of sites have labelled their article LPDDR6 on purpose ugh! That being said it has been at JEDEC under JESD8-30A since at least 2019 that I could find.

@leman I wasnt being sarcastic, I actually would appreciate being educated on this stuff. Your latest answers have been very helpful.
I am not sure how suited the NPU is for BVH intersection acceleration, but it would be great for DLSS upscaling (replicating Tensor Cores) if Apple wanted to go that route.
 

Fomalhaut

macrumors 68000
Oct 6, 2020
1,993
1,724
Leman, you gotta chill out man. Are you ok?
I'm legit thinking of not using MR forums because of the tone of your comments. We dont need to be attacked. I'm SURE I'm not the only one who feels this way about your replies.
Actually, I think you ARE the only one who feels this way about @leman‘s posts :) I don’t see anything wrong with his tone, and have found he/she provides well reasoned and balanced answers.
 
Last edited:

Kung gu

Suspended
Oct 20, 2018
1,379
2,434
LPDDR4 for 8 concurrent cores requests
"Besides the additional cores on the part of the CPUs and GPU, one main performance factor of the M1 that differs from the A14 is the fact that’s it’s running on a 128-bit memory bus rather than the mobile 64-bit bus. Across 8x 16-bit memory channels and at LPDDR4X-4266-class memory, this means the M1 hits a peak of 68.25GB/s memory bandwidth."

The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test (anandtech.com)

M1 uses LPDRR4X
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
I had thought any processor can crunch RT if the software-level allows it, is this true? If so, then why is it impossible for the SoC to not offload certain efforts onto neural?

The Neural cores are optimized for certain matrix math on limited precision data types. While there is some matrix math in ray tracing doing "rays" is undersized data types ( 16-bit or less) is not going to result in generally good results. AI/ML throws away precision to get stuff fast for "good enough" results were precision doesn't matter as much. Imprecision that throws away color to go faster isn't going to help much if shooting for high fidelity, realistic colors. Or RT if throwing away precision of vertex locations only gets to a weaker bounding box faster (as to tighter fit. )

The other issue is where a significant amount of the data lies. If there is a decent sized chunk in the GPU L2 cache that has to be copied over and back to/from the NPU at the end of its outsourced subset of the computation, that is more overhead. If all the data is sitting in RAM and the NPU is equally as far from the data as the GPU is then that is a different trade off.

There are probably some specific , narrow tasks that could be shared but RT probably has some substantive cache locality hiccups.


That's true! And it does look like hundreds of sites have labelled their article LPDDR6 on purpose ugh! That being said it has been at JEDEC under JESD8-30A since at least 2019 that I could find.

30A is more about GDDR6
"... is primarily used to communicate with GDDR6 SGRAM devices. ..."

DDR and GDDR are two different standards. They work substantively differently; not completely detached but different enough to require a different standard. There will be some overlapping implementation techiques/tools but different. LPDDR is an minor variation of DDR, but is also technically a separate standard ( usually LP and 'plain' arrive at the same time though. So more tightly coupled. DDR5 has some bigger power tweaks to it so the final version of it slid past the final of LPDDR5 )

GDDR6 does not mean any way that there is a completed DDR6. GDDR is on a different update cadence than DDR (and hence LPDDR ). Going froward all three are likely to stay decoupled. Smartphones driving a bigger wedge between DDR and LPDDR .
 
  • Like
Reactions: BenRacicot

quarkysg

macrumors 65816
Oct 12, 2019
1,247
841
The other issue is where a significant amount of the data lies. If there is a decent sized chunk in the GPU L2 cache that has to be copied over and back to/from the NPU at the end of its outsourced subset of the computation, that is more overhead. If all the data is sitting in RAM and the NPU is equally as far from the data as the GPU is then that is a different trade off.
The way I understand cache design is that the M1 CPU, NPU, and GPU each have their own caches at least L1/L2, and that the M1 has an overall System Level Cache that all processing cores sees. This is how the unified memory comes about. If data is not found in the L1/L2, cache lines could be flushed and the required data fetched from SLC into L1/L2. If it's not found in SLC, it'll be fetched from RAM and stored in SLC for subsequent fetches. If any of the processing changes the value of memory, it'll either be send back to the L1/L2/SLC or delayed until it's needed by other cores, depending on the caching protocol employed.

So the 'copying' between the processing cores are mainly fetches from SLC, which is a lot faster compared to RAM. In theory, this should not be considered copying tho.
 

leman

macrumors Core
Oct 14, 2008
19,530
19,709
I think you mean 60MB not Mb. ( 5120*2880*30 = 442Mb - 55 MB )

Yeah, sorry, it was a typo plus writing a long post will being tired... its 60MB of course.

Depends upon what is going on. On a dense set of vector/matrix ops fetching too much isn't all that likely limited to a 16-bit wide sources.

LPDDR4 has prefetch size of 16n, so it fetches data in blocks of 32 bytes per request per channel (might be more complex than that, I have little practical understanding how memory works)

Again, probably not the right number. 16MB LLC.

Yeah, sorry, it's another typo. It's 16MB cache of course.

As for no GPUs have large caches. Not really true.

"..Navi 21 will have a 128MB Infinity Cache. Meanwhile AMD isn’t speaking about other GPUs, but those will presumably include smaller caches as fast caches eat up a lot of die space.

Which is exactly my point! Nvidia 3000 series uses faster (and hotter) RAM with 50% or higher RAM bandwidth than Navi 2 series, but thanks to the non-traditionally large cache AMD can get competitive performance with much slower VRAM. Apple uses the same idea to compensate for slower RAM.

Just some days ago AMD has announced that their future CPUs will come with absolutely humongous amounts of L3 cache stacked on die. Given that Apple has had similar patents filed years ago and that they mentioned their intent to use "advanced packaging technologies" offered by TSMC, I suspect that they will use something similar in their prosumer product line. In fact, I wouldn't be surprised if the upcoming Mac chips use 64MB of LLC or something ridicules like that.

That's the same "paint ourselves into a corner" Apple already said was a problem before. Only one internal storage drive is good enough for all high end users. Nope. Only an embedded proprieitary GPU is good enough. Nope. Thunderbolt as a panacea for high end bandwidth I/O cards . Nope ( a single RX 6900 is whipping Vega II Duos from a couple years ago on intense computational tasks. So two RX 6900's would do what to a single Apple iGPU.... more than extremely likely leave it in the dust. let alone what a 7900 or 8900 would do in a several of years. )

These are all good points and it remains to be seen how Apple will approach solving it.
 

leman

macrumors Core
Oct 14, 2008
19,530
19,709
The way I understand cache design is that the M1 CPU, NPU, and GPU each have their own caches at least L1/L2, and that the M1 has an overall System Level Cache that all processing cores sees. This is how the unified memory comes about. If data is not found in the L1/L2, cache lines could be flushed and the required data fetched from SLC into L1/L2. If it's not found in SLC, it'll be fetched from RAM and stored in SLC for subsequent fetches. If any of the processing changes the value of memory, it'll either be send back to the L1/L2/SLC or delayed until it's needed by other cores, depending on the caching protocol employed.

So the 'copying' between the processing cores are mainly fetches from SLC, which is a lot faster compared to RAM. In theory, this should not be considered copying tho.

Precisely. Sharing the LLC as a "gateway" to the RAM is really what enables coherent unified memory. As you say, if GPU + CPU need to access the same data, they are essentially accessing it through the LLC, which is significantly faster than reaching out to RAM directly.
 
  • Like
Reactions: BenRacicot

leman

macrumors Core
Oct 14, 2008
19,530
19,709
Pretty sure the LPDDR4X ICs are custom. I did a quick analysis in this post awhile ago.

That’s very cool analysis! Can’t believe I didn’t see it before. Also, the number of BGA pads on those chips is insane. It’s surprising that the RAM has the same performance as any standard LPDDR4X given all that custom technology. I think your idea that’s it’s done for power efficiency is spot on though, it’s quite incredible that Apple got RAM power usage to under a watt under load...
 

cmaier

Suspended
Jul 25, 2007
25,405
33,474
California
That’s very cool analysis! Can’t believe I didn’t see it before. Also, the number of BGA pads on those chips is insane. It’s surprising that the RAM has the same performance as any standard LPDDR4X given all that custom technology. I think your idea that’s it’s done for power efficiency is spot on though, it’s quite incredible that Apple got RAM power usage to under a watt under load...

My guess - they are using differential signaling at half voltage. And the rest of the pads are power/ground for shielding and better IR characteristics.

Just guessing.
 
  • Like
Reactions: Andropov

jdb8167

macrumors 601
Nov 17, 2008
4,867
4,603
Pretty sure the LPDDR4X ICs are custom. I did a quick analysis in this post awhile ago.
This is why the posts that claim Apple is overcharging for RAM are not credible. They go and look at a spot price for LPDDR4X RAM and draw conclusions. Apple doesn’t always use the same playbook as other companies.

I missed this analysis when you originally posted it. Very interesting.
 

vigilant

macrumors 6502a
Aug 7, 2007
715
288
Nashville, TN
The reason why I am asking is because unified memory is pretty much the default assumption across the entire range of Apple Silicon Macs (Apple communication has been fairly clear on this so far). So every time someone suggests that they are going to use separate VRAM I have to ask why that person thinks that Apple would break their elegant design and what kind of benefit that would bring in their opinion.
I know I'm late to the conversation, but I agree that Unified Memory is here to stay. As this continues to scale, I can see the benefit of possibly an elaborate caching mechanism to help keep memory and processors busy... but even now, thinking about having 32GB and 64GB of unified memory should demolish just about every workload situation I can think of.
 

Krevnik

macrumors 601
Sep 8, 2003
4,101
1,312
Thank you again @leman. People on MR need to google Grand Central Dispatch. Idk how you have the patience to explain to these people they can't get RT cores from m1 compute cores. Asinine.

TBH, I don’t get what GCD has to do with any of this. We aren’t talking about the CPU behavior here.

To be fair you can use the compute cores to do RT, you just tend to not be able to process other shader instructions at the same time. If you have space and power budget and are advertising it, doing it nvidias way is the "faster" option.

Makes me wonder if you also miss out on other possible optimizations too.
 

deconstruct60

macrumors G5
Mar 10, 2009
12,493
4,053
Yeah, sorry, it was a typo plus writing a long post will being tired... its 60MB of course.



LPDDR4 has prefetch size of 16n, so it fetches data in blocks of 32 bytes per request per channel (might be more complex than that, I have little practical understanding how memory works)



Yeah, sorry, it's another typo. It's 16MB cache of course.

The problem is those three things are in substantive conflict with each other. if the frame is refreshing at 60/sec then 60 60MB frames is 3600MB of data. The case is 16MB. So once second of framebuffer is 225x as big as the cache. Throw on top of that most caches are Least Recently Used retention ( LRU) whereas as soon as the display controller has pumped out that most recently constructed framebuffer , it is bascially garbage (useless). So the data retention policies are exactly backwards. ( dump MRU for frame buffer versus LRU for mainstream code. ). If you let the framebuffer go through the L3 cache you'll churn it like butter. The other problem with framebuffer is that it is latency very sensitive. if just send half or 3/r the frames to a display it won't look right. It doesn't need to consume the whole memory bandwidth but will want a very consisent slice of the pie of what is being shared.

So if fetching 32 bytes through a 16-bit wide "straw" the other read/write request that arrive in the interim have to wait in line. The framebuffer is probably going to be set to bypass the L3/System Cache so they will statically suck up more than their "fair share" of request queue space. Apple offsets that by going wider ( more queues to send requests to. Either designate 'carpool' lanes for the CPU (or GPU) or try to dilute the framebuffer queue pressure out to more outlets (and even out the latencies ).

The system is as much set up not to latency sag "too far" in under higher bandwidth as it is for low latencies. (the Anandtech article where the parallel read requests all sag down to flat line once get to data pulls that are directly hitting the RAM . if go to Anandtech review of Intel Tiger Lake the latencie is slightly lower but a heavy data pull drops to a lower bandwidth. )

Once put three 6-8k displays now up into the double digit GB/s range and little to no cache help to offload that. Framebuffers are not a "share friendly" memory workload.








Which is exactly my point! Nvidia 3000 series uses faster (and hotter) RAM with 50% or higher RAM bandwidth than Navi 2 series, but thanks to the non-traditionally large cache AMD can get competitive performance with much slower VRAM. Apple uses the same idea to compensate for slower RAM.

But that is with a shared GDDR6 baseline. The Framebuff subset can't use the cache so that won't help much. Apple is also using slower baseline RAM store so the "trick" ( for non framebuffer stuff) will be used to make up the deficit but that is just to get back to the baseline the other two started with. Apple catches up to where they start from.




Just some days ago AMD has announced that their future CPUs will come with absolutely humongous amounts of L3 cache stacked on die.

Not that humongous. Basically likely just doubling the L3. It is layered on top of the L3 that is on the main die. Not putting the cache over hotspots on the die.

You can 4x or 6x the L3, LRU cache and a single 6-8K framebuffer will still churn it like butter.



Given that Apple has had similar patents filed years ago and that they mentioned their intent to use "advanced packaging technologies" offered by TSMC, I suspect that they will use something similar in their prosumer product line.

That is likely to be as much 2D (or 2.5) advanced packaging as 3D. has some InFO-R stuff coming online in mid 2021 that pretty large .

Advanced%20Packaging%20Technology%20Leadership.mkv_snapshot_10.32_%5B2020.08.25_14.14.08%5D.jpg

https://www.anandtech.com/show/16051/3dfabric-the-home-for-tsmc-2-5d-and-3d-stacking-roadmap


2.5X reticle gives them plenty of room for several smallish chiplets ( or a couple of mid-size ones ) . And the 100x100mm substrate still to perhaps have space to spinkle LPDDR4 modules around the outside. (if keep the heat way down ).

2Q21 the more 3D variant is suppose to get up into the 3x reticle zone. (and go to 4x in 2023 )



In fact, I wouldn't be surprised if the upcoming Mac chips use 64MB of LLC or something ridicules like that.

If they dramatically crank up the core count ( CPU and/or GPU) and also climb up into the 64+GB system RAM range then 64MB is far from the "ridiculous" range. Unless all those cores are only pulling off a narrow subset of memory as soon as the number of requests make the queues longer at the memory controllers and/or the range of data goes up ( more cache set affinity overlaps. )

it isn't that hard to do. Just throw throw the total of the additional transistor budget to the "cores" to crank up core count.
 
Last edited:

AgentMcGeek

macrumors 6502
Jan 18, 2016
374
305
London, UK
They kind of already do. From what we know, the SSD is connected directly to the M1 internal bus and the on-chip controller emulates NVMe protocol to communicate with the OS. The next logical step is to drop NVMe altogether and just go full custom, potentially exposing the SSD storage as byte-addressable physical RAM in a common address space. This would allow the kernel to map SSD storage directly, eliminating the need for any logical layer and dramatically improving latency. But I have no idea about these things and I don't know whether there are any special requirements that would make such direct mapping approach non-viable.

I think what you've described here is partially what the next NVMe protocol will attempt to do:

The NVMe Key Value (NVMe-KV) Command Set provides access to data on an NVMe SSD namespace using a key as opposed to a logical block addresses. The NVMe-KV command set allows users to access a variable sized value (e.g., an object) using a key without the overhead of the host maintaining a translation table that defines an object as a sequence of logical block addresses.

Source: https://nvmexpress.org/everything-y...0-specifications-and-new-technical-proposals/
 
  • Like
Reactions: jdb8167

theluggage

macrumors G3
Jul 29, 2011
8,023
8,467
This is why the posts that claim Apple is overcharging for RAM are not credible. They go and look at a spot price for LPDDR4X RAM and draw conclusions. Apple doesn’t always use the same playbook as other companies.
I don't think that proves anything. Apple charge $200 for an 8-to-16GB RAM upgrade whether that means a different M1 package, different LPDDR SMT chips on a logic board or just plugging bog-standard SODIMMs into a 5k iMac or Intel Mini. Upgrade prices have little or nothing to do with bill-of-materials. Plus, Apple are dealing in huge volumes, have huge negotiating power with suppliers and sure as heck aren't paying "spot prices" for their components. Even custom chips can be economical on that scale (or the M1 wouldn't have happened...)

Meanwhile, Interesting discussion & I look forward to seeing how Apple are going to scale the M1 to replace the higher-end 16" MBPs, 5k iMac/iMac Pro, particularly when it comes to GPU power and RAM capacity. I don't think it is obvious and I really hope they have a plan.
 
  • Like
Reactions: BenRacicot
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.