Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

diamond.g

macrumors G4
Mar 20, 2007
11,437
2,665
OBX
ML upscaling is not a magical solution. It requires explicit game support (as the game needs to provide additional data to the NN), it requires the training infrastructure, and it requires local ML performance. Apples ML accelerators are fast, but they are optimized for mobile use — tensor cores in something like a RTX 2060 are still around 5 times faster (and a RTX 3080 is a whopping 25 times faster for ML). And of course, you need a lot of training data, which Nvidia has been collecting for years. Apple — not so much.

Generally I'd wager that tweaking your engine to take explicit advantage of TBDR is not more complicated than supporting ML upscaling. And you don't need to deal with the time-consuming and potentially expensive training step.

Not to mention that ML upscaling has little sense for the resolutions you are talking about. A14 is already offers 2x performance per watt compared to Nvidia's best. Apple GPUs won't have any problems dealing with 1080 resolutions. Where ML upscaling becomes really interesting is getting to resolution that GPU can't realistically do on it's own. But for that you need ML performance and bandwidth... which means needing a beefy GPU in the first place. I am not sure how this tech is suitable for entry level, I have a suspicion that it will stay reserved for high-end only for some time.



API doesn't matter. Nvidia's upscaling technology is based on a) custom API extensions and b) data for training the NN model and c) excellent ML performance of their GPUs. They use supercomputers to gather high-resolution game image data and build the upscaling network. They also seem to have collected plenty of data to quickly support new games.

But there is nothing special about any of this. If you want, you can build an autoencoder network for your iOS game, train it with high-res output and roll your own ML upscaling. Not sure whether it will be worth the effort though.

By the way, this kind of stuff is obligatory for RT. Ray-tracing produces "grainy" images (since you are casting individual rays), you need to use a filter that will produce a smooth image.
Is that because Ray Tracing isn’t done at full resolution?
 

onfire23

macrumors member
Oct 20, 2020
37
26
Seems that iphone 12 uses Lpddr4x ram according to ifixit. Is Apple using quad channel for increased bandwidth?
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,522
19,679
Is that because Ray Tracing isn’t done at full resolution?

I think this is where the concept of resolution starts getting a bit fuzzy... RT works by simulating light rays in reverse, you basically cast a ray and trace back its path until it hits a light source. To get an accurate image, you need to cast multiple rays for each pixel (ideally A LOT of them), which becomes extremely expensive very quickly. Denoising techniques let you get away with less rays per pixel, improving the performance.

Seems that iphone 12 uses Lpddr4x ram according to ifixit. Is Apple using quad channel for increased bandwidth?

Interesting! According to GB4 emory benchmark tests, RAM bandwidth of A14 devices is 50% higher. That's quite a big jump! But we also don't know how GB does memory tests... maybe we are seeing some effects of increased cache in A14...

At any rate, the numbers are not bad — they are comparable with the 13" MBP that uses quad-channel 3733 MHz LPDDR4X (at the same time, in multi-core Ice Lake bandwidth seems to be significantly higher).
 
Last edited:

PortoMavericks

macrumors 6502
Jun 23, 2016
288
353
Gotham City
ML upscaling is not a magical solution. It requires explicit game support (as the game needs to provide additional data to the NN), it requires the training infrastructure, and it requires local ML performance. Apples ML accelerators are fast, but they are optimized for mobile use — tensor cores in something like a RTX 2060 are still around 5 times faster (and a RTX 3080 is a whopping 25 times faster for ML). And of course, you need a lot of training data, which Nvidia has been collecting for years. Apple — not so much.

Generally I'd wager that tweaking your engine to take explicit advantage of TBDR is not more complicated than supporting ML upscaling. And you don't need to deal with the time-consuming and potentially expensive training step.

Not to mention that ML upscaling has little sense for the resolutions you are talking about. A14 is already offers 2x performance per watt compared to Nvidia's best. Apple GPUs won't have any problems dealing with 1080 resolutions. Where ML upscaling becomes really interesting is getting to resolution that GPU can't realistically do on it's own. But for that you need ML performance and bandwidth... which means needing a beefy GPU in the first place. I am not sure how this tech is suitable for entry level, I have a suspicion that it will stay reserved for high-end only for some time.



API doesn't matter. Nvidia's upscaling technology is based on a) custom API extensions and b) data for training the NN model and c) excellent ML performance of their GPUs. They use supercomputers to gather high-resolution game image data and build the upscaling network. They also seem to have collected plenty of data to quickly support new games.

But there is nothing special about any of this. If you want, you can build an autoencoder network for your iOS game, train it with high-res output and roll your own ML upscaling. Not sure whether it will be worth the effort though.

By the way, this kind of stuff is obligatory for RT. Ray-tracing produces "grainy" images (since you are casting individual rays), you need to use a filter that will produce a smooth image.


DLSS 2.0 already uses general ML samples.

It literally came out 8 months ago and they improved it a lot from 1.0. Expect new options for internal/target resolutions.

Regarding the Tensor Cores, sure you pay the premium for that. Like you said, the Tensor Cores aren’t just for upscaling and it has a broader usage like medical applications, biological research. Those GPUs are not toys.

The thing is, the next Nintendo Switch is rumored to have DLSS 2.0. If they’re including Tensor Cores on a low end Tegra or they’re doing something half baked but something that works, this sounds very low end to me.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,522
19,679
DLSS 2.0 already uses general ML samples.

It literally came out 8 months ago and they improved it a lot from 1.0. Expect new options for internal/target resolutions.

Regarding the Tensor Cores, sure you pay the premium for that. Like you said, the Tensor Cores aren’t just for upscaling and it has a broader usage like medical applications, biological research. Those GPUs are not toys.

The thing is, the next Nintendo Switch is rumored to have DLSS 2.0. If they’re including Tensor Cores on a low end Tegra or they’re doing something half baked but something that works, this sounds very low end to me.

You are certainly right about one thing: this is exiting technology! And it could be a good reasoning for including ML coprocessors on consumer computer hardware.

I agree with you that tit might be a good fit for Apple because of unified memory architecture and the fact that Apple already has good ML acceleration. What makes me a bit skeptical is the question about the minimal required ML performance for it to be useful, and the data needed to train the model. A gaming company such as Nvidia or AMD are probably in a better position to collect this data, and Nvidia has been doing it for long time...

Then again, Apple already includes a bunch of interesting filters with Metal Performance Shaders, I can see them adding a temporal upscaling of some sort... has the algorithm been published? It's not my area of expertise unfortunately...
 
Last edited:

JMacHack

Suspended
Mar 16, 2017
1,965
2,424
API doesn't matter. Nvidia's upscaling technology is based on a) custom API extensions and b) data for training the NN model and c) excellent ML performance of their GPUs. They use supercomputers to gather high-resolution game image data and build the upscaling network. They also seem to have collected plenty of data to quickly support new games.

But there is nothing special about any of this. If you want, you can build an autoencoder network for your iOS game, train it with high-res output and roll your own ML upscaling. Not sure whether it will be worth the effort though.

By the way, this kind of stuff is obligatory for RT. Ray-tracing produces "grainy" images (since you are casting individual rays), you need to use a filter that will produce a smooth image.
I see, so (in theory) they're halfway there with the ML cores on their SoCs. If they had dedicated RT cores, they could do HWRT then, right? (which I assume they're working on, since Metal has support for it)
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,522
19,679
I see, so (in theory) they're halfway there with the ML cores on their SoCs. If they had dedicated RT cores, they could do HWRT then, right? (which I assume they're working on, since Metal has support for it)

I don't think that having a good denoising filter qualifies as being "halfway there" in respect to hardware raytracing ;) Denoising is part of Metal Performance Shaders and it's fast. Hardware-accelerated RT is a completely different challenge. In the meantime you can already use RT with Metal, it just won't be too fast. But you can literally download their demo application, compile it on your Big Sur machine and run it ;)
 

diamond.g

macrumors G4
Mar 20, 2007
11,437
2,665
OBX
I see, so (in theory) they're halfway there with the ML cores on their SoCs. If they had dedicated RT cores, they could do HWRT then, right? (which I assume they're working on, since Metal has support for it)
The ML upscaling is a side issue for RT. In essence it allows you to render at lower resolutions for performance gains, yet still have output that looks close to native.

@leman So there really isn't any reason why they cannot use the existing cores to "HW Accelerate' RT. You just potentially take a hit with rendering output right? As RT takes up power that could be used towards rendering pixels. Or does TBDR allow you to cheat that part?
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,522
19,679
@leman So there really isn't any reason why they cannot use the existing cores to "HW Accelerate' RT. You just potentially take a hit with rendering output right? As RT takes up power that could be used towards rendering pixels. Or does TBDR allow you to cheat that part?

I don't think that I am qualified to answer these questions. I wrote a ray-tracer for fun when I was a teenager (who didn't, right?), but I have no clue how that stuff would work in hardware.

If you want my layman opinion... it kind of depends on what you mean by "HW accelerate". In the end, we don't care much about the engineering detail as long as the result is fast, right? As I understand it, currently raytracing in Metal is done in "software", using the general-purpose GPU shader units (the same that do pixel shading, vertex shading and any compute work). To make it "fast" however, they would need to add some sort of fixed-function hardware dedicated to RT like Nvidia did. I have no idea how Nvidia's RT cores work or what that would entail. I don't even know whether one needs dedicated "RT cores" or just some sort of additional instruction for the general-purpose cores.
 

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
Interesting! According to GB4 emory benchmark tests, RAM bandwidth of A14 devices is 50% higher. That's quite a big jump! But we also don't know how GB does memory tests... maybe we are seeing some effects of increased cache in A14...

At any rate, the numbers are not bad — they are comparable with the 13" MBP that uses quad-channel 3733 MHz LPDDR4X (at the same time, in multi-core Ice Lake bandwidth seems to be significantly higher).
We DO know how Geekbench does memory tests! John Poole is quite open.
The results from the iPhone 12 are consistent though. And they show results that are close to the 2018 iPad Pro for instance, much closer than to earlier iphones. In bandwidth. And 20% or so lower in Mem-copy. And better latencies.
I can’t really reconcile the results with a 64-bit interface to LPDDR4-4266 which is what the parts number indicates. While it’s within the theoretical limits of such an interface, the utilization would have to be incredibly good. It matches the numbers from huawei note 40 Pro+, that we know have an LPDDR5/LPDDR4x interface, and where the Pro+ model has been assumed to use LPDDR5. God knows what’s going on. But something has changed, and not by a small margin.
 

PortoMavericks

macrumors 6502
Jun 23, 2016
288
353
Gotham City
You are certainly right about one thing: this is exiting technology! And it could be a good reasoning for including ML coprocessors on consumer computer hardware.

I agree with you that tit might be a good fit for Apple because of unified memory architecture and the fact that Apple already has god ML acceleration. What makes me a bit skeptical is the question about the minimal required ML performance for it to be useful, and the data needed to train the model. A gaming company such as Nvidia or AMD are probably in a better position to collect this data, and Nvidia has been doing it for long time...

Then again, Apple already includes a bunch of interesting filters with Metal Performance Shaders, I can see them adding a temporal upscaling of some sort... has the algorithm been published? It's not my area of expertise unfortunately...
That’s something I definitely agree with.

Apple and Qualcomm are both in good position to offer something in that field, since AMD runs the ML stuff on the shaders (please correct me if I’m wrong any point of the way).
 

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
Andrei Frumusani, who writes the architectural articles at Anandtech, just said that the RAW memory subsystem, 64-bit lpddr4-4233, is unchanged, and that the GB4 score change is due to changes on the SoC. Which makes it difficult without deeper knowledge to predict how these changes affect codes in general. (As opposed to a straightforward memory bandwidth increase.)
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,522
19,679
Andrei Frumusani, who writes the architectural articles at Anandtech, just said that the RAW memory subsystem, 64-bit lpddr4-4233, is unchanged, and that the GB4 score change is due to changes on the SoC. Which makes it difficult without deeper knowledge to predict how these changes affect codes in general. (As opposed to a straightforward memory bandwidth increase.)

Where did he say it? Can't find anything on Anandtech or on Twitter...
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,867
Interesting! According to GB4 emory benchmark tests, RAM bandwidth of A14 devices is 50% higher. That's quite a big jump! But we also don't know how GB does memory tests... maybe we are seeing some effects of increased cache in A14...

At any rate, the numbers are not bad — they are comparable with the 13" MBP that uses quad-channel 3733 MHz LPDDR4X (at the same time, in multi-core Ice Lake bandwidth seems to be significantly higher).

It's only dual channel 3733 MHz LPDDR4X in that MBP. Intel doesn't do quad channel on notebook parts AFAIK.

Who has published GB4 numbers for A14? I haven't been able to find any.
 

PortoMavericks

macrumors 6502
Jun 23, 2016
288
353
Gotham City

Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU​

by Andrei Frumusanu on October 13, 2020 4:00 AM EST


Beyond confirming that the new C-Series will have ray tracing capabilities, Imagination further confirms that this will be an implementation using the company’s fullest of capabilities, including BVH processing and Coherency Sorting in hardware, a capability the company denotes as a “Level 4” ray tracing implementation, which would be more advanced than what current generation Nvidia and AMD GPUs are able to achieve at a “Level 3”.


The question is, if they have so wonderful IPs why no one is working with them? This is really puzzling, their claims do not fit reality.
 
Last edited:
  • Like
Reactions: ETN3 and diamond.g

leman

macrumors Core
Original poster
Oct 14, 2008
19,522
19,679
It's only dual channel 3733 MHz LPDDR4X in that MBP. Intel doesn't do quad channel on notebook parts AFAIK.

It’s a really confusing subject since DDR4 works so much differently from LPDDR4. I spent some time trying to understand it, and who I’m not sure that I got it 100% right, the gist of it is that LPDDR4 memory channel width is only 32 bit as opposed to DDR4 64 bit. So LPDDR4 has more bandwidth per bit, but less bits so to speak. To get most out of it, high performance implementations use quad channel configurations. This includes Intel Ice Lake and I think chips like A12X. That’s why LPDDR4 can outperform DDR4, abs that’s one reason why such configurations are more expensive.

Who has published GB4 numbers for A14? I haven't been able to find any.

It's in the GB database. Look for iPhone13
 

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
Where did he say it? Can't find anything on Anandtech or on Twitter...
I put my questions elsewhere.
He is going to do his now almost traditional article, which typically includes quite robust research into the memory subsystem, and I'll wait for that rather than trying to pump him for more info. But he confirmed that the memory system is what it physically looks like.
Now, improvements in the SoC accessing that memory is obviously also useful, but it is harder to evaluate for different workloads even when you know exactly what has changed and how it affects the benchmark code (and we don't). It's easier by far to extrapolate what, say, a doubling of the bus width means for different codes.
 

thenewperson

macrumors 6502a
Mar 27, 2011
992
912
The question is, if they have so wonderful IPs why no one is working with them? This is really puzzling, their claims do not fit reality
Apple is, though it's unknown whether they plan to use these announced IPs. However for everyone else, I'm guessing it requires moving from TBIM to TBDR architectures which I can't see them doing any time soon.
 

EntropyQ3

macrumors 6502a
Mar 20, 2009
718
824
Apple is, though it's unknown whether they plan to use these announced IPs. However for everyone else, I'm guessing it requires moving from TBIM to TBDR architectures which I can't see them doing any time soon.
Img Tech has challenges in selling its IP for several reasons.
One is that ARM has been offering their own Mali (ex-Falanx) solutions. The smallest IP buyers are basically out of the market today, the cost for producing SoCs on reasonably modern lithography requires fairly large production volumes to be viable. Small players find it easier to mix and match ARM IP building blocks (with traffic between functional blocks as part of the IP-stack), rather than having to integrate a large and rather demanding block from another supplier. The large mobile SoC producers prefer to have their own solutions, because it also has advantages to not be dependent on an outside source for something that can otherwise potentially be a competitive advantage + issues with coordination between developments of CPU/GPU/NPU/buses/et cetera. And of course running the risk that a competitor simply buys the supplier of critical IP.

Personal Opinion: Img depended too much on Apple lining their coffers, and rather than pushing hard in developing their core product and courting other customers, they diversified into doomed side projects and lost momentum in their main market. Mismanagement, basically.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,867
It’s a really confusing subject since DDR4 works so much differently from LPDDR4. I spent some time trying to understand it, and who I’m not sure that I got it 100% right, the gist of it is that LPDDR4 memory channel width is only 32 bit as opposed to DDR4 64 bit. So LPDDR4 has more bandwidth per bit, but less bits so to speak. To get most out of it, high performance implementations use quad channel configurations. This includes Intel Ice Lake and I think chips like A12X. That’s why LPDDR4 can outperform DDR4, abs that’s one reason why such configurations are more expensive.

It's in the GB database. Look for iPhone13

Ah, thank you. I was searching on things like "A14" and getting nowhere.

About LPDDR4 - If you look at ark.intel.com, the processors which support LPDDR4 also support 2-channel DDR4. I interpret this to mean (given Intel's norms) that these chips have a total of 128 data bits, and two command channels (one per 64-bit DDR4 DIMM).

The stuff you're having trouble with is that, unlike normal DDR4, the LPDDR4 spec focuses exclusively on the PoP packaging used used in cellphones and some tablets. DDR3/4 memory chips traditionally have only one command channel per chip package. A LPDDR4 PoP package has 64 data bits, and four command channels. Each command channel governs 16 data bits, not 32. SoC designers can choose a variety of ways to utilize the DQ (data) bits and CA (command-address) channels. You can gang all the CAs up and run them in lockstep, meaning it functions as if it's a 1-channel 64-bit wide memory. You can run each CA channel independently, so it's like four 16-bit wide memories, or 2x32. There's a total of eight ways to do it, so it's easy to get confused because you might find conflicting info based on how individual implementations did it.

Some of these modes reduce bandwidth per bit, but none of them can ever increase it. The purpose of more channels in a single PoP package is to increase the number of active pages (better for random access performance).

You can still make the calculation "Mbps per pin * # data pins" to figure out the theoretical max bandwidth. iFixit's teardown found this LPDDR4 PoP in the iPhone 12:


It's 64 bits with a maximum clock frequency of 2133 MHz = 4266 Mbps per pin. 64*4266 = 273,024 Mbps = 34.128 GB/s. GB4 measures 29.0 GB/s, which is about 85% of theoretical max.

That's actually pretty reasonable. If you're doing nothing but stupidly linear access patterns, it's possible to achieve numbers as high as 90% on plain old DDR3. Now, is it impressive that GB4 can do this while the entire rest of the system in the SoC is also injecting its own memory accesses in the mix? Yes. Quite impressive. Speaks well of the hit rate on Apple's whole-SoC last-level cache.
 

leman

macrumors Core
Original poster
Oct 14, 2008
19,522
19,679
It's 64 bits with a maximum clock frequency of 2133 MHz = 4266 Mbps per pin. 64*4266 = 273,024 Mbps = 34.128 GB/s. GB4 measures 29.0 GB/s, which is about 85% of theoretical max.

What I mean is that some newer CPUs can run LPDDR4 over 128-bit bus, with corresponding performance improvements. E.g. MBP 13" with Ice Lake:

https://browser.geekbench.com/v4/cpu/15830945 (43.2 GB/sec)

Compare this to the 128-bit DDR4

https://browser.geekbench.com/v4/cpu/15839891 (25.6GB/sec)
 
  • Love
Reactions: Andropov

PortoMavericks

macrumors 6502
Jun 23, 2016
288
353
Gotham City
4846C4DB-E1F0-4A6F-B3B1-5E7E1B0CB99D.jpeg



Microsoft is claiming that the AMD ML solution uses a very small area on the die.

Also, they’re claiming it’s a DirectX feature but instead is an open solution with Sony on board as well. When I say open, means Nvidia could use it on the tensor cores if they want to.
 

diamond.g

macrumors G4
Mar 20, 2007
11,437
2,665
OBX
Word on the street is AMD RT implementation is different then nvidia.
I wonder if a Apple Core is the same as a Shader Engine for AMD.
 

Attachments

  • 1603985997369.jpeg
    1603985997369.jpeg
    431.9 KB · Views: 168
  • Like
Reactions: PortoMavericks

Unregistered 4U

macrumors G4
Jul 22, 2002
10,610
8,629
Word on the street is AMD RT implementation is different then nvidia.
I wonder if a Apple Core is the same as a Shader Engine for AMD.
And they announced that there are benefits (one of which, CPU having direct access to GPU memory) that are only possible when you pair an AMD CPU with an AMD GPU. It’s not like Apple’s solution, but I think it’s the next best thing.
 
  • Like
Reactions: 2Stepfan
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.