Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

diamond.g · Oct 26, 2020

leman said:
ML upscaling is not a magical solution. It requires explicit game support (as the game needs to provide additional data to the NN), it requires the training infrastructure, and it requires local ML performance. Apples ML accelerators are fast, but they are optimized for mobile use — tensor cores in something like a RTX 2060 are still around 5 times faster (and a RTX 3080 is a whopping 25 times faster for ML). And of course, you need a lot of training data, which Nvidia has been collecting for years. Apple — not so much.

Generally I'd wager that tweaking your engine to take explicit advantage of TBDR is not more complicated than supporting ML upscaling. And you don't need to deal with the time-consuming and potentially expensive training step.

Not to mention that ML upscaling has little sense for the resolutions you are talking about. A14 is already offers 2x performance per watt compared to Nvidia's best. Apple GPUs won't have any problems dealing with 1080 resolutions. Where ML upscaling becomes really interesting is getting to resolution that GPU can't realistically do on it's own. But for that you need ML performance and bandwidth... which means needing a beefy GPU in the first place. I am not sure how this tech is suitable for entry level, I have a suspicion that it will stay reserved for high-end only for some time.

API doesn't matter. Nvidia's upscaling technology is based on a) custom API extensions and b) data for training the NN model and c) excellent ML performance of their GPUs. They use supercomputers to gather high-resolution game image data and build the upscaling network. They also seem to have collected plenty of data to quickly support new games.

But there is nothing special about any of this. If you want, you can build an autoencoder network for your iOS game, train it with high-res output and roll your own ML upscaling. Not sure whether it will be worth the effort though.

By the way, this kind of stuff is obligatory for RT. Ray-tracing produces "grainy" images (since you are casting individual rays), you need to use a filter that will produce a smooth image.

Is that because Ray Tracing isn’t done at full resolution?

onfire23 · Oct 26, 2020

Seems that iphone 12 uses Lpddr4x ram according to ifixit. Is Apple using quad channel for increased bandwidth?

diamond.g · Oct 26, 2020

onfire23 said:
Seems that iphone 12 uses Lpddr4x ram according to ifixit. Is Apple using quad channel for increased bandwidth?

Assuming this link is accurate, https://www.mouser.com/ProductDetail/Micron/MT53D512M64D4UA-046-XTF?qs=XeJtXLiO41SY3OnTrnxGrg== they appear to only be using 1 chip with a 64-bit bus. Not sure how that would be split for quad channel.

leman · Oct 26, 2020

diamond.g said:
Is that because Ray Tracing isn’t done at full resolution?

I think this is where the concept of resolution starts getting a bit fuzzy... RT works by simulating light rays in reverse, you basically cast a ray and trace back its path until it hits a light source. To get an accurate image, you need to cast multiple rays for each pixel (ideally A LOT of them), which becomes extremely expensive very quickly. Denoising techniques let you get away with less rays per pixel, improving the performance.

onfire23 said:
Seems that iphone 12 uses Lpddr4x ram according to ifixit. Is Apple using quad channel for increased bandwidth?

Interesting! According to GB4 emory benchmark tests, RAM bandwidth of A14 devices is 50% higher. That's quite a big jump! But we also don't know how GB does memory tests... maybe we are seeing some effects of increased cache in A14...

At any rate, the numbers are not bad — they are comparable with the 13" MBP that uses quad-channel 3733 MHz LPDDR4X (at the same time, in multi-core Ice Lake bandwidth seems to be significantly higher).

PortoMavericks · Oct 26, 2020

leman said:
ML upscaling is not a magical solution. It requires explicit game support (as the game needs to provide additional data to the NN), it requires the training infrastructure, and it requires local ML performance. Apples ML accelerators are fast, but they are optimized for mobile use — tensor cores in something like a RTX 2060 are still around 5 times faster (and a RTX 3080 is a whopping 25 times faster for ML). And of course, you need a lot of training data, which Nvidia has been collecting for years. Apple — not so much.

Generally I'd wager that tweaking your engine to take explicit advantage of TBDR is not more complicated than supporting ML upscaling. And you don't need to deal with the time-consuming and potentially expensive training step.

Not to mention that ML upscaling has little sense for the resolutions you are talking about. A14 is already offers 2x performance per watt compared to Nvidia's best. Apple GPUs won't have any problems dealing with 1080 resolutions. Where ML upscaling becomes really interesting is getting to resolution that GPU can't realistically do on it's own. But for that you need ML performance and bandwidth... which means needing a beefy GPU in the first place. I am not sure how this tech is suitable for entry level, I have a suspicion that it will stay reserved for high-end only for some time.

API doesn't matter. Nvidia's upscaling technology is based on a) custom API extensions and b) data for training the NN model and c) excellent ML performance of their GPUs. They use supercomputers to gather high-resolution game image data and build the upscaling network. They also seem to have collected plenty of data to quickly support new games.

But there is nothing special about any of this. If you want, you can build an autoencoder network for your iOS game, train it with high-res output and roll your own ML upscaling. Not sure whether it will be worth the effort though.

By the way, this kind of stuff is obligatory for RT. Ray-tracing produces "grainy" images (since you are casting individual rays), you need to use a filter that will produce a smooth image.

DLSS 2.0 already uses general ML samples.

It literally came out 8 months ago and they improved it a lot from 1.0. Expect new options for internal/target resolutions.

Regarding the Tensor Cores, sure you pay the premium for that. Like you said, the Tensor Cores aren’t just for upscaling and it has a broader usage like medical applications, biological research. Those GPUs are not toys.

The thing is, the next Nintendo Switch is rumored to have DLSS 2.0. If they’re including Tensor Cores on a low end Tegra or they’re doing something half baked but something that works, this sounds very low end to me.

leman · Oct 26, 2020

PortoMavericks said:
DLSS 2.0 already uses general ML samples.

It literally came out 8 months ago and they improved it a lot from 1.0. Expect new options for internal/target resolutions.

Regarding the Tensor Cores, sure you pay the premium for that. Like you said, the Tensor Cores aren’t just for upscaling and it has a broader usage like medical applications, biological research. Those GPUs are not toys.

The thing is, the next Nintendo Switch is rumored to have DLSS 2.0. If they’re including Tensor Cores on a low end Tegra or they’re doing something half baked but something that works, this sounds very low end to me.

You are certainly right about one thing: this is exiting technology! And it could be a good reasoning for including ML coprocessors on consumer computer hardware.

I agree with you that tit might be a good fit for Apple because of unified memory architecture and the fact that Apple already has good ML acceleration. What makes me a bit skeptical is the question about the minimal required ML performance for it to be useful, and the data needed to train the model. A gaming company such as Nvidia or AMD are probably in a better position to collect this data, and Nvidia has been doing it for long time...

Then again, Apple already includes a bunch of interesting filters with Metal Performance Shaders, I can see them adding a temporal upscaling of some sort... has the algorithm been published? It's not my area of expertise unfortunately...

JMacHack · Oct 26, 2020

leman said:
API doesn't matter. Nvidia's upscaling technology is based on a) custom API extensions and b) data for training the NN model and c) excellent ML performance of their GPUs. They use supercomputers to gather high-resolution game image data and build the upscaling network. They also seem to have collected plenty of data to quickly support new games.

But there is nothing special about any of this. If you want, you can build an autoencoder network for your iOS game, train it with high-res output and roll your own ML upscaling. Not sure whether it will be worth the effort though.

By the way, this kind of stuff is obligatory for RT. Ray-tracing produces "grainy" images (since you are casting individual rays), you need to use a filter that will produce a smooth image.

I see, so (in theory) they're halfway there with the ML cores on their SoCs. If they had dedicated RT cores, they could do HWRT then, right? (which I assume they're working on, since Metal has support for it)

leman · Oct 26, 2020

JMacHack said:
I see, so (in theory) they're halfway there with the ML cores on their SoCs. If they had dedicated RT cores, they could do HWRT then, right? (which I assume they're working on, since Metal has support for it)

I don't think that having a good denoising filter qualifies as being "halfway there" in respect to hardware raytracing

Denoising is part of Metal Performance Shaders and it's fast. Hardware-accelerated RT is a completely different challenge. In the meantime you can already use RT with Metal, it just won't be too fast. But you can literally download their demo application, compile it on your Big Sur machine and run it

diamond.g · Oct 26, 2020

JMacHack said:
I see, so (in theory) they're halfway there with the ML cores on their SoCs. If they had dedicated RT cores, they could do HWRT then, right? (which I assume they're working on, since Metal has support for it)

The ML upscaling is a side issue for RT. In essence it allows you to render at lower resolutions for performance gains, yet still have output that looks close to native.

@leman So there really isn't any reason why they cannot use the existing cores to "HW Accelerate' RT. You just potentially take a hit with rendering output right? As RT takes up power that could be used towards rendering pixels. Or does TBDR allow you to cheat that part?

leman · Oct 26, 2020

diamond.g said:
@leman So there really isn't any reason why they cannot use the existing cores to "HW Accelerate' RT. You just potentially take a hit with rendering output right? As RT takes up power that could be used towards rendering pixels. Or does TBDR allow you to cheat that part?

I don't think that I am qualified to answer these questions. I wrote a ray-tracer for fun when I was a teenager (who didn't, right?), but I have no clue how that stuff would work in hardware.

If you want my layman opinion... it kind of depends on what you mean by "HW accelerate". In the end, we don't care much about the engineering detail as long as the result is fast, right? As I understand it, currently raytracing in Metal is done in "software", using the general-purpose GPU shader units (the same that do pixel shading, vertex shading and any compute work). To make it "fast" however, they would need to add some sort of fixed-function hardware dedicated to RT like Nvidia did. I have no idea how Nvidia's RT cores work or what that would entail. I don't even know whether one needs dedicated "RT cores" or just some sort of additional instruction for the general-purpose cores.

EntropyQ3 · Oct 27, 2020

leman said:
Interesting! According to GB4 emory benchmark tests, RAM bandwidth of A14 devices is 50% higher. That's quite a big jump! But we also don't know how GB does memory tests... maybe we are seeing some effects of increased cache in A14...

At any rate, the numbers are not bad — they are comparable with the 13" MBP that uses quad-channel 3733 MHz LPDDR4X (at the same time, in multi-core Ice Lake bandwidth seems to be significantly higher).

We DO know how Geekbench does memory tests! John Poole is quite open.
The results from the iPhone 12 are consistent though. And they show results that are close to the 2018 iPad Pro for instance, much closer than to earlier iphones. In bandwidth. And 20% or so lower in Mem-copy. And better latencies.
I can’t really reconcile the results with a 64-bit interface to LPDDR4-4266 which is what the parts number indicates. While it’s within the theoretical limits of such an interface, the utilization would have to be incredibly good. It matches the numbers from huawei note 40 Pro+, that we know have an LPDDR5/LPDDR4x interface, and where the Pro+ model has been assumed to use LPDDR5. God knows what’s going on. But something has changed, and not by a small margin.

PortoMavericks · Oct 27, 2020

leman said:
You are certainly right about one thing: this is exiting technology! And it could be a good reasoning for including ML coprocessors on consumer computer hardware.

I agree with you that tit might be a good fit for Apple because of unified memory architecture and the fact that Apple already has god ML acceleration. What makes me a bit skeptical is the question about the minimal required ML performance for it to be useful, and the data needed to train the model. A gaming company such as Nvidia or AMD are probably in a better position to collect this data, and Nvidia has been doing it for long time...

Then again, Apple already includes a bunch of interesting filters with Metal Performance Shaders, I can see them adding a temporal upscaling of some sort... has the algorithm been published? It's not my area of expertise unfortunately...

That’s something I definitely agree with.

Apple and Qualcomm are both in good position to offer something in that field, since AMD runs the ML stuff on the shaders (please correct me if I’m wrong any point of the way).

EntropyQ3 · Oct 27, 2020

Andrei Frumusani, who writes the architectural articles at Anandtech, just said that the RAW memory subsystem, 64-bit lpddr4-4233, is unchanged, and that the GB4 score change is due to changes on the SoC. Which makes it difficult without deeper knowledge to predict how these changes affect codes in general. (As opposed to a straightforward memory bandwidth increase.)

leman · Oct 27, 2020

EntropyQ3 said:
Andrei Frumusani, who writes the architectural articles at Anandtech, just said that the RAW memory subsystem, 64-bit lpddr4-4233, is unchanged, and that the GB4 score change is due to changes on the SoC. Which makes it difficult without deeper knowledge to predict how these changes affect codes in general. (As opposed to a straightforward memory bandwidth increase.)

Where did he say it? Can't find anything on Anandtech or on Twitter...

mr_roboto · Oct 27, 2020

leman said:
Interesting! According to GB4 emory benchmark tests, RAM bandwidth of A14 devices is 50% higher. That's quite a big jump! But we also don't know how GB does memory tests... maybe we are seeing some effects of increased cache in A14...

At any rate, the numbers are not bad — they are comparable with the 13" MBP that uses quad-channel 3733 MHz LPDDR4X (at the same time, in multi-core Ice Lake bandwidth seems to be significantly higher).

It's only dual channel 3733 MHz LPDDR4X in that MBP. Intel doesn't do quad channel on notebook parts AFAIK.

Who has published GB4 numbers for A14? I haven't been able to find any.

PortoMavericks · Oct 27, 2020

Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU

by Andrei Frumusanu on October 13, 2020 4:00 AM EST

Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU

www.anandtech.com

Beyond confirming that the new C-Series will have ray tracing capabilities, Imagination further confirms that this will be an implementation using the company’s fullest of capabilities, including BVH processing and Coherency Sorting in hardware, a capability the company denotes as a “Level 4” ray tracing implementation, which would be more advanced than what current generation Nvidia and AMD GPUs are able to achieve at a “Level 3”.

The question is, if they have so wonderful IPs why no one is working with them? This is really puzzling, their claims do not fit reality.

leman · Oct 27, 2020

mr_roboto said:
It's only dual channel 3733 MHz LPDDR4X in that MBP. Intel doesn't do quad channel on notebook parts AFAIK.

It’s a really confusing subject since DDR4 works so much differently from LPDDR4. I spent some time trying to understand it, and who I’m not sure that I got it 100% right, the gist of it is that LPDDR4 memory channel width is only 32 bit as opposed to DDR4 64 bit. So LPDDR4 has more bandwidth per bit, but less bits so to speak. To get most out of it, high performance implementations use quad channel configurations. This includes Intel Ice Lake and I think chips like A12X. That’s why LPDDR4 can outperform DDR4, abs that’s one reason why such configurations are more expensive.

mr_roboto said:
Who has published GB4 numbers for A14? I haven't been able to find any.

It's in the GB database. Look for iPhone13

EntropyQ3 · Oct 27, 2020

leman said:
Where did he say it? Can't find anything on Anandtech or on Twitter...

I put my questions elsewhere.
He is going to do his now almost traditional article, which typically includes quite robust research into the memory subsystem, and I'll wait for that rather than trying to pump him for more info. But he confirmed that the memory system is what it physically looks like.
Now, improvements in the SoC accessing that memory is obviously also useful, but it is harder to evaluate for different workloads even when you know exactly what has changed and how it affects the benchmark code (and we don't). It's easier by far to extrapolate what, say, a doubling of the bus width means for different codes.

thenewperson · Oct 28, 2020

PortoMavericks said:
The question is, if they have so wonderful IPs why no one is working with them? This is really puzzling, their claims do not fit reality

Apple is, though it's unknown whether they plan to use these announced IPs. However for everyone else, I'm guessing it requires moving from TBIM to TBDR architectures which I can't see them doing any time soon.

EntropyQ3 · Oct 28, 2020

thenewperson said:
Apple is, though it's unknown whether they plan to use these announced IPs. However for everyone else, I'm guessing it requires moving from TBIM to TBDR architectures which I can't see them doing any time soon.

Img Tech has challenges in selling its IP for several reasons.
One is that ARM has been offering their own Mali (ex-Falanx) solutions. The smallest IP buyers are basically out of the market today, the cost for producing SoCs on reasonably modern lithography requires fairly large production volumes to be viable. Small players find it easier to mix and match ARM IP building blocks (with traffic between functional blocks as part of the IP-stack), rather than having to integrate a large and rather demanding block from another supplier. The large mobile SoC producers prefer to have their own solutions, because it also has advantages to not be dependent on an outside source for something that can otherwise potentially be a competitive advantage + issues with coordination between developments of CPU/GPU/NPU/buses/et cetera. And of course running the risk that a competitor simply buys the supplier of critical IP.

Personal Opinion: Img depended too much on Apple lining their coffers, and rather than pushing hard in developing their core product and courting other customers, they diversified into doomed side projects and lost momentum in their main market. Mismanagement, basically.

mr_roboto · Oct 28, 2020

leman said:
It’s a really confusing subject since DDR4 works so much differently from LPDDR4. I spent some time trying to understand it, and who I’m not sure that I got it 100% right, the gist of it is that LPDDR4 memory channel width is only 32 bit as opposed to DDR4 64 bit. So LPDDR4 has more bandwidth per bit, but less bits so to speak. To get most out of it, high performance implementations use quad channel configurations. This includes Intel Ice Lake and I think chips like A12X. That’s why LPDDR4 can outperform DDR4, abs that’s one reason why such configurations are more expensive.

It's in the GB database. Look for iPhone13

Ah, thank you. I was searching on things like "A14" and getting nowhere.

About LPDDR4 - If you look at ark.intel.com, the processors which support LPDDR4 also support 2-channel DDR4. I interpret this to mean (given Intel's norms) that these chips have a total of 128 data bits, and two command channels (one per 64-bit DDR4 DIMM).

The stuff you're having trouble with is that, unlike normal DDR4, the LPDDR4 spec focuses exclusively on the PoP packaging used used in cellphones and some tablets. DDR3/4 memory chips traditionally have only one command channel per chip package. A LPDDR4 PoP package has 64 data bits, and four command channels. Each command channel governs 16 data bits, not 32. SoC designers can choose a variety of ways to utilize the DQ (data) bits and CA (command-address) channels. You can gang all the CAs up and run them in lockstep, meaning it functions as if it's a 1-channel 64-bit wide memory. You can run each CA channel independently, so it's like four 16-bit wide memories, or 2x32. There's a total of eight ways to do it, so it's easy to get confused because you might find conflicting info based on how individual implementations did it.

Some of these modes reduce bandwidth per bit, but none of them can ever increase it. The purpose of more channels in a single PoP package is to increase the number of active pages (better for random access performance).

You can still make the calculation "Mbps per pin * # data pins" to figure out the theoretical max bandwidth. iFixit's teardown found this LPDDR4 PoP in the iPhone 12:

MT53D512M64D4UA-046 XT:F Micron | Mouser

MT53D512M64D4UA-046 XT:F Micron DRAM LPDDR4 32G 512MX64 FBGA QDP Z21M datasheet, inventory, & pricing.

www.mouser.com

It's 64 bits with a maximum clock frequency of 2133 MHz = 4266 Mbps per pin. 64*4266 = 273,024 Mbps = 34.128 GB/s. GB4 measures 29.0 GB/s, which is about 85% of theoretical max.

That's actually pretty reasonable. If you're doing nothing but stupidly linear access patterns, it's possible to achieve numbers as high as 90% on plain old DDR3. Now, is it impressive that GB4 can do this while the entire rest of the system in the SoC is also injecting its own memory accesses in the mix? Yes. Quite impressive. Speaks well of the hit rate on Apple's whole-SoC last-level cache.

leman · Oct 28, 2020

mr_roboto said:
It's 64 bits with a maximum clock frequency of 2133 MHz = 4266 Mbps per pin. 64*4266 = 273,024 Mbps = 34.128 GB/s. GB4 measures 29.0 GB/s, which is about 85% of theoretical max.

What I mean is that some newer CPUs can run LPDDR4 over 128-bit bus, with corresponding performance improvements. E.g. MBP 13" with Ice Lake:

https://browser.geekbench.com/v4/cpu/15830945 (43.2 GB/sec)

Compare this to the 128-bit DDR4

https://browser.geekbench.com/v4/cpu/15839891 (25.6GB/sec)

PortoMavericks · Oct 29, 2020

Microsoft is claiming that the AMD ML solution uses a very small area on the die.

Also, they’re claiming it’s a DirectX feature but instead is an open solution with Sony on board as well. When I say open, means Nvidia could use it on the tensor cores if they want to.

diamond.g · Oct 29, 2020

Word on the street is AMD RT implementation is different then nvidia.
I wonder if a Apple Core is the same as a Shader Engine for AMD.

Unregistered 4U · Oct 29, 2020

diamond.g said:
Word on the street is AMD RT implementation is different then nvidia.
I wonder if a Apple Core is the same as a Shader Engine for AMD.

And they announced that there are benefits (one of which, CPU having direct access to GPU memory) that are only possible when you pair an AMD CPU with an AMD GPU. It’s not like Apple’s solution, but I think it’s the next best thing.

Apple GPUs speculation: TBDR, RT and the benefit of wide ALUs

macrumors G5

macrumors member

macrumors G5

macrumors Core

macrumors 6502

macrumors Core

Suspended

macrumors Core

macrumors G5

macrumors Core

macrumors 6502a

macrumors 6502

macrumors 6502a

macrumors Core

macrumors 6502a

macrumors 6502

Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU​

macrumors Core

macrumors 6502a

macrumors 65816

macrumors 6502a

macrumors 6502a

macrumors Core

macrumors 6502

macrumors G5

Attachments

macrumors G4

Our Staff

Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU