The final blow for real gaming on the Mac, or the beginning of a new era?

jeanlain · Oct 5, 2020

Actually, I didn't want to compare the A12Z to an Nvidia GPU (this sentence is the leftover of a draft that I though I had deleted) as it's difficult to find an Nvidia GPU that's in the range of performance/power consumption of the A12Z.
But on a thin laptop, efficiency is key. Apple will use their GPUs in this context.

leman · Oct 5, 2020

jeanlain said:
Are you saying that a TBDR architecture is better suited for the inclusion of ray tracing in real time games?

Just a speculation. Not sure. But if Apple uses Imagination IP (or similar) to collect rays that hit neighboring surfaces together, it seems plausible that a TBDR-style on-chip cache might be utilized to improve performance and efficiency.

jeanlain said:
BTW, should we say "direct rendering" or "forward rendering"? I think Apple opposes TBDR to "Direct Renderer".

As far as I know, frequently used terms are "immediate rendering" or "forward rendering". Don't think I have ever heard "direct renderer", but what do I know.

leman · Oct 5, 2020

diamond.g said:
Which is interesting since there are 0 games (as far as we know) on iOS that is using Ray Traced GI or Reflections. There are 2 games that support it on other platforms (Fortnite and MineCraft) but those features haven't trickled down to mobile (and in the case of Fortnite, won't ever).

Not a big surprise, considering that there is no iOS hardware that supports hardware raytracing. Also, RT has been added to Metal only in iOS 14.

jeanlain · Oct 5, 2020

diamond.g said:
Which is interesting since there are 0 games (as far as we know) on iOS that is using Ray Traced GI or Reflections. There

There's no ray-tracing hardware on Apple platforms, but what Apple is doing in Metal WRT ray tracing suggests that there will be some dedicated hardware.

jeanlain · Oct 5, 2020

leman said:
As far as I know, frequently used terms are "immediate rendering" or "forward rendering". Don't think I have ever heard "direct renderer", but what do I know.

Ah, you're right. I was thinking about "immediate-mode".

diamond.g · Oct 5, 2020

leman said:
Not a big surprise, considering that there is no iOS hardware that supports hardware raytracing. Also, RT has been added to Metal only in iOS 14.

jeanlain said:
There's no ray-tracing hardware on Apple platforms, but what Apple is doing in Metal WTR ray tracing suggests that there will be some dedicated hardware.

So we are assuming that Apple's dedicated hardware is going to be faster than nvidia's dedicated hardware.

jeanlain · Oct 5, 2020

I'm not making this assumption.

leman · Oct 5, 2020

diamond.g said:
So we are assuming that Apple's dedicated hardware is going to be faster than nvidia's dedicated hardware.

Not necessarily. I am hypothesizing that the deferred aspect of the TBDR architecture could synergize with raytracing. I wouldn't go as far as to claim that Apple could outperform high-end Nvidia GPUs in anything, really, but they might very well be able to bring usable raytracing performance to mobile and ultraportable laptops.

magbarn · Oct 5, 2020

Can TBDR be taken advantage easily by just writing better drivers? Or does the entire game engine have to be redone to benefit TBDR vs Nvidia/AMD architecture?

jeanlain · Oct 5, 2020

magbarn said:
Can TBDR be taken advantage easily by just writing better drivers? Or does the entire game engine have to be redone to benefit TBDR vs Nvidia/AMD architecture?

I'm not an expert but I believe it's a little bit of both. There are APIs/objects that take advantage of the Apple TBDR GPUs, like image blocks, which are available for developers to use. But general advantages of TBDRs, like hidden surface removal, should be "automatic" I suppose.

leman · Oct 5, 2020

magbarn said:
Can TBDR be taken advantage easily by just writing better drivers? Or does the entire game engine have to be redone to benefit TBDR vs Nvidia/AMD architecture?

You don't need to change anything to get the advantage of the more efficient shader execution, a TBDR GPU will happily run your normal rendering code and apply all the usual TBDR optimizations such as perfect hidden surface removal to it. You might run into weird issues if your shaders somehow rely on invisible pixels being shaded, but I'd call it a software design bug.

Now, if you want to take advantage of Apple-specific TBDR features, such as control over the on-chip cache memory, programmable blending, programmable multisampling, memoryless render targets etc. — those are all custom extensions and have to be used by the game engine explicitly.

Basically, old games should run as expected on Apple Silicon Macs, occasional bugs non-withstanding. But they might run much better if they are updated to take advantage of Apple's custom features. I'd assume that newer games based on Unity will get some stuff "for free", as Unity is likely to optimize for Apple APIs.

mr_roboto · Oct 5, 2020

rkuo said:
Where did you get all of the above info from? I'm genuinely curious.

My limited research on this subject indicates that it's highly unlikely that desktop GPU manufacturers are leaving any performance on the table by not using TBDR. They've already incorporated some of the better parts of the "tile based" portion of rendering into their chips, and the deferred rendering part incurs both latency and increased load vs immediate rendering on more complex scenes. Most of that does not jive with what you are saying. There's really no evidence I can find to indicate that nVidia is being lazy with their GPU's or that immediate mode rendering isn't the right choice for the desktop.

Please don't take what I said as a suggestion that Nvidia is lazy. I'll try to rephrase... When you're the 800 pound gorilla of a market, your success was built on a specific technology, and you don't have any compelling reason to abandon it, you're going to have strong incentives to keep iterating on that tech rather than going all-in on a big change. That doesn't imply any lack of effort. Nvidia invests a ton of money and engineering into designing newer and better immediate mode GPUs.

They have borrowed some things from the tile world, but I'm not sure I'd characterize them as the "better parts", just pieces which make sense for an immediate mode renderer. Maybe if you think immediate is truly better, those are the better parts.

On the question of what's best for the desktop, I guess a better way to express what I'm getting at is that maybe immediate mode is a local maximum. AMD and Nvidia are high up on the immediate-mode hill. The view is pretty good. There's a TBDR hill off in the hazy distance, and it's possible it's taller than the immediate mode hill, but at present the occupants of that hill aren't competing in the PC GPU market, and have furthermore built moats (patents) around their hill to keep AMD and Nvidia out, just as they've done for their tech. There's also the valley filled with thorny bushes that PC GPU companies would have to slowly hack through to get there. (My analogies are getting stretched here; these are the difficulty of having a giant installed software base which expects an immediate mode renderer and needs rework to have better performance and image quality on TBDR.)

I don't pretend to know whether TBDR actually is a better desktop GPU tech, I just think there's reasons to consider it a possibility. The past couple decades of silicon engineering have had one main theme, and it's power efficiency as the path to better performance. Since TBDR is often much more power efficient, we should take it seriously. There is plenty of inertia keeping immediate GPUs on top in the PC world, but maybe it takes an Apple (a company whose management is arrogant enough to think they can make the inertia not matter) to prove that TBDR is viable.

(Of course, even if Apple is successful in their corner, that doesn't mean it's going to make the PC market move off of immediate mode. We can be pretty confident that Apple won't be selling Apple GPUs for PCs, so Nvidia and AMD won't feel direct competitive pressure unless ARM Mac is so successful it takes a huge chunk out of the PC market.)

rkuo · Oct 5, 2020

leman said:
Ah, I think I understand what you mean. You are saying that in a TBDR model the shader invocation has to wait until all the geometry in a tile has been processed. I wouldn't use the term parallelism here, it's more about latency. Anyway, this would be a cause for concern if you had shaders waiting to do pixel work... but this is a unified shader architecture. There is always work to do. You have compute shaders, vertex shaders, other tiles to process... one of the strong points of modern GPUs is automatic work balancing and latency hiding.

What is really relevant her tis the time it takes to draw a tile. A TBDR will invoke the pixel shader later, but it will only need to one "large" pixel shader dispatch. An IR will dispatch "small" pixel shader tasks continuously (see below for more details), but it will end up processing many more pixels. If you have a decent amount of overdraw, TBDR will end up doing much less work overall. And even more importantly, it will significantly reduce cache trashing and trips to the off-chip memory.

Yes, I said latency initially. The different parallelism I was referring to was at the tile level vs "immediate" processing of triangles. The extra setup enables tile level parallelism and cache locality there, but obviously at some cost.

Exactly! We see TBDR in mobile space because it's the only way to get reasonable performance out of bandwidth and energy-starved environment. We don't see it in the desktop space because IR are simpler devices and one can afford to be a bit wasteful in the desktop space anyway.

Agreed.

One problem with theoretical computer science is that it can be 100% correct and yet quite useless in real world Also, the paper you are quoting is from 1998... a lot of things changed since then. The authors are obviously correct that there is a per-triangle-per-tile setup cost — you need to find the tiles that the triangle intersects and store that information — but in practical terms this cost is zero or at least very very small. I have no idea how tiling is done in hardware, but a naive algorithm that comes to my mind is simply using a rasterizer to "draw" the triangle to a grid where a tile is represented by exactly one pixel. Let us consider the worst case scenario. A full-screen triangle at 5K (5120x2880) resolution will touch every single tile. At 32x32 tile size, this is a 160x90 tile grid or 14400 tiles to be marked. Sounds like a lot, right? But when we actually start rasterizing the triangle properly, which we need to do anyway, it will generate whopping 14745600 pixels — that is the amount of work our rasterizer needs to run through anyway! It's an overhead of 0.1%, on an operation that is already really really cheap. For smaller triangles, the overhead will be much lower, as they will touch fewer tiles. Not to mention that you can do other stuff with tiling, like running multiple rasterizers in parallel for different tiles. And that this is the overhead using the most naive approach to tiling — I am sure that experienced engineers who work on it for years were able to come up with a better idea than an amateur like me could do in a minute

I'm not quick to dismiss concepts out of hand just because they were discovered a while ago. The entire computing world is built on concepts from decades ago. Some things change and some things don't. If the redundant triangle work isn't somehow relevant today, then let's discuss why and how.

Rasterizing every triangle to bin it sounds horrendously expensive. But you know, from what I gather, the binning process is a big ingredient in Imagination's secret sauce, so to speak.

But even disregarding all that — modern desktop GPUs already use tiling! So it must be worth it for them. Here is a screenshot from a tool I wrote to study rasterization patterns on Macs:

View attachment 963093

Here, I am drawing two large overlapping triangles: yellow over red, while limiting the amount of pixels shaded, on a MacBook Pro with an AMD Navi GPU. As you can see, the GPU tiles the screen into large rectangular areas which are rasterized in tile order (left to right, top to bottom). Inside the large tiles, you get smaller square block artifacts. These blocks correspond to the invoked pixel shader groups: since Navi SIMD units are 32-wide units, it needs to receive work in batches of 32 items to achieve good hardware utilization. The blocks itself are 8x8 pixels, so once the rasterizer has output a 8x8 (64 pixels in total) pixel block, the pixel shader is executed. The tiling capability of Navi is fairly limited, once I go over a certain number of triangles (I think it's about 512, but I need to verify it), it resets. I am sure that details are much more complex, but this gives us a general idea how this stuff works. Nvidia is similar (can't test it since I don't have one, but I saw results from other people), just with different tile and block sizes. Intel (at least in the Skylake gen) does not use tiling (see the other screenshot I have attached), and it also uses different SIMD dispatch blocks (4x4 pixels — which makes sense given that fact that Intel Gen uses 4-wide SIMD).

As you can see, tiling must be cheap and beneficial enough so that big houses like Nvidia and AMD use it as well. But while they use tiling rendering, this is not deferred rendering: you can see the bottom red triangle drawn before the yellow triangle paints it over. Each pixel gets shaded twice. On a TBDR GPU (iPad, iPhone etc.) you won't ever see any red pixels, because the pixel shader is never called for the red triangle.

Fun tool. Yes, part of my point is that modern GPU's have already taken a lot of the useful bits of tile based rendering where it makes sense ... it can improve cache hit ratios and reduce memory bandwidth, among other things.

Definitely not free TBDR is a much more complex design, it needs more meticulously designed fixed function hardware, and there are a lot of details that I can only assume are extremely difficult to get right. That's why there is only one company that really gets TBDR done right, and it's Imagination Technologies, who has been researching and designing these GPUs for the last 20+ years. Apple GPUs are based on top of Imagination IP, with Apple possibly refining the architecture further.

I'm guessing we're probably going to see some nice improvements in the traditionally iGPU space ... notebooks and lower end desktops. I'm pretty reserved on this translating into anything exciting on the desktop class front.

leman · Oct 5, 2020

rkuo said:
I'm not quick to dismiss concepts out of hand just because they were discovered a while ago. The entire computing world is built on concepts from decades ago. Some things change and some things don't. If the redundant triangle work isn't somehow relevant today, then let's discuss why and how.

Rasterizing every triangle to bin it sounds horrendously expensive. But you know, from what I gather, the binning process is a big ingredient in Imagination's secret sauce, so to speak.

I think you might be overestimating the cost of binning. Rasterization cost itself is the same in each scenario, and while binning introduces some additional overhead, it’s practically negligible (as I tried to illustrate with my naive approach to binning in my previous post). As you mention yourself, everyone does tbinning these days, so it can’t be that expensive. Imaginations secret sauce is actually their deferred shading approach, many other popular mobile GPUs are immediate renderers. But all of them use tiling.

rkuo said:
I'm guessing we're probably going to see some nice improvements in the traditionally iGPU space ... notebooks and lower end desktops. I'm pretty reserved on this translating into anything exciting on the desktop class front.

I agree with you in that I don’t see how Apple wants to compete in the higher end workstation GPU segment, although I suspect they might be using very wide SIMD units to get a significant perf-per-watt advantage. Still, if you want to build a 30 TFLOP GPU, you need a lot of cores and memory bandwidth to feed them. For laptops and the iMac, I have little doubt that Apple GPU will do very well, especially when paired with faster RAM.

rkuo · Oct 5, 2020

mr_roboto said:
Please don't take what I said as a suggestion that Nvidia is lazy. I'll try to rephrase... When you're the 800 pound gorilla of a market, your success was built on a specific technology, and you don't have any compelling reason to abandon it, you're going to have strong incentives to keep iterating on that tech rather than going all-in on a big change. That doesn't imply any lack of effort. Nvidia invests a ton of money and engineering into designing newer and better immediate mode GPUs.

I mean, that’s one explanation. The answer is probably far simpler ... TBDR isn’t a relevant design choice in desktop class GPU’s.

They have borrowed some things from the tile world, but I'm not sure I'd characterize them as the "better parts", just pieces which make sense for an immediate mode renderer. Maybe if you think immediate is truly better, those are the better parts.

Fair enough ... better for immediate mode then. I’m not suggesting that TBDR is worse per se, it’s just different. Clearly in unified memory systems and mobile devices it’s had a great deal of success.

On the question of what's best for the desktop, I guess a better way to express what I'm getting at is that maybe immediate mode is a local maximum. AMD and Nvidia are high up on the immediate-mode hill. The view is pretty good. There's a TBDR hill off in the hazy distance, and it's possible it's taller than the immediate mode hill, but at present the occupants of that hill aren't competing in the PC GPU market, and have furthermore built moats (patents) around their hill to keep AMD and Nvidia out, just as they've done for their tech. There's also the valley filled with thorny bushes that PC GPU companies would have to slowly hack through to get there. (My analogies are getting stretched here; these are the difficulty of having a giant installed software base which expects an immediate mode renderer and needs rework to have better performance and image quality on TBDR.)

I don't pretend to know whether TBDR actually is a better desktop GPU tech, I just think there's reasons to consider it a possibility. The past couple decades of silicon engineering have had one main theme, and it's power efficiency as the path to better performance. Since TBDR is often much more power efficient, we should take it seriously. There is plenty of inertia keeping immediate GPUs on top in the PC world, but maybe it takes an Apple (a company whose management is arrogant enough to think they can make the inertia not matter) to prove that TBDR is viable.

(Of course, even if Apple is successful in their corner, that doesn't mean it's going to make the PC market move off of immediate mode. We can be pretty confident that Apple won't be selling Apple GPUs for PCs, so Nvidia and AMD won't feel direct competitive pressure unless ARM Mac is so successful it takes a huge chunk out of the PC market.)

You’re coming up with a lot of non-technical analogies to explain ... outside of the architectures themselves ... why TBDR could maybe beat immediate mode renderers. But the architectural issues themselves explain everything. TBDR doesn’t scale well because every tile has to be preprocessed, load balancing and resource constraints are difficult because some tiles can be much more complex than others, etc. etc. Modern desktop GPU’s are also not quite as dumb as we’re making them out to be here ... there are plenty of techniques already for discarding hidden primitives, etc.

The takeaway is that architectural decisions that look inefficient, duplicative, or “dumb” quickly become the correct approach when the need is to go massively parallel.

FYI in reading up on this stuff, which has all been really interesting, the TBDR approach is often referred to as “sort middle” and immediate mode as ”sort last”. Good for looking things up.

Again, I’m not an expert on this particular topic, just educating myself as we discuss. Feel free to rebut.

leman · Oct 5, 2020

rkuo said:
But the architectural issues themselves explain everything. TBDR doesn’t scale well because every tile has to be preprocessed, load balancing and resource constraints are difficult because some tiles can be much more complex than others, etc. etc.

Isn't there a contradiction in your statement? If tiling brings all these problems, why did both Nvidia and AMD recently switch to tiling? In particular, Nvidia almost doubled the performance with their Maxwell architecture, and the introduction of tiling is claimed to be the main reason behind it. As to load balancing... we are talking about 32x32 pixel tiles here. They are small enough to smooth out any imbalances in the work package size.

MrGunnyPT · Oct 6, 2020

Only thing I play on the Mac when I'm on the go is World of Warcraft so yeah it would be nice that not the ARM Macs it would support pretty well.

rkuo · Oct 6, 2020

leman said:
Isn't there a contradiction in your statement? If tiling brings all these problems, why did both Nvidia and AMD recently switch to tiling? In particular, Nvidia almost doubled the performance with their Maxwell architecture, and the introduction of tiling is claimed to be the main reason behind it. As to load balancing... we are talking about 32x32 pixel tiles here. They are small enough to smooth out any imbalances in the work package size.

I don't think there's any contradiction. Some parts of a tile based approach can be adapted for use in immediate mode rendering. In particular, a tile based memory layout has a better cache hit ratio since geometry and textures are spatially located near each other, not by rows. Those and other potential optimizations wouldn't be the same as a fully TBDR approach tho.

diamond.g · Oct 6, 2020

leman said:
Isn't there a contradiction in your statement? If tiling brings all these problems, why did both Nvidia and AMD recently switch to tiling? In particular, Nvidia almost doubled the performance with their Maxwell architecture, and the introduction of tiling is claimed to be the main reason behind it. As to load balancing... we are talking about 32x32 pixel tiles here. They are small enough to smooth out any imbalances in the work package size.

Wasnt it (tiling) more a fix for cache coherency issues?

leman · Oct 6, 2020

rkuo said:
I don't think there's any contradiction. Some parts of a tile based approach can be adapted for use in immediate mode rendering. In particular, a tile based memory layout has a better cache hit ratio since geometry and textures are spatially located near each other, not by rows. Those and other potential optimizations wouldn't be the same as a fully TBDR approach tho.

Yes, but aren't most of your concerns related to tiling itself? Like geometry processing and rasterization overhead? You also mention load balancing issues for the pixel shading phase of course, but that is probably the less substantiated worry, when we consider the small tile size and the amount of work saved by the deferred approach.

diamond.g said:
Wasnt it (tiling) more a fix for cache coherency issues?

It is, yes. The point that I am trying to make is that most TBDR skeptics are questioning the scalability of the tiling phase — which is odd, as tiling is used on pretty much all modern rendering hardware.

By the way, if I am not mistaken other mobile GPUs such as Qualcomm's Adreno are tile based immediate renderers, not deferred.

diamond.g · Oct 6, 2020

So aside from Unity and Unreal Engine 4, what other (middleware style) game engines run on Apple's ARM based systems?

Chozes · Oct 6, 2020

I would say it will boost ARM development. Some major games are being ported to the new architecture. Plus all the iOS games will give the eco system a boost.

diamond.g · Oct 6, 2020

Chozes said:
I would say it will boost ARM development. Some major games are being ported to the new architecture. Plus all the iOS games will give the eco system a boost.

What major games?

Chozes · Oct 6, 2020

Blizzard are porting World of Warcraft to ARM. They are even adding Ray Tracing. I would assume many companies may make it a focus?

Boil · Oct 6, 2020

Chozes said:
Blizzard are porting World of Warcraft to ARM. They are even adding Ray Tracing. I would assume many companies may make it a focus?

Blizz putting WoW on Apple silicon is great, makes me wonder if we might see WoW come to iPadOS, now that iPads have keyboard & mouse capabilities...!?!

The final blow for real gaming on the Mac, or the beginning of a new era?

macrumors 68020

macrumors Core

macrumors Core

macrumors 68020

macrumors 68020

macrumors G5

macrumors 68020

macrumors Core

macrumors 68040

macrumors 68020

macrumors Core

macrumors 6502a

macrumors 65816

macrumors Core

macrumors 65816

macrumors Core

macrumors 65816

macrumors 65816

macrumors G5

macrumors Core

macrumors G5

macrumors member

macrumors G5

macrumors member

macrumors 68040

Our Staff