Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Realityck

macrumors G4
Nov 9, 2015
11,414
17,205
Silicon Valley, CA
We have very little to go on regarding exactly how good Apple GPUs will be. That said, they seem to be leading the industry more often than not in phones and tablets so that has to be a good sign. At the very least its not a bad one.

The rumoured Apple games console and controller: It makes a lot of logical sense that Apple would want to move into this space. Its a missing component of the Apple ecosystem that now overlaps substantially with a lot of their existing gear and services.

Touch friendly interface: It has been suggested Apple will bring touchscreens to Macs but it has been suggested that above this Apple has ideas about an AR interface. They have fancy LIDAR cameras already in use, and rumours persist of AR glasses, its not a huge stretch at all to think this is coming. While the haters will complain that AR and VR already exist elsewhere and no-one really uses it, its standard form for Apple to tie a few existing technologies together into something so usable its brilliant.
You brought up good points to comment about . ..

Yes there is a lot assumption against how fluid the game will be that is highly complex with interaction, not a beautiful walk through like the demo was.

The principle difference between iOS/iPadOS games is the lack of control interaction with a game such as touch interface vs a console game controller vs a keyboard and gamer mouse. The first instance doesn't offer anything you need to put much thought into, I mean how much can you do holding a iPhone or a iPad as far as user input.

The programmable game controller with console is a major step up (missing), but what is Apple offering, using an Apple TV remote?

The traditional keyboard and gamer mouse with more involved games is where the Mac is more challenged because the computers that Apple makes are more akin to a laptop then a desktop in most aspects. It doesn't have to be thin, use mobile GPU's, it should be more accessible to upgrade the product, not sell a computer in a box only.

Apple also need to persuade game manufacturers by being a proponent of games on the Apple lineup, not pitch iOS/tvOS/iPadOS Apple Arcade as the only solution. I would also rant about how poor the App Store is as far as listing more advanced games that would use a console game controller or keyboard and gamer mouse to interact with.

So there is a lot more to this than just claiming ARM will be fast and offer respectable graphics at lower temperatures. Does anyone want to say their game box is super cool as something to boost about, really? :p
 
Last edited:

leman

macrumors Core
Oct 14, 2008
19,521
19,677
Yes you can, not to mention you need to keep all of that in VRAM. In this case, Apple Silicon will use unified memory and likely the LPDDR5, which is slow.

I get you, you’ll come up and say something about the vertex data and the cache will have zero impact because it’s all in the same silicon.

It’s a pipe dream, the Mac line will get a huge performance boost over the current Intel line but it will never, ever reach the level of performance of a discrete GPU for gaming.

Like, never.

Not sure what you are trying to say. Is this about the speed of the system RAM? LPDDR5 is comparable to GDDR4, so that has the lower end covered. Higher end can use HBM or something similar. The iPad Pro already offers competitive performance to the GTX 1050, so I don’t see why an improved architecture with more cores and faster RAM wouldn’t be able to compete with mid-range discrete GPUs.


Now, mesh shaders have the advantage that the produced geometry is immediately fed to the rasterizer. I don’t think that Apple currently has an answer to that. Also, Apple GPU needs to marshal the geometry through memory anyway since that’s were the binning data is stored. So yes, if you want to produce excessive amounts of small geometry on the GPU, with little overdraw, mesh shaders on Nvidia hardware would do wonders. Except of course that mesh shaders can only produce limited amount of data (it has to stay on cache after all), so we are back to square one...
 

jerwin

Suspended
Jun 13, 2015
2,895
4,652
Will keep my iMac with Mojave as long as I can, then switch back to Windows eventually.
Loss of 32 bit is annoying, particularly if you still enjoy playing the classics. Shadow of the term Raider and a couple of other titles require Catalina. They are ports, though, so bootcamp remains a possibility.
 
  • Like
Reactions: Rashy

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
I wouldn’t call it a microarchitecture difference*, it’s an algorithmic difference. I’m quite exited to get TBDR on desktop, especially one with programmable GPU cache as offered by Apple. It allows one to utilize the hardware in much more efficient ways.
*IMR vs. TBDR

I wasn't sure myself, but listed it as microarchitecture because that's how Rys Sommefeldt at Imagination described it in a blog (https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/). I think he did so because, at least in their design, the algorithm is implemented by the microarchitecture (as he puts it, it's "baked into the hardware"). I take this to mean that they've designed the microarchitecture specifically to execute this algorithm—like is the case with a Field Programmable Gate Array (FPGA), or a programmable Application-Specific Integrated Circuit (ASIC).

So it sounds like it is an algorithmic difference rather than a microarchitectural difference, but that the differences in microarchitecture follow from this.

****
Having said that, NVIDIA and AMD have world-class graphics experts. Doesn't the fact that use IMR (even though TBDR is more efficient) indicate (or at least strongly suggest) that TBDR in fact can't do everything IMR can do?
 
Last edited:

Waragainstsleep

macrumors 6502a
Oct 15, 2003
612
221
UK
Having said that, NVIDIA and AMD have world-class graphics experts. Doesn't the fact that use IMR (even though TBDR is more efficient) indicate (or at least strongly suggest) that TBDR in fact can't do everything IMR can do?

Nvidia and AMD have been doing what they are doing for a very long time. Much like Intel. You only change your entire architecture when you hit a wall you can't get through. Unless you're Apple maybe.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
*IMR vs. TBDR

I wasn't sure myself, but listed it as a microarchitecture because that's how one of Imagination's engineers described it in an article. But let's suppose it is, more precisely, algorithmic (which I take to mean a difference in software/firmware rather than hardware). This raises some interesting questions:

1) Even if the difference between IMR and TBDR is algorithmic, does this necessitate corresponding microarchitecture differences—i.e., is the microarchitecture needed to run IMR different from TBDR?

2) If not, and if TBDR is uniformly superior to IMR, why have NVIDIA, AMD, and Intel all stayed with IMR?

I.e., NVIDIA and AMD have world-class graphics experts. Doesn't the fact that they've stuck with IMR (even though TBDR is more efficient) indicate that TBDR can't do everything IMR can do?

That’s an excellent question! First of all, let me clarify what I mean by “algorithmic”. I am merely referring to the fact that TBDR and IMR use different approaches (algorithms) to rendering. These algorithms themselves are indeed partially encoded in fixed-function hardware (I suppose this is what you mean by microarchitecture). I prefer not to use this term since it can mean any kind of hardware difference (e.g. two IMR GPUs can also have different microarchitectures).

Now, about TBDR’s “superiority”. TBDR has two decisive advantages. First, it does perfect hidden surface removal, meaning that you don’t waste any fragment shading work (and fragment shading is usually more expensive). Second, it offers strong guarantees of memory coherence - since images are generated in tiles, you only need to consider the region of memory for a given tile for fragment shader dispatch. This is important as it optimized external memory accesses, making the GPU less reliant on the memory bandwidth. The combination of these two properties allows for some interesting properties. For example, they guarantee that all shader invocations target a different pixel. This enables things like programmable blending (stuff that IMR cant’t do efficiently because of data races). Since the hardware knows which primitives are visible, it can do crazy stuff like data reordering to improve SIMD coherency (Apple mentioned it during WWDC). Also, since tiles are an actual physical resource, Apple exposes them to the developers. Which in turn means that you can use ultra-fast on-chip cache to store custom data, run custom processing steps, do custom multisampling - without ever needing to touch VRAM. Some traditionally heavy things like multi sampling or g-buffer deferred rendering are much cheaper on Apple hardware if it’s features are properly utilized.

A key disadvantage of the TBDR is that it has to “sort” the data before drawing. This means a lot of prior processing and culling (not a big deal), but more importantly, this intermediate data has to be stored somewhere. A TBDR needs to do a round trip in memory, pre-processing all the geometry data before it can actually start drawing. An IMR can dispatch the geometry to the rasterizer at any time. For example, new techniques like mesh shaders are not suitable with TBDR. Mesh shaders generate the geometry on the GPU directly and let it be drawn without touching the VRAM. But a TBDR needs to sort the geometry, meaning a trip to the memory. Then of course, TBDR is a much more complicated - after all, all IMR has to do is draw stuff, while TBDR has to bin it, sort it, determine only visible pixels and then draw it. It’s much more difficult to do properly, additional complexity tends to introduce bottlenecks.

So, why don’t Nvidia or AMD use TBDR? I would say it’s because they come from a different historical perspective. TBDR is associated with low-end mobile graphics, where optimizing the hell out of hardware resources is really important in order to get remotely acceptable performance. And since low-end mobile graphics is slow anyway, games written for it are low-poly, which means the “tricky” geometry sorting step can be done without much worry. Nvidia and co. however come from the desktop, where power concerns are less of a worry. They can afford to brute-force the entire process as it’s ok for them to use power-hungry memory and a lot of shader cores. And of course, IMR are simpler, meaning they are easier to scale to high performance. Finally, modern desktop GPUs started to borrow some aspects of TBDR, for example, they use a limited form of tiling (without the sorting step) to optimize memory coherence. This was responsible for a big performance boost on Maxwell and Navi for example, and I’m sure it’s coming to Intel Xe as well.

So basically, TBDR is a difficult engineering puzzle to get right. Apple is currently one of the few companies that has the required know-how, which they have inherited from Imagination. If they managed to solve the weak point of TBDR, which is the geometry front-end, they certainly can deliver very good performance with very low power consumption.

TLDR: TBDR is an inherently more efficient approach to rasterization and it offers strong guarantees that can be utilized for implementing advanced rendering techniques. It’s drawbacks are high engineering complexity and the need to pre-process the geometry. Desktop renderers don’t use TBDR because they can afford to be more wasteful and because INR is simpler to implement.
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
That’s an excellent question! First of all, let me clarify what I mean by “algorithmic”. I am merely referring to the fact that TBDR and IMR use different approaches (algorithms) to rendering. These algorithms themselves are indeed partially encoded in fixed-function hardware (I suppose this is what you mean by microarchitecture). I prefer not to use this term since it can mean any kind of hardware difference (e.g. two IMR GPUs can also have different microarchitectures).

Now, about TBDR’s “superiority”. TBDR has two decisive advantages. First, it does perfect hidden surface removal, meaning that you don’t waste any fragment shading work (and fragment shading is usually more expensive). Second, it offers strong guarantees of memory coherence - since images are generated in tiles, you only need to consider the region of memory for a given tile for fragment shader dispatch. This is important as it optimized external memory accesses, making the GPU less reliant on the memory bandwidth. The combination of these two properties allows for some interesting properties. For example, they guarantee that all shader invocations target a different pixel. This enables things like programmable blending (stuff that IMR cant’t do efficiently because of data races). Since the hardware knows which primitives are visible, it can do crazy stuff like data reordering to improve SIMD coherency (Apple mentioned it during WWDC). Also, since tiles are an actual physical resource, Apple exposes them to the developers. Which in turn means that you can use ultra-fast on-chip cache to store custom data, run custom processing steps, do custom multisampling - without ever needing to touch VRAM. Some traditionally heavy things like multi sampling or g-buffer deferred rendering are much cheaper on Apple hardware if it’s features are properly utilized.

A key disadvantage of the TBDR is that it has to “sort” the data before drawing. This means a lot of prior processing and culling (not a big deal), but more importantly, this intermediate data has to be stored somewhere. A TBDR needs to do a round trip in memory, pre-processing all the geometry data before it can actually start drawing. An IMR can dispatch the geometry to the rasterizer at any time. For example, new techniques like mesh shaders are not suitable with TBDR. Mesh shaders generate the geometry on the GPU directly and let it be drawn without touching the VRAM. But a TBDR needs to sort the geometry, meaning a trip to the memory. Then of course, TBDR is a much more complicated - after all, all IMR has to do is draw stuff, while TBDR has to bin it, sort it, determine only visible pixels and then draw it. It’s much more difficult to do properly, additional complexity tends to introduce bottlenecks.

So, why don’t Nvidia or AMD use TBDR? I would say it’s because they come from a different historical perspective. TBDR is associated with low-end mobile graphics, where optimizing the hell out of hardware resources is really important in order to get remotely acceptable performance. And since low-end mobile graphics is slow anyway, games written for it are low-poly, which means the “tricky” geometry sorting step can be done without much worry. Nvidia and co. however come from the desktop, where power concerns are less of a worry. They can afford to brute-force the entire process as it’s ok for them to use power-hungry memory and a lot of shader cores. And of course, IMR are simpler, meaning they are easier to scale to high performance. Finally, modern desktop GPUs started to borrow some aspects of TBDR, for example, they use a limited form of tiling (without the sorting step) to optimize memory coherence. This was responsible for a big performance boost on Maxwell and Navi for example, and I’m sure it’s coming to Intel Xe as well.

So basically, TBDR is a difficult engineering puzzle to get right. Apple is currently one of the few companies that has the required know-how, which they have inherited from Imagination. If they managed to solve the weak point of TBDR, which is the geometry front-end, they certainly can deliver very good performance with very low power consumption.

TLDR: TBDR is an inherently more efficient approach to rasterization and it offers strong guarantees that can be utilized for implementing advanced rendering techniques. It’s drawbacks are high engineering complexity and the need to pre-process the geometry. Desktop renderers don’t use TBDR because they can afford to be more wasteful and because INR is simpler to implement.
Nice write up!
I was always under the impression that TBDR was patented to the gills by Imagination so Nvidia and ATI had no choice but to do IMR.

Wikipedia makes it seem like we would have gotten more desktop cards but STMicro stopped making them in 2001.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
Nice write up!
I was always under the impression that TBDR was patented to the gills by Imagination so Nvidia and ATI had no choice but to do IMR.

That’s another aspect of the story, yes. I can’t comment much on it since I still have difficulty understanding how patents work. Like, couldn’t Nvidia do TBDR at all because of patents, or does it just means they can’t use the same approach as Imagination does?
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
That’s another aspect of the story, yes. I can’t comment much on it since I still have difficulty understanding how patents work. Like, couldn’t Nvidia do TBDR at all because of patents, or does it just means they can’t use the same approach as Imagination does?
Yeah. I am not a lawyer, but my hip take is that they cannot use the same method. Looks like HSR and HST are what they patented, so others would have to find a different way of doing it.
 

jerwin

Suspended
Jun 13, 2015
2,895
4,652
Is that a bad thing? I don’t think so, with much more performance available on the entry line it can entice developers. I’m sure if I were running a studio I’d consider very seriously to bring my game to the Mac. The iPad not so much because of mandatory touch controls and in my opinion, this is something that Apple should rethink.

The mac is distinuguished across the line by high definition displays. The imac 27 has a stupidly high resolution. It's not a gimmick. It's actually useful, particularly for text. When I had better eyes, I could compare it confidentally to a finely printed book full of glossy photographs that moved.

Imagine if your games could look like that!. And yet-- they don't. Sure you might be able to find a 32 bit game ( reinstalling Mojave), and crank up all the settings to the ultra of the day, and play it on 5120x2880, and it would look pretty good. Crisp even.. A recent title? Perish the thought. Halving the resolution helps. Scaling the resolution by 1/3 helps ever more, but then everything becomes a tad blurry. I'm ready for the next generation of improvements, while Apple is saying that my years old graphics adapter is good enough-- maybe even too good.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
That’s an excellent question! First of all, let me clarify what I mean by “algorithmic”. I am merely referring to the fact that TBDR and IMR use different approaches (algorithms) to rendering. These algorithms themselves are indeed partially encoded in fixed-function hardware (I suppose this is what you mean by microarchitecture). I prefer not to use this term since it can mean any kind of hardware difference (e.g. two IMR GPUs can also have different microarchitectures).

Now, about TBDR’s “superiority”. TBDR has two decisive advantages. First, it does perfect hidden surface removal, meaning that you don’t waste any fragment shading work (and fragment shading is usually more expensive). Second, it offers strong guarantees of memory coherence - since images are generated in tiles, you only need to consider the region of memory for a given tile for fragment shader dispatch. This is important as it optimized external memory accesses, making the GPU less reliant on the memory bandwidth. The combination of these two properties allows for some interesting properties. For example, they guarantee that all shader invocations target a different pixel. This enables things like programmable blending (stuff that IMR cant’t do efficiently because of data races). Since the hardware knows which primitives are visible, it can do crazy stuff like data reordering to improve SIMD coherency (Apple mentioned it during WWDC). Also, since tiles are an actual physical resource, Apple exposes them to the developers. Which in turn means that you can use ultra-fast on-chip cache to store custom data, run custom processing steps, do custom multisampling - without ever needing to touch VRAM. Some traditionally heavy things like multi sampling or g-buffer deferred rendering are much cheaper on Apple hardware if it’s features are properly utilized.

A key disadvantage of the TBDR is that it has to “sort” the data before drawing. This means a lot of prior processing and culling (not a big deal), but more importantly, this intermediate data has to be stored somewhere. A TBDR needs to do a round trip in memory, pre-processing all the geometry data before it can actually start drawing. An IMR can dispatch the geometry to the rasterizer at any time. For example, new techniques like mesh shaders are not suitable with TBDR. Mesh shaders generate the geometry on the GPU directly and let it be drawn without touching the VRAM. But a TBDR needs to sort the geometry, meaning a trip to the memory. Then of course, TBDR is a much more complicated - after all, all IMR has to do is draw stuff, while TBDR has to bin it, sort it, determine only visible pixels and then draw it. It’s much more difficult to do properly, additional complexity tends to introduce bottlenecks.

So, why don’t Nvidia or AMD use TBDR? I would say it’s because they come from a different historical perspective. TBDR is associated with low-end mobile graphics, where optimizing the hell out of hardware resources is really important in order to get remotely acceptable performance. And since low-end mobile graphics is slow anyway, games written for it are low-poly, which means the “tricky” geometry sorting step can be done without much worry. Nvidia and co. however come from the desktop, where power concerns are less of a worry. They can afford to brute-force the entire process as it’s ok for them to use power-hungry memory and a lot of shader cores. And of course, IMR are simpler, meaning they are easier to scale to high performance. Finally, modern desktop GPUs started to borrow some aspects of TBDR, for example, they use a limited form of tiling (without the sorting step) to optimize memory coherence. This was responsible for a big performance boost on Maxwell and Navi for example, and I’m sure it’s coming to Intel Xe as well.

So basically, TBDR is a difficult engineering puzzle to get right. Apple is currently one of the few companies that has the required know-how, which they have inherited from Imagination. If they managed to solve the weak point of TBDR, which is the geometry front-end, they certainly can deliver very good performance with very low power consumption.

TLDR: TBDR is an inherently more efficient approach to rasterization and it offers strong guarantees that can be utilized for implementing advanced rendering techniques. It’s drawbacks are high engineering complexity and the need to pre-process the geometry. Desktop renderers don’t use TBDR because they can afford to be more wasteful and because INR is simpler to implement.
Thanks! Really nice of you to take the time to write this up!

So is the additional front-end algorithmic complexity of TBDR hidden from the developers, or would it be something that increases the challenge of writing higher-end games?

Given the complexity of these systems, it seems the only way this question (whether it will be more difficult to produce games with high geometrical complexity with TBDR) will be answered is if when and devs try to write high-complexity games for AS, and report on their experiences.

BTW, you'll see that a few minutes after my initial post (but before I saw your reply) I edited my post in a way that comports with what you wrote in your first paragraph.
 
  • Like
Reactions: unsui_grep

leman

macrumors Core
Oct 14, 2008
19,521
19,677
Thanks! Really nice of you to take the time to write this up!

Thank you for your kind words! And sorry fir the typos, I wrote it all on my iPhone...

So is the additional front-end algorithmic complexity of TBDR hidden from the developers, or would it be something that increases the challenge of writing higher-end games?

I don’t think it matters at all for developers. Same drawing code will produce identical results on IMR or TBDR. Developers who take advantage of special features exposed by TBDR might see improved performance. The rest is up to the given implementation.


Given the complexity of these systems, it seems the only way this question (whether it will be more difficult to produce games with high geometrical complexity with TBDR) will be answered is if when and devs try to write high-complexity games for AS, and report on their experiences.

Pretty much, yes. That’s why I am exited about AS, it will show whether TBDR is viable on desktop.


BTW, you'll see that a few minutes after my initial post (but before I saw your reply) I edited my post in a way that comports with what you wrote in your first paragraph.

I don’t think it’s in my competence to comment on that :) As to my personal thoughts, I believe that big GPU companies never had to bother with these optimizations because they always had a higher power budget. I also don’t believe that IMR or TBDR can do “ different thing”, after all, they are two different approaches to achieving the same thing. They just have different algorithmic complexity. And their unique particularities that might enable some special features (like the programmable blending or tile shaders on TBDR).
 
  • Like
Reactions: unsui_grep

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
It's curious that the multi-core result is more than 4x the single-core result, given that others have reported the DTK uses only the four high-power cores, leaving the four low-power cores disabled (https://www.eejournal.com/article/whats-inside-apple-silicon-processors/). Maybe whatever they did to get GB5 running natively also allowed them to turn on all eight cores (?). Alas, the screenshot cuts off just above where GB reports the core count.

Is it possible that Rosetta 2 imply doesn't make use of the efficiency cores? Since there aren't any x86 Apple devices that have any.

Interesting idea, don't know.

I wouldn't read too much into that. The multi-core benchmark rarely scales linearly with the single-core number. Usually, it's less, but there are a couple of examples with multi-core > single-core x cores, for instance this one.
Yes, but outside of pathological cases, the non-linearity is always in the direction of the x-core result being less than x times faster than the single-core result. So unless there's an error in the GB benchmark such that it commonly mismeasures single- vs. multi-core results, I was thinking what you found probably represents an outlier, and is thus not a likely explanation for what we're seeing with the 9to5Mac result (https://9to5mac.com/2020/07/23/apple-silicon-benchmarks-apps/).

And, indeed, in revisiting that result, I just noticed that the number of cores is listed in their other screenshot (the one for the Metal score, which I wasn't looking at and thus missed). It is, as I originally speculated, using eight cores for the native GB test:

1595724269897.png


Basically, what the native scores tell us is that, at least for the GB5 CPU test, the 2.49 GHz A12Z in the DTK Mac Mini running MacOS performs essentially the same (1098/4555) as the 2.49 GHz A12Z in a 2020 Pad running iOS (1118/4626).
 
Last edited:
  • Like
Reactions: unsui_grep

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
Thank you for your kind words! And sorry fir the typos, I wrote it all on my iPhone...



I don’t think it matters at all for developers. Same drawing code will produce identical results on IMR or TBDR. Developers who take advantage of special features exposed by TBDR might see improved performance. The rest is up to the given implementation.




Pretty much, yes. That’s why I am exited about AS, it will show whether TBDR is viable on desktop.




I don’t think it’s in my competence to comment on that :) As to my personal thoughts, I believe that big GPU companies never had to bother with these optimizations because they always had a higher power budget. I also don’t believe that IMR or TBDR can do “ different thing”, after all, they are two different approaches to achieving the same thing. They just have different algorithmic complexity. And their unique particularities that might enable some special features (like the programmable blending or tile shaders on TBDR).
More questions!:

I. Apple makes the broad claim that "Apple GPUs are also way, way more power-efficient than both integrated and discrete GPUs" (https://forums.macrumors.com/thread...at-egpus-are-also-dead.2246098/#post-28682372), but that is in the context of games and other graphics intensive apps, to which TBDR would be applicable. But, as you know, GPUs are used for much more than games. In other areas, where TBDR might, or definitely wouldn't, apply, would Apple's GPU design lose its efficiency advantage over NVIDIA/AMD? For instance:

1) Rendering of web pages. Here's there's probably not much overdraw, so I'd guess TBDR woudn't help much.
2) Generalized rendering, e.g., if you are doing work that requires a lot of screen area, and want your GPU to be able to drive, say, twin 8k monitors.
3) Graphics content creation, i.e., photo and video editing. Would TBDR play a significant role here? Sure, as part of photo and video editing, you need to display the scenes, but most of the processing power here goes to the editing, not to displaying what's been edited. For instance, on Pixar's site they said it took their 24,000-core rendering farm two years to render Monster U: https://sciencebehindpixar.org/pipeline/rendering So, to the extent GPU's are used in this process (and they are), would TBDR come into play in any significant way?

Also worth keeping in mind:
4) Non-graphical uses, e.g., HPC/ML. Here TBDR would obviously be irrelevant. While MBP's and iMacs wouldn't be used for production ML, a future AS Mac Pro might be. [Though one can do lightweight ML on an MBP or iMac.]

II. In the same video cited above, Apple claims that "the performance characteristics of Apple GPUs are in line with discrete ones, not the integrated ones." To achieve this, will Apple need (relative to the the performance of its current AS chips) to up its game in the GPU area much more than in the CPU area?

I know that one shouldn't use GB to make cross-platform (e.g., AS vs x86) comparisons. For instance, we now know that, on MacOS, the single-core GB5 score of the A12Z (1098) is within spitting distance of the single-core GB5 score of the fastest Intel x86 iMac (1242). Yet this doesn't mean processing times for actual, real-world applications would be anywher near that close on the two machines.

Having said that, it is striking that, by contrast to the GB5 CPU scores, the GB5 GPU scores show much more disparity: Running natively on MacOS, the metal score for the A12Z is 12,610, while that for the fastest mobile and consumer desktop dGPUs current used on the Mac (the Radeon Pro 5600M on the 16" MBP and the Radeon Pro Vega 48 on the 27" iMac) are 40,934 and 49,596, respectively.

If the GB5 scores mean anything (and I don't know if they do), it suggests that current AS (we don't know what the new AS will be like) has far more of a gap (vs. what's currently available for the MBP/iMac) in GPU performance than CPU performance.
 
Last edited:

Alex W.

macrumors 6502
Apr 18, 2020
353
190
its done intel macs are the last good macbooks till the industry changes over. ARM macs will not be using dedicated graphics so its even dead for the professional community imo.


3 years from now, maybe we'll see but apple graphics are low end.
 

Waragainstsleep

macrumors 6502a
Oct 15, 2003
612
221
UK
its done intel macs are the last good macbooks till the industry changes over. ARM macs will not be using dedicated graphics so its even dead for the professional community imo.

3 years from now, maybe we'll see but apple graphics are low end.

Apple's unreleased silicon is 5 years ahead of Nvidia and even the new MacBook/Air will blow away the PS5 and new Xbox.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
I. Apple makes the broad claim that "Apple GPUs are also way, way more power-efficient than both integrated and discrete GPUs" (https://forums.macrumors.com/thread...at-egpus-are-also-dead.2246098/#post-28682372), but that is in the context of games and other graphics intensive apps, to which TBDR would be applicable. But, as you know, GPUs are used for much more than games. In other areas, where TBDR might, or definitely wouldn't, apply, would Apple's GPU design lose its efficiency advantage over NVIDIA/AMD? For instance:

1) Rendering of web pages. Here's there's probably not much overdraw, so I'd guess TBDR woudn't help much.
2) Generalized rendering, e.g., if you are doing work that requires a lot of screen area, and want your GPU to be able to drive, say, twin 8k monitors.
3) Graphics content creation, i.e., photo and video editing. Would TBDR play a significant role here? Sure, as part of photo and video editing, you need to display the scenes, but most of the processing power here goes to the editing, not to displaying what's been edited. For instance, on Pixar's site they said it took their 24,000-core rendering farm two years to render Monster U: https://sciencebehindpixar.org/pipeline/rendering So, to the extent GPU's are used in this process (and they are), would TBDR come into play in any significant way?

Apple marketing claims obviously need to be taken with a big grain of salt.

The efficiency of TBDR only comes in play — as you say — when there is potential overdraw. If there is no overdraw, IMR performs just as efficiently. Few caveats: additional memory coherence and ordering properties allow for things like programmable blending, which can be useful for 2D GUI compositing. Also, exploiting tile memory might make some blurring and other effects more efficient.

As to photo and video editing, Apple GPUs are uniquely suited for this because — funnily enough — they don't have dedicated VRAM. So you don't have to play for the transfer of data between the CPU and the GPU. This reduces the latency for one-off tasks, especially if we are talking about rapid video workflows. One just needs fast enough system RAM.


II. In the same video cited above, Apple claims that "the performance characteristics of Apple GPUs are in line with discrete ones, not the integrated ones." To achieve this, will Apple need (relative to the the performance of its current AS chips) to up its game in the GPU area much more than in the CPU area?

[...]

Having said that, it is striking that, by contrast to the GB5 CPU scores, the GB5 GPU scores show much more disparity: Running natively on MacOS, the metal score for the A12Z is 12,610, while that for the fastest mobile and consumer desktop dGPUs current used on the Mac (the Radeon Pro 5600M on the 16" MBP and the Radeon Pro Vega 48 on the 27" iMac) are 40,934 and 49,596, respectively.

If the GB5 scores mean anything (and I don't know if they do), it suggests that current AS (we don't know what the new AS will be like) has far more of a gap (vs. what's currently available for the MBP/iMac) in GPU performance than CPU performance.

Actually, my conclusion from all this is much more optimistic. Warning: wild conjectures ahead. Please don't take it even remotely seriously!

Let us rephrase the A12Z GPU capabilities into more common marketing slang. Apple does not publish this data, but looking at Metal device info and the overall performance, it seems likely that an Apple GPU ALU are 32-wide (1024bit) and a GPU Core probably(?) contains 2 such ALUs (according to this guy whom we will trust for now). So a full A12Z then would contain 512 "shader cores". The GPU most likely runs on a relatively low frequency to improve the power consumption.

Let's further assume that the A12Z GPU uses 10 watts in sustained compute tasks (it's probably less than that, but let's be conservative). A 5600M Pro, with it's 2560 low clocked stream processors is about 4 times faster while consuming 5x power. If we scaled this GPU down to 512 "shader cores" while keeping the performance, it's expected Geekbench score would be around 8200 at the same 10 watts power consumption. Note that the relative efficiency improves with a number of cores, since we can clock them down for large energy saving. The 9300 for example has 1280 "shader cores" that are supposed to have the same TDP of 50Watts. Let's scale it down: at 512 "cores" the Geekbench score would be 8800 and power consumption 20watts. On the other hand, if we scaled the Apple GPU up, by a factor of 5x to have it match the total TDP of 50W (which would give it 40 cores or 2560 "shader cores", same as 5600M Pro actually), we expect a Geekbench score of around 62000.

Of course, all these numbers don't mean much in practical terms. But there is an important point to this little exercise.
What we see here is that the A12 GPU cores are — with reasonable certainly — at least as efficient, and probably slightly more efficient than the state of the art Navi core (Nvidia Turing is in the similar ballpark). What I am trying to say here is that on the fundamental architecture level, Apple has caught up in compute performance with the GPU titans. If they can build larger versions of their GPUs, and retain their energy efficiency (which is the big question), the GPGPU compute performance of Apple Silicon will be or par or superior to state of the art discrete GPU offerings in the comparable power brackets.
[automerge]1595835936[/automerge]
Apple's unreleased silicon is 5 years ahead of Nvidia and even the new MacBook/Air will blow away the PS5 and new Xbox.

I really would not go that far. In the end, everyone is cooking with the same water. One can use smart algorithms and stuff to optimize rasterization, but when we get to the practical compute performance, everyone's technology is pretty much comparable. There is a reason why ALU performance of all modern GPUs appears to be so close to each other — it's limited by the process.
 

theorist9

macrumors 68040
May 28, 2015
3,880
3,060
Apple marketing claims obviously need to be taken with a big grain of salt.

The efficiency of TBDR only comes in play — as you say — when there is potential overdraw. If there is no overdraw, IMR performs just as efficiently. Few caveats: additional memory coherence and ordering properties allow for things like programmable blending, which can be useful for 2D GUI compositing. Also, exploiting tile memory might make some blurring and other effects more efficient.

As to photo and video editing, Apple GPUs are uniquely suited for this because — funnily enough — they don't have dedicated VRAM. So you don't have to play for the transfer of data between the CPU and the GPU. This reduces the latency for one-off tasks, especially if we are talking about rapid video workflows. One just needs fast enough system RAM.




Actually, my conclusion from all this is much more optimistic. Warning: wild conjectures ahead. Please don't take it even remotely seriously!

Let us rephrase the A12Z GPU capabilities into more common marketing slang. Apple does not publish this data, but looking at Metal device info and the overall performance, it seems likely that an Apple GPU ALU are 32-wide (1024bit) and a GPU Core probably(?) contains 2 such ALUs (according to this guy whom we will trust for now). So a full A12Z then would contain 512 "shader cores". The GPU most likely runs on a relatively low frequency to improve the power consumption.

Let's further assume that the A12Z GPU uses 10 watts in sustained compute tasks (it's probably less than that, but let's be conservative). A 5600M Pro, with it's 2560 low clocked stream processors is about 4 times faster while consuming 5x power. If we scaled this GPU down to 512 "shader cores" while keeping the performance, it's expected Geekbench score would be around 8200 at the same 10 watts power consumption. Note that the relative efficiency improves with a number of cores, since we can clock them down for large energy saving. The 9300 for example has 1280 "shader cores" that are supposed to have the same TDP of 50Watts. Let's scale it down: at 512 "cores" the Geekbench score would be 8800 and power consumption 20watts. On the other hand, if we scaled the Apple GPU up, by a factor of 5x to have it match the total TDP of 50W (which would give it 40 cores or 2560 "shader cores", same as 5600M Pro actually), we expect a Geekbench score of around 62000.

Of course, all these numbers don't mean much in practical terms. But there is an important point to this little exercise.
What we see here is that the A12 GPU cores are — with reasonable certainly — at least as efficient, and probably slightly more efficient than the state of the art Navi core (Nvidia Turing is in the similar ballpark). What I am trying to say here is that on the fundamental architecture level, Apple has caught up in compute performance with the GPU titans. If they can build larger versions of their GPUs, and retain their energy efficiency (which is the big question), the GPGPU compute performance of Apple Silicon will be or par or superior to state of the art discrete GPU offerings in the comparable power brackets.
[automerge]1595835936[/automerge]


I really would not go that far. In the end, everyone is cooking with the same water. One can use smart algorithms and stuff to optimize rasterization, but when we get to the practical compute performance, everyone's technology is pretty much comparable. There is a reason why ALU performance of all modern GPUs appears to be so close to each other — it's limited by the process.
Thanks again leman! I really enjoy the clarity and reasonableness of your answers.

It will be a lot of fun to see how they actually perform, once they're released into the wild.
 
  • Like
Reactions: leman

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
Apple marketing claims obviously need to be taken with a big grain of salt.

The efficiency of TBDR only comes in play — as you say — when there is potential overdraw. If there is no overdraw, IMR performs just as efficiently. Few caveats: additional memory coherence and ordering properties allow for things like programmable blending, which can be useful for 2D GUI compositing. Also, exploiting tile memory might make some blurring and other effects more efficient.

As to photo and video editing, Apple GPUs are uniquely suited for this because — funnily enough — they don't have dedicated VRAM. So you don't have to play for the transfer of data between the CPU and the GPU. This reduces the latency for one-off tasks, especially if we are talking about rapid video workflows. One just needs fast enough system RAM.




Actually, my conclusion from all this is much more optimistic. Warning: wild conjectures ahead. Please don't take it even remotely seriously!

Let us rephrase the A12Z GPU capabilities into more common marketing slang. Apple does not publish this data, but looking at Metal device info and the overall performance, it seems likely that an Apple GPU ALU are 32-wide (1024bit) and a GPU Core probably(?) contains 2 such ALUs (according to this guy whom we will trust for now). So a full A12Z then would contain 512 "shader cores". The GPU most likely runs on a relatively low frequency to improve the power consumption.

Let's further assume that the A12Z GPU uses 10 watts in sustained compute tasks (it's probably less than that, but let's be conservative). A 5600M Pro, with it's 2560 low clocked stream processors is about 4 times faster while consuming 5x power. If we scaled this GPU down to 512 "shader cores" while keeping the performance, it's expected Geekbench score would be around 8200 at the same 10 watts power consumption. Note that the relative efficiency improves with a number of cores, since we can clock them down for large energy saving. The 9300 for example has 1280 "shader cores" that are supposed to have the same TDP of 50Watts. Let's scale it down: at 512 "cores" the Geekbench score would be 8800 and power consumption 20watts. On the other hand, if we scaled the Apple GPU up, by a factor of 5x to have it match the total TDP of 50W (which would give it 40 cores or 2560 "shader cores", same as 5600M Pro actually), we expect a Geekbench score of around 62000.

Of course, all these numbers don't mean much in practical terms. But there is an important point to this little exercise.
What we see here is that the A12 GPU cores are — with reasonable certainly — at least as efficient, and probably slightly more efficient than the state of the art Navi core (Nvidia Turing is in the similar ballpark). What I am trying to say here is that on the fundamental architecture level, Apple has caught up in compute performance with the GPU titans. If they can build larger versions of their GPUs, and retain their energy efficiency (which is the big question), the GPGPU compute performance of Apple Silicon will be or par or superior to state of the art discrete GPU offerings in the comparable power brackets.
[automerge]1595835936[/automerge]


I really would not go that far. In the end, everyone is cooking with the same water. One can use smart algorithms and stuff to optimize rasterization, but when we get to the practical compute performance, everyone's technology is pretty much comparable. There is a reason why ALU performance of all modern GPUs appears to be so close to each other — it's limited by the process.
I would argue, from AMD's perspective, that comparing compute between Vega and Navi isn't quite fair, as Vega was really focused on GPGPU instead of rendering. AMD is supposed to be coming out with compute specific cards this or next year called CNDA, and will branch off better performing rendering to RDNA(2).

See this all goes back to how I was confused about Apple saying that the A12X was as powerful as the X1S. Which, according to B3D, was basically talking about TFLOPS not what it could actually render. Using fortnite as an example of the power isn't apples to apples either since "mobile" version is missing subtle things that the console players get (like proper shadow detail, moving foliage, etc) even when using epic quality. Maybe when they move it to UE5 next year the mobile version will align with the console (and desktop) versions.
 

leman

macrumors Core
Oct 14, 2008
19,521
19,677
I would argue, from AMD's perspective, that comparing compute between Vega and Navi isn't quite fair, as Vega was really focused on GPGPU instead of rendering. AMD is supposed to be coming out with compute specific cards this or next year called CNDA, and will branch off better performing rendering to RDNA(2).

The compute performance between Navi and Vega is practically identical. The Vega Pro 20 with 20 CUs @ 1300Mhz has the peak theoretical performance of 3.3 TFLOPS while the 5300M with 20 CUs @ 1250Mghz has the peak theoretical performance of 3.1 TFLOPS. As you can see, the difference is just the clocks — the ALUs setup is identical. Navi simply incorporates some adjustments to make it better at rendering (I have no idea whether there are any scheduler etc. changes that would result in different practical compute performance).
 

diamond.g

macrumors G4
Mar 20, 2007
11,438
2,664
OBX
The compute performance between Navi and Vega is practically identical. The Vega Pro 20 with 20 CUs @ 1300Mhz has the peak theoretical performance of 3.3 TFLOPS while the 5300M with 20 CUs @ 1250Mghz has the peak theoretical performance of 3.1 TFLOPS. As you can see, the difference is just the clocks — the ALUs setup is identical. Navi simply incorporates some adjustments to make it better at rendering (I have no idea whether there are any scheduler etc. changes that would result in different practical compute performance).
I spent some time looking at the RDNA & Vega Shader ISA docs. The ALU's are similar, but they are packaged (for lack of better term) differently. According to the document, techically a WGP consists of 2 CU's. The WGP is the smallest block, not the CU, though the card can operate in CU mode. It is said that CU mode could be faster (more parallelism), so it isn't clear why they bothered pairing them for WGP.

EDIT: assuming the math works, Vega is .00254 flop per clock while RDNA is .00248 flop per clock. So even boosting the clock 50Mhz for them to match isn't enough to bridge the 200 gflop difference.
 
Last edited:

iPadified

macrumors 68020
Apr 25, 2017
2,014
2,257
The compute performance between Navi and Vega is practically identical. The Vega Pro 20 with 20 CUs @ 1300Mhz has the peak theoretical performance of 3.3 TFLOPS while the 5300M with 20 CUs @ 1250Mghz has the peak theoretical performance of 3.1 TFLOPS. As you can see, the difference is just the clocks — the ALUs setup is identical. Navi simply incorporates some adjustments to make it better at rendering (I have no idea whether there are any scheduler etc. changes that would result in different practical compute performance).
Is it really necessary with strong GPGPU (compute) capabilities in Apple GPU when Apple obviously have capable coprocessors such as the neural engine that can deal with compute (or rather AI acceleration)?

I do agree that scaling Apple GPU is possible and something that will draw power for the high end chip but if coprocessor do the compute better, the upscaling of the GPU might be more modest.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.