3D Rendering on Apple Silicon, CPU&GPU

quarkysg · Apr 14, 2022

Actually, re-reading the post by @Lone Deranger, from what I understand of OS design, there can only be one TLB to the OS. There cannot be more than one, unless we're talking about NUMA design. I have to admit I know next to nothing about the M1's memory controller architecture, but AFAIK, to the OS, there should only be one TLB to be managed by the OS. The memory controller uses the TLB to then fetch data from physical memory from the virtual address provided.

I may be wrong tho. Happy to be corrected.

leman · Apr 14, 2022

Lone Deranger said:
Here's the full text I pulled out of the Twitter feed. It's a bit of a read, but interesting none the less.

Problem: Apple shows that the M1 Ultra GPU can use up to 105W of power. However, the highest we could ever get it to reach was around 86W. No, the Mac Studio cooling wasn't a problem because the GPU stayed cool, around 55-58°C compared to in the past when Apple allowed 100°C. This makes it pretty clear that the Mac Studio cooling system is OVERKILL in most apps, which means that there was a disconnect between Apple's Mac Studio cooling system engineers and the M1 Ultra chip designers/engineers. Something has gone terribly wrong in terms of chip perf.

Culprit: Each cluster of GPU cores within an M1/M1 Pro/M1 Max/M1 Ultra chip comes with a 32MB TLB or Transaction Lookaside Buffer, which is a memory cache that stores the recent translations of virtual memory to physical memory, used to reduce user memory location access time.

Hishnash: "If an application has not been optimized for the M1 GPU architecture's tile memory, (not just Metal optimized) then every read/write needs to go all the way out to system memory. If the GPU compute task is issuing MANY little reads, then this will saturate the TLB. The issue is if GPU data hits the TLB and the page table being read/written to is not loaded, then that entire thread group on the GPU needs to pause while the page table is loaded into the TLB. If your application is using MANY reads/writes per second, this results in a lot of STALLED GPU thread groups. Unlike a CPU, when a GPU is waiting for data, it can't just switch to work on something else. So the GPU sits there and waits for the TLB buffer to clear in order to get more work to process."

This is why we only saw 86W peak GPU usage in an app that was considered to be decently optimized. However, for apps that CLAIM to support Apple Silicon support but have NOT been rewritten to take advantage of Apple's TBDR tile memory system, they will be severely limited by the 32MB TLB if there are many reads/writes.
The problem is that ALMOST ALL apps out there haven't been optimized for Apple's TBDR tile memory system. Many software developers simply get it to work using the traditional TBIR model and call it good to go, being unaware of the 32MB TLB limitation that bottlenecks performance.

Hishnash: "What apps should be doing is loading as much data as possible into the tile mem and flushing it out in large chunks when needed. I bet a lot of the writes (over 95%) are for temporary values that could've been stored in tile mem and never needed to be written at all."

Hishnash: "I expect that the people building the M1 family of chips didn't expect applications to be running on it that are not TBDR optimized. So they thought 32MB would be enough."

WRONG. Most apps aren't optimized for Tile-mem, even if they claim it supports Apple Silicon.

Keep in mind that between the time when Apple started engineering the M1 family 5-7 years ago, reliance on GPU performance has skyrocketed, so the chip designers probably didn't think there would be so many reads/writes to the 32MB TLB.

What does this mean? The M1 family of chips, including the M1 Ultra, has a major limitation that can't be fixed unless apps are properly optimized. Here's the problem.

Hishnash: "The effort needed to optimize for tile memory is MASSIVE. It requires going all the way back to the drawing board, re-considering everything, like the concept that there is a local on-die memory pool you can read/write from with very very low perf impact is unthinkable in the current desktop GPU space. It’s a matter of a complete rewrite at a concept/algorithmic level."

Why is this such a big problem for M1 Ultra?
With the M1 and M1 Pro chips, there wasn't enough GPU performance to hit that 32MB TLB limit. However, the M1 Max is where you see GPU scaling fall off a cliff due to the TLB, especially the 32-core GPU model.
This problem scales linearly, so if, for example, 26 cores is the sweet spot for the M1 Max, with the rest of the 6 cores being bottlenecked by the TLB, the M1 Ultra will be bottlenecked by 12 GPU cores because it features two 32-core M1 Max dies. No wonder it scales poorly.

The solution from hishnash: "Increasing the TLB will help a lot for applications that are not optimized. This is important because many apps will NEVER be optimized, and even fewer games."
This is why gaming performance is so poor on M1 Ultra, apart from the Rosetta bottleneck.

Hishnash: "For game engines that are not TBDR aware/optimized, they might be currently bottlenecked on reads.. and depending on the post-processing effects, might have some large bottlenecks on writes if they're not using tile memory and tile compute shaders where possible."

The reason World of Warcraft runs so well and compares well to the RTX 3080/3090 is because APPLE helped them optimize the game PROPERLY to take advantage of the new TBDR tile-based architecture. (WoW Metal Update Released one week after M1 event proves they got help from Apple.)

The solution from our source: Future M-chip families (Hopefully and probably M2) will see a big increase in the TLB to solve this problem since developers are likely to be slow in optimizing apps. Apple will likely release white papers at WWDC on how to optimize apps properly.

This means that the M2 Ultra will see a HUGE boost in GPU performance over the M1 Ultra if the 32TLB bottleneck is removed. And that performance boost will be on top of higher clock speeds and potentially higher GPU core counts.

The only hope for the M1 Ultra is that developers finally decide to completely rethink and rewrite their apps to support the TBDR tile-based memory architecture. (Good luck)
Oh, and by the way, expect hardware ray-tracing support on future M-chip families. (Hopefully M2 Pro+)

I believe this is massively blown out of proportion. Yes, GPUs access RAM though a cache layer (duh!)! Yes, a cache miss causes a stall (duh!). None of this is new or surprising, that's just how computers operate. And no the cache size is not a "limitation overlooked by the engineers", Apple GPUs have more cache than most GPUs out there have. I mean, the 3090 Ti has only 6MB cache and has to deal with the same stalls, but somehow it works well.

Anyway, all these things are measurable and testable. I mean, it is entirely possible that M1 Ultra is bandwidth/latency starved for some tasks. But I want to see a shader trace showing this.

Lone Deranger said:
Hishnash: "If an application has not been optimized for the M1 GPU architecture's tile memory, (not just Metal optimized) then every read/write needs to go all the way out to system memory. If the GPU compute task is issuing MANY little reads, then this will saturate the TLB. The issue is if GPU data hits the TLB and the page table being read/written to is not loaded, then that entire thread group on the GPU needs to pause while the page table is loaded into the TLB. If your application is using MANY reads/writes per second, this results in a lot of STALLED GPU thread groups. Unlike a CPU, when a GPU is waiting for data, it can't just switch to work on something else. So the GPU sits there and waits for the TLB buffer to clear in order to get more work to process."

This is simply false. GPUs are champions on hiding memory stalls. Memory stalling is a fundamental reality of GPU operation, which is why all contemporary GPUs are designed to deal with it. GPUs run multiple jobs in parallel, when one job is waiting for a memory fetch, the GPU will switch to running another job. It's exactly the other way around than this twitter user claims: CPU has to wait, GPU will do something else.

leman · Apr 14, 2022

quarkysg said:
Actually, re-reading the post by @Lone Deranger, from what I understand of OS design, there can only be one TLB to the OS. There cannot be more than one, unless we're talking about NUMA design. I have to admit I know next to nothing about the M1's memory controller architecture, but AFAIK, to the OS, there should only be one TLB to be managed by the OS. The memory controller uses the TLB to then fetch data from physical memory from the virtual address provided.

I may be wrong tho. Happy to be corrected.

I think you are spot on. The GPU would only need it's own TLB if it used its own memory mapping, but at this point all you do is massively complicate your cache hierarchies and mess everything up, especially if you want to have unified memory. It would make much more sense to virtual address space everywhere and have the memory controller figure it out. And Apple's memory controllers are second to none at handling parallel memory requests.

I think the folks on twitter are fundamentally confused anyway. TLB is measured in the number of entries, and not the size. Maybe they mean the amount of RAM mapped by the TBL? But A14 has an absolutely humongous 3072-entry TLB*, which covers 48MB of memory (incidentally, LLC on M1 Max is 48MB), so I am not sure where the 32MB they talk about comes from.

Or maybe there is just something I am not getting (like ad additional level of indirection on the GPU itself).

diamond.g · Apr 14, 2022

quarkysg said:
Actually, re-reading the post by @Lone Deranger, from what I understand of OS design, there can only be one TLB to the OS. There cannot be more than one, unless we're talking about NUMA design. I have to admit I know next to nothing about the M1's memory controller architecture, but AFAIK, to the OS, there should only be one TLB to be managed by the OS. The memory controller uses the TLB to then fetch data from physical memory from the virtual address provided.

I may be wrong tho. Happy to be corrected.

M1 Ultra is NUMA so there should be 2 of everything that exists in a single M1 Max. Now how the OS sees those two of everything is up for debate.

quarkysg · Apr 14, 2022

diamond.g said:
M1 Ultra is NUMA so there should be 2 of everything that exists in a single M1 Max.

AFAIK, macOS is not NUMA capable, so I don't think it is. The UltraFushion interconnect likely combined the memory controllers of both M1 Max dies, and presented it to macOS as one cohesive unit. Apple's Peek Performance keynote said that it is presented as one to the OS, instead of two.

diamond.g · Apr 14, 2022

quarkysg said:
AFAIK, macOS is not NUMA capable, so I don't think it is. The UltraFushion interconnect likely combined the memory controllers of both M1 Max dies, and presented it to macOS as one cohesive unit. Apple's Peek Performance keynote said that it is presented as one to the OS, instead of two.

macOS is totes NUMA aware, as the pre-trashcan Mac Pro used to come with a dual socket option. Admittedly they are not doing that in this case (as they say it shows up as 1 SoC instead of 2), but they could have if they wanted.

It also, to me, highlights something that folks keep saying. If, according to Apple, you don't have to write your app as though you are referencing two CPU's then why are folks insisting that apps have to be further optimized to take advantage of something that is supposed to be transparent to them? If your app is taking advantage of the M1 Max already then why does it need to be rewritten to take advantage of the Ultra?

leman · Apr 14, 2022

diamond.g said:
M1 Ultra is NUMA so there should be 2 of everything that exists in a single M1 Max. Now how the OS sees those two of everything is up for debate.

Is it though? If the chip to chip communication is handled at the cache level, it doesn’t have to be NUMA. Maybe you’d get a slightly higher latency, maybe not even that. The bandwidth of the interconnect is definitely high enough: M1 SLC is less than 1GB/s, so you can probably fetch data from the other chip cache without any loss of performance.

jmho · Apr 14, 2022

This is why Apple needs to make its own 3D software as soon as possible.

It's is a chicken and egg problem where 3D graphics on Mac doesn't have the user-base because the software runs poorly, and nobody is motivated to optimise their software because Mac doesn't have the user-base.

The other reason is that Apple really needs to show if good performance is even possible. Nobody wants vague tweets from some random dude about 32MB TLBs, they want conclusive demos and code samples from Apple themselves that show how to run the GPU at as close to 100% as possible.

diamond.g · Apr 14, 2022

leman said:
Is it though? If the chip to chip communication is handled at the cache level, it doesn’t have to be NUMA. Maybe you’d get a slightly higher latency, maybe not even that. The bandwidth of the interconnect is definitely high enough: M1 SLC is less than 1GB/s, so you can probably fetch data from the other chip cache without any loss of performance.

It seems like we shouldn't be seeing scaling issues (especially on the GPU side for 3D Rendering) if it was not NUMA (internally, we all know that it isn't exposed that way to the OS/User).

leman · Apr 14, 2022

diamond.g said:
It seems like we shouldn't be seeing scaling issues (especially on the GPU side for 3D Rendering) if it was not NUMA (internally, we all know that it isn't exposed that way to the OS/User).

Maybe. Or maybe the current software does not submit enough parallel work to fully utilise the Ultra. Or maybe there is a bug on Apple's side somewhere. I don't think we know enough to pin point the exact location of the problem.

leman · Apr 14, 2022

BTW, here is a discussion of the "32MB TLB story" on Hacker's News. It seems that most GPU folks are as confused by these tweets as I am.

Apple's M1 Ultra comes with a 32MB TLB bottleneck | Hacker News

news.ycombinator.com

thenewperson · Apr 14, 2022

Am I correct in understanding that the issue here is software isn't optimised for Apple's GPUs generally? Seems like a weird thing to frame as an inherent ASi limitation.

diamond.g · Apr 14, 2022

leman said:
BTW, here is a discussion of the "32MB TLB story" on Hacker's News. It seems that most GPU folks are as confused by these tweets as I am.

Apple's M1 Ultra comes with a 32MB TLB bottleneck | Hacker News

news.ycombinator.com

Yeah a lot of folk there are really confused as every GPU would exhibit the same issues as they all work (more or less) similarly. It is also curious that, with as much bandwidth as Apple says these chips have, falling back to IMR style rendering would have such a large impact to performance. Especially considering the Ultra has more bandwidth than the W6900X they sell.

thenewperson said:
Am I correct in understanding that the issue here is software isn't optimised for Apple's GPUs generally? Seems like a weird thing to frame as an inherent ASi limitation.

That is what is being said. Weirdly if your app is optimized enough to saturate M1 Max apparently that isn't enough to say it will also saturate a M1 Ultra.

leman · Apr 14, 2022

diamond.g said:
It is also curious that, with as much bandwidth as Apple says these chips have, falling back to IMR style rendering would have such a large impact to performance. Especially considering the Ultra has more bandwidth than the W6900X they sell.

What do you mean? Apple GPUs don't do IMR style rendering at all. And yes, I can imagine that in a IMR-like scenario (e.g. where Apple is forced to flush tiles early, like when you compute depth values on the fly or use transparency) Apple GPU will be less efficient than a standard IMG due to having a higher overhead. But this is something that can be tested I suppose.

diamond.g · Apr 14, 2022

leman said:
What do you mean? Apple GPUs don't do IMR style rendering at all. And yes, I can imagine that in a IMR-like scenario (e.g. where Apple is forced to flush tiles early, like when you compute depth values on the fly or use transparency) Apple GPU will be less efficient than a standard IMG due to having a higher overhead. But this is something that can be tested I suppose.

I was referring to IMR like scenarios. Correct me if I am off base here, but the concern of falling back to IMR-like is that you don't have enough bandwidth to keep up, which shouldn't be true in M1x Macs.

leman · Apr 14, 2022

diamond.g said:
I was referring to IMR like scenarios. Correct me if I am off base here, but the concern of falling back to IMR-like is that you don't have enough bandwidth to keep up, which shouldn't be true in M1x Macs.

I think lack of bandwidth is your least concern here. These scenarios (high overdraw etc.) are pretty much the worst case scenario for all GPUs since they defeat many common GPU optimisations and result in poor hardware utilisation. That's why a lot of rendering optimisation is about avoiding them in the first place. But again, I can imagine that M1 takes a larger hit because it's just not what it has been designed to do. There is additional overhead from setting up the TBDR functionality, and it does not save you anything, you probably end up doing more work. But I have no idea how they do it in practice.

Lone Deranger · Apr 14, 2022

jmho said:
This is why Apple needs to make its own 3D software as soon as possible.

It's is a chicken and egg problem where 3D graphics on Mac doesn't have the user-base because the software runs poorly, and nobody is motivated to optimise their software because Mac doesn't have the user-base.

The other reason is that Apple really needs to show if good performance is even possible. Nobody wants vague tweets from some random dude about 32MB TLBs, they want conclusive demos and code samples from Apple themselves that show how to run the GPU at as close to 100% as possible.

Wouldn't that be something. I'd like that, if only to see what Apple would do GUI-wise for a 3D app.
Maybe they can buy the old XSI code off of Autodesk and patch that up?

Irishman · Apr 14, 2022

altaic said:
Why? The oringinal-source-tweet-guy self contradicted. It’s clickbait. I’m starting to worry about you, Xiao_Xi.

Why do you think so? Max and Vadim Yuryev have experience in the Apple media space going back before their Max Tech YouTube channel. One or both of them used to work for or with Apple Insider, as I have seen at least one of them featured in a video.

jmho · Apr 14, 2022

The level of specialised knowledge required to understand this issue is a group of probably a few hundred people.

You have to be a) an experienced real time graphics programmer, who is b) a Metal API expert, and c) has a very deep understanding of how the hardware itself works.

And most of those people work at Apple, and won't be out there speculating on twitter.

There are plenty of people with partial competence in each of those areas, but that's why it's so easy to spread misinformation.

I like Max Tech, but they absolutely don't have any of those qualifications, so the question is whether or not we trust their source HishNash who definitely seems to have some competence, but who knows if it's high enough to correctly diagnose this issue.

diamond.g · Apr 14, 2022

jmho said:
The level of specialised knowledge required to understand this issue is a group of probably a few hundred people.

You have to be a) an experienced real time graphics programmer, who is b) a Metal API expert, and c) has a very deep understanding of how the hardware itself works.

And most of those people work at Apple, and won't be out there speculating on twitter.

There are plenty of people with partial competence in each of those areas, but that's why it's so easy to spread misinformation.

I like Max Tech, but they absolutely don't have any of those qualifications, so the question is whether or not we trust their source HishNash who definitely seems to have some competence, but who knows if it's high enough to correctly diagnose this issue.

Yeah, unless (until?) Apple says something it feels like we are just grasping at straws to be honest.

Fallinangel · Apr 14, 2022

Irishman said:
Max and Vadim Yuryev have experience in the Apple media space going back before their Max Tech YouTube channel. One or both of them used to work for or with Apple Insider, as I have seen at least one of them featured in a video.

That by no means makes them experts on integrated circuits, SoC, or SiP processors!
If you read a little into the comments on HackerNews (cf. link above), where much more tech-savy people traffic, it seems that they jumped to conclusions without understanding much about what they were writing/talking about.
At the rate that this Max pumps out videos, I honestly don't even see it as seriously researched and/or vetted tech journalism. It's clearly just about click-bait and fast food pseudo-news propagation. More serious outlets who do rigurous testing (i.e. Ars Technica, AnandTech, etc.) can't output a video or article each day or two.
This is also not the first thing they got wrong in their reporting about the Mac Studio. They just jump to the latest half-truth on Twitter and mouth-breath about it in their videos.

Irishman · Apr 14, 2022

Fallinangel said:
That by no means makes them experts on integrated circuits, SoC, or SiP processors!
If you read a little into the comments on HackerNews (cf. link above), where much more tech-savy people traffic, it seems that they jumped to conclusions without understanding much about what they were writing/talking about.
At the rate that this Max pumps out videos, I honestly don't even see it as seriously researched and/or vetted tech journalism. It's clearly just about click-bait and fast food pseudo-news propagation. More serious outlets who do rigurous testing (i.e. Ars Technica, AnandTech, etc.) can't output a video or article each day or two.
This is also not the first thing they got wrong in their reporting about the Mac Studio. They just jump to the latest half-truth on Twitter and mouth-breath about it in their videos.

Well, the difference between you and me is that I know that I’m ignorant about ICs, SIPs, and SOCs.

By no means did I claim that Max and Vadim are technical chip experts. I don’t have their budget nor their bravery to disassemble a Mac Studio just to see what’s inside. To my eye, they make Mac videos for non-tech-savvy Mac users like me out there.

I think that you might be letting the perfect be the enemy of the good, here. Making mistakes about the Mac Studio in videos is one thing. It doesn’t make them bad guys, even if I grant that you’re correct about this having happened numerous times. Noone is always right - not even Ars and Anandtech, and they’ll be the first to tell you that.

sunny5 · Apr 14, 2022

jmho said:
This is why Apple needs to make its own 3D software as soon as possible.

It's is a chicken and egg problem where 3D graphics on Mac doesn't have the user-base because the software runs poorly, and nobody is motivated to optimise their software because Mac doesn't have the user-base.

The other reason is that Apple really needs to show if good performance is even possible. Nobody wants vague tweets from some random dude about 32MB TLBs, they want conclusive demos and code samples from Apple themselves that show how to run the GPU at as close to 100% as possible.

But how? Apple doesn't have such thing unless they acquire Blender or else.

jmho · Apr 14, 2022

sunny5 said:
But how? Apple doesn't have such thing unless they acquire Blender or else.

It doesn't even have to be particularly good or feature rich. It just has to be the performance benchmark that other software can be measured against.

AMD have their own "ProRender" software, so I don't see why Apple couldn't also have something similar.

I'd eventually love to see a full Logic / FCP style 3D suite, but just for starters I want to see a hardware accelerated renderer that can take advantage of an M1 Ultra to the fullest and prove to 3rd party vendors that optimisation is going to be worth their time and money.

Currently the biggest question mark is: is the hardware slow, or is it the software not being optimised.

Apple could (hopefully) put that question to rest in a fairly short amount of time.

Xiao_Xi · Apr 14, 2022

jmho said:
Apple needs to make its own 3D software as soon as possible

Apple doesn't need to create a full 3D application, just a renderer optimized for Metal that people can use with different 3D programs.

sunny5 said:
Apple doesn't have such thing unless they acquire Blender or else.

Why would Apple buy Blender instead of forking its code?

3D Rendering on Apple Silicon, CPU&GPU

macrumors 65816

macrumors Core

macrumors Core

macrumors G5

macrumors 65816

macrumors G5

macrumors Core

macrumors 6502a

macrumors G5

macrumors Core

macrumors Core

macrumors 65816

macrumors G5

macrumors Core

macrumors G5

macrumors Core

macrumors 68000

macrumors 68040

macrumors 6502a

macrumors G5

macrumors regular

macrumors 68040

Suspended

macrumors 6502a

macrumors 68000

Our Staff